refactor(normalizer): drop file column now PDFs resolve by index
The import corpus is uniform: every PDF is named <index>.pdf, so the file column (the spreadsheet's datei value) is redundant. Remove file from CanonicalDocument, RawRow, _FIELDS, to_canonical, and DOC_COLUMNS, plus the now-moot index_file_mismatch review flag/CSV/stat and the datei header mapping. date_end and the tree person_id are kept. Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -22,7 +22,7 @@ def _csv_safe(value):
|
||||
return "'" + s if s[:1] in ("=", "+", "-", "@", "\t", "\r", "\n") else s
|
||||
|
||||
|
||||
DOC_COLUMNS = ["index", "file", "box", "folder", "sender_person_id", "sender_name",
|
||||
DOC_COLUMNS = ["index", "box", "folder", "sender_person_id", "sender_name",
|
||||
"receiver_person_ids", "receiver_names", "date_iso", "date_raw",
|
||||
"date_precision", "date_end", "location", "tags", "summary",
|
||||
"source_row", "needs_review"]
|
||||
|
||||
Reference in New Issue
Block a user