refactor(normalizer): drop file column now PDFs resolve by index

The import corpus is uniform: every PDF is named <index>.pdf, so the file column (the spreadsheet's datei value) is redundant. Remove file from CanonicalDocument, RawRow, _FIELDS, to_canonical, and DOC_COLUMNS, plus the now-moot index_file_mismatch review flag/CSV/stat and the datei header mapping. date_end and the tree person_id are kept. Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 20:54:37 +02:00
parent 929acf6964
commit 09ba7e74e3
7 changed files with 17 additions and 46 deletions
--- a/tools/import-normalizer/writers.py
+++ b/tools/import-normalizer/writers.py
@@ -22,7 +22,7 @@ def _csv_safe(value):
    return "'" + s if s[:1] in ("=", "+", "-", "@", "\t", "\r", "\n") else s


-DOC_COLUMNS = ["index", "file", "box", "folder", "sender_person_id", "sender_name",
+DOC_COLUMNS = ["index", "box", "folder", "sender_person_id", "sender_name",
               "receiver_person_ids", "receiver_names", "date_iso", "date_raw",
               "date_precision", "date_end", "location", "tags", "summary",
               "source_row", "needs_review"]