refactor(normalizer): drop file column now PDFs resolve by index
The import corpus is uniform: every PDF is named <index>.pdf, so the file column (the spreadsheet's datei value) is redundant. Remove file from CanonicalDocument, RawRow, _FIELDS, to_canonical, and DOC_COLUMNS, plus the now-moot index_file_mismatch review flag/CSV/stat and the datei header mapping. date_end and the tree person_id are kept. Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -18,7 +18,6 @@ OVERRIDES_DIR = BASE_DIR / "overrides"
|
||||
# --- Header text (lowercased, whitespace-collapsed) -> canonical field ---
|
||||
DOCUMENT_HEADER_MAP = {
|
||||
"index": "index",
|
||||
"datei": "file",
|
||||
"box": "box",
|
||||
"mappe": "folder",
|
||||
"briefeschreiberin": "sender",
|
||||
|
||||
Reference in New Issue
Block a user