The four files in tools/import-normalizer/out/ contain real names, addresses, and attribution prose for ~163 living/deceased family members and were committed by mistake. They are now removed from the index (kept on disk for local development) and gitignored. The canonical artifacts are produced locally from the Python normalizer and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The contract between normalizer and importer is the header schema, not the file contents — CanonicalSheetReader fails closed on a missing header, which is what locks the contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Import Normalizer
Transforms the raw family-archive spreadsheets in ../../import/ into a clean canonical
dataset (out/) plus review reports (review/). See the spec:
../../docs/import-migration/02-normalization-spec.md.
Setup
Requires Python 3.12 (uses StrEnum).
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
Run
.venv/bin/python normalize.py
Outputs:
out/canonical-documents.xlsx,out/canonical-persons.xlsxreview/*.csv(residue to fix),review/summary.txt(grouped run stats incl. unknown-date rate)
Iteration loop
- Run. Read
review/summary.txtfor the health snapshot. - Fix the residue by editing the version-controlled overrides files, then re-run. Repeat.
| Review file | What to do |
|---|---|
unparsed-dates.csv |
For each raw (sorted by frequency), fill suggested_iso + suggested_precision, then paste raw,suggested_iso,suggested_precision into overrides/dates.csv (header raw,iso,precision). |
unresolved-names.csv |
Names whose value is itself problematic, grouped by category: unknown (?/illegible), single_token (first OR last name only), relational (Tante …), collective (Familie …), prose (a description landed in a name column), ambiguous_pair (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to overrides/names.csv (look up valid ids in out/canonical-persons.xlsx). |
duplicate-index.csv, blank-index-rows.csv, skipped-x-suffix.csv |
Inspect; fix in the source spreadsheet if needed. |
unresolved-names.csvis the focused "names that need a human" list. Non-family correspondents that simply aren't in the register are NOT reported — they just become provisional persons inout/canonical-persons.xlsx(theunmatched_name_stringscount insummary.txttracks how many). The given-name set that drivesambiguous_pairdetection is the register's first names plusconfig.EXTRA_GIVEN_NAMES— add names there if a real two-person cell isn't being flagged.
Valid person_id values all come from the person_id column of out/canonical-persons.xlsx.
Tests
.venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once)