Import Normalizer
Transforms the raw family-archive spreadsheets in ../../import/ into a clean canonical
dataset (out/) plus review reports (review/). See the spec:
../../docs/import-migration/02-normalization-spec.md.
Setup
Requires Python 3.12 (uses StrEnum).
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
Run
.venv/bin/python normalize.py
Outputs:
out/canonical-documents.xlsx,out/canonical-persons.xlsxreview/*.csv(residue to fix),review/summary.txt(grouped run stats incl. unknown-date rate)
Iteration loop
- Run. Read
review/summary.txtfor the health snapshot. - Fix the residue by editing the version-controlled overrides files, then re-run. Repeat.
| Review file | What to do |
|---|---|
unparsed-dates.csv |
For each raw (sorted by frequency), fill suggested_iso + suggested_precision, then paste raw,suggested_iso,suggested_precision into overrides/dates.csv (header raw,iso,precision). |
unmatched-names.csv |
If suggested_id is right, copy raw,suggested_id into overrides/names.csv; else look up the correct id in out/canonical-persons.xlsx (the person_id column). |
unresolved-names.csv |
Names whose value is itself problematic, grouped by category: unknown (?/illegible), single_token (first OR last name only), relational (Tante …), collective (Familie …), prose (a description landed in a name column), ambiguous_pair (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to overrides/names.csv. |
index-file-mismatch.csv |
The Datei path disagrees with the index-derived filename — reconcile when the PDFs arrive. |
duplicate-index.csv, blank-index-rows.csv, skipped-x-suffix.csv |
Inspect; fix in the source spreadsheet if needed. |
unresolved-names.csvis the focused "names that need a human" list — distinct fromunmatched-names.csv(which is just non-family correspondents that got provisional persons). The given-name set that drivesambiguous_pairdetection is the register's first names plusconfig.EXTRA_GIVEN_NAMES— add names there if a real two-person cell isn't being flagged.
Valid person_id values all come from the person_id column of out/canonical-persons.xlsx.
Tests
.venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once)