# Import Normalizer Transforms the raw family-archive spreadsheets in `../../import/` into a clean canonical dataset (`out/`) plus review reports (`review/`). See the spec: `../../docs/import-migration/02-normalization-spec.md`. ## Setup Requires **Python 3.12** (uses `StrEnum`). ```bash python3 -m venv .venv && .venv/bin/pip install -r requirements.txt ``` ## Run ```bash .venv/bin/python normalize.py ``` Outputs: - `out/canonical-documents.xlsx`, `out/canonical-persons.xlsx` - `review/*.csv` (residue to fix), `review/summary.txt` (grouped run stats incl. unknown-date rate) ## Iteration loop 1. **Run.** Read `review/summary.txt` for the health snapshot. 2. **Fix the residue** by editing the version-controlled overrides files, then re-run. Repeat. | Review file | What to do | | --- | --- | | `unparsed-dates.csv` | For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`). | | `unmatched-names.csv` | If `suggested_id` is right, copy `raw,suggested_id` into `overrides/names.csv`; else look up the correct id in `out/canonical-persons.xlsx` (the `person_id` column). | | `ambiguous-receivers.csv` | A space-joined pair we refused to auto-split (e.g. `Ella Anita`). Decide and add a names override if it is really two people. | | `index-file-mismatch.csv` | The `Datei` path disagrees with the index-derived filename — reconcile when the PDFs arrive. | | `duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv` | Inspect; fix in the source spreadsheet if needed. | **Valid `person_id` values** all come from the `person_id` column of `out/canonical-persons.xlsx`. ## Tests ```bash .venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once) ```