From d314fd9338e0004e263cb9491bbae4c595b5f89a Mon Sep 17 00:00:00 2001 From: Marcel Date: Mon, 25 May 2026 14:51:20 +0200 Subject: [PATCH] docs(normalizer): README + seed overrides Co-Authored-By: Claude Opus 4.7 --- tools/import-normalizer/README.md | 38 +++++++++++++++++++++ tools/import-normalizer/overrides/dates.csv | 1 + tools/import-normalizer/overrides/names.csv | 1 + 3 files changed, 40 insertions(+) create mode 100644 tools/import-normalizer/README.md create mode 100644 tools/import-normalizer/overrides/dates.csv create mode 100644 tools/import-normalizer/overrides/names.csv diff --git a/tools/import-normalizer/README.md b/tools/import-normalizer/README.md new file mode 100644 index 00000000..98ac5b8d --- /dev/null +++ b/tools/import-normalizer/README.md @@ -0,0 +1,38 @@ +# Import Normalizer + +Transforms the raw family-archive spreadsheets in `../../import/` into a clean canonical +dataset (`out/`) plus review reports (`review/`). See the spec: +`../../docs/import-migration/02-normalization-spec.md`. + +## Setup +Requires **Python 3.12** (uses `StrEnum`). +```bash +python3 -m venv .venv && .venv/bin/pip install -r requirements.txt +``` + +## Run +```bash +.venv/bin/python normalize.py +``` +Outputs: +- `out/canonical-documents.xlsx`, `out/canonical-persons.xlsx` +- `review/*.csv` (residue to fix), `review/summary.txt` (grouped run stats incl. unknown-date rate) + +## Iteration loop +1. **Run.** Read `review/summary.txt` for the health snapshot. +2. **Fix the residue** by editing the version-controlled overrides files, then re-run. Repeat. + +| Review file | What to do | +| --- | --- | +| `unparsed-dates.csv` | For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`). | +| `unmatched-names.csv` | If `suggested_id` is right, copy `raw,suggested_id` into `overrides/names.csv`; else look up the correct id in `out/canonical-persons.xlsx` (the `person_id` column). | +| `ambiguous-receivers.csv` | A space-joined pair we refused to auto-split (e.g. `Ella Anita`). Decide and add a names override if it is really two people. | +| `index-file-mismatch.csv` | The `Datei` path disagrees with the index-derived filename — reconcile when the PDFs arrive. | +| `duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv` | Inspect; fix in the source spreadsheet if needed. | + +**Valid `person_id` values** all come from the `person_id` column of `out/canonical-persons.xlsx`. + +## Tests +```bash +.venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once) +``` diff --git a/tools/import-normalizer/overrides/dates.csv b/tools/import-normalizer/overrides/dates.csv new file mode 100644 index 00000000..f4ace38f --- /dev/null +++ b/tools/import-normalizer/overrides/dates.csv @@ -0,0 +1 @@ +raw,iso,precision diff --git a/tools/import-normalizer/overrides/names.csv b/tools/import-normalizer/overrides/names.csv new file mode 100644 index 00000000..445b0cb1 --- /dev/null +++ b/tools/import-normalizer/overrides/names.csv @@ -0,0 +1 @@ +raw,person_id