Remove the 5th unauthorized index key (_norm_tree(first)) from _build_index. The spec requires exactly 4 keys per person: 1. forward (first last) 2. reversed (last first) 3. maiden name (first maiden) if maiden set 4. lastName only (last) Update test data to use full names in Bemerkung fields (e.g., 'Clara Cram' instead of 'Clara') since single first names alone are no longer resolvable. All 52 tests pass.
Import Normalizer
Transforms the raw family-archive spreadsheets in ../../import/ into a clean canonical
dataset (out/) plus review reports (review/). See the spec:
../../docs/import-migration/02-normalization-spec.md.
Setup
Requires Python 3.12 (uses StrEnum).
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
Run
.venv/bin/python normalize.py
Outputs:
out/canonical-documents.xlsx,out/canonical-persons.xlsxreview/*.csv(residue to fix),review/summary.txt(grouped run stats incl. unknown-date rate)
Iteration loop
- Run. Read
review/summary.txtfor the health snapshot. - Fix the residue by editing the version-controlled overrides files, then re-run. Repeat.
| Review file | What to do |
|---|---|
unparsed-dates.csv |
For each raw (sorted by frequency), fill suggested_iso + suggested_precision, then paste raw,suggested_iso,suggested_precision into overrides/dates.csv (header raw,iso,precision). |
unresolved-names.csv |
Names whose value is itself problematic, grouped by category: unknown (?/illegible), single_token (first OR last name only), relational (Tante …), collective (Familie …), prose (a description landed in a name column), ambiguous_pair (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to overrides/names.csv (look up valid ids in out/canonical-persons.xlsx). |
index-file-mismatch.csv |
The Datei path disagrees with the index-derived filename — reconcile when the PDFs arrive. |
duplicate-index.csv, blank-index-rows.csv, skipped-x-suffix.csv |
Inspect; fix in the source spreadsheet if needed. |
unresolved-names.csvis the focused "names that need a human" list. Non-family correspondents that simply aren't in the register are NOT reported — they just become provisional persons inout/canonical-persons.xlsx(theunmatched_name_stringscount insummary.txttracks how many). The given-name set that drivesambiguous_pairdetection is the register's first names plusconfig.EXTRA_GIVEN_NAMES— add names there if a real two-person cell isn't being flagged.
Valid person_id values all come from the person_id column of out/canonical-persons.xlsx.
Tests
.venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once)