Files
familienarchiv/tools/import-normalizer
Marcel ace41ad209 fix(normalizer): remove unauthorized first-name index key from _build_index
Remove the 5th unauthorized index key (_norm_tree(first)) from _build_index.
The spec requires exactly 4 keys per person:
1. forward (first last)
2. reversed (last first)
3. maiden name (first maiden) if maiden set
4. lastName only (last)

Update test data to use full names in Bemerkung fields (e.g., 'Clara Cram'
instead of 'Clara') since single first names alone are no longer resolvable.
All 52 tests pass.
2026-05-25 21:08:49 +02:00
..

Import Normalizer

Transforms the raw family-archive spreadsheets in ../../import/ into a clean canonical dataset (out/) plus review reports (review/). See the spec: ../../docs/import-migration/02-normalization-spec.md.

Setup

Requires Python 3.12 (uses StrEnum).

python3 -m venv .venv && .venv/bin/pip install -r requirements.txt

Run

.venv/bin/python normalize.py

Outputs:

  • out/canonical-documents.xlsx, out/canonical-persons.xlsx
  • review/*.csv (residue to fix), review/summary.txt (grouped run stats incl. unknown-date rate)

Iteration loop

  1. Run. Read review/summary.txt for the health snapshot.
  2. Fix the residue by editing the version-controlled overrides files, then re-run. Repeat.
Review file What to do
unparsed-dates.csv For each raw (sorted by frequency), fill suggested_iso + suggested_precision, then paste raw,suggested_iso,suggested_precision into overrides/dates.csv (header raw,iso,precision).
unresolved-names.csv Names whose value is itself problematic, grouped by category: unknown (?/illegible), single_token (first OR last name only), relational (Tante …), collective (Familie …), prose (a description landed in a name column), ambiguous_pair (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to overrides/names.csv (look up valid ids in out/canonical-persons.xlsx).
index-file-mismatch.csv The Datei path disagrees with the index-derived filename — reconcile when the PDFs arrive.
duplicate-index.csv, blank-index-rows.csv, skipped-x-suffix.csv Inspect; fix in the source spreadsheet if needed.

unresolved-names.csv is the focused "names that need a human" list. Non-family correspondents that simply aren't in the register are NOT reported — they just become provisional persons in out/canonical-persons.xlsx (the unmatched_name_strings count in summary.txt tracks how many). The given-name set that drives ambiguous_pair detection is the register's first names plus config.EXTRA_GIVEN_NAMES — add names there if a real two-person cell isn't being flagged.

Valid person_id values all come from the person_id column of out/canonical-persons.xlsx.

Tests

.venv/bin/python -m pytest tests/test_dates.py -v   # run files individually (never the whole suite at once)