Files
familienarchiv/tools/import-normalizer

Import Normalizer

Transforms the raw family-archive spreadsheets in ../../import/ into a clean canonical dataset (out/) plus review reports (review/). See the spec: ../../docs/import-migration/02-normalization-spec.md.

Setup

Requires Python 3.12 (uses StrEnum).

python3 -m venv .venv && .venv/bin/pip install -r requirements.txt

Run

.venv/bin/python normalize.py

Outputs:

  • out/canonical-documents.xlsx, out/canonical-persons.xlsx
  • review/*.csv (residue to fix), review/summary.txt (grouped run stats incl. unknown-date rate)

Iteration loop

  1. Run. Read review/summary.txt for the health snapshot.
  2. Fix the residue by editing the version-controlled overrides files, then re-run. Repeat.
Review file What to do
unparsed-dates.csv For each raw (sorted by frequency), fill suggested_iso + suggested_precision, then paste raw,suggested_iso,suggested_precision into overrides/dates.csv (header raw,iso,precision).
unmatched-names.csv If suggested_id is right, copy raw,suggested_id into overrides/names.csv; else look up the correct id in out/canonical-persons.xlsx (the person_id column).
ambiguous-receivers.csv A space-joined pair we refused to auto-split (e.g. Ella Anita). Decide and add a names override if it is really two people.
index-file-mismatch.csv The Datei path disagrees with the index-derived filename — reconcile when the PDFs arrive.
duplicate-index.csv, blank-index-rows.csv, skipped-x-suffix.csv Inspect; fix in the source spreadsheet if needed.

Valid person_id values all come from the person_id column of out/canonical-persons.xlsx.

Tests

.venv/bin/python -m pytest tests/test_dates.py -v   # run files individually (never the whole suite at once)