Files

Marcel f10b80a03f feat(normalizer): build_given_names from register + supplement

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 15:51:23 +02:00

overrides

docs(normalizer): README + seed overrides

2026-05-25 14:51:20 +02:00

tests

feat(normalizer): build_given_names from register + supplement

2026-05-25 15:51:23 +02:00

.gitignore

feat(normalizer): scaffold tool + config tables

2026-05-25 13:18:52 +02:00

config.py

feat(normalizer): config tables for name classification

2026-05-25 15:43:31 +02:00

dates.py

fix(normalizer): require day-dot in English month-first matcher (structural anti-shadow)

2026-05-25 13:53:05 +02:00

documents.py

feat(normalizer): person resolution context + to_canonical

2026-05-25 14:18:09 +02:00

ingest.py

fix(normalizer): don't coerce boolean cells to 1/0

2026-05-25 14:11:19 +02:00

normalize.py

feat(normalizer): orchestrator + end-to-end integration test

2026-05-25 14:46:13 +02:00

overrides.py

feat(normalizer): overrides loader + xlsx/csv writers

2026-05-25 14:39:28 +02:00

persons.py

feat(normalizer): build_given_names from register + supplement

2026-05-25 15:51:23 +02:00

README.md

docs(normalizer): README + seed overrides

2026-05-25 14:51:20 +02:00

requirements.txt

feat(normalizer): scaffold tool + config tables

2026-05-25 13:18:52 +02:00

writers.py

fix(normalizer): defang leading LF in CSV + assert pinned workbook timestamp

2026-05-25 14:43:45 +02:00

README.md

Import Normalizer

Transforms the raw family-archive spreadsheets in ../../import/ into a clean canonical dataset (out/) plus review reports (review/). See the spec: ../../docs/import-migration/02-normalization-spec.md.

Setup

Requires Python 3.12 (uses StrEnum).

python3 -m venv .venv && .venv/bin/pip install -r requirements.txt

Run

.venv/bin/python normalize.py

Outputs:

out/canonical-documents.xlsx, out/canonical-persons.xlsx
review/*.csv (residue to fix), review/summary.txt (grouped run stats incl. unknown-date rate)

Iteration loop

Run. Read review/summary.txt for the health snapshot.
Fix the residue by editing the version-controlled overrides files, then re-run. Repeat.

Review file	What to do
`unparsed-dates.csv`	For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`).
`unmatched-names.csv`	If `suggested_id` is right, copy `raw,suggested_id` into `overrides/names.csv`; else look up the correct id in `out/canonical-persons.xlsx` (the `person_id` column).
`ambiguous-receivers.csv`	A space-joined pair we refused to auto-split (e.g. `Ella Anita`). Decide and add a names override if it is really two people.
`index-file-mismatch.csv`	The `Datei` path disagrees with the index-derived filename — reconcile when the PDFs arrive.
`duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv`	Inspect; fix in the source spreadsheet if needed.

Valid person_id values all come from the person_id column of out/canonical-persons.xlsx.

Tests

.venv/bin/python -m pytest tests/test_dates.py -v   # run files individually (never the whole suite at once)