Files
familienarchiv/tools/import-normalizer
Marcel 5efe3b8a7c
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m31s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m42s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m2s
feat(normalizer): parse Spanish month names + Month DD-YYYY hyphen form
Add Spanish month names (Mexican-branch letters) to config.MONTHS and let
the month-first matcher accept a hyphen (not just a dot) before the year, so
"Mayo 18-1929"/"Junio 7-904" parse without manual overrides. Also bound
4-digit years to 1700-2100 so gross typos ("23-9003") stay in review instead
of producing a bogus year. Cuts unknown-date rate 9.2% -> 7.9%.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 17:00:33 +02:00
..

Import Normalizer

Transforms the raw family-archive spreadsheets in ../../import/ into a clean canonical dataset (out/) plus review reports (review/). See the spec: ../../docs/import-migration/02-normalization-spec.md.

Setup

Requires Python 3.12 (uses StrEnum).

python3 -m venv .venv && .venv/bin/pip install -r requirements.txt

Run

.venv/bin/python normalize.py

Outputs:

  • out/canonical-documents.xlsx, out/canonical-persons.xlsx
  • review/*.csv (residue to fix), review/summary.txt (grouped run stats incl. unknown-date rate)

Iteration loop

  1. Run. Read review/summary.txt for the health snapshot.
  2. Fix the residue by editing the version-controlled overrides files, then re-run. Repeat.
Review file What to do
unparsed-dates.csv For each raw (sorted by frequency), fill suggested_iso + suggested_precision, then paste raw,suggested_iso,suggested_precision into overrides/dates.csv (header raw,iso,precision).
unresolved-names.csv Names whose value is itself problematic, grouped by category: unknown (?/illegible), single_token (first OR last name only), relational (Tante …), collective (Familie …), prose (a description landed in a name column), ambiguous_pair (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to overrides/names.csv (look up valid ids in out/canonical-persons.xlsx).
index-file-mismatch.csv The Datei path disagrees with the index-derived filename — reconcile when the PDFs arrive.
duplicate-index.csv, blank-index-rows.csv, skipped-x-suffix.csv Inspect; fix in the source spreadsheet if needed.

unresolved-names.csv is the focused "names that need a human" list. Non-family correspondents that simply aren't in the register are NOT reported — they just become provisional persons in out/canonical-persons.xlsx (the unmatched_name_strings count in summary.txt tracks how many). The given-name set that drives ambiguous_pair detection is the register's first names plus config.EXTRA_GIVEN_NAMES — add names there if a real two-person cell isn't being flagged.

Valid person_id values all come from the person_id column of out/canonical-persons.xlsx.

Tests

.venv/bin/python -m pytest tests/test_dates.py -v   # run files individually (never the whole suite at once)