The unmatched list was just non-family correspondents (expected noise); their count stays in summary.txt and they remain in canonical-persons.xlsx. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.4 KiB
Import Normalizer
Transforms the raw family-archive spreadsheets in ../../import/ into a clean canonical
dataset (out/) plus review reports (review/). See the spec:
../../docs/import-migration/02-normalization-spec.md.
Setup
Requires Python 3.12 (uses StrEnum).
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
Run
.venv/bin/python normalize.py
Outputs:
out/canonical-documents.xlsx,out/canonical-persons.xlsxreview/*.csv(residue to fix),review/summary.txt(grouped run stats incl. unknown-date rate)
Iteration loop
- Run. Read
review/summary.txtfor the health snapshot. - Fix the residue by editing the version-controlled overrides files, then re-run. Repeat.
| Review file | What to do |
|---|---|
unparsed-dates.csv |
For each raw (sorted by frequency), fill suggested_iso + suggested_precision, then paste raw,suggested_iso,suggested_precision into overrides/dates.csv (header raw,iso,precision). |
unresolved-names.csv |
Names whose value is itself problematic, grouped by category: unknown (?/illegible), single_token (first OR last name only), relational (Tante …), collective (Familie …), prose (a description landed in a name column), ambiguous_pair (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to overrides/names.csv (look up valid ids in out/canonical-persons.xlsx). |
index-file-mismatch.csv |
The Datei path disagrees with the index-derived filename — reconcile when the PDFs arrive. |
duplicate-index.csv, blank-index-rows.csv, skipped-x-suffix.csv |
Inspect; fix in the source spreadsheet if needed. |
unresolved-names.csvis the focused "names that need a human" list. Non-family correspondents that simply aren't in the register are NOT reported — they just become provisional persons inout/canonical-persons.xlsx(theunmatched_name_stringscount insummary.txttracks how many). The given-name set that drivesambiguous_pairdetection is the register's first names plusconfig.EXTRA_GIVEN_NAMES— add names there if a real two-person cell isn't being flagged.
Valid person_id values all come from the person_id column of out/canonical-persons.xlsx.
Tests
.venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once)