1.8 KiB
1.8 KiB
Import Normalizer
Transforms the raw family-archive spreadsheets in ../../import/ into a clean canonical
dataset (out/) plus review reports (review/). See the spec:
../../docs/import-migration/02-normalization-spec.md.
Setup
Requires Python 3.12 (uses StrEnum).
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
Run
.venv/bin/python normalize.py
Outputs:
out/canonical-documents.xlsx,out/canonical-persons.xlsxreview/*.csv(residue to fix),review/summary.txt(grouped run stats incl. unknown-date rate)
Iteration loop
- Run. Read
review/summary.txtfor the health snapshot. - Fix the residue by editing the version-controlled overrides files, then re-run. Repeat.
| Review file | What to do |
|---|---|
unparsed-dates.csv |
For each raw (sorted by frequency), fill suggested_iso + suggested_precision, then paste raw,suggested_iso,suggested_precision into overrides/dates.csv (header raw,iso,precision). |
unmatched-names.csv |
If suggested_id is right, copy raw,suggested_id into overrides/names.csv; else look up the correct id in out/canonical-persons.xlsx (the person_id column). |
ambiguous-receivers.csv |
A space-joined pair we refused to auto-split (e.g. Ella Anita). Decide and add a names override if it is really two people. |
index-file-mismatch.csv |
The Datei path disagrees with the index-derived filename — reconcile when the PDFs arrive. |
duplicate-index.csv, blank-index-rows.csv, skipped-x-suffix.csv |
Inspect; fix in the source spreadsheet if needed. |
Valid person_id values all come from the person_id column of out/canonical-persons.xlsx.
Tests
.venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once)