Files
familienarchiv/tools/import-normalizer/README.md
Marcel 8cac63e938
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m32s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m26s
CI / fail2ban Regex (pull_request) Successful in 47s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
feat(normalizer): drop unmatched-names.csv; unresolved-names is the names report
The unmatched list was just non-family correspondents (expected noise);
their count stays in summary.txt and they remain in canonical-persons.xlsx.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:46:08 +02:00

45 lines
2.4 KiB
Markdown

# Import Normalizer
Transforms the raw family-archive spreadsheets in `../../import/` into a clean canonical
dataset (`out/`) plus review reports (`review/`). See the spec:
`../../docs/import-migration/02-normalization-spec.md`.
## Setup
Requires **Python 3.12** (uses `StrEnum`).
```bash
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
```
## Run
```bash
.venv/bin/python normalize.py
```
Outputs:
- `out/canonical-documents.xlsx`, `out/canonical-persons.xlsx`
- `review/*.csv` (residue to fix), `review/summary.txt` (grouped run stats incl. unknown-date rate)
## Iteration loop
1. **Run.** Read `review/summary.txt` for the health snapshot.
2. **Fix the residue** by editing the version-controlled overrides files, then re-run. Repeat.
| Review file | What to do |
| --- | --- |
| `unparsed-dates.csv` | For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`). |
| `unresolved-names.csv` | Names whose value is itself problematic, grouped by `category`: `unknown` (`?`/illegible), `single_token` (first OR last name only), `relational` (`Tante …`), `collective` (`Familie …`), `prose` (a description landed in a name column), `ambiguous_pair` (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to `overrides/names.csv` (look up valid ids in `out/canonical-persons.xlsx`). |
| `index-file-mismatch.csv` | The `Datei` path disagrees with the index-derived filename — reconcile when the PDFs arrive. |
| `duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv` | Inspect; fix in the source spreadsheet if needed. |
> `unresolved-names.csv` is the focused "names that need a human" list. Non-family
> correspondents that simply aren't in the register are NOT reported — they just become
> provisional persons in `out/canonical-persons.xlsx` (the `unmatched_name_strings` count in
> `summary.txt` tracks how many). The given-name set that drives `ambiguous_pair` detection is
> the register's first names plus `config.EXTRA_GIVEN_NAMES` — add names there if a real
> two-person cell isn't being flagged.
**Valid `person_id` values** all come from the `person_id` column of `out/canonical-persons.xlsx`.
## Tests
```bash
.venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once)
```