Files

Marcel 46d1f5c6d8 chore(import): stop tracking real family PII canonical artifacts

The four files in tools/import-normalizer/out/ contain real names,
addresses, and attribution prose for ~163 living/deceased family members
and were committed by mistake. They are now removed from the index
(kept on disk for local development) and gitignored.

The canonical artifacts are produced locally from the Python normalizer
and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The
contract between normalizer and importer is the header schema, not the
file contents — CanonicalSheetReader fails closed on a missing header,
which is what locks the contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 10:20:38 +02:00

overrides

feat(normalizer): generate structured tags from Schlagwort + Inhalt fields

2026-05-25 19:47:36 +02:00

tests

refactor(normalizer): drop file column now PDFs resolve by index

2026-05-27 22:08:45 +02:00

.gitignore

chore(import): stop tracking real family PII canonical artifacts

2026-05-28 10:20:38 +02:00

config.py

refactor(normalizer): drop file column now PDFs resolve by index

2026-05-27 22:08:45 +02:00

dates.py

feat(normalizer): flag half-resolved RANGE for review

2026-05-27 08:18:36 +02:00

documents.py

refactor(normalizer): drop file column now PDFs resolve by index

2026-05-27 22:08:45 +02:00

ingest.py

fix(normalizer): don't coerce boolean cells to 1/0

2026-05-25 14:11:19 +02:00

normalize.py

refactor(normalizer): drop file column now PDFs resolve by index

2026-05-27 22:08:45 +02:00

overrides.py

feat(normalizer): overrides loader + xlsx/csv writers

2026-05-25 14:39:28 +02:00

persons_tree.py

fix(normalizer): fail-closed on person_id zip length divergence

2026-05-27 08:16:06 +02:00

persons.py

feat(normalizer): unresolved-names report + fix ambiguous-pair over-flagging

2026-05-25 15:54:37 +02:00

README.md

refactor(normalizer): drop file column now PDFs resolve by index

2026-05-27 22:08:45 +02:00

requirements.txt

feat(normalizer): scaffold tool + config tables

2026-05-25 13:18:52 +02:00

tags.py

feat(normalizer): generate structured tags from Schlagwort + Inhalt fields

2026-05-25 19:47:36 +02:00

writers.py

refactor(normalizer): drop file column now PDFs resolve by index

2026-05-27 22:08:45 +02:00

README.md

Import Normalizer

Transforms the raw family-archive spreadsheets in ../../import/ into a clean canonical dataset (out/) plus review reports (review/). See the spec: ../../docs/import-migration/02-normalization-spec.md.

Setup

Requires Python 3.12 (uses StrEnum).

python3 -m venv .venv && .venv/bin/pip install -r requirements.txt

Run

.venv/bin/python normalize.py

Outputs:

out/canonical-documents.xlsx, out/canonical-persons.xlsx
review/*.csv (residue to fix), review/summary.txt (grouped run stats incl. unknown-date rate)

Iteration loop

Run. Read review/summary.txt for the health snapshot.
Fix the residue by editing the version-controlled overrides files, then re-run. Repeat.

Review file	What to do
`unparsed-dates.csv`	For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`).
`unresolved-names.csv`	Names whose value is itself problematic, grouped by `category`: `unknown` (`?`/illegible), `single_token` (first OR last name only), `relational` (`Tante …`), `collective` (`Familie …`), `prose` (a description landed in a name column), `ambiguous_pair` (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to `overrides/names.csv` (look up valid ids in `out/canonical-persons.xlsx`).
`duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv`	Inspect; fix in the source spreadsheet if needed.

unresolved-names.csv is the focused "names that need a human" list. Non-family correspondents that simply aren't in the register are NOT reported — they just become provisional persons in out/canonical-persons.xlsx (the unmatched_name_strings count in summary.txt tracks how many). The given-name set that drives ambiguous_pair detection is the register's first names plus config.EXTRA_GIVEN_NAMES — add names there if a real two-person cell isn't being flagged.

Valid person_id values all come from the person_id column of out/canonical-persons.xlsx.

Tests

.venv/bin/python -m pytest tests/test_dates.py -v   # run files individually (never the whole suite at once)