Files

Marcel 6f7aa643c9 docs(import): add normalizer implementation plan + apply persona review

17-task TDD plan for tools/import-normalizer/. Incorporates inline
6-persona review: content-deterministic idempotency, duplicate-index
fix, provisional-id collision guard, date-parser edge cases, multi-sender
split, CSV-injection defang, pinned deps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 12:55:50 +02:00

24 KiB

Raw Blame History

Spec — Import Normalizer

Authored in the voice of "Elicit", requirements engineer (see .claude/personas/req_engineer.md). This is a requirements artifact: it states what the normalizer must do and how we'll know it's done, in problem/behaviour language. Technology choices already made during brainstorming (Python, openpyxl, overrides-and-rerun) are recorded as constraints, not re-litigated here.

Status: Draft for review
Date: 2026-05-25
Related: 01-findings-spreadsheet-analysis.md (issues IMP-01..12), README.md
Scope boundary: This spec covers the offline normalizer that turns the raw spreadsheets into a clean, canonical dataset + review artifacts. Wiring the canonical contract into the Java MassImportService and the Document/Person model is Phase 2 and gets its own spec. This spec only defines the contract Phase 2 must satisfy.

1. Project Brief

Vision. Turn the family's human-curated, free-form archive spreadsheets into a clean, canonical dataset that imports deterministically — without hand-editing thousands of rows and without losing the historical nuance of how things were originally written.

Problem. The real archive (…aktuell…xlsx, 7,943 rows) and the person register (Personendatei 2.xlsx, 163 people) were authored for humans to read, not machines to import. Dates are written as they appeared in each letter (≈90% unparseable by the current importer), the column layout differs from what the importer expects, and the same person appears under many names. Importing as-is produces garbage (see IMP-01..12).

Goal (measurable).

G1 — After the automated pass, ≤ 5% of dated rows remain UNKNOWN; after the overrides-iteration loop, ≤ 0.5%.
G2 — 100% of source rows are represented in the canonical output or in a review file — zero silent drops.
G3 — 100% of original values (raw date string, raw name string, source row number) are preserved.
G4 — A full run over the current inputs completes in < 60 s on the dev laptop and is content-deterministic when re-run with unchanged inputs+overrides: identical canonical cell matrices and identical review-file contents. (Workbook metadata is pinned; literal xlsx byte-identity is not guaranteed because the zip container stores entry metadata.)

Primary actor. Marcel — solo owner & data steward (tech comfort 4/5). Also: a future agent re-running the pipeline; and the MassImportService as the downstream consumer.

Non-Goals (explicitly out of scope).

NG1 — Changing MassImportService or the DB schema (that is Phase 2).
NG2 — Uploading/attaching the ~7,000 PDFs (they arrive later; import matches by index).
NG3 — A GUI. The interface is spreadsheets in, CSVs out, an overrides file hand-edited.
NG4 — Perfect genealogical reconstruction. We resolve confidently-matchable people; the long tail stays as provisional persons.
NG5 — OCR/transcription content (the new xlsx has no transcription column).

Key assumptions. (A1) Sheet Familienarchiv is the document source of truth. (A2) Archive date range is 1873–1957 (drives the 2-digit-year century rule). (A3) index is the stable document key and the basis for future PDF matching. (A4) Schlagwort is a broad tag; Inhalt is a short summary/topic.

Risks. (R1) 2-digit/partial dates are genuinely ambiguous → mitigated by precision flag

overrides. (R2) Name matching false-positives merge distinct people → mitigated by conservative matching + review before merge. (R3) Source spreadsheet may be re-exported with layout drift → mitigated by header-name-based mapping, not fixed indices.

2. Personas

Marcel — Data Steward. Role: solo owner of Familienarchiv. Context: holds the complete raw archive; PDFs follow. Tech comfort: 4/5 (semi-technical, reads CSV/spreadsheets fluently, not keen to hand-edit 7,600 rows). Primary goal: a clean, importable dataset he trusts. Frustrations: dates in ~20 formats; one ancestor under 4 name variants. JTBD: "When I have raw, human-curated archive spreadsheets, I want to transform them into a clean importable dataset without losing how things were originally written, so I can load the archive and keep correcting edge cases as they surface."

The Returning Agent. Role: a future assistant session resuming the work. Goal: re-run the pipeline deterministically and understand exactly what still needs human input. JTBD: "When I pick this up cold, I want one command and a clear residue report, so I can continue without re-deriving context."

3. Constraints & Decisions Already Made

These were settled during brainstorming and are fixed inputs to the requirements below.

#	Decision	Rationale
C1	New canonical layout with explicit headers (not the old positional ODS shape).	Fits the new data; importer becomes header-driven in Phase 2.
C2	Dates stored as parsed (nullable) + raw + precision.	Historical archive; never lose the original; enable "ca. 1916".
C3	Include person resolution (register + alias/marriage map → canonical persons) in this effort.	Maiden-name dedup needs the register.
C4	Overrides-file + re-run loop for residue.	Deterministic, diffable, repeatable.
C5	Implementation: Python 3.12 + openpyxl, standalone tool at `tools/import-normalizer/`.	Fast iteration; no Spring rebuild / coverage gate on transform code.
C6	Century rule for archive 1873–1957: 2-digit `00–57`→`19YY`, `73–99`→`18YY`, `58–72`→flag; 3-digit `DDD`→`1DDD`; never 20xx.	Stated by Marcel. Boundaries live in config.
C7	`Schlagwort`→tag, `Inhalt`→summary.	Matches importer's existing semantics.
C8	Non-register correspondents become provisional persons.	~945 distinct sender strings vs 163 register people.

4. Functional Requirements

Each requirement has a stable ID. User stories use Connextra + Given-When-Then; system rules use EARS. Traceability to findings in §8.

4.1 Ingest & layout (`FR-INGEST`, `FR-MAP`)

US-MAP-01 — As the data steward, I want each source column mapped to a named canonical field regardless of its position, so a re-exported spreadsheet with shifted columns still imports correctly.

AC1 — Given the Familienarchiv sheet, when the normalizer reads the header row, then it maps columns by header name (not fixed index) to the canonical fields.
AC2 — Given a header the normalizer does not recognise, when it runs, then it records the unknown header in review/summary.txt and continues (does not crash).
AC3 — Given a required source header is absent, when it runs, then it aborts with a clear message naming the missing header (fail loud, before producing partial output).
REQ-INGEST-01 — The normalizer shall read only the Familienarchiv sheet of the document workbook and the Tabelle1 sheet of the person workbook.
REQ-MAP-01 — Header matching shall be case-insensitive and tolerant of internal multiple spaces (e.g. "Datum des Briefes").

4.2 Row triage (`FR-TRIAGE`) — resolves IMP-06, IMP-07, IMP-08

US-TRIAGE-01 — As the data steward, I want rows that have data but no index surfaced rather than dropped, so I never lose a letter silently.

AC1 — Given a row whose index is blank but which has any other non-empty cell, when the normalizer runs, then that row is written to review/blank-index-rows.csv with its source row number and is not emitted as a canonical document.
AC2 — Given a fully empty row, when it runs, then the row is skipped and counted (not reported as an anomaly).
REQ-TRIAGE-01 — If two or more rows resolve to the same index, then the normalizer shall emit all of them to review/duplicate-index.csv and mark each canonical row needs_review = duplicate_index (it shall not silently drop either).
REQ-TRIAGE-02 — Where a row is identified as a section/banner row (blank index, text only in a name column), the normalizer shall classify it as such in the blank-index report.
REQ-TRIAGE-03 — Rows whose index ends in x (a transcription/back-side of the base letter, not yet independently mappable) shall be skipped — not emitted as a canonical document — and written to review/skipped-x-suffix.csv with their source row and base index (index minus the trailing x), so they can be linked in a later pass. (Resolves IMP-10.)

4.3 Date normalization (`FR-DATE`) — resolves IMP-02, IMP-03

US-DATE-01 — As the data steward, I want every date interpreted as precisely as the source allows, with the original always kept, so I can sort the archive and still see what the letter actually said.

AC1 — Given a parseable date, when normalized, then date_iso holds the best-effort ISO date, date_raw holds the verbatim source string, and date_precision ∈ {DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN}.
AC2 — Given an unparseable date, when normalized, then date_iso is empty, date_precision = UNKNOWN, date_raw is preserved, and the value appears in review/unparsed-dates.csv.
AC3 — Given the same date_raw appears in overrides/dates.csv, when normalized, then the override's (iso, precision) wins over the automatic parse.
REQ-DATE-01 — The parser shall accept, at minimum, these forms (see §10 examples): Excel/ISO; D.M.YYYY/D.M.YY; D/M. YY[YY] (slash treated as dot); Roman-numeral months I–XII; German + English month names, full and abbreviated, with or without a separating space; Month YYYY; season/holiday + year; bare YYYY; and start-anchored ranges.
REQ-DATE-02 — Precision shall be assigned by what is known: full day → DAY; month+year → MONTH (day = 1); a named feast/holiday + year → resolved to its actual calendar date for that year → DAY; a season + year → representative mid-season month (day = 1) → SEASON; year only → YEAR (month = Jan, day = 1); a range → start date + RANGE; a value carrying an uncertainty marker (?, um, ca, circa) → APPROX with best-effort date.
REQ-DATE-03 — Two-digit and three-digit years shall be expanded per C6; a 2-digit year in 58–72 shall yield UNKNOWN + a review entry rather than a guess.
REQ-DATE-04 — Trailing editorial notes (e.g. ", 2. Brief") shall be stripped before parsing and preserved (kept within date_raw; not invented into the date).
REQ-DATE-05 — The parser shall be pure and side-effect-free so it can be unit-tested in isolation (see NFR-TEST-01).
REQ-DATE-06 — Movable feasts are never mapped to a fixed month; they shall be computed per year from Easter (Gauss/Butcher computus): Karfreitag = Easter−2, Ostern = Easter Sunday, Himmelfahrt = Easter+39, Pfingst(sonntag) = Easter+49, Pfingstmontag = Easter+50, Fronleichnam = Easter+60, 1.–4. Advent = the 4th…1st Sunday before 25 Dec. Fixed feasts use a lookup table (Neujahr=01-01, Heiligabend=12-24, Weihnachten=12-25, Silvester=12-31, …). Seasons map to representative months: Frühling/Frühjahr=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan. The feast/season tables and Easter algorithm live in config.py (NFR-MAINT-01).

4.4 Person resolution & dedup (`FR-PERS`, `FR-DEDUP`) — resolves IMP-04, IMP-05, IMP-11

US-PERS-01 — As the data steward, I want the genealogical register turned into canonical people with all their known facts, so documents can link to real persons.

AC1 — Given a register row, when parsed, then a canonical person is produced with person_id, name parts, maiden_name, birth/death (parsed + raw + place), spouse, generation, nickname, notes — applying the same date rules as §4.3 to birth/death dates.
AC2 — Given multi-value given names ("Charlotte,Meta,Jacobi"), when parsed, then the primary given name is the first; the remainder are retained as additional names/aliases.

US-PERS-02 — As the data steward, I want each sender/receiver string matched to a canonical person where possible and never dropped otherwise, so the correspondence graph is complete.

AC1 — Given a sender/receiver string, when resolved, then it maps to a register person_id via the alias index (exact → normalized/casefold → conservative fuzzy).
AC2 — Given no confident match, when resolved, then a provisional person is created from the cleaned string, linked, and listed in review/unmatched-names.csv (occurrence count + example source rows).
AC3 — Given the string appears in overrides/names.csv, when resolved, then it maps to the specified person_id (override wins).
AC4 — Given a multi-person receiver cell ("Eugenie u Walter de Gruyter", "Herbert u Clara", "…//…", "Hedi und Tutu (Gruber)"), when resolved, then it is split into individual people, each resolved independently; ambiguous space-joined pairs ("Ella Anita") are emitted to review/ambiguous-receivers.csv rather than guessed.
REQ-DEDUP-01 — The alias index shall be derived from the register: canonical "First Last", maiden form (geb als), spouse-surname married form, nickname, and first-name-only only when unambiguous across the register.
REQ-DEDUP-02 — The normalizer shall not merge two distinct strings into one person on fuzzy similarity alone above a configured threshold without the match being reported; merges must be auditable.
REQ-PERS-01 — Sender cells shall be parsed for multi-person content using the same rules as receiver cells (today the importer parses only receivers — IMP-11).

4.5 Overrides & idempotency (`FR-OVR`) — supports the iteration loop

REQ-OVR-01 — When the normalizer runs, then it shall load overrides/dates.csv and overrides/names.csv if present and apply them; absence of either file shall not be an error.
REQ-OVR-02 — While overrides are unchanged and inputs are unchanged, re-running shall produce byte-identical canonical outputs and review files (NFR-IDEM-01).
REQ-OVR-03 — Each override application shall be counted in review/summary.txt (how many dates/names were resolved by override vs automatically).

4.6 Canonical output & provenance (`FR-OUT`, `FR-PROV`) — resolves IMP-01, IMP-09, IMP-12

REQ-OUT-01 — The normalizer shall write out/canonical-documents.xlsx and out/canonical-persons.xlsx with the headered schemas in §6.
REQ-PROV-01 — Every canonical document row shall carry source_row (1-based row number in the source sheet) so any value can be traced back to the original.
REQ-PROV-02 — Every canonical row shall carry a needs_review field listing zero or more flags (duplicate_index, unparsed_date, unmatched_sender, unmatched_receiver, index_file_mismatch, …) so the import and the UI can foreground uncertain data.
REQ-OUT-02 — Where the source Datei path disagrees with the index-derived filename (IMP-09), the normalizer shall record the discrepancy in review/index-file-mismatch.csv and flag the row; it shall not alter the index (the stable key).

5. Non-Functional Requirements

ID	Category	Requirement (measurable)
NFR-DATA-01	Data integrity	100% of source rows are accounted for in output or a review file; 100% of original date/name strings preserved verbatim.
NFR-IDEM-01	Determinism	Identical inputs + overrides ⇒ identical logical output across runs/machines: identical canonical cell matrices and review-file contents. Workbook `created`/`modified` metadata is pinned to a constant; ordering of all generated rows/aliases is stable (no set-iteration leakage). xlsx byte-identity is explicitly not required — determinism is asserted on content.
NFR-PERF-01	Performance	Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop.
NFR-ACCUR-01	Date accuracy	After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%.
NFR-ACCUR-02	Name coverage	Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped.
NFR-I18N-01	Encoding	UTF-8 end-to-end; German diacritics and ß round-trip with no mojibake in any output.
NFR-TEST-01	Testability	`dates.py` and `persons.py` have pytest tests covering every format/alias category in §10 with real examples from the archive.
NFR-MAINT-01	Maintainability	Column-name map, century boundaries, season→month map, and fuzzy threshold live in `config.py`, not inline in logic.
NFR-OBSERV-01	Observability	`review/summary.txt` reports per-run stats: rows in, documents out, dates by precision, names matched vs provisional, overrides applied, anomalies by type.
NFR-SAFETY-01	Source safety	Source workbooks are opened read-only and never written.

6. Data Dictionary (canonical contract)

This is the contract Phase 2 (the importer) must consume. Field-level, format-level — not a DB schema.

6.1 `canonical-documents.xlsx`

Field	Required	Format / values	Notes
`index`	yes	string	Stable key; basis for PDF matching.
`box`	no	string	from `Box`.
`folder`	no	string	from `Mappe`.
`sender_person_id`	no	person_id	resolved; empty if no sender.
`sender_name`	no	string	canonical display name (or cleaned raw if provisional).
`receiver_person_ids`	no	`id\|id\|…`	pipe-separated.
`receiver_names`	no	`name\|name\|…`	pipe-separated, aligned with ids.
`date_iso`	no	`YYYY-MM-DD`	best-effort; empty if `UNKNOWN`.
`date_raw`	no	string	verbatim source date.
`date_precision`	yes	enum	`DAY\|MONTH\|SEASON\|YEAR\|RANGE\|APPROX\|UNKNOWN`.
`location`	no	string	from `Ort`.
`tags`	no	`tag\|tag`	from `Schlagwort`.
`summary`	no	string	from `Inhalt`.
`source_row`	yes	int	provenance (NFR-DATA-01).
`needs_review`	yes	`flag\|flag` or empty	review flags (REQ-PROV-02).

6.2 `canonical-persons.xlsx`

Field	Required	Format	Notes
`person_id`	yes	slug	stable id (e.g. `de-gruyter-eugenie`); collisions suffixed.
`last_name`	yes	string	from `Familienname`.
`first_name`	no	string	primary given name.
`maiden_name`	no	string	from `geb als` — drives dedup.
`title`	no	string	e.g. honorifics if present.
`nickname`	no	string	from quoted `Bemerkung`/spouse field.
`birth_date` / `birth_date_raw` / `birth_place`	no	ISO / string / string	§4.3 rules.
`death_date` / `death_date_raw` / `death_place`	no	ISO / string / string	§4.3 rules.
`spouse`	no	person_id or name	from `verheiratet mit`.
`generation`	no	string	`G 1`..`G 4`.
`notes`	no	string	from `Bemerkung`.
`aliases`	no	`a\|b\|c`	every surface form that maps here.
`provisional`	yes	bool	true if created from a document string, not the register.

7. Prioritized Backlog (MoSCoW)

ID	Item	MoSCoW	Effort	Depends on
B1	Project scaffolding + read both workbooks (`FR-INGEST`, header map `FR-MAP`)	Must	S	—
B2	Row triage + blank/duplicate/empty reports (`FR-TRIAGE`)	Must	S	B1
B3	Date parser + precision + century rule + Easter/feast computus + season map + tests (`FR-DATE`)	Must	L	B1
B4	Person register parser → canonical persons (`FR-PERS` US-PERS-01)	Must	M	B1
B5	Alias index + name resolution + multi-person split (`FR-DEDUP`, US-PERS-02)	Must	L	B4
B6	Overrides load + apply + idempotency (`FR-OVR`)	Must	S	B3,B5
B7	Canonical writers + provenance + review summary (`FR-OUT`, `FR-PROV`)	Must	M	B2,B3,B5
B8	Index↔Datei mismatch report (`REQ-OUT-02`)	Should	XS	B1
B9	Ambiguous-receiver review path (US-PERS-02 AC4)	Should	S	B5
B10	Comma-split `Inhalt` into extra tags	Could	XS	B7
B11	Phase-2 importer wiring (separate spec)	Won't (this spec)	—	B7

8. Traceability — Findings → Requirements

Finding	Severity	Addressed by
IMP-01 layout mismatch	blocker	C1, FR-MAP, REQ-OUT-01
IMP-02 free-text dates	blocker	FR-DATE (all), C2, C6
IMP-03 no ISO/normalized cols	blocker	FR-DATE, FR-PERS
IMP-04 register unimported	major	C3, US-PERS-01, §6.2
IMP-05 name variants → dupes	major	C3, FR-DEDUP
IMP-06 blank-index dropped	major	US-TRIAGE-01
IMP-07 duplicate indices	minor	REQ-TRIAGE-01
IMP-08 section rows / tags vs summary	minor	REQ-TRIAGE-02, C7
IMP-09 index↔file mismatch	minor	REQ-OUT-02, B8
IMP-10 `x`-suffix rows	minor	REQ-TRIAGE-03 (skip + log this pass)
IMP-11 sender not split / `u` sep	minor	REQ-PERS-01, US-PERS-02 AC4
IMP-12 first-sheet, no validation	minor	REQ-INGEST-01, FR-MAP AC2/AC3

9. Open Questions / TBD Register

ID	Question	Why it matters	Ref	Resolution
OQ-01 ✅	Season/holiday → date.	Accuracy of ~70 SEASON/feast rows.	REQ-DATE-06	Resolved (2026-05-25): movable feasts (Ostern, Pfingsten, Himmelfahrt, Advent, …) computed per year from Easter — never a fixed month; fixed feasts looked up (Weihnachten=12-25, Neujahr=01-01, …); seasons = mid-season month (Frühling=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan).
OQ-02 ✅	Date ranges: start only, or start+end?	Sorting/display of ~315 range values.	REQ-DATE-02	Confirmed: store start in `date_iso`, precision `RANGE`, full text in `date_raw`.
OQ-03 ✅	`person_id` format.	Stability across re-runs; diffability.	§6	Confirmed: readable slug `lastname-firstname`, numeric suffix on collision.
OQ-04 ✅	`x`-suffix row handling.	42 rows.	REQ-TRIAGE-03	Resolved (2026-05-25): `x` rows are transcriptions of the base letter but not yet mappable → skip this pass, log to `review/skipped-x-suffix.csv` for later linking.
OQ-05 ✅	Importer output format.	Phase-2 reader.	B11	Confirmed: `.xlsx` (openpyxl-native, headered).
OQ-06 ✅	Fuzzy-match policy.	False-positive person merges (R2).	REQ-DEDUP-02	Confirmed: conservative — report all fuzzy matches; no silent merge.

All open questions resolved as of 2026-05-25. New ambiguities discovered during build go here.

10. Glossary & Worked Examples

Precision — how exactly a date is known (DAY … UNKNOWN). Provisional person — a person created from a document name string with no register match. Alias index — map from every known surface form of a name to a canonical person_id. Override — a human-supplied correction applied deterministically on each run.

Date examples → expected outcome:

`date_raw`	`date_iso`	`date_precision`
`15.2.1888`	1888-02-15	DAY
`6.März 1888`	1888-03-06	DAY
`22.III.18`	1918-03-22	DAY
`13.5.09`	1909-05-13	DAY
`10.Oct.95`	1895-10-10	DAY
`17/6. 1916`	1916-06-17	DAY
`Mai 1895`	1895-05-01	MONTH
`Pfingsten 1922`	1922-06-04	DAY (computed: Easter 1922 = Apr 16, +49 days)
`Herbst 1913`	1913-10-01	SEASON
`1905`	1905-01-01	YEAR
`8.1.1916 - 15.3.1916`	1916-01-08	RANGE
`17.Nov (?) 1887`	1887-11-17	APPROX
`?`	(empty)	UNKNOWN

Name examples → expected outcome:

raw cell	resolves to
`Eugenie Müller` (+ register `geb Müller`)	`de-gruyter-eugenie` (matched via maiden alias)
`Eugenie de Gruyter`	`de-gruyter-eugenie`
`Herbert u Clara`	`cram-herbert` + `cram-clara` (split, surname distributed)
`Hedi und Tutu (Gruber)`	`gruber-hedi` + `gruber-tutu`
`Ella Anita`	→ `review/ambiguous-receivers.csv` (not auto-split)
`Hans Wittkopf` (not in register)	provisional `wittkopf-hans`

24 KiB Raw Blame History Unescape Escape