Files
familienarchiv/docs/import-migration/02-normalization-spec.md
Marcel 6f7aa643c9 docs(import): add normalizer implementation plan + apply persona review
17-task TDD plan for tools/import-normalizer/. Incorporates inline
6-persona review: content-deterministic idempotency, duplicate-index
fix, provisional-id collision guard, date-parser edge cases, multi-sender
split, CSV-injection defang, pinned deps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:55:50 +02:00

24 KiB
Raw Blame History

Spec — Import Normalizer

Authored in the voice of "Elicit", requirements engineer (see .claude/personas/req_engineer.md). This is a requirements artifact: it states what the normalizer must do and how we'll know it's done, in problem/behaviour language. Technology choices already made during brainstorming (Python, openpyxl, overrides-and-rerun) are recorded as constraints, not re-litigated here.

  • Status: Draft for review
  • Date: 2026-05-25
  • Related: 01-findings-spreadsheet-analysis.md (issues IMP-01..12), README.md
  • Scope boundary: This spec covers the offline normalizer that turns the raw spreadsheets into a clean, canonical dataset + review artifacts. Wiring the canonical contract into the Java MassImportService and the Document/Person model is Phase 2 and gets its own spec. This spec only defines the contract Phase 2 must satisfy.

1. Project Brief

Vision. Turn the family's human-curated, free-form archive spreadsheets into a clean, canonical dataset that imports deterministically — without hand-editing thousands of rows and without losing the historical nuance of how things were originally written.

Problem. The real archive (…aktuell…xlsx, 7,943 rows) and the person register (Personendatei 2.xlsx, 163 people) were authored for humans to read, not machines to import. Dates are written as they appeared in each letter (≈90% unparseable by the current importer), the column layout differs from what the importer expects, and the same person appears under many names. Importing as-is produces garbage (see IMP-01..12).

Goal (measurable).

  • G1 — After the automated pass, ≤ 5% of dated rows remain UNKNOWN; after the overrides-iteration loop, ≤ 0.5%.
  • G2 — 100% of source rows are represented in the canonical output or in a review file — zero silent drops.
  • G3 — 100% of original values (raw date string, raw name string, source row number) are preserved.
  • G4 — A full run over the current inputs completes in < 60 s on the dev laptop and is content-deterministic when re-run with unchanged inputs+overrides: identical canonical cell matrices and identical review-file contents. (Workbook metadata is pinned; literal xlsx byte-identity is not guaranteed because the zip container stores entry metadata.)

Primary actor. Marcel — solo owner & data steward (tech comfort 4/5). Also: a future agent re-running the pipeline; and the MassImportService as the downstream consumer.

Non-Goals (explicitly out of scope).

  • NG1 — Changing MassImportService or the DB schema (that is Phase 2).
  • NG2 — Uploading/attaching the ~7,000 PDFs (they arrive later; import matches by index).
  • NG3 — A GUI. The interface is spreadsheets in, CSVs out, an overrides file hand-edited.
  • NG4 — Perfect genealogical reconstruction. We resolve confidently-matchable people; the long tail stays as provisional persons.
  • NG5 — OCR/transcription content (the new xlsx has no transcription column).

Key assumptions. (A1) Sheet Familienarchiv is the document source of truth. (A2) Archive date range is 18731957 (drives the 2-digit-year century rule). (A3) index is the stable document key and the basis for future PDF matching. (A4) Schlagwort is a broad tag; Inhalt is a short summary/topic.

Risks. (R1) 2-digit/partial dates are genuinely ambiguous → mitigated by precision flag

  • overrides. (R2) Name matching false-positives merge distinct people → mitigated by conservative matching + review before merge. (R3) Source spreadsheet may be re-exported with layout drift → mitigated by header-name-based mapping, not fixed indices.

2. Personas

Marcel — Data Steward. Role: solo owner of Familienarchiv. Context: holds the complete raw archive; PDFs follow. Tech comfort: 4/5 (semi-technical, reads CSV/spreadsheets fluently, not keen to hand-edit 7,600 rows). Primary goal: a clean, importable dataset he trusts. Frustrations: dates in ~20 formats; one ancestor under 4 name variants. JTBD: "When I have raw, human-curated archive spreadsheets, I want to transform them into a clean importable dataset without losing how things were originally written, so I can load the archive and keep correcting edge cases as they surface."

The Returning Agent. Role: a future assistant session resuming the work. Goal: re-run the pipeline deterministically and understand exactly what still needs human input. JTBD: "When I pick this up cold, I want one command and a clear residue report, so I can continue without re-deriving context."


3. Constraints & Decisions Already Made

These were settled during brainstorming and are fixed inputs to the requirements below.

# Decision Rationale
C1 New canonical layout with explicit headers (not the old positional ODS shape). Fits the new data; importer becomes header-driven in Phase 2.
C2 Dates stored as parsed (nullable) + raw + precision. Historical archive; never lose the original; enable "ca. 1916".
C3 Include person resolution (register + alias/marriage map → canonical persons) in this effort. Maiden-name dedup needs the register.
C4 Overrides-file + re-run loop for residue. Deterministic, diffable, repeatable.
C5 Implementation: Python 3.12 + openpyxl, standalone tool at tools/import-normalizer/. Fast iteration; no Spring rebuild / coverage gate on transform code.
C6 Century rule for archive 18731957: 2-digit 005719YY, 739918YY, 5872flag; 3-digit DDD1DDD; never 20xx. Stated by Marcel. Boundaries live in config.
C7 Schlagwort→tag, Inhalt→summary. Matches importer's existing semantics.
C8 Non-register correspondents become provisional persons. ~945 distinct sender strings vs 163 register people.

4. Functional Requirements

Each requirement has a stable ID. User stories use Connextra + Given-When-Then; system rules use EARS. Traceability to findings in §8.

4.1 Ingest & layout (FR-INGEST, FR-MAP)

US-MAP-01As the data steward, I want each source column mapped to a named canonical field regardless of its position, so a re-exported spreadsheet with shifted columns still imports correctly.

  • AC1 — Given the Familienarchiv sheet, when the normalizer reads the header row, then it maps columns by header name (not fixed index) to the canonical fields.

  • AC2 — Given a header the normalizer does not recognise, when it runs, then it records the unknown header in review/summary.txt and continues (does not crash).

  • AC3 — Given a required source header is absent, when it runs, then it aborts with a clear message naming the missing header (fail loud, before producing partial output).

  • REQ-INGEST-01 — The normalizer shall read only the Familienarchiv sheet of the document workbook and the Tabelle1 sheet of the person workbook.

  • REQ-MAP-01 — Header matching shall be case-insensitive and tolerant of internal multiple spaces (e.g. "Datum des Briefes").

4.2 Row triage (FR-TRIAGE) — resolves IMP-06, IMP-07, IMP-08

US-TRIAGE-01As the data steward, I want rows that have data but no index surfaced rather than dropped, so I never lose a letter silently.

  • AC1 — Given a row whose index is blank but which has any other non-empty cell, when the normalizer runs, then that row is written to review/blank-index-rows.csv with its source row number and is not emitted as a canonical document.

  • AC2 — Given a fully empty row, when it runs, then the row is skipped and counted (not reported as an anomaly).

  • REQ-TRIAGE-01 — If two or more rows resolve to the same index, then the normalizer shall emit all of them to review/duplicate-index.csv and mark each canonical row needs_review = duplicate_index (it shall not silently drop either).

  • REQ-TRIAGE-02 — Where a row is identified as a section/banner row (blank index, text only in a name column), the normalizer shall classify it as such in the blank-index report.

  • REQ-TRIAGE-03 — Rows whose index ends in x (a transcription/back-side of the base letter, not yet independently mappable) shall be skipped — not emitted as a canonical document — and written to review/skipped-x-suffix.csv with their source row and base index (index minus the trailing x), so they can be linked in a later pass. (Resolves IMP-10.)

4.3 Date normalization (FR-DATE) — resolves IMP-02, IMP-03

US-DATE-01As the data steward, I want every date interpreted as precisely as the source allows, with the original always kept, so I can sort the archive and still see what the letter actually said.

  • AC1 — Given a parseable date, when normalized, then date_iso holds the best-effort ISO date, date_raw holds the verbatim source string, and date_precision{DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN}.

  • AC2 — Given an unparseable date, when normalized, then date_iso is empty, date_precision = UNKNOWN, date_raw is preserved, and the value appears in review/unparsed-dates.csv.

  • AC3 — Given the same date_raw appears in overrides/dates.csv, when normalized, then the override's (iso, precision) wins over the automatic parse.

  • REQ-DATE-01 — The parser shall accept, at minimum, these forms (see §10 examples): Excel/ISO; D.M.YYYY/D.M.YY; D/M. YY[YY] (slash treated as dot); Roman-numeral months IXII; German + English month names, full and abbreviated, with or without a separating space; Month YYYY; season/holiday + year; bare YYYY; and start-anchored ranges.

  • REQ-DATE-02 — Precision shall be assigned by what is known: full day → DAY; month+year → MONTH (day = 1); a named feast/holiday + year → resolved to its actual calendar date for that yearDAY; a season + year → representative mid-season month (day = 1) → SEASON; year only → YEAR (month = Jan, day = 1); a range → start date + RANGE; a value carrying an uncertainty marker (?, um, ca, circa) → APPROX with best-effort date.

  • REQ-DATE-03 — Two-digit and three-digit years shall be expanded per C6; a 2-digit year in 5872 shall yield UNKNOWN + a review entry rather than a guess.

  • REQ-DATE-04 — Trailing editorial notes (e.g. ", 2. Brief") shall be stripped before parsing and preserved (kept within date_raw; not invented into the date).

  • REQ-DATE-05 — The parser shall be pure and side-effect-free so it can be unit-tested in isolation (see NFR-TEST-01).

  • REQ-DATE-06Movable feasts are never mapped to a fixed month; they shall be computed per year from Easter (Gauss/Butcher computus): Karfreitag = Easter2, Ostern = Easter Sunday, Himmelfahrt = Easter+39, Pfingst(sonntag) = Easter+49, Pfingstmontag = Easter+50, Fronleichnam = Easter+60, 1.4. Advent = the 4th…1st Sunday before 25 Dec. Fixed feasts use a lookup table (Neujahr=01-01, Heiligabend=12-24, Weihnachten=12-25, Silvester=12-31, …). Seasons map to representative months: Frühling/Frühjahr=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan. The feast/season tables and Easter algorithm live in config.py (NFR-MAINT-01).

4.4 Person resolution & dedup (FR-PERS, FR-DEDUP) — resolves IMP-04, IMP-05, IMP-11

US-PERS-01As the data steward, I want the genealogical register turned into canonical people with all their known facts, so documents can link to real persons.

  • AC1 — Given a register row, when parsed, then a canonical person is produced with person_id, name parts, maiden_name, birth/death (parsed + raw + place), spouse, generation, nickname, notes — applying the same date rules as §4.3 to birth/death dates.
  • AC2 — Given multi-value given names ("Charlotte,Meta,Jacobi"), when parsed, then the primary given name is the first; the remainder are retained as additional names/aliases.

US-PERS-02As the data steward, I want each sender/receiver string matched to a canonical person where possible and never dropped otherwise, so the correspondence graph is complete.

  • AC1 — Given a sender/receiver string, when resolved, then it maps to a register person_id via the alias index (exact → normalized/casefold → conservative fuzzy).

  • AC2 — Given no confident match, when resolved, then a provisional person is created from the cleaned string, linked, and listed in review/unmatched-names.csv (occurrence count + example source rows).

  • AC3 — Given the string appears in overrides/names.csv, when resolved, then it maps to the specified person_id (override wins).

  • AC4 — Given a multi-person receiver cell ("Eugenie u Walter de Gruyter", "Herbert u Clara", "…//…", "Hedi und Tutu (Gruber)"), when resolved, then it is split into individual people, each resolved independently; ambiguous space-joined pairs ("Ella Anita") are emitted to review/ambiguous-receivers.csv rather than guessed.

  • REQ-DEDUP-01 — The alias index shall be derived from the register: canonical "First Last", maiden form (geb als), spouse-surname married form, nickname, and first-name-only only when unambiguous across the register.

  • REQ-DEDUP-02 — The normalizer shall not merge two distinct strings into one person on fuzzy similarity alone above a configured threshold without the match being reported; merges must be auditable.

  • REQ-PERS-01 — Sender cells shall be parsed for multi-person content using the same rules as receiver cells (today the importer parses only receivers — IMP-11).

4.5 Overrides & idempotency (FR-OVR) — supports the iteration loop

  • REQ-OVR-01 — When the normalizer runs, then it shall load overrides/dates.csv and overrides/names.csv if present and apply them; absence of either file shall not be an error.
  • REQ-OVR-02 — While overrides are unchanged and inputs are unchanged, re-running shall produce byte-identical canonical outputs and review files (NFR-IDEM-01).
  • REQ-OVR-03 — Each override application shall be counted in review/summary.txt (how many dates/names were resolved by override vs automatically).

4.6 Canonical output & provenance (FR-OUT, FR-PROV) — resolves IMP-01, IMP-09, IMP-12

  • REQ-OUT-01 — The normalizer shall write out/canonical-documents.xlsx and out/canonical-persons.xlsx with the headered schemas in §6.
  • REQ-PROV-01 — Every canonical document row shall carry source_row (1-based row number in the source sheet) so any value can be traced back to the original.
  • REQ-PROV-02 — Every canonical row shall carry a needs_review field listing zero or more flags (duplicate_index, unparsed_date, unmatched_sender, unmatched_receiver, index_file_mismatch, …) so the import and the UI can foreground uncertain data.
  • REQ-OUT-02 — Where the source Datei path disagrees with the index-derived filename (IMP-09), the normalizer shall record the discrepancy in review/index-file-mismatch.csv and flag the row; it shall not alter the index (the stable key).

5. Non-Functional Requirements

ID Category Requirement (measurable)
NFR-DATA-01 Data integrity 100% of source rows are accounted for in output or a review file; 100% of original date/name strings preserved verbatim.
NFR-IDEM-01 Determinism Identical inputs + overrides ⇒ identical logical output across runs/machines: identical canonical cell matrices and review-file contents. Workbook created/modified metadata is pinned to a constant; ordering of all generated rows/aliases is stable (no set-iteration leakage). xlsx byte-identity is explicitly not required — determinism is asserted on content.
NFR-PERF-01 Performance Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop.
NFR-ACCUR-01 Date accuracy After automated pass, UNKNOWN dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%.
NFR-ACCUR-02 Name coverage Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped.
NFR-I18N-01 Encoding UTF-8 end-to-end; German diacritics and ß round-trip with no mojibake in any output.
NFR-TEST-01 Testability dates.py and persons.py have pytest tests covering every format/alias category in §10 with real examples from the archive.
NFR-MAINT-01 Maintainability Column-name map, century boundaries, season→month map, and fuzzy threshold live in config.py, not inline in logic.
NFR-OBSERV-01 Observability review/summary.txt reports per-run stats: rows in, documents out, dates by precision, names matched vs provisional, overrides applied, anomalies by type.
NFR-SAFETY-01 Source safety Source workbooks are opened read-only and never written.

6. Data Dictionary (canonical contract)

This is the contract Phase 2 (the importer) must consume. Field-level, format-level — not a DB schema.

6.1 canonical-documents.xlsx

Field Required Format / values Notes
index yes string Stable key; basis for PDF matching.
box no string from Box.
folder no string from Mappe.
sender_person_id no person_id resolved; empty if no sender.
sender_name no string canonical display name (or cleaned raw if provisional).
receiver_person_ids no id|id|… pipe-separated.
receiver_names no name|name|… pipe-separated, aligned with ids.
date_iso no YYYY-MM-DD best-effort; empty if UNKNOWN.
date_raw no string verbatim source date.
date_precision yes enum DAY|MONTH|SEASON|YEAR|RANGE|APPROX|UNKNOWN.
location no string from Ort.
tags no tag|tag from Schlagwort.
summary no string from Inhalt.
source_row yes int provenance (NFR-DATA-01).
needs_review yes flag|flag or empty review flags (REQ-PROV-02).

6.2 canonical-persons.xlsx

Field Required Format Notes
person_id yes slug stable id (e.g. de-gruyter-eugenie); collisions suffixed.
last_name yes string from Familienname.
first_name no string primary given name.
maiden_name no string from geb als — drives dedup.
title no string e.g. honorifics if present.
nickname no string from quoted Bemerkung/spouse field.
birth_date / birth_date_raw / birth_place no ISO / string / string §4.3 rules.
death_date / death_date_raw / death_place no ISO / string / string §4.3 rules.
spouse no person_id or name from verheiratet mit.
generation no string G 1..G 4.
notes no string from Bemerkung.
aliases no a|b|c every surface form that maps here.
provisional yes bool true if created from a document string, not the register.

7. Prioritized Backlog (MoSCoW)

ID Item MoSCoW Effort Depends on
B1 Project scaffolding + read both workbooks (FR-INGEST, header map FR-MAP) Must S
B2 Row triage + blank/duplicate/empty reports (FR-TRIAGE) Must S B1
B3 Date parser + precision + century rule + Easter/feast computus + season map + tests (FR-DATE) Must L B1
B4 Person register parser → canonical persons (FR-PERS US-PERS-01) Must M B1
B5 Alias index + name resolution + multi-person split (FR-DEDUP, US-PERS-02) Must L B4
B6 Overrides load + apply + idempotency (FR-OVR) Must S B3,B5
B7 Canonical writers + provenance + review summary (FR-OUT, FR-PROV) Must M B2,B3,B5
B8 Index↔Datei mismatch report (REQ-OUT-02) Should XS B1
B9 Ambiguous-receiver review path (US-PERS-02 AC4) Should S B5
B10 Comma-split Inhalt into extra tags Could XS B7
B11 Phase-2 importer wiring (separate spec) Won't (this spec) B7

8. Traceability — Findings → Requirements

Finding Severity Addressed by
IMP-01 layout mismatch blocker C1, FR-MAP, REQ-OUT-01
IMP-02 free-text dates blocker FR-DATE (all), C2, C6
IMP-03 no ISO/normalized cols blocker FR-DATE, FR-PERS
IMP-04 register unimported major C3, US-PERS-01, §6.2
IMP-05 name variants → dupes major C3, FR-DEDUP
IMP-06 blank-index dropped major US-TRIAGE-01
IMP-07 duplicate indices minor REQ-TRIAGE-01
IMP-08 section rows / tags vs summary minor REQ-TRIAGE-02, C7
IMP-09 index↔file mismatch minor REQ-OUT-02, B8
IMP-10 x-suffix rows minor REQ-TRIAGE-03 (skip + log this pass)
IMP-11 sender not split / u sep minor REQ-PERS-01, US-PERS-02 AC4
IMP-12 first-sheet, no validation minor REQ-INGEST-01, FR-MAP AC2/AC3

9. Open Questions / TBD Register

ID Question Why it matters Ref Resolution
OQ-01 Season/holiday → date. Accuracy of ~70 SEASON/feast rows. REQ-DATE-06 Resolved (2026-05-25): movable feasts (Ostern, Pfingsten, Himmelfahrt, Advent, …) computed per year from Easter — never a fixed month; fixed feasts looked up (Weihnachten=12-25, Neujahr=01-01, …); seasons = mid-season month (Frühling=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan).
OQ-02 Date ranges: start only, or start+end? Sorting/display of ~315 range values. REQ-DATE-02 Confirmed: store start in date_iso, precision RANGE, full text in date_raw.
OQ-03 person_id format. Stability across re-runs; diffability. §6 Confirmed: readable slug lastname-firstname, numeric suffix on collision.
OQ-04 x-suffix row handling. 42 rows. REQ-TRIAGE-03 Resolved (2026-05-25): x rows are transcriptions of the base letter but not yet mappable → skip this pass, log to review/skipped-x-suffix.csv for later linking.
OQ-05 Importer output format. Phase-2 reader. B11 Confirmed: .xlsx (openpyxl-native, headered).
OQ-06 Fuzzy-match policy. False-positive person merges (R2). REQ-DEDUP-02 Confirmed: conservative — report all fuzzy matches; no silent merge.

All open questions resolved as of 2026-05-25. New ambiguities discovered during build go here.


10. Glossary & Worked Examples

Precision — how exactly a date is known (DAYUNKNOWN). Provisional person — a person created from a document name string with no register match. Alias index — map from every known surface form of a name to a canonical person_id. Override — a human-supplied correction applied deterministically on each run.

Date examples → expected outcome:

date_raw date_iso date_precision
15.2.1888 1888-02-15 DAY
6.März 1888 1888-03-06 DAY
22.III.18 1918-03-22 DAY
13.5.09 1909-05-13 DAY
10.Oct.95 1895-10-10 DAY
17/6. 1916 1916-06-17 DAY
Mai 1895 1895-05-01 MONTH
Pfingsten 1922 1922-06-04 DAY (computed: Easter 1922 = Apr 16, +49 days)
Herbst 1913 1913-10-01 SEASON
1905 1905-01-01 YEAR
8.1.1916 - 15.3.1916 1916-01-08 RANGE
17.Nov (?) 1887 1887-11-17 APPROX
? (empty) UNKNOWN

Name examples → expected outcome:

raw cell resolves to
Eugenie Müller (+ register geb Müller) de-gruyter-eugenie (matched via maiden alias)
Eugenie de Gruyter de-gruyter-eugenie
Herbert u Clara cram-herbert + cram-clara (split, surname distributed)
Hedi und Tutu (Gruber) gruber-hedi + gruber-tutu
Ella Anita review/ambiguous-receivers.csv (not auto-split)
Hans Wittkopf (not in register) provisional wittkopf-hans