17-task TDD plan for tools/import-normalizer/. Incorporates inline 6-persona review: content-deterministic idempotency, duplicate-index fix, provisional-id collision guard, date-parser edge cases, multi-sender split, CSV-injection defang, pinned deps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
24 KiB
Spec — Import Normalizer
Authored in the voice of "Elicit", requirements engineer (see
.claude/personas/req_engineer.md). This is a requirements artifact: it states what the normalizer must do and how we'll know it's done, in problem/behaviour language. Technology choices already made during brainstorming (Python, openpyxl, overrides-and-rerun) are recorded as constraints, not re-litigated here.
- Status: Draft for review
- Date: 2026-05-25
- Related:
01-findings-spreadsheet-analysis.md(issuesIMP-01..12),README.md - Scope boundary: This spec covers the offline normalizer that turns the raw
spreadsheets into a clean, canonical dataset + review artifacts. Wiring the canonical
contract into the Java
MassImportServiceand theDocument/Personmodel is Phase 2 and gets its own spec. This spec only defines the contract Phase 2 must satisfy.
1. Project Brief
Vision. Turn the family's human-curated, free-form archive spreadsheets into a clean, canonical dataset that imports deterministically — without hand-editing thousands of rows and without losing the historical nuance of how things were originally written.
Problem. The real archive (…aktuell…xlsx, 7,943 rows) and the person register
(Personendatei 2.xlsx, 163 people) were authored for humans to read, not machines to
import. Dates are written as they appeared in each letter (≈90% unparseable by the current
importer), the column layout differs from what the importer expects, and the same person
appears under many names. Importing as-is produces garbage (see IMP-01..12).
Goal (measurable).
- G1 — After the automated pass, ≤ 5% of dated rows remain
UNKNOWN; after the overrides-iteration loop, ≤ 0.5%. - G2 — 100% of source rows are represented in the canonical output or in a review file — zero silent drops.
- G3 — 100% of original values (raw date string, raw name string, source row number) are preserved.
- G4 — A full run over the current inputs completes in < 60 s on the dev laptop and is content-deterministic when re-run with unchanged inputs+overrides: identical canonical cell matrices and identical review-file contents. (Workbook metadata is pinned; literal xlsx byte-identity is not guaranteed because the zip container stores entry metadata.)
Primary actor. Marcel — solo owner & data steward (tech comfort 4/5). Also: a future
agent re-running the pipeline; and the MassImportService as the downstream consumer.
Non-Goals (explicitly out of scope).
- NG1 — Changing
MassImportServiceor the DB schema (that is Phase 2). - NG2 — Uploading/attaching the ~7,000 PDFs (they arrive later; import matches by
index). - NG3 — A GUI. The interface is spreadsheets in, CSVs out, an overrides file hand-edited.
- NG4 — Perfect genealogical reconstruction. We resolve confidently-matchable people; the long tail stays as provisional persons.
- NG5 — OCR/transcription content (the new xlsx has no transcription column).
Key assumptions. (A1) Sheet Familienarchiv is the document source of truth.
(A2) Archive date range is 1873–1957 (drives the 2-digit-year century rule).
(A3) index is the stable document key and the basis for future PDF matching.
(A4) Schlagwort is a broad tag; Inhalt is a short summary/topic.
Risks. (R1) 2-digit/partial dates are genuinely ambiguous → mitigated by precision flag
- overrides. (R2) Name matching false-positives merge distinct people → mitigated by conservative matching + review before merge. (R3) Source spreadsheet may be re-exported with layout drift → mitigated by header-name-based mapping, not fixed indices.
2. Personas
Marcel — Data Steward. Role: solo owner of Familienarchiv. Context: holds the complete raw archive; PDFs follow. Tech comfort: 4/5 (semi-technical, reads CSV/spreadsheets fluently, not keen to hand-edit 7,600 rows). Primary goal: a clean, importable dataset he trusts. Frustrations: dates in ~20 formats; one ancestor under 4 name variants. JTBD: "When I have raw, human-curated archive spreadsheets, I want to transform them into a clean importable dataset without losing how things were originally written, so I can load the archive and keep correcting edge cases as they surface."
The Returning Agent. Role: a future assistant session resuming the work. Goal: re-run the pipeline deterministically and understand exactly what still needs human input. JTBD: "When I pick this up cold, I want one command and a clear residue report, so I can continue without re-deriving context."
3. Constraints & Decisions Already Made
These were settled during brainstorming and are fixed inputs to the requirements below.
| # | Decision | Rationale |
|---|---|---|
| C1 | New canonical layout with explicit headers (not the old positional ODS shape). | Fits the new data; importer becomes header-driven in Phase 2. |
| C2 | Dates stored as parsed (nullable) + raw + precision. | Historical archive; never lose the original; enable "ca. 1916". |
| C3 | Include person resolution (register + alias/marriage map → canonical persons) in this effort. | Maiden-name dedup needs the register. |
| C4 | Overrides-file + re-run loop for residue. | Deterministic, diffable, repeatable. |
| C5 | Implementation: Python 3.12 + openpyxl, standalone tool at tools/import-normalizer/. |
Fast iteration; no Spring rebuild / coverage gate on transform code. |
| C6 | Century rule for archive 1873–1957: 2-digit 00–57→19YY, 73–99→18YY, 58–72→flag; 3-digit DDD→1DDD; never 20xx. |
Stated by Marcel. Boundaries live in config. |
| C7 | Schlagwort→tag, Inhalt→summary. |
Matches importer's existing semantics. |
| C8 | Non-register correspondents become provisional persons. | ~945 distinct sender strings vs 163 register people. |
4. Functional Requirements
Each requirement has a stable ID. User stories use Connextra + Given-When-Then; system rules use EARS. Traceability to findings in §8.
4.1 Ingest & layout (FR-INGEST, FR-MAP)
US-MAP-01 — As the data steward, I want each source column mapped to a named canonical field regardless of its position, so a re-exported spreadsheet with shifted columns still imports correctly.
-
AC1 — Given the
Familienarchivsheet, when the normalizer reads the header row, then it maps columns by header name (not fixed index) to the canonical fields. -
AC2 — Given a header the normalizer does not recognise, when it runs, then it records the unknown header in
review/summary.txtand continues (does not crash). -
AC3 — Given a required source header is absent, when it runs, then it aborts with a clear message naming the missing header (fail loud, before producing partial output).
-
REQ-INGEST-01 — The normalizer shall read only the
Familienarchivsheet of the document workbook and theTabelle1sheet of the person workbook. -
REQ-MAP-01 — Header matching shall be case-insensitive and tolerant of internal multiple spaces (e.g.
"Datum des Briefes").
4.2 Row triage (FR-TRIAGE) — resolves IMP-06, IMP-07, IMP-08
US-TRIAGE-01 — As the data steward, I want rows that have data but no index surfaced rather than dropped, so I never lose a letter silently.
-
AC1 — Given a row whose
indexis blank but which has any other non-empty cell, when the normalizer runs, then that row is written toreview/blank-index-rows.csvwith its source row number and is not emitted as a canonical document. -
AC2 — Given a fully empty row, when it runs, then the row is skipped and counted (not reported as an anomaly).
-
REQ-TRIAGE-01 — If two or more rows resolve to the same
index, then the normalizer shall emit all of them toreview/duplicate-index.csvand mark each canonical rowneeds_review = duplicate_index(it shall not silently drop either). -
REQ-TRIAGE-02 — Where a row is identified as a section/banner row (blank index, text only in a name column), the normalizer shall classify it as such in the blank-index report.
-
REQ-TRIAGE-03 — Rows whose
indexends inx(a transcription/back-side of the base letter, not yet independently mappable) shall be skipped — not emitted as a canonical document — and written toreview/skipped-x-suffix.csvwith their source row and base index (indexminus the trailingx), so they can be linked in a later pass. (Resolves IMP-10.)
4.3 Date normalization (FR-DATE) — resolves IMP-02, IMP-03
US-DATE-01 — As the data steward, I want every date interpreted as precisely as the source allows, with the original always kept, so I can sort the archive and still see what the letter actually said.
-
AC1 — Given a parseable date, when normalized, then
date_isoholds the best-effort ISO date,date_rawholds the verbatim source string, anddate_precision∈{DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN}. -
AC2 — Given an unparseable date, when normalized, then
date_isois empty,date_precision = UNKNOWN,date_rawis preserved, and the value appears inreview/unparsed-dates.csv. -
AC3 — Given the same
date_rawappears inoverrides/dates.csv, when normalized, then the override's(iso, precision)wins over the automatic parse. -
REQ-DATE-01 — The parser shall accept, at minimum, these forms (see §10 examples): Excel/ISO;
D.M.YYYY/D.M.YY;D/M. YY[YY](slash treated as dot); Roman-numeral monthsI–XII; German + English month names, full and abbreviated, with or without a separating space;Month YYYY; season/holiday + year; bareYYYY; and start-anchored ranges. -
REQ-DATE-02 — Precision shall be assigned by what is known: full day →
DAY; month+year →MONTH(day = 1); a named feast/holiday + year → resolved to its actual calendar date for that year →DAY; a season + year → representative mid-season month (day = 1) →SEASON; year only →YEAR(month = Jan, day = 1); a range → start date +RANGE; a value carrying an uncertainty marker (?,um,ca,circa) →APPROXwith best-effort date. -
REQ-DATE-03 — Two-digit and three-digit years shall be expanded per C6; a 2-digit year in
58–72shall yieldUNKNOWN+ a review entry rather than a guess. -
REQ-DATE-04 — Trailing editorial notes (e.g.
", 2. Brief") shall be stripped before parsing and preserved (kept withindate_raw; not invented into the date). -
REQ-DATE-05 — The parser shall be pure and side-effect-free so it can be unit-tested in isolation (see NFR-TEST-01).
-
REQ-DATE-06 — Movable feasts are never mapped to a fixed month; they shall be computed per year from Easter (Gauss/Butcher computus): Karfreitag = Easter−2, Ostern = Easter Sunday, Himmelfahrt = Easter+39, Pfingst(sonntag) = Easter+49, Pfingstmontag = Easter+50, Fronleichnam = Easter+60, 1.–4. Advent = the 4th…1st Sunday before 25 Dec. Fixed feasts use a lookup table (Neujahr=01-01, Heiligabend=12-24, Weihnachten=12-25, Silvester=12-31, …). Seasons map to representative months: Frühling/Frühjahr=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan. The feast/season tables and Easter algorithm live in
config.py(NFR-MAINT-01).
4.4 Person resolution & dedup (FR-PERS, FR-DEDUP) — resolves IMP-04, IMP-05, IMP-11
US-PERS-01 — As the data steward, I want the genealogical register turned into canonical people with all their known facts, so documents can link to real persons.
- AC1 — Given a register row, when parsed, then a canonical person is produced with
person_id, name parts,maiden_name, birth/death (parsed + raw + place), spouse, generation, nickname, notes — applying the same date rules as §4.3 to birth/death dates. - AC2 — Given multi-value given names (
"Charlotte,Meta,Jacobi"), when parsed, then the primary given name is the first; the remainder are retained as additional names/aliases.
US-PERS-02 — As the data steward, I want each sender/receiver string matched to a canonical person where possible and never dropped otherwise, so the correspondence graph is complete.
-
AC1 — Given a sender/receiver string, when resolved, then it maps to a register
person_idvia the alias index (exact → normalized/casefold → conservative fuzzy). -
AC2 — Given no confident match, when resolved, then a provisional person is created from the cleaned string, linked, and listed in
review/unmatched-names.csv(occurrence count + example source rows). -
AC3 — Given the string appears in
overrides/names.csv, when resolved, then it maps to the specifiedperson_id(override wins). -
AC4 — Given a multi-person receiver cell (
"Eugenie u Walter de Gruyter","Herbert u Clara","…//…","Hedi und Tutu (Gruber)"), when resolved, then it is split into individual people, each resolved independently; ambiguous space-joined pairs ("Ella Anita") are emitted toreview/ambiguous-receivers.csvrather than guessed. -
REQ-DEDUP-01 — The alias index shall be derived from the register: canonical "First Last", maiden form (
geb als), spouse-surname married form, nickname, and first-name-only only when unambiguous across the register. -
REQ-DEDUP-02 — The normalizer shall not merge two distinct strings into one person on fuzzy similarity alone above a configured threshold without the match being reported; merges must be auditable.
-
REQ-PERS-01 — Sender cells shall be parsed for multi-person content using the same rules as receiver cells (today the importer parses only receivers — IMP-11).
4.5 Overrides & idempotency (FR-OVR) — supports the iteration loop
- REQ-OVR-01 — When the normalizer runs, then it shall load
overrides/dates.csvandoverrides/names.csvif present and apply them; absence of either file shall not be an error. - REQ-OVR-02 — While overrides are unchanged and inputs are unchanged, re-running shall produce byte-identical canonical outputs and review files (NFR-IDEM-01).
- REQ-OVR-03 — Each override application shall be counted in
review/summary.txt(how many dates/names were resolved by override vs automatically).
4.6 Canonical output & provenance (FR-OUT, FR-PROV) — resolves IMP-01, IMP-09, IMP-12
- REQ-OUT-01 — The normalizer shall write
out/canonical-documents.xlsxandout/canonical-persons.xlsxwith the headered schemas in §6. - REQ-PROV-01 — Every canonical document row shall carry
source_row(1-based row number in the source sheet) so any value can be traced back to the original. - REQ-PROV-02 — Every canonical row shall carry a
needs_reviewfield listing zero or more flags (duplicate_index,unparsed_date,unmatched_sender,unmatched_receiver,index_file_mismatch, …) so the import and the UI can foreground uncertain data. - REQ-OUT-02 — Where the source
Dateipath disagrees with the index-derived filename (IMP-09), the normalizer shall record the discrepancy inreview/index-file-mismatch.csvand flag the row; it shall not alter theindex(the stable key).
5. Non-Functional Requirements
| ID | Category | Requirement (measurable) |
|---|---|---|
| NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output or a review file; 100% of original date/name strings preserved verbatim. |
| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ identical logical output across runs/machines: identical canonical cell matrices and review-file contents. Workbook created/modified metadata is pinned to a constant; ordering of all generated rows/aliases is stable (no set-iteration leakage). xlsx byte-identity is explicitly not required — determinism is asserted on content. |
| NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. |
| NFR-ACCUR-01 | Date accuracy | After automated pass, UNKNOWN dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. |
| NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. |
| NFR-I18N-01 | Encoding | UTF-8 end-to-end; German diacritics and ß round-trip with no mojibake in any output. |
| NFR-TEST-01 | Testability | dates.py and persons.py have pytest tests covering every format/alias category in §10 with real examples from the archive. |
| NFR-MAINT-01 | Maintainability | Column-name map, century boundaries, season→month map, and fuzzy threshold live in config.py, not inline in logic. |
| NFR-OBSERV-01 | Observability | review/summary.txt reports per-run stats: rows in, documents out, dates by precision, names matched vs provisional, overrides applied, anomalies by type. |
| NFR-SAFETY-01 | Source safety | Source workbooks are opened read-only and never written. |
6. Data Dictionary (canonical contract)
This is the contract Phase 2 (the importer) must consume. Field-level, format-level — not a DB schema.
6.1 canonical-documents.xlsx
| Field | Required | Format / values | Notes |
|---|---|---|---|
index |
yes | string | Stable key; basis for PDF matching. |
box |
no | string | from Box. |
folder |
no | string | from Mappe. |
sender_person_id |
no | person_id | resolved; empty if no sender. |
sender_name |
no | string | canonical display name (or cleaned raw if provisional). |
receiver_person_ids |
no | id|id|… |
pipe-separated. |
receiver_names |
no | name|name|… |
pipe-separated, aligned with ids. |
date_iso |
no | YYYY-MM-DD |
best-effort; empty if UNKNOWN. |
date_raw |
no | string | verbatim source date. |
date_precision |
yes | enum | DAY|MONTH|SEASON|YEAR|RANGE|APPROX|UNKNOWN. |
location |
no | string | from Ort. |
tags |
no | tag|tag |
from Schlagwort. |
summary |
no | string | from Inhalt. |
source_row |
yes | int | provenance (NFR-DATA-01). |
needs_review |
yes | flag|flag or empty |
review flags (REQ-PROV-02). |
6.2 canonical-persons.xlsx
| Field | Required | Format | Notes |
|---|---|---|---|
person_id |
yes | slug | stable id (e.g. de-gruyter-eugenie); collisions suffixed. |
last_name |
yes | string | from Familienname. |
first_name |
no | string | primary given name. |
maiden_name |
no | string | from geb als — drives dedup. |
title |
no | string | e.g. honorifics if present. |
nickname |
no | string | from quoted Bemerkung/spouse field. |
birth_date / birth_date_raw / birth_place |
no | ISO / string / string | §4.3 rules. |
death_date / death_date_raw / death_place |
no | ISO / string / string | §4.3 rules. |
spouse |
no | person_id or name | from verheiratet mit. |
generation |
no | string | G 1..G 4. |
notes |
no | string | from Bemerkung. |
aliases |
no | a|b|c |
every surface form that maps here. |
provisional |
yes | bool | true if created from a document string, not the register. |
7. Prioritized Backlog (MoSCoW)
| ID | Item | MoSCoW | Effort | Depends on |
|---|---|---|---|---|
| B1 | Project scaffolding + read both workbooks (FR-INGEST, header map FR-MAP) |
Must | S | — |
| B2 | Row triage + blank/duplicate/empty reports (FR-TRIAGE) |
Must | S | B1 |
| B3 | Date parser + precision + century rule + Easter/feast computus + season map + tests (FR-DATE) |
Must | L | B1 |
| B4 | Person register parser → canonical persons (FR-PERS US-PERS-01) |
Must | M | B1 |
| B5 | Alias index + name resolution + multi-person split (FR-DEDUP, US-PERS-02) |
Must | L | B4 |
| B6 | Overrides load + apply + idempotency (FR-OVR) |
Must | S | B3,B5 |
| B7 | Canonical writers + provenance + review summary (FR-OUT, FR-PROV) |
Must | M | B2,B3,B5 |
| B8 | Index↔Datei mismatch report (REQ-OUT-02) |
Should | XS | B1 |
| B9 | Ambiguous-receiver review path (US-PERS-02 AC4) | Should | S | B5 |
| B10 | Comma-split Inhalt into extra tags |
Could | XS | B7 |
| B11 | Phase-2 importer wiring (separate spec) | Won't (this spec) | — | B7 |
8. Traceability — Findings → Requirements
| Finding | Severity | Addressed by |
|---|---|---|
| IMP-01 layout mismatch | blocker | C1, FR-MAP, REQ-OUT-01 |
| IMP-02 free-text dates | blocker | FR-DATE (all), C2, C6 |
| IMP-03 no ISO/normalized cols | blocker | FR-DATE, FR-PERS |
| IMP-04 register unimported | major | C3, US-PERS-01, §6.2 |
| IMP-05 name variants → dupes | major | C3, FR-DEDUP |
| IMP-06 blank-index dropped | major | US-TRIAGE-01 |
| IMP-07 duplicate indices | minor | REQ-TRIAGE-01 |
| IMP-08 section rows / tags vs summary | minor | REQ-TRIAGE-02, C7 |
| IMP-09 index↔file mismatch | minor | REQ-OUT-02, B8 |
IMP-10 x-suffix rows |
minor | REQ-TRIAGE-03 (skip + log this pass) |
IMP-11 sender not split / u sep |
minor | REQ-PERS-01, US-PERS-02 AC4 |
| IMP-12 first-sheet, no validation | minor | REQ-INGEST-01, FR-MAP AC2/AC3 |
9. Open Questions / TBD Register
| ID | Question | Why it matters | Ref | Resolution |
|---|---|---|---|---|
| OQ-01 ✅ | Season/holiday → date. | Accuracy of ~70 SEASON/feast rows. | REQ-DATE-06 | Resolved (2026-05-25): movable feasts (Ostern, Pfingsten, Himmelfahrt, Advent, …) computed per year from Easter — never a fixed month; fixed feasts looked up (Weihnachten=12-25, Neujahr=01-01, …); seasons = mid-season month (Frühling=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan). |
| OQ-02 ✅ | Date ranges: start only, or start+end? | Sorting/display of ~315 range values. | REQ-DATE-02 | Confirmed: store start in date_iso, precision RANGE, full text in date_raw. |
| OQ-03 ✅ | person_id format. |
Stability across re-runs; diffability. | §6 | Confirmed: readable slug lastname-firstname, numeric suffix on collision. |
| OQ-04 ✅ | x-suffix row handling. |
42 rows. | REQ-TRIAGE-03 | Resolved (2026-05-25): x rows are transcriptions of the base letter but not yet mappable → skip this pass, log to review/skipped-x-suffix.csv for later linking. |
| OQ-05 ✅ | Importer output format. | Phase-2 reader. | B11 | Confirmed: .xlsx (openpyxl-native, headered). |
| OQ-06 ✅ | Fuzzy-match policy. | False-positive person merges (R2). | REQ-DEDUP-02 | Confirmed: conservative — report all fuzzy matches; no silent merge. |
All open questions resolved as of 2026-05-25. New ambiguities discovered during build go here.
10. Glossary & Worked Examples
Precision — how exactly a date is known (DAY … UNKNOWN). Provisional person — a
person created from a document name string with no register match. Alias index — map from
every known surface form of a name to a canonical person_id. Override — a
human-supplied correction applied deterministically on each run.
Date examples → expected outcome:
date_raw |
date_iso |
date_precision |
|---|---|---|
15.2.1888 |
1888-02-15 | DAY |
6.März 1888 |
1888-03-06 | DAY |
22.III.18 |
1918-03-22 | DAY |
13.5.09 |
1909-05-13 | DAY |
10.Oct.95 |
1895-10-10 | DAY |
17/6. 1916 |
1916-06-17 | DAY |
Mai 1895 |
1895-05-01 | MONTH |
Pfingsten 1922 |
1922-06-04 | DAY (computed: Easter 1922 = Apr 16, +49 days) |
Herbst 1913 |
1913-10-01 | SEASON |
1905 |
1905-01-01 | YEAR |
8.1.1916 - 15.3.1916 |
1916-01-08 | RANGE |
17.Nov (?) 1887 |
1887-11-17 | APPROX |
? |
(empty) | UNKNOWN |
Name examples → expected outcome:
| raw cell | resolves to |
|---|---|
Eugenie Müller (+ register geb Müller) |
de-gruyter-eugenie (matched via maiden alias) |
Eugenie de Gruyter |
de-gruyter-eugenie |
Herbert u Clara |
cram-herbert + cram-clara (split, surname distributed) |
Hedi und Tutu (Gruber) |
gruber-hedi + gruber-tutu |
Ella Anita |
→ review/ambiguous-receivers.csv (not auto-split) |
Hans Wittkopf (not in register) |
provisional wittkopf-hans |