The four files in tools/import-normalizer/out/ contain real names, addresses, and attribution prose for ~163 living/deceased family members and were committed by mistake. They are now removed from the index (kept on disk for local development) and gitignored. The canonical artifacts are produced locally from the Python normalizer and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The contract between normalizer and importer is the header schema, not the file contents — CanonicalSheetReader fails closed on a missing header, which is what locks the contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
420 lines
27 KiB
Markdown
420 lines
27 KiB
Markdown
# Spec — Import Normalizer
|
||
|
||
> Authored in the voice of **"Elicit"**, requirements engineer (see
|
||
> `.claude/personas/req_engineer.md`). This is a requirements artifact: it states
|
||
> *what* the normalizer must do and *how we'll know it's done*, in problem/behaviour
|
||
> language. Technology choices already made during brainstorming (Python, openpyxl,
|
||
> overrides-and-rerun) are recorded as **constraints**, not re-litigated here.
|
||
|
||
- **Status:** Draft for review
|
||
- **Date:** 2026-05-25
|
||
- **Related:** [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md) (issues `IMP-01..12`), [`README.md`](./README.md)
|
||
- **Scope boundary:** This spec covers the **offline normalizer** that turns the raw
|
||
spreadsheets into a clean, canonical dataset + review artifacts. Wiring the canonical
|
||
contract into the Java `MassImportService` and the `Document`/`Person` model is **Phase 2**
|
||
and gets its own spec. This spec only *defines the contract* Phase 2 must satisfy.
|
||
|
||
---
|
||
|
||
## 1. Project Brief
|
||
|
||
**Vision.** Turn the family's human-curated, free-form archive spreadsheets into a clean,
|
||
canonical dataset that imports deterministically — without hand-editing thousands of rows
|
||
and without losing the historical nuance of how things were originally written.
|
||
|
||
**Problem.** The real archive (`…aktuell…xlsx`, 7,943 rows) and the person register
|
||
(`Personendatei 2.xlsx`, 163 people) were authored for humans to read, not machines to
|
||
import. Dates are written as they appeared in each letter (≈90% unparseable by the current
|
||
importer), the column layout differs from what the importer expects, and the same person
|
||
appears under many names. Importing as-is produces garbage (see `IMP-01..12`).
|
||
|
||
**Goal (measurable).**
|
||
- G1 — After the automated pass, **≤ 5%** of dated rows remain `UNKNOWN`; after the
|
||
overrides-iteration loop, **≤ 0.5%**.
|
||
- G2 — **100%** of source rows are represented in the canonical output or in a review file —
|
||
*zero silent drops*.
|
||
- G3 — **100%** of original values (raw date string, raw name string, source row number)
|
||
are preserved.
|
||
- G4 — A full run over the current inputs completes in **< 60 s** on the dev laptop and is
|
||
**content-deterministic** when re-run with unchanged inputs+overrides: identical canonical
|
||
cell matrices and identical review-file contents. (Workbook metadata is pinned; literal xlsx
|
||
byte-identity is not guaranteed because the zip container stores entry metadata.)
|
||
|
||
**Primary actor.** Marcel — solo owner & data steward (tech comfort 4/5). Also: a future
|
||
agent re-running the pipeline; and the `MassImportService` as the downstream consumer.
|
||
|
||
**Non-Goals (explicitly out of scope).**
|
||
- NG1 — Changing `MassImportService` or the DB schema (that is Phase 2).
|
||
- NG2 — Uploading/attaching the ~7,000 PDFs (they arrive later; import matches by `index`).
|
||
- NG3 — A GUI. The interface is spreadsheets in, CSVs out, an overrides file hand-edited.
|
||
- NG4 — Perfect genealogical reconstruction. We resolve confidently-matchable people; the
|
||
long tail stays as provisional persons.
|
||
- NG5 — OCR/transcription content (the new xlsx has no transcription column).
|
||
|
||
**Key assumptions.** (A1) Sheet `Familienarchiv` is the document source of truth.
|
||
(A2) Archive date range is **1873–1957** (drives the 2-digit-year century rule).
|
||
(A3) `index` is the stable document key and the basis for future PDF matching.
|
||
(A4) `Schlagwort` is a broad tag; `Inhalt` is a short summary/topic.
|
||
|
||
**Risks.** (R1) 2-digit/partial dates are genuinely ambiguous → mitigated by precision flag
|
||
+ overrides. (R2) Name matching false-positives merge distinct people → mitigated by
|
||
conservative matching + review before merge. (R3) Source spreadsheet may be re-exported with
|
||
layout drift → mitigated by header-name-based mapping, not fixed indices.
|
||
|
||
---
|
||
|
||
## 2. Personas
|
||
|
||
**Marcel — Data Steward.** Role: solo owner of Familienarchiv. Context: holds the complete
|
||
raw archive; PDFs follow. Tech comfort: 4/5 (semi-technical, reads CSV/spreadsheets fluently,
|
||
not keen to hand-edit 7,600 rows). Primary goal: a clean, importable dataset he trusts.
|
||
Frustrations: dates in ~20 formats; one ancestor under 4 name variants. **JTBD:** *"When I
|
||
have raw, human-curated archive spreadsheets, I want to transform them into a clean importable
|
||
dataset without losing how things were originally written, so I can load the archive and keep
|
||
correcting edge cases as they surface."*
|
||
|
||
**The Returning Agent.** Role: a future assistant session resuming the work. Goal: re-run the
|
||
pipeline deterministically and understand exactly what still needs human input. **JTBD:**
|
||
*"When I pick this up cold, I want one command and a clear residue report, so I can continue
|
||
without re-deriving context."*
|
||
|
||
---
|
||
|
||
## 3. Constraints & Decisions Already Made
|
||
|
||
These were settled during brainstorming and are fixed inputs to the requirements below.
|
||
|
||
| # | Decision | Rationale |
|
||
| --- | --- | --- |
|
||
| C1 | **New canonical layout** with explicit headers (not the old positional ODS shape). | Fits the new data; importer becomes header-driven in Phase 2. |
|
||
| C2 | Dates stored as **parsed (nullable) + raw + precision**. | Historical archive; never lose the original; enable "ca. 1916". |
|
||
| C3 | **Include person resolution** (register + alias/marriage map → canonical persons) in this effort. | Maiden-name dedup needs the register. |
|
||
| C4 | **Overrides-file + re-run** loop for residue. | Deterministic, diffable, repeatable. |
|
||
| C5 | Implementation: **Python 3.12 + openpyxl**, standalone tool at `tools/import-normalizer/`. | Fast iteration; no Spring rebuild / coverage gate on transform code. |
|
||
| C6 | Century rule for archive **1873–1957**: 2-digit `00–57`→`19YY`, `73–99`→`18YY`, `58–72`→**flag**; 3-digit `DDD`→`1DDD`; never 20xx. | Stated by Marcel. Boundaries live in config. |
|
||
| C7 | `Schlagwort`→tag, `Inhalt`→summary. | Matches importer's existing semantics. |
|
||
| C8 | Non-register correspondents become **provisional persons**. | ~945 distinct sender strings vs 163 register people. |
|
||
|
||
---
|
||
|
||
## 4. Functional Requirements
|
||
|
||
Each requirement has a stable ID. User stories use Connextra + Given-When-Then; system rules
|
||
use EARS. Traceability to findings in §8.
|
||
|
||
### 4.1 Ingest & layout (`FR-INGEST`, `FR-MAP`)
|
||
|
||
**US-MAP-01** — *As the data steward, I want each source column mapped to a named canonical
|
||
field regardless of its position, so a re-exported spreadsheet with shifted columns still
|
||
imports correctly.*
|
||
- AC1 — Given the `Familienarchiv` sheet, when the normalizer reads the header row, then it
|
||
maps columns by **header name** (not fixed index) to the canonical fields.
|
||
- AC2 — Given a header the normalizer does not recognise, when it runs, then it records the
|
||
unknown header in `review/summary.txt` and continues (does not crash).
|
||
- AC3 — Given a required source header is **absent**, when it runs, then it aborts with a
|
||
clear message naming the missing header (fail loud, before producing partial output).
|
||
|
||
- **REQ-INGEST-01** — The normalizer shall read only the `Familienarchiv` sheet of the
|
||
document workbook and the `Tabelle1` sheet of the person workbook.
|
||
- **REQ-MAP-01** — Header matching shall be case-insensitive and tolerant of internal
|
||
multiple spaces (e.g. `"Datum des Briefes"`).
|
||
|
||
### 4.2 Row triage (`FR-TRIAGE`) — resolves IMP-06, IMP-07, IMP-08
|
||
|
||
**US-TRIAGE-01** — *As the data steward, I want rows that have data but no index surfaced
|
||
rather than dropped, so I never lose a letter silently.*
|
||
- AC1 — Given a row whose `index` is blank but which has any other non-empty cell, when the
|
||
normalizer runs, then that row is written to `review/blank-index-rows.csv` with its source
|
||
row number and is **not** emitted as a canonical document.
|
||
- AC2 — Given a fully empty row, when it runs, then the row is skipped and counted (not
|
||
reported as an anomaly).
|
||
|
||
- **REQ-TRIAGE-01** — If two or more rows resolve to the same `index`, then the normalizer
|
||
shall emit all of them to `review/duplicate-index.csv` and mark each canonical row
|
||
`needs_review = duplicate_index` (it shall **not** silently drop either).
|
||
- **REQ-TRIAGE-02** — Where a row is identified as a section/banner row (blank index, text
|
||
only in a name column), the normalizer shall classify it as such in the blank-index report.
|
||
- **REQ-TRIAGE-03** — Rows whose `index` ends in `x` (a transcription/back-side of the base
|
||
letter, not yet independently mappable) shall be **skipped** — not emitted as a canonical
|
||
document — and written to `review/skipped-x-suffix.csv` with their source row and base index
|
||
(`index` minus the trailing `x`), so they can be linked in a later pass. (Resolves IMP-10.)
|
||
|
||
### 4.3 Date normalization (`FR-DATE`) — resolves IMP-02, IMP-03
|
||
|
||
**US-DATE-01** — *As the data steward, I want every date interpreted as precisely as the
|
||
source allows, with the original always kept, so I can sort the archive and still see what the
|
||
letter actually said.*
|
||
- AC1 — Given a parseable date, when normalized, then `date_iso` holds the best-effort ISO
|
||
date, `date_raw` holds the verbatim source string, and `date_precision` ∈
|
||
`{DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN}`.
|
||
- AC2 — Given an unparseable date, when normalized, then `date_iso` is empty,
|
||
`date_precision = UNKNOWN`, `date_raw` is preserved, and the value appears in
|
||
`review/unparsed-dates.csv`.
|
||
- AC3 — Given the same `date_raw` appears in `overrides/dates.csv`, when normalized, then the
|
||
override's `(iso, precision)` wins over the automatic parse.
|
||
|
||
- **REQ-DATE-01** — The parser shall accept, at minimum, these forms (see §10 examples):
|
||
Excel/ISO; `D.M.YYYY`/`D.M.YY`; `D/M. YY[YY]` (slash treated as dot); Roman-numeral months
|
||
`I–XII`; German + English month names, full and abbreviated, with or without a separating
|
||
space; `Month YYYY`; season/holiday + year; bare `YYYY`; and start-anchored ranges.
|
||
- **REQ-DATE-02** — Precision shall be assigned by what is known: full day → `DAY`; month+year
|
||
→ `MONTH` (day = 1); a **named feast/holiday + year** → resolved to its **actual calendar
|
||
date for that year** → `DAY`; a **season + year** → representative mid-season month (day = 1)
|
||
→ `SEASON`; year only → `YEAR` (month = Jan, day = 1); a range → start date + `RANGE`; a
|
||
value carrying an uncertainty marker (`?`, `um`, `ca`, `circa`) → `APPROX` with best-effort date.
|
||
- **REQ-DATE-03** — Two-digit and three-digit years shall be expanded per **C6**; a 2-digit
|
||
year in `58–72` shall yield `UNKNOWN` + a review entry rather than a guess.
|
||
- **REQ-DATE-04** — Trailing editorial notes (e.g. `", 2. Brief"`) shall be stripped before
|
||
parsing and preserved (kept within `date_raw`; not invented into the date).
|
||
- **REQ-DATE-05** — The parser shall be pure and side-effect-free so it can be unit-tested in
|
||
isolation (see NFR-TEST-01).
|
||
- **REQ-DATE-06** — **Movable feasts are never mapped to a fixed month**; they shall be
|
||
computed per year from Easter (Gauss/Butcher computus): Karfreitag = Easter−2, Ostern =
|
||
Easter Sunday, Himmelfahrt = Easter+39, Pfingst(sonntag) = Easter+49, Pfingstmontag =
|
||
Easter+50, Fronleichnam = Easter+60, 1.–4. Advent = the 4th…1st Sunday before 25 Dec. Fixed
|
||
feasts use a lookup table (Neujahr=01-01, Heiligabend=12-24, Weihnachten=12-25,
|
||
Silvester=12-31, …). Seasons map to representative months: Frühling/Frühjahr=Apr, Sommer=Jul,
|
||
Herbst=Oct, Winter=Jan. The feast/season tables and Easter algorithm live in `config.py`
|
||
(NFR-MAINT-01).
|
||
- **REQ-DATE-07** — **Intra-month day ranges carry an end day; half-resolved ranges are
|
||
flagged.** For a day range like `7./8. Sept.1923`, `date_iso` holds the start day, the end
|
||
day is resolved against the shared month/year into `date_end`, and `date_precision` =
|
||
`RANGE`. If the **start** parses but the **end day is impossible** (e.g. `10./40.1.1917`),
|
||
the row keeps the start and `RANGE` precision, leaves `date_end` **empty**, and is flagged
|
||
`needs_review = range_end_unparsed` — the unparseable end is dropped honestly (surfaced for
|
||
review), never silently invented or clamped. A `RANGE` row **may** therefore legitimately
|
||
have an empty `date_end`; the importer must treat `date_end` as optional even on a `RANGE`.
|
||
|
||
### 4.4 Person resolution & dedup (`FR-PERS`, `FR-DEDUP`) — resolves IMP-04, IMP-05, IMP-11
|
||
|
||
**US-PERS-01** — *As the data steward, I want the genealogical register turned into canonical
|
||
people with all their known facts, so documents can link to real persons.*
|
||
- AC1 — Given a register row, when parsed, then a canonical person is produced with
|
||
`person_id`, name parts, `maiden_name`, birth/death (parsed + raw + place), spouse,
|
||
generation, nickname, notes — applying the same date rules as §4.3 to birth/death dates.
|
||
- AC2 — Given multi-value given names (`"Charlotte,Meta,Jacobi"`), when parsed, then the
|
||
primary given name is the first; the remainder are retained as additional names/aliases.
|
||
|
||
**US-PERS-02** — *As the data steward, I want each sender/receiver string matched to a
|
||
canonical person where possible and never dropped otherwise, so the correspondence graph is
|
||
complete.*
|
||
- AC1 — Given a sender/receiver string, when resolved, then it maps to a register
|
||
`person_id` via the alias index (exact → normalized/casefold → conservative fuzzy).
|
||
- AC2 — Given no confident match, when resolved, then a **provisional person** is created from
|
||
the cleaned string, linked, and listed in `review/unmatched-names.csv` (occurrence count +
|
||
example source rows).
|
||
- AC3 — Given the string appears in `overrides/names.csv`, when resolved, then it maps to the
|
||
specified `person_id` (override wins).
|
||
- AC4 — Given a multi-person receiver cell (`"Eugenie u Walter de Gruyter"`, `"Herbert u
|
||
Clara"`, `"…//…"`, `"Hedi und Tutu (Gruber)"`), when resolved, then it is split into
|
||
individual people, each resolved independently; ambiguous space-joined pairs
|
||
(`"Ella Anita"`) are emitted to `review/ambiguous-receivers.csv` rather than guessed.
|
||
|
||
- **REQ-DEDUP-01** — The alias index shall be derived from the register: canonical
|
||
"First Last", maiden form (`geb als`), spouse-surname married form, nickname, and
|
||
first-name-only **only when unambiguous** across the register.
|
||
- **REQ-DEDUP-02** — The normalizer shall not merge two distinct strings into one person on
|
||
fuzzy similarity alone above a configured threshold without the match being reported; merges
|
||
must be auditable.
|
||
- **REQ-PERS-01** — Sender cells shall be parsed for multi-person content using the same rules
|
||
as receiver cells (today the importer parses only receivers — IMP-11).
|
||
|
||
### 4.5 Overrides & idempotency (`FR-OVR`) — supports the iteration loop
|
||
|
||
- **REQ-OVR-01** — When the normalizer runs, then it shall load `overrides/dates.csv` and
|
||
`overrides/names.csv` if present and apply them; absence of either file shall not be an error.
|
||
- **REQ-OVR-02** — While overrides are unchanged and inputs are unchanged, re-running shall
|
||
produce **byte-identical** canonical outputs and review files (NFR-IDEM-01).
|
||
- **REQ-OVR-03** — Each override application shall be counted in `review/summary.txt` (how many
|
||
dates/names were resolved by override vs automatically).
|
||
|
||
### 4.6 Canonical output & provenance (`FR-OUT`, `FR-PROV`) — resolves IMP-01, IMP-09, IMP-12
|
||
|
||
- **REQ-OUT-01** — The normalizer shall write `out/canonical-documents.xlsx` and
|
||
`out/canonical-persons.xlsx` with the headered schemas in §6. The `out/` directory is
|
||
**gitignored** (real family PII — see ADR-025); ops syncs the regenerated files onto the
|
||
import host alongside the PDFs out-of-band.
|
||
- **REQ-PROV-01** — Every canonical document row shall carry `source_row` (1-based row number
|
||
in the source sheet) so any value can be traced back to the original.
|
||
- **REQ-PROV-02** — Every canonical row shall carry a `needs_review` field listing zero or more
|
||
flags (`duplicate_index`, `unparsed_date`, `unmatched_sender`, `unmatched_receiver`,
|
||
`index_file_mismatch`, …) so the import and the UI can foreground uncertain data.
|
||
- **REQ-OUT-02** — Where the source `Datei` path disagrees with the index-derived filename
|
||
(IMP-09), the normalizer shall record the discrepancy in `review/index-file-mismatch.csv`
|
||
and flag the row; it shall **not** alter the `index` (the stable key).
|
||
|
||
---
|
||
|
||
## 5. Non-Functional Requirements
|
||
|
||
| ID | Category | Requirement (measurable) |
|
||
| --- | --- | --- |
|
||
| NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output **or** a review file; 100% of original date/name strings preserved verbatim. |
|
||
| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ identical *logical* output across runs/machines: identical canonical cell matrices and review-file contents. Workbook `created`/`modified` metadata is pinned to a constant; ordering of all generated rows/aliases is stable (no set-iteration leakage). xlsx byte-identity is explicitly not required — determinism is asserted on content. |
|
||
| NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. |
|
||
| NFR-ACCUR-01 | Date accuracy | After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. |
|
||
| NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. |
|
||
| NFR-I18N-01 | Encoding | UTF-8 end-to-end; German diacritics and ß round-trip with no mojibake in any output. |
|
||
| NFR-TEST-01 | Testability | `dates.py` and `persons.py` have pytest tests covering every format/alias category in §10 with real examples from the archive. |
|
||
| NFR-MAINT-01 | Maintainability | Column-name map, century boundaries, season→month map, and fuzzy threshold live in `config.py`, not inline in logic. |
|
||
| NFR-OBSERV-01 | Observability | `review/summary.txt` reports per-run stats: rows in, documents out, dates by precision, names matched vs provisional, overrides applied, anomalies by type. |
|
||
| NFR-SAFETY-01 | Source safety | Source workbooks are opened read-only and never written. |
|
||
|
||
---
|
||
|
||
## 6. Data Dictionary (canonical contract)
|
||
|
||
This is the contract Phase 2 (the importer) must consume. Field-level, format-level — not a
|
||
DB schema.
|
||
|
||
### 6.1 `canonical-documents.xlsx`
|
||
|
||
| Field | Required | Format / values | Notes |
|
||
| --- | --- | --- | --- |
|
||
| `index` | yes | string | Stable key; basis for PDF matching. |
|
||
| `file` | no | string | verbatim `Datei` value (e.g. `H-0730.pdf`); carried through for the importer to link the scanned PDF. |
|
||
| `box` | no | string | from `Box`. |
|
||
| `folder` | no | string | from `Mappe`. |
|
||
| `sender_person_id` | no | person_id | resolved; empty if no sender. |
|
||
| `sender_name` | no | string | canonical display name (or cleaned raw if provisional). |
|
||
| `receiver_person_ids` | no | `id\|id\|…` | pipe-separated. |
|
||
| `receiver_names` | no | `name\|name\|…` | pipe-separated, aligned with ids. |
|
||
| `date_iso` | no | `YYYY-MM-DD` | best-effort; empty if `UNKNOWN`. |
|
||
| `date_raw` | no | string | verbatim source date. |
|
||
| `date_precision` | yes | enum | `DAY\|MONTH\|SEASON\|YEAR\|RANGE\|APPROX\|UNKNOWN`. |
|
||
| `date_end` | no | `YYYY-MM-DD` or empty | RANGE end day (e.g. `7./8. Sept.1923` → `date_iso` = start, `date_end` = end). Empty for every non-RANGE precision **and** for a half-resolved RANGE whose end did not parse (see REQ-DATE-07). |
|
||
| `location` | no | string | from `Ort`. |
|
||
| `tags` | no | `tag\|tag` | from `Schlagwort`. |
|
||
| `summary` | no | string | from `Inhalt`. |
|
||
| `source_row` | yes | int | provenance (NFR-DATA-01). |
|
||
| `needs_review` | yes | `flag\|flag` or empty | review flags (REQ-PROV-02). Flags include `unparsed_date`, `range_end_unparsed` (half-resolved RANGE, REQ-DATE-07), `unmatched_sender`, `unmatched_receiver`, `multi_sender`, `index_file_mismatch`, `duplicate_index`. |
|
||
|
||
### 6.2 `canonical-persons.xlsx`
|
||
|
||
| Field | Required | Format | Notes |
|
||
| --- | --- | --- | --- |
|
||
| `person_id` | yes | slug | stable id (e.g. `de-gruyter-eugenie`); collisions suffixed. |
|
||
| `last_name` | yes | string | from `Familienname`. |
|
||
| `first_name` | no | string | primary given name. |
|
||
| `maiden_name` | no | string | from `geb als` — drives dedup. |
|
||
| `title` | no | string | e.g. honorifics if present. |
|
||
| `nickname` | no | string | from quoted `Bemerkung`/spouse field. |
|
||
| `birth_date` / `birth_date_raw` / `birth_place` | no | ISO / string / string | §4.3 rules. |
|
||
| `death_date` / `death_date_raw` / `death_place` | no | ISO / string / string | §4.3 rules. |
|
||
| `spouse` | no | person_id or name | from `verheiratet mit`. |
|
||
| `generation` | no | string | `G 1`..`G 4`. |
|
||
| `notes` | no | string | from `Bemerkung`. |
|
||
| `aliases` | no | `a\|b\|c` | every surface form that maps here. |
|
||
| `provisional` | yes | bool | true if created from a document string, not the register. |
|
||
|
||
### 6.3 `canonical-persons-tree.json`
|
||
|
||
The de-duplicated genealogical tree (family members + their relationships) the importer
|
||
uses to seed the family graph. Each `persons[]` entry carries a `personId` that **joins
|
||
1:1 onto** `person_id` in `canonical-persons.xlsx`.
|
||
|
||
| Field | Required | Format | Notes |
|
||
| --- | --- | --- | --- |
|
||
| `personId` | yes | slug | The register's **verbatim** `person_id` (e.g. `cram-hans-1`), propagated — never re-slugified — so collision suffixes match `canonical-persons.xlsx` exactly. Every tree `personId` exists in the register; the register is the sole slug authority. |
|
||
| `firstName` / `lastName` / `maidenName` | first/last yes | string | name parts. |
|
||
| `birthYear` / `deathYear` | no | int or null | year only (tree granularity). |
|
||
| `birthPlace` / `deathPlace` | no | string or null | from the register. |
|
||
| `generation` | no | int or null | parsed from `G n`. |
|
||
| `notes` | no | string or null | leftover Bemerkung text after relationship extraction. |
|
||
| `familyMember` | yes | bool | always true for tree persons. |
|
||
|
||
A top-level `generated_at` is pinned to a fixed timestamp (`2020-01-01T00:00:00`) for
|
||
reproducibility (NFR-IDEM-01), not a wall-clock value. `relationships[]` carry `SPOUSE_OF`
|
||
and `PARENT_OF` edges keyed by `rowId`; `unresolved[]` lists relationship strings that did
|
||
not match a tree person.
|
||
|
||
---
|
||
|
||
## 7. Prioritized Backlog (MoSCoW)
|
||
|
||
| ID | Item | MoSCoW | Effort | Depends on |
|
||
| --- | --- | --- | --- | --- |
|
||
| B1 | Project scaffolding + read both workbooks (`FR-INGEST`, header map `FR-MAP`) | Must | S | — |
|
||
| B2 | Row triage + blank/duplicate/empty reports (`FR-TRIAGE`) | Must | S | B1 |
|
||
| B3 | Date parser + precision + century rule + Easter/feast computus + season map + tests (`FR-DATE`) | Must | L | B1 |
|
||
| B4 | Person register parser → canonical persons (`FR-PERS` US-PERS-01) | Must | M | B1 |
|
||
| B5 | Alias index + name resolution + multi-person split (`FR-DEDUP`, US-PERS-02) | Must | L | B4 |
|
||
| B6 | Overrides load + apply + idempotency (`FR-OVR`) | Must | S | B3,B5 |
|
||
| B7 | Canonical writers + provenance + review summary (`FR-OUT`, `FR-PROV`) | Must | M | B2,B3,B5 |
|
||
| B8 | Index↔Datei mismatch report (`REQ-OUT-02`) | Should | XS | B1 |
|
||
| B9 | Ambiguous-receiver review path (US-PERS-02 AC4) | Should | S | B5 |
|
||
| B10 | Comma-split `Inhalt` into extra tags | Could | XS | B7 |
|
||
| B11 | Phase-2 importer wiring (separate spec) | Won't (this spec) | — | B7 |
|
||
|
||
---
|
||
|
||
## 8. Traceability — Findings → Requirements
|
||
|
||
| Finding | Severity | Addressed by |
|
||
| --- | --- | --- |
|
||
| IMP-01 layout mismatch | blocker | C1, FR-MAP, REQ-OUT-01 |
|
||
| IMP-02 free-text dates | blocker | FR-DATE (all), C2, C6 |
|
||
| IMP-03 no ISO/normalized cols | blocker | FR-DATE, FR-PERS |
|
||
| IMP-04 register unimported | major | C3, US-PERS-01, §6.2 |
|
||
| IMP-05 name variants → dupes | major | C3, FR-DEDUP |
|
||
| IMP-06 blank-index dropped | major | US-TRIAGE-01 |
|
||
| IMP-07 duplicate indices | minor | REQ-TRIAGE-01 |
|
||
| IMP-08 section rows / tags vs summary | minor | REQ-TRIAGE-02, C7 |
|
||
| IMP-09 index↔file mismatch | minor | REQ-OUT-02, B8 |
|
||
| IMP-10 `x`-suffix rows | minor | REQ-TRIAGE-03 (skip + log this pass) |
|
||
| IMP-11 sender not split / ` u ` sep | minor | REQ-PERS-01, US-PERS-02 AC4 |
|
||
| IMP-12 first-sheet, no validation | minor | REQ-INGEST-01, FR-MAP AC2/AC3 |
|
||
|
||
---
|
||
|
||
## 9. Open Questions / TBD Register
|
||
|
||
| ID | Question | Why it matters | Ref | Resolution |
|
||
| --- | --- | --- | --- | --- |
|
||
| OQ-01 ✅ | Season/holiday → date. | Accuracy of ~70 SEASON/feast rows. | REQ-DATE-06 | **Resolved (2026-05-25):** movable feasts (Ostern, Pfingsten, Himmelfahrt, Advent, …) **computed per year from Easter — never a fixed month**; fixed feasts looked up (Weihnachten=12-25, Neujahr=01-01, …); seasons = mid-season month (Frühling=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan). |
|
||
| OQ-02 ✅ | Date ranges: start only, or start+end? | Sorting/display of ~315 range values. | REQ-DATE-02, REQ-DATE-07 | **Confirmed (updated #670):** store **start** in `date_iso`, precision `RANGE`, full text in `date_raw`, **and the resolved end day in `date_end`** for intra-month day ranges. A half-resolved range (start parsed, end impossible) keeps `date_end` empty and is flagged `range_end_unparsed`. |
|
||
| OQ-03 ✅ | `person_id` format. | Stability across re-runs; diffability. | §6 | **Confirmed:** readable slug `lastname-firstname`, numeric suffix on collision. |
|
||
| OQ-04 ✅ | `x`-suffix row handling. | 42 rows. | REQ-TRIAGE-03 | **Resolved (2026-05-25):** `x` rows are transcriptions of the base letter but not yet mappable → **skip this pass**, log to `review/skipped-x-suffix.csv` for later linking. |
|
||
| OQ-05 ✅ | Importer output format. | Phase-2 reader. | B11 | **Confirmed:** `.xlsx` (openpyxl-native, headered). |
|
||
| OQ-06 ✅ | Fuzzy-match policy. | False-positive person merges (R2). | REQ-DEDUP-02 | **Confirmed:** conservative — report all fuzzy matches; no silent merge. |
|
||
|
||
*All open questions resolved as of 2026-05-25. New ambiguities discovered during build go here.*
|
||
|
||
---
|
||
|
||
## 10. Glossary & Worked Examples
|
||
|
||
**Precision** — how exactly a date is known (`DAY` … `UNKNOWN`). **Provisional person** — a
|
||
person created from a document name string with no register match. **Alias index** — map from
|
||
every known surface form of a name to a canonical `person_id`. **Override** — a
|
||
human-supplied correction applied deterministically on each run.
|
||
|
||
**Date examples → expected outcome:**
|
||
|
||
| `date_raw` | `date_iso` | `date_precision` |
|
||
| --- | --- | --- |
|
||
| `15.2.1888` | 1888-02-15 | DAY |
|
||
| `6.März 1888` | 1888-03-06 | DAY |
|
||
| `22.III.18` | 1918-03-22 | DAY |
|
||
| `13.5.09` | 1909-05-13 | DAY |
|
||
| `10.Oct.95` | 1895-10-10 | DAY |
|
||
| `17/6. 1916` | 1916-06-17 | DAY |
|
||
| `Mai 1895` | 1895-05-01 | MONTH |
|
||
| `Pfingsten 1922` | 1922-06-04 | DAY (computed: Easter 1922 = Apr 16, +49 days) |
|
||
| `Herbst 1913` | 1913-10-01 | SEASON |
|
||
| `1905` | 1905-01-01 | YEAR |
|
||
| `8.1.1916 - 15.3.1916` | 1916-01-08 | RANGE |
|
||
| `17.Nov (?) 1887` | 1887-11-17 | APPROX |
|
||
| `?` | *(empty)* | UNKNOWN |
|
||
|
||
**Name examples → expected outcome:**
|
||
|
||
| raw cell | resolves to |
|
||
| --- | --- |
|
||
| `Eugenie Müller` (+ register `geb Müller`) | `de-gruyter-eugenie` (matched via maiden alias) |
|
||
| `Eugenie de Gruyter` | `de-gruyter-eugenie` |
|
||
| `Herbert u Clara` | `cram-herbert` + `cram-clara` (split, surname distributed) |
|
||
| `Hedi und Tutu (Gruber)` | `gruber-hedi` + `gruber-tutu` |
|
||
| `Ella Anita` | → `review/ambiguous-receivers.csv` (not auto-split) |
|
||
| `Hans Wittkopf` (not in register) | provisional `wittkopf-hans` |
|