familienarchiv/docs/import-migration/02-normalization-spec.md

# Spec — Import Normalizer

> Authored in the voice of **"Elicit"**, requirements engineer (see
> `.claude/personas/req_engineer.md`). This is a requirements artifact: it states
> *what* the normalizer must do and *how we'll know it's done*, in problem/behaviour
> language. Technology choices already made during brainstorming (Python, openpyxl,
> overrides-and-rerun) are recorded as **constraints**, not re-litigated here.

- **Status:** Draft for review
- **Date:** 2026-05-25
- **Related:** [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md) (issues `IMP-01..12`), [`README.md`](./README.md)
- **Scope boundary:** This spec covers the **offline normalizer** that turns the raw
  spreadsheets into a clean, canonical dataset + review artifacts. Wiring the canonical
  contract into the Java `MassImportService` and the `Document`/`Person` model is **Phase 2**
  and gets its own spec. This spec only *defines the contract* Phase 2 must satisfy.

---

## 1. Project Brief

**Vision.** Turn the family's human-curated, free-form archive spreadsheets into a clean,
canonical dataset that imports deterministically — without hand-editing thousands of rows
and without losing the historical nuance of how things were originally written.

**Problem.** The real archive (`…aktuell…xlsx`, 7,943 rows) and the person register
(`Personendatei 2.xlsx`, 163 people) were authored for humans to read, not machines to
import. Dates are written as they appeared in each letter (≈90% unparseable by the current
importer), the column layout differs from what the importer expects, and the same person
appears under many names. Importing as-is produces garbage (see `IMP-01..12`).

**Goal (measurable).**
- G1 — After the automated pass, **≤ 5%** of dated rows remain `UNKNOWN`; after the
  overrides-iteration loop, **≤ 0.5%**.
- G2 — **100%** of source rows are represented in the canonical output or in a review file —
  *zero silent drops*.
- G3 — **100%** of original values (raw date string, raw name string, source row number)
  are preserved.
- G4 — A full run over the current inputs completes in **< 60 s** on the dev laptop and is
  **content-deterministic** when re-run with unchanged inputs+overrides: identical canonical
  cell matrices and identical review-file contents. (Workbook metadata is pinned; literal xlsx
  byte-identity is not guaranteed because the zip container stores entry metadata.)

**Primary actor.** Marcel — solo owner & data steward (tech comfort 4/5). Also: a future
agent re-running the pipeline; and the `MassImportService` as the downstream consumer.

**Non-Goals (explicitly out of scope).**
- NG1 — Changing `MassImportService` or the DB schema (that is Phase 2).
- NG2 — Uploading/attaching the ~7,000 PDFs (they arrive later; import matches by `index`).
- NG3 — A GUI. The interface is spreadsheets in, CSVs out, an overrides file hand-edited.
- NG4 — Perfect genealogical reconstruction. We resolve confidently-matchable people; the
  long tail stays as provisional persons.
- NG5 — OCR/transcription content (the new xlsx has no transcription column).

**Key assumptions.** (A1) Sheet `Familienarchiv` is the document source of truth.
(A2) Archive date range is **1873–1957** (drives the 2-digit-year century rule).
(A3) `index` is the stable document key and the basis for future PDF matching.
(A4) `Schlagwort` is a broad tag; `Inhalt` is a short summary/topic.

**Risks.** (R1) 2-digit/partial dates are genuinely ambiguous → mitigated by precision flag
+ overrides. (R2) Name matching false-positives merge distinct people → mitigated by
conservative matching + review before merge. (R3) Source spreadsheet may be re-exported with
layout drift → mitigated by header-name-based mapping, not fixed indices.

---

## 2. Personas

**Marcel — Data Steward.** Role: solo owner of Familienarchiv. Context: holds the complete
raw archive; PDFs follow. Tech comfort: 4/5 (semi-technical, reads CSV/spreadsheets fluently,
not keen to hand-edit 7,600 rows). Primary goal: a clean, importable dataset he trusts.
Frustrations: dates in ~20 formats; one ancestor under 4 name variants. **JTBD:** *"When I
have raw, human-curated archive spreadsheets, I want to transform them into a clean importable
dataset without losing how things were originally written, so I can load the archive and keep
correcting edge cases as they surface."*

**The Returning Agent.** Role: a future assistant session resuming the work. Goal: re-run the
pipeline deterministically and understand exactly what still needs human input. **JTBD:**
*"When I pick this up cold, I want one command and a clear residue report, so I can continue
without re-deriving context."*

---

## 3. Constraints & Decisions Already Made

These were settled during brainstorming and are fixed inputs to the requirements below.

| # | Decision | Rationale |
| --- | --- | --- |
| C1 | **New canonical layout** with explicit headers (not the old positional ODS shape). | Fits the new data; importer becomes header-driven in Phase 2. |
| C2 | Dates stored as **parsed (nullable) + raw + precision**. | Historical archive; never lose the original; enable "ca. 1916". |
| C3 | **Include person resolution** (register + alias/marriage map → canonical persons) in this effort. | Maiden-name dedup needs the register. |
| C4 | **Overrides-file + re-run** loop for residue. | Deterministic, diffable, repeatable. |
| C5 | Implementation: **Python 3.12 + openpyxl**, standalone tool at `tools/import-normalizer/`. | Fast iteration; no Spring rebuild / coverage gate on transform code. |
| C6 | Century rule for archive **1873–1957**: 2-digit `00–57`→`19YY`, `73–99`→`18YY`, `58–72`→**flag**; 3-digit `DDD`→`1DDD`; never 20xx. | Stated by Marcel. Boundaries live in config. |
| C7 | `Schlagwort`→tag, `Inhalt`→summary. | Matches importer's existing semantics. |
| C8 | Non-register correspondents become **provisional persons**. | ~945 distinct sender strings vs 163 register people. |

---

## 4. Functional Requirements

Each requirement has a stable ID. User stories use Connextra + Given-When-Then; system rules
use EARS. Traceability to findings in §8.

### 4.1 Ingest & layout (`FR-INGEST`, `FR-MAP`)

**US-MAP-01** — *As the data steward, I want each source column mapped to a named canonical
field regardless of its position, so a re-exported spreadsheet with shifted columns still
imports correctly.*
- AC1 — Given the `Familienarchiv` sheet, when the normalizer reads the header row, then it
  maps columns by **header name** (not fixed index) to the canonical fields.
- AC2 — Given a header the normalizer does not recognise, when it runs, then it records the
  unknown header in `review/summary.txt` and continues (does not crash).
- AC3 — Given a required source header is **absent**, when it runs, then it aborts with a
  clear message naming the missing header (fail loud, before producing partial output).

- **REQ-INGEST-01** — The normalizer shall read only the `Familienarchiv` sheet of the
  document workbook and the `Tabelle1` sheet of the person workbook.
- **REQ-MAP-01** — Header matching shall be case-insensitive and tolerant of internal
  multiple spaces (e.g. `"Datum  des Briefes"`).

### 4.2 Row triage (`FR-TRIAGE`) — resolves IMP-06, IMP-07, IMP-08

**US-TRIAGE-01** — *As the data steward, I want rows that have data but no index surfaced
rather than dropped, so I never lose a letter silently.*
- AC1 — Given a row whose `index` is blank but which has any other non-empty cell, when the
  normalizer runs, then that row is written to `review/blank-index-rows.csv` with its source
  row number and is **not** emitted as a canonical document.
- AC2 — Given a fully empty row, when it runs, then the row is skipped and counted (not
  reported as an anomaly).

- **REQ-TRIAGE-01** — If two or more rows resolve to the same `index`, then the normalizer
  shall emit all of them to `review/duplicate-index.csv` and mark each canonical row
  `needs_review = duplicate_index` (it shall **not** silently drop either).
- **REQ-TRIAGE-02** — Where a row is identified as a section/banner row (blank index, text
  only in a name column), the normalizer shall classify it as such in the blank-index report.
- **REQ-TRIAGE-03** — Rows whose `index` ends in `x` (a transcription/back-side of the base
  letter, not yet independently mappable) shall be **skipped** — not emitted as a canonical
  document — and written to `review/skipped-x-suffix.csv` with their source row and base index
  (`index` minus the trailing `x`), so they can be linked in a later pass. (Resolves IMP-10.)

### 4.3 Date normalization (`FR-DATE`) — resolves IMP-02, IMP-03

**US-DATE-01** — *As the data steward, I want every date interpreted as precisely as the
source allows, with the original always kept, so I can sort the archive and still see what the
letter actually said.*
- AC1 — Given a parseable date, when normalized, then `date_iso` holds the best-effort ISO
  date, `date_raw` holds the verbatim source string, and `date_precision` ∈
  `{DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN}`.
- AC2 — Given an unparseable date, when normalized, then `date_iso` is empty,
  `date_precision = UNKNOWN`, `date_raw` is preserved, and the value appears in
  `review/unparsed-dates.csv`.
- AC3 — Given the same `date_raw` appears in `overrides/dates.csv`, when normalized, then the
  override's `(iso, precision)` wins over the automatic parse.

- **REQ-DATE-01** — The parser shall accept, at minimum, these forms (see §10 examples):
  Excel/ISO; `D.M.YYYY`/`D.M.YY`; `D/M. YY[YY]` (slash treated as dot); Roman-numeral months
  `I–XII`; German + English month names, full and abbreviated, with or without a separating
  space; `Month YYYY`; season/holiday + year; bare `YYYY`; and start-anchored ranges.
- **REQ-DATE-02** — Precision shall be assigned by what is known: full day → `DAY`; month+year
  → `MONTH` (day = 1); a **named feast/holiday + year** → resolved to its **actual calendar
  date for that year** → `DAY`; a **season + year** → representative mid-season month (day = 1)
  → `SEASON`; year only → `YEAR` (month = Jan, day = 1); a range → start date + `RANGE`; a
  value carrying an uncertainty marker (`?`, `um`, `ca`, `circa`) → `APPROX` with best-effort date.
- **REQ-DATE-03** — Two-digit and three-digit years shall be expanded per **C6**; a 2-digit
  year in `58–72` shall yield `UNKNOWN` + a review entry rather than a guess.
- **REQ-DATE-04** — Trailing editorial notes (e.g. `", 2. Brief"`) shall be stripped before
  parsing and preserved (kept within `date_raw`; not invented into the date).
- **REQ-DATE-05** — The parser shall be pure and side-effect-free so it can be unit-tested in
  isolation (see NFR-TEST-01).
- **REQ-DATE-06** — **Movable feasts are never mapped to a fixed month**; they shall be
  computed per year from Easter (Gauss/Butcher computus): Karfreitag = Easter−2, Ostern =
  Easter Sunday, Himmelfahrt = Easter+39, Pfingst(sonntag) = Easter+49, Pfingstmontag =
  Easter+50, Fronleichnam = Easter+60, 1.–4. Advent = the 4th…1st Sunday before 25 Dec. Fixed
  feasts use a lookup table (Neujahr=01-01, Heiligabend=12-24, Weihnachten=12-25,
  Silvester=12-31, …). Seasons map to representative months: Frühling/Frühjahr=Apr, Sommer=Jul,
  Herbst=Oct, Winter=Jan. The feast/season tables and Easter algorithm live in `config.py`
  (NFR-MAINT-01).
- **REQ-DATE-07** — **Intra-month day ranges carry an end day; half-resolved ranges are
  flagged.** For a day range like `7./8. Sept.1923`, `date_iso` holds the start day, the end
  day is resolved against the shared month/year into `date_end`, and `date_precision` =
  `RANGE`. If the **start** parses but the **end day is impossible** (e.g. `10./40.1.1917`),
  the row keeps the start and `RANGE` precision, leaves `date_end` **empty**, and is flagged
  `needs_review = range_end_unparsed` — the unparseable end is dropped honestly (surfaced for
  review), never silently invented or clamped. A `RANGE` row **may** therefore legitimately
  have an empty `date_end`; the importer must treat `date_end` as optional even on a `RANGE`.

### 4.4 Person resolution & dedup (`FR-PERS`, `FR-DEDUP`) — resolves IMP-04, IMP-05, IMP-11

**US-PERS-01** — *As the data steward, I want the genealogical register turned into canonical
people with all their known facts, so documents can link to real persons.*
- AC1 — Given a register row, when parsed, then a canonical person is produced with
  `person_id`, name parts, `maiden_name`, birth/death (parsed + raw + place), spouse,
  generation, nickname, notes — applying the same date rules as §4.3 to birth/death dates.
- AC2 — Given multi-value given names (`"Charlotte,Meta,Jacobi"`), when parsed, then the
  primary given name is the first; the remainder are retained as additional names/aliases.

**US-PERS-02** — *As the data steward, I want each sender/receiver string matched to a
canonical person where possible and never dropped otherwise, so the correspondence graph is
complete.*
- AC1 — Given a sender/receiver string, when resolved, then it maps to a register
  `person_id` via the alias index (exact → normalized/casefold → conservative fuzzy).
- AC2 — Given no confident match, when resolved, then a **provisional person** is created from
  the cleaned string, linked, and listed in `review/unmatched-names.csv` (occurrence count +
  example source rows).
- AC3 — Given the string appears in `overrides/names.csv`, when resolved, then it maps to the
  specified `person_id` (override wins).
- AC4 — Given a multi-person receiver cell (`"Eugenie u Walter de Gruyter"`, `"Herbert u
  Clara"`, `"…//…"`, `"Hedi und Tutu (Gruber)"`), when resolved, then it is split into
  individual people, each resolved independently; ambiguous space-joined pairs
  (`"Ella Anita"`) are emitted to `review/ambiguous-receivers.csv` rather than guessed.

- **REQ-DEDUP-01** — The alias index shall be derived from the register: canonical
  "First Last", maiden form (`geb als`), spouse-surname married form, nickname, and
  first-name-only **only when unambiguous** across the register.
- **REQ-DEDUP-02** — The normalizer shall not merge two distinct strings into one person on
  fuzzy similarity alone above a configured threshold without the match being reported; merges
  must be auditable.
- **REQ-PERS-01** — Sender cells shall be parsed for multi-person content using the same rules
  as receiver cells (today the importer parses only receivers — IMP-11).

### 4.5 Overrides & idempotency (`FR-OVR`) — supports the iteration loop

- **REQ-OVR-01** — When the normalizer runs, then it shall load `overrides/dates.csv` and
  `overrides/names.csv` if present and apply them; absence of either file shall not be an error.
- **REQ-OVR-02** — While overrides are unchanged and inputs are unchanged, re-running shall
  produce **byte-identical** canonical outputs and review files (NFR-IDEM-01).
- **REQ-OVR-03** — Each override application shall be counted in `review/summary.txt` (how many
  dates/names were resolved by override vs automatically).

### 4.6 Canonical output & provenance (`FR-OUT`, `FR-PROV`) — resolves IMP-01, IMP-09, IMP-12

- **REQ-OUT-01** — The normalizer shall write `out/canonical-documents.xlsx` and
  `out/canonical-persons.xlsx` with the headered schemas in §6. The `out/` directory is
  **gitignored** (real family PII — see ADR-025); ops syncs the regenerated files onto the
  import host alongside the PDFs out-of-band.
- **REQ-PROV-01** — Every canonical document row shall carry `source_row` (1-based row number
  in the source sheet) so any value can be traced back to the original.
- **REQ-PROV-02** — Every canonical row shall carry a `needs_review` field listing zero or more
  flags (`duplicate_index`, `unparsed_date`, `unmatched_sender`, `unmatched_receiver`,
  `index_file_mismatch`, …) so the import and the UI can foreground uncertain data.
- **REQ-OUT-02** — Where the source `Datei` path disagrees with the index-derived filename
  (IMP-09), the normalizer shall record the discrepancy in `review/index-file-mismatch.csv`
  and flag the row; it shall **not** alter the `index` (the stable key).

---

## 5. Non-Functional Requirements

| ID | Category | Requirement (measurable) |
| --- | --- | --- |
| NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output **or** a review file; 100% of original date/name strings preserved verbatim. |
| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ identical *logical* output across runs/machines: identical canonical cell matrices and review-file contents. Workbook `created`/`modified` metadata is pinned to a constant; ordering of all generated rows/aliases is stable (no set-iteration leakage). xlsx byte-identity is explicitly not required — determinism is asserted on content. |
| NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. |
| NFR-ACCUR-01 | Date accuracy | After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. |
| NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. |
| NFR-I18N-01 | Encoding | UTF-8 end-to-end; German diacritics and ß round-trip with no mojibake in any output. |
| NFR-TEST-01 | Testability | `dates.py` and `persons.py` have pytest tests covering every format/alias category in §10 with real examples from the archive. |
| NFR-MAINT-01 | Maintainability | Column-name map, century boundaries, season→month map, and fuzzy threshold live in `config.py`, not inline in logic. |
| NFR-OBSERV-01 | Observability | `review/summary.txt` reports per-run stats: rows in, documents out, dates by precision, names matched vs provisional, overrides applied, anomalies by type. |
| NFR-SAFETY-01 | Source safety | Source workbooks are opened read-only and never written. |

---

## 6. Data Dictionary (canonical contract)

This is the contract Phase 2 (the importer) must consume. Field-level, format-level — not a
DB schema.

### 6.1 `canonical-documents.xlsx`

| Field | Required | Format / values | Notes |
| --- | --- | --- | --- |
| `index` | yes | string | Stable key; basis for PDF matching. |
| `file` | no | string | verbatim `Datei` value (e.g. `H-0730.pdf`); carried through for the importer to link the scanned PDF. |
| `box` | no | string | from `Box`. |
| `folder` | no | string | from `Mappe`. |
| `sender_person_id` | no | person_id | resolved; empty if no sender. |
| `sender_name` | no | string | canonical display name (or cleaned raw if provisional). |
| `receiver_person_ids` | no | `id\|id\|…` | pipe-separated. |
| `receiver_names` | no | `name\|name\|…` | pipe-separated, aligned with ids. |
| `date_iso` | no | `YYYY-MM-DD` | best-effort; empty if `UNKNOWN`. |
| `date_raw` | no | string | verbatim source date. |
| `date_precision` | yes | enum | `DAY\|MONTH\|SEASON\|YEAR\|RANGE\|APPROX\|UNKNOWN`. |
| `date_end` | no | `YYYY-MM-DD` or empty | RANGE end day (e.g. `7./8. Sept.1923` → `date_iso` = start, `date_end` = end). Empty for every non-RANGE precision **and** for a half-resolved RANGE whose end did not parse (see REQ-DATE-07). |
| `location` | no | string | from `Ort`. |
| `tags` | no | `tag\|tag` | from `Schlagwort`. |
| `summary` | no | string | from `Inhalt`. |
| `source_row` | yes | int | provenance (NFR-DATA-01). |
| `needs_review` | yes | `flag\|flag` or empty | review flags (REQ-PROV-02). Flags include `unparsed_date`, `range_end_unparsed` (half-resolved RANGE, REQ-DATE-07), `unmatched_sender`, `unmatched_receiver`, `multi_sender`, `index_file_mismatch`, `duplicate_index`. |

### 6.2 `canonical-persons.xlsx`

| Field | Required | Format | Notes |
| --- | --- | --- | --- |
| `person_id` | yes | slug | stable id (e.g. `de-gruyter-eugenie`); collisions suffixed. |
| `last_name` | yes | string | from `Familienname`. |
| `first_name` | no | string | primary given name. |
| `maiden_name` | no | string | from `geb als` — drives dedup. |
| `title` | no | string | e.g. honorifics if present. |
| `nickname` | no | string | from quoted `Bemerkung`/spouse field. |
| `birth_date` / `birth_date_raw` / `birth_place` | no | ISO / string / string | §4.3 rules. |
| `death_date` / `death_date_raw` / `death_place` | no | ISO / string / string | §4.3 rules. |
| `spouse` | no | person_id or name | from `verheiratet mit`. |
| `generation` | no | string | `G 1`..`G 4`. |
| `notes` | no | string | from `Bemerkung`. |
| `aliases` | no | `a\|b\|c` | every surface form that maps here. |
| `provisional` | yes | bool | true if created from a document string, not the register. |

### 6.3 `canonical-persons-tree.json`

The de-duplicated genealogical tree (family members + their relationships) the importer
uses to seed the family graph. Each `persons[]` entry carries a `personId` that **joins
1:1 onto** `person_id` in `canonical-persons.xlsx`.

| Field | Required | Format | Notes |
| --- | --- | --- | --- |
| `personId` | yes | slug | The register's **verbatim** `person_id` (e.g. `cram-hans-1`), propagated — never re-slugified — so collision suffixes match `canonical-persons.xlsx` exactly. Every tree `personId` exists in the register; the register is the sole slug authority. |
| `firstName` / `lastName` / `maidenName` | first/last yes | string | name parts. |
| `birthYear` / `deathYear` | no | int or null | year only (tree granularity). |
| `birthPlace` / `deathPlace` | no | string or null | from the register. |
| `generation` | no | int or null | parsed from `G n`. |
| `notes` | no | string or null | leftover Bemerkung text after relationship extraction. |
| `familyMember` | yes | bool | always true for tree persons. |

A top-level `generated_at` is pinned to a fixed timestamp (`2020-01-01T00:00:00`) for
reproducibility (NFR-IDEM-01), not a wall-clock value. `relationships[]` carry `SPOUSE_OF`
and `PARENT_OF` edges keyed by `rowId`; `unresolved[]` lists relationship strings that did
not match a tree person.

---

## 7. Prioritized Backlog (MoSCoW)

| ID | Item | MoSCoW | Effort | Depends on |
| --- | --- | --- | --- | --- |
| B1 | Project scaffolding + read both workbooks (`FR-INGEST`, header map `FR-MAP`) | Must | S | — |
| B2 | Row triage + blank/duplicate/empty reports (`FR-TRIAGE`) | Must | S | B1 |
| B3 | Date parser + precision + century rule + Easter/feast computus + season map + tests (`FR-DATE`) | Must | L | B1 |
| B4 | Person register parser → canonical persons (`FR-PERS` US-PERS-01) | Must | M | B1 |
| B5 | Alias index + name resolution + multi-person split (`FR-DEDUP`, US-PERS-02) | Must | L | B4 |
| B6 | Overrides load + apply + idempotency (`FR-OVR`) | Must | S | B3,B5 |
| B7 | Canonical writers + provenance + review summary (`FR-OUT`, `FR-PROV`) | Must | M | B2,B3,B5 |
| B8 | Index↔Datei mismatch report (`REQ-OUT-02`) | Should | XS | B1 |
| B9 | Ambiguous-receiver review path (US-PERS-02 AC4) | Should | S | B5 |
| B10 | Comma-split `Inhalt` into extra tags | Could | XS | B7 |
| B11 | Phase-2 importer wiring (separate spec) | Won't (this spec) | — | B7 |

---

## 8. Traceability — Findings → Requirements

| Finding | Severity | Addressed by |
| --- | --- | --- |
| IMP-01 layout mismatch | blocker | C1, FR-MAP, REQ-OUT-01 |
| IMP-02 free-text dates | blocker | FR-DATE (all), C2, C6 |
| IMP-03 no ISO/normalized cols | blocker | FR-DATE, FR-PERS |
| IMP-04 register unimported | major | C3, US-PERS-01, §6.2 |
| IMP-05 name variants → dupes | major | C3, FR-DEDUP |
| IMP-06 blank-index dropped | major | US-TRIAGE-01 |
| IMP-07 duplicate indices | minor | REQ-TRIAGE-01 |
| IMP-08 section rows / tags vs summary | minor | REQ-TRIAGE-02, C7 |
| IMP-09 index↔file mismatch | minor | REQ-OUT-02, B8 |
| IMP-10 `x`-suffix rows | minor | REQ-TRIAGE-03 (skip + log this pass) |
| IMP-11 sender not split / ` u ` sep | minor | REQ-PERS-01, US-PERS-02 AC4 |
| IMP-12 first-sheet, no validation | minor | REQ-INGEST-01, FR-MAP AC2/AC3 |

---

## 9. Open Questions / TBD Register

| ID | Question | Why it matters | Ref | Resolution |
| --- | --- | --- | --- | --- |
| OQ-01 ✅ | Season/holiday → date. | Accuracy of ~70 SEASON/feast rows. | REQ-DATE-06 | **Resolved (2026-05-25):** movable feasts (Ostern, Pfingsten, Himmelfahrt, Advent, …) **computed per year from Easter — never a fixed month**; fixed feasts looked up (Weihnachten=12-25, Neujahr=01-01, …); seasons = mid-season month (Frühling=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan). |
| OQ-02 ✅ | Date ranges: start only, or start+end? | Sorting/display of ~315 range values. | REQ-DATE-02, REQ-DATE-07 | **Confirmed (updated #670):** store **start** in `date_iso`, precision `RANGE`, full text in `date_raw`, **and the resolved end day in `date_end`** for intra-month day ranges. A half-resolved range (start parsed, end impossible) keeps `date_end` empty and is flagged `range_end_unparsed`. |
| OQ-03 ✅ | `person_id` format. | Stability across re-runs; diffability. | §6 | **Confirmed:** readable slug `lastname-firstname`, numeric suffix on collision. |
| OQ-04 ✅ | `x`-suffix row handling. | 42 rows. | REQ-TRIAGE-03 | **Resolved (2026-05-25):** `x` rows are transcriptions of the base letter but not yet mappable → **skip this pass**, log to `review/skipped-x-suffix.csv` for later linking. |
| OQ-05 ✅ | Importer output format. | Phase-2 reader. | B11 | **Confirmed:** `.xlsx` (openpyxl-native, headered). |
| OQ-06 ✅ | Fuzzy-match policy. | False-positive person merges (R2). | REQ-DEDUP-02 | **Confirmed:** conservative — report all fuzzy matches; no silent merge. |

*All open questions resolved as of 2026-05-25. New ambiguities discovered during build go here.*

---

## 10. Glossary & Worked Examples

**Precision** — how exactly a date is known (`DAY` … `UNKNOWN`). **Provisional person** — a
person created from a document name string with no register match. **Alias index** — map from
every known surface form of a name to a canonical `person_id`. **Override** — a
human-supplied correction applied deterministically on each run.

**Date examples → expected outcome:**

| `date_raw` | `date_iso` | `date_precision` |
| --- | --- | --- |
| `15.2.1888` | 1888-02-15 | DAY |
| `6.März 1888` | 1888-03-06 | DAY |
| `22.III.18` | 1918-03-22 | DAY |
| `13.5.09` | 1909-05-13 | DAY |
| `10.Oct.95` | 1895-10-10 | DAY |
| `17/6. 1916` | 1916-06-17 | DAY |
| `Mai 1895` | 1895-05-01 | MONTH |
| `Pfingsten 1922` | 1922-06-04 | DAY (computed: Easter 1922 = Apr 16, +49 days) |
| `Herbst 1913` | 1913-10-01 | SEASON |
| `1905` | 1905-01-01 | YEAR |
| `8.1.1916 - 15.3.1916` | 1916-01-08 | RANGE |
| `17.Nov (?) 1887` | 1887-11-17 | APPROX |
| `?` | *(empty)* | UNKNOWN |

**Name examples → expected outcome:**

| raw cell | resolves to |
| --- | --- |
| `Eugenie Müller` (+ register `geb Müller`) | `de-gruyter-eugenie` (matched via maiden alias) |
| `Eugenie de Gruyter` | `de-gruyter-eugenie` |
| `Herbert u Clara` | `cram-herbert` + `cram-clara` (split, surname distributed) |
| `Hedi und Tutu (Gruber)` | `gruber-hedi` + `gruber-tutu` |
| `Ella Anita` | → `review/ambiguous-receivers.csv` (not auto-split) |
| `Hans Wittkopf` (not in register) | provisional `wittkopf-hans` |