diff --git a/docs/import-migration/02-normalization-spec.md b/docs/import-migration/02-normalization-spec.md index 08ccf1d2..b2829d23 100644 --- a/docs/import-migration/02-normalization-spec.md +++ b/docs/import-migration/02-normalization-spec.md @@ -36,7 +36,9 @@ appears under many names. Importing as-is produces garbage (see `IMP-01..12`). - G3 — **100%** of original values (raw date string, raw name string, source row number) are preserved. - G4 — A full run over the current inputs completes in **< 60 s** on the dev laptop and is - **byte-identical** when re-run with unchanged inputs+overrides. + **content-deterministic** when re-run with unchanged inputs+overrides: identical canonical + cell matrices and identical review-file contents. (Workbook metadata is pinned; literal xlsx + byte-identity is not guaranteed because the zip container stores entry metadata.) **Primary actor.** Marcel — solo owner & data steward (tech comfort 4/5). Also: a future agent re-running the pipeline; and the `MassImportService` as the downstream consumer. @@ -238,7 +240,7 @@ complete.* | ID | Category | Requirement (measurable) | | --- | --- | --- | | NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output **or** a review file; 100% of original date/name strings preserved verbatim. | -| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ byte-identical outputs across runs and machines. | +| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ identical *logical* output across runs/machines: identical canonical cell matrices and review-file contents. Workbook `created`/`modified` metadata is pinned to a constant; ordering of all generated rows/aliases is stable (no set-iteration leakage). xlsx byte-identity is explicitly not required — determinism is asserted on content. | | NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. | | NFR-ACCUR-01 | Date accuracy | After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. | | NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. | diff --git a/docs/import-migration/03-normalizer-implementation-plan.md b/docs/import-migration/03-normalizer-implementation-plan.md new file mode 100644 index 00000000..f315596f --- /dev/null +++ b/docs/import-migration/03-normalizer-implementation-plan.md @@ -0,0 +1,2273 @@ +# Import Normalizer Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Build an offline Python tool that turns the raw family-archive spreadsheets into a clean, canonical dataset (`canonical-documents.xlsx`, `canonical-persons.xlsx`) plus review CSVs, with a deterministic overrides-and-rerun loop. + +**Architecture:** A standalone Python package at `tools/import-normalizer/`. Pure, independently-testable units — date parsing (`dates.py`), person/register logic (`persons.py`), spreadsheet ingest (`ingest.py`), row mapping (`documents.py`) — are orchestrated by `normalize.py`. Source workbooks are read-only; all tunables live in `config.py`. Residue (unparseable dates, unmatched names) is reported to `review/*.csv` and corrected via version-controlled `overrides/*.csv` applied on each run. + +**Tech Stack:** Python 3.12, `openpyxl` (xlsx read/write), `pytest`. No third-party fuzzy library — `difflib` (stdlib) provides *suggestions only* (never auto-applied), per the conservative-matching requirement. + +**Spec:** [`02-normalization-spec.md`](./02-normalization-spec.md). Requirement IDs (`FR-*`, `REQ-*`, `NFR-*`) referenced per task. + +--- + +## File Structure + +``` +tools/import-normalizer/ +├── config.py # paths, header maps, century rule, season/feast tables, month tables, matching config +├── dates.py # Easter computus, feast/season resolution, year expansion, parse_date() +├── persons.py # slug, Person, parse_register(), split_receivers(), AliasIndex, ResolutionContext +├── ingest.py # read_sheet(), build_header_map() +├── documents.py # RawRow, extract_row(), triage helpers, CanonicalDocument, to_canonical() +├── writers.py # write_documents_xlsx(), write_persons_xlsx(), write_review_csv(), write_summary() +├── overrides.py # load_overrides() +├── normalize.py # main() orchestrator + CLI +├── requirements.txt +├── .gitignore # .venv/ out/ review/ __pycache__/ +├── README.md +├── overrides/ +│ ├── dates.csv # seed header: raw,iso,precision +│ └── names.csv # seed header: raw,person_id +└── tests/ + ├── __init__.py + ├── test_dates.py + ├── test_persons.py + ├── test_ingest.py + ├── test_documents.py + ├── test_writers.py + └── test_normalize.py +``` + +**Test command convention** (per the "never run the full suite" rule — run targeted files): +`tools/import-normalizer/.venv/bin/python -m pytest tools/import-normalizer/tests/test_X.py -v` + +All `git` commands assume CWD = repo root and the current branch `docs/import-migration`. + +--- + +### Task 1: Project scaffold, venv, config constants + +**Files:** +- Create: `tools/import-normalizer/requirements.txt` +- Create: `tools/import-normalizer/.gitignore` +- Create: `tools/import-normalizer/config.py` +- Create: `tools/import-normalizer/tests/__init__.py` +- Create: `tools/import-normalizer/tests/test_config.py` + +- [ ] **Step 1: Create `requirements.txt`** (pinned — an openpyxl minor bump can change xlsx serialization and break determinism, NFR-IDEM-01) + +``` +openpyxl==3.1.5 +pytest==8.3.4 +``` + +- [ ] **Step 2: Create the tool-local `.gitignore`** + +``` +.venv/ +out/ +review/ +__pycache__/ +*.pyc +``` + +- [ ] **Step 2b: Harden the repo-root `.gitignore`** (the root file currently has no venv pattern — that is how `ocr-service/.venv` got committed; prevent the whole class). Append these lines to `/home/marcel/Desktop/familienarchiv/.gitignore` if not already present: + +``` +**/.venv/ +**/__pycache__/ +*.pyc +``` +(Cleaning up the *already-committed* `ocr-service/.venv` via `git rm -r --cached ocr-service/.venv` is a separate task — do NOT bundle it into this branch.) + +- [ ] **Step 3: Create `config.py`** + +```python +"""Tunables for the import normalizer. No logic here — only data tables.""" +from pathlib import Path + +# --- Paths --- +BASE_DIR = Path(__file__).resolve().parent +REPO_ROOT = BASE_DIR.parent.parent +IMPORT_DIR = REPO_ROOT / "import" + +DOCUMENT_WORKBOOK = IMPORT_DIR / "zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx" +DOCUMENT_SHEET = "Familienarchiv" +PERSON_WORKBOOK = IMPORT_DIR / "Personendatei 2.xlsx" +PERSON_SHEET = "Tabelle1" + +OUT_DIR = BASE_DIR / "out" +REVIEW_DIR = BASE_DIR / "review" +OVERRIDES_DIR = BASE_DIR / "overrides" + +# --- Header text (lowercased, whitespace-collapsed) -> canonical field --- +DOCUMENT_HEADER_MAP = { + "index": "index", + "datei": "file", + "box": "box", + "mappe": "folder", + "briefeschreiberin": "sender", + "empfängerin": "receivers", + "datum des briefes": "date", + "ort": "location", + "schlagwort": "tags", + "inhalt": "summary", +} +DOCUMENT_REQUIRED_FIELDS = {"index"} + +PERSON_HEADER_MAP = { + "generation": "generation", + "familienname": "last_name", + "vorname": "first_name", + "geb als": "maiden_name", + "geburtsdatum": "birth_date", + "geburtsort": "birth_place", + "todesdatum": "death_date", + "sterbeort": "death_place", + "verheiratet mit": "spouse", + "bemerkung": "notes", +} +PERSON_REQUIRED_FIELDS = {"last_name"} + +# --- Century rule (archive 1873–1957) --- +TWO_DIGIT_19XX_MAX = 57 # 00..57 -> 1900+yy +TWO_DIGIT_18XX_MIN = 73 # 73..99 -> 1800+yy ; 58..72 -> ambiguous -> UNKNOWN + +# --- Seasons -> representative month (day = 1) --- +SEASON_MONTHS = { + "frühling": 4, "fruehling": 4, "frühjahr": 4, "fruehjahr": 4, + "sommer": 7, "herbst": 10, "winter": 1, +} + +# --- Fixed feasts -> (month, day) --- +FIXED_FEASTS = { + "neujahr": (1, 1), + "heiligabend": (12, 24), "heiliger abend": (12, 24), + "weihnachten": (12, 25), "weihnacht": (12, 25), "1. weihnachtstag": (12, 25), + "silvester": (12, 31), "sylvester": (12, 31), +} + +# --- Movable feasts -> day offset from Easter Sunday --- +MOVABLE_FEASTS = { + "karfreitag": -2, + "ostern": 0, "ostersonntag": 0, "ostermontag": 1, + "himmelfahrt": 39, "christi himmelfahrt": 39, + "pfingsten": 49, "pfingstsonntag": 49, "pfingstmontag": 50, + "fronleichnam": 60, +} + +# --- Month names -> number (German + English, full + abbreviations) --- +MONTHS = { + "januar": 1, "jan": 1, "january": 1, + "februar": 2, "feb": 2, "febr": 2, "february": 2, + "märz": 3, "maerz": 3, "mär": 3, "mar": 3, "march": 3, + "april": 4, "apr": 4, + "mai": 5, "may": 5, + "juni": 6, "jun": 6, "june": 6, + "juli": 7, "jul": 7, "july": 7, + "august": 8, "aug": 8, + "september": 9, "sep": 9, "sept": 9, + "oktober": 10, "okt": 10, "oct": 10, "october": 10, + "november": 11, "nov": 11, + "dezember": 12, "dez": 12, "dec": 12, "december": 12, +} + +ROMAN_MONTHS = { + "i": 1, "ii": 2, "iii": 3, "iv": 4, "v": 5, "vi": 6, + "vii": 7, "viii": 8, "ix": 9, "x": 10, "xi": 11, "xii": 12, +} + +# --- Person matching --- +KNOWN_LAST_NAMES = [ + "von der Heide", "von Massenbach", "von Geldern", "von Gelden", "von Staa", + "de Gruyter", "Dieckmann", "Gruber", "Müller", "Wolff", "Cram", +] +FUZZY_SUGGEST_THRESHOLD = 0.82 # difflib ratio; suggestions only, never auto-applied +``` + +- [ ] **Step 4: Create empty `tests/__init__.py`** (empty file). + +- [ ] **Step 5: Write `tests/test_config.py`** + +```python +import config + +def test_century_boundaries(): + assert config.TWO_DIGIT_19XX_MAX == 57 + assert config.TWO_DIGIT_18XX_MIN == 73 + +def test_header_maps_cover_required_fields(): + assert "index" in config.DOCUMENT_HEADER_MAP.values() + assert "last_name" in config.PERSON_HEADER_MAP.values() + +def test_feast_tables_present(): + assert config.MOVABLE_FEASTS["pfingsten"] == 49 + assert config.SEASON_MONTHS["herbst"] == 10 +``` + +- [ ] **Step 6: Create the venv and install deps** + +Run: +```bash +cd tools/import-normalizer && python3 -m venv .venv && .venv/bin/pip install -r requirements.txt && cd - +``` +Expected: openpyxl + pytest install successfully. + +- [ ] **Step 7: Run the config test** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_config.py -v && cd -` +Expected: 3 passed. (Tests import `config` directly, so pytest must run with CWD = the tool dir; `conftest.py` is unnecessary because the modules are flat in that dir.) + +- [ ] **Step 8: Commit** + +```bash +git add .gitignore tools/import-normalizer/requirements.txt tools/import-normalizer/.gitignore tools/import-normalizer/config.py tools/import-normalizer/tests/__init__.py tools/import-normalizer/tests/test_config.py +git commit -m "feat(normalizer): scaffold tool + config tables" +``` + +--- + +### Task 2: Easter computus (`REQ-DATE-06`) + +**Files:** +- Create: `tools/import-normalizer/dates.py` +- Create: `tools/import-normalizer/tests/test_dates.py` + +- [ ] **Step 1: Write the failing test** in `tests/test_dates.py` + +```python +import datetime +import dates + +def test_easter_known_years(): + # Anonymous Gregorian algorithm — verified against published tables + assert dates.easter(2024) == datetime.date(2024, 3, 31) + assert dates.easter(2000) == datetime.date(2000, 4, 23) + assert dates.easter(1922) == datetime.date(1922, 4, 16) + assert dates.easter(1888) == datetime.date(1888, 4, 1) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_easter_known_years -v && cd -` +Expected: FAIL with `ModuleNotFoundError: No module named 'dates'` or `AttributeError: module 'dates' has no attribute 'easter'`. + +- [ ] **Step 3: Create `dates.py` with the computus** + +```python +"""Tolerant historical date parsing for the family archive.""" +import datetime + + +def easter(year: int) -> datetime.date: + """Easter Sunday (Gregorian) via the Anonymous Gregorian / Butcher algorithm.""" + a = year % 19 + b = year // 100 + c = year % 100 + d = b // 4 + e = b % 4 + f = (b + 8) // 25 + g = (b - f + 1) // 3 + h = (19 * a + b - d - g + 15) % 30 + i = c // 4 + k = c % 4 + l = (32 + 2 * e + 2 * i - h - k) % 7 + m = (a + 11 * h + 22 * l) // 451 + month = (h + l - 7 * m + 114) // 31 + day = ((h + l - 7 * m + 114) % 31) + 1 + return datetime.date(year, month, day) +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_easter_known_years -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py +git commit -m "feat(normalizer): Easter computus" +``` + +--- + +### Task 3: Feast & season resolution (`REQ-DATE-02`, `REQ-DATE-06`) + +**Files:** +- Modify: `tools/import-normalizer/dates.py` +- Modify: `tools/import-normalizer/tests/test_dates.py` + +- [ ] **Step 1: Add the failing test** to `tests/test_dates.py` + +```python +from dates import Precision + +def test_resolve_feast_movable(): + assert dates.resolve_feast_or_season("Pfingsten", 1922) == ("1922-06-04", Precision.DAY) + assert dates.resolve_feast_or_season("Ostern", 2024) == ("2024-03-31", Precision.DAY) + assert dates.resolve_feast_or_season("Pfingstmontag", 1922) == ("1922-06-05", Precision.DAY) + +def test_resolve_feast_fixed(): + assert dates.resolve_feast_or_season("Weihnachten", 1900) == ("1900-12-25", Precision.DAY) + assert dates.resolve_feast_or_season("Neujahr", 1910) == ("1910-01-01", Precision.DAY) + +def test_resolve_season(): + assert dates.resolve_feast_or_season("Herbst", 1913) == ("1913-10-01", Precision.SEASON) + assert dates.resolve_feast_or_season("Sommer", 1910) == ("1910-07-01", Precision.SEASON) + +def test_resolve_unknown_token_returns_none(): + assert dates.resolve_feast_or_season("Freitag", 1919) is None +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "feast or season" -v && cd -` +Expected: FAIL — `Precision` and `resolve_feast_or_season` not defined. + +- [ ] **Step 3: Implement** — add to `dates.py` (top imports + new code) + +```python +from enum import StrEnum +import config + + +class Precision(StrEnum): + DAY = "DAY" + MONTH = "MONTH" + SEASON = "SEASON" + YEAR = "YEAR" + RANGE = "RANGE" + APPROX = "APPROX" + UNKNOWN = "UNKNOWN" + + +def _advent_sunday(year: int, n: int) -> datetime.date: + """n-th Advent (1..4). 4th Advent = last Sunday on/before Dec 24.""" + dec24 = datetime.date(year, 12, 24) + back_to_sunday = (dec24.weekday() - 6) % 7 # Mon=0..Sun=6 + fourth = dec24 - datetime.timedelta(days=back_to_sunday) + return fourth - datetime.timedelta(days=(4 - n) * 7) + + +def resolve_feast_or_season(token: str, year: int): + """Return (iso, Precision) for a known feast/season token, else None.""" + key = " ".join(token.lower().split()).strip(" .") + if key in config.MOVABLE_FEASTS: + d = easter(year) + datetime.timedelta(days=config.MOVABLE_FEASTS[key]) + return d.isoformat(), Precision.DAY + if key in config.FIXED_FEASTS: + month, day = config.FIXED_FEASTS[key] + return datetime.date(year, month, day).isoformat(), Precision.DAY + advent = {"1. advent": 1, "2. advent": 2, "3. advent": 3, "4. advent": 4, "advent": 1} + if key in advent: + return _advent_sunday(year, advent[key]).isoformat(), Precision.DAY + if key in config.SEASON_MONTHS: + return datetime.date(year, config.SEASON_MONTHS[key], 1).isoformat(), Precision.SEASON + return None +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "feast or season" -v && cd -` +Expected: PASS (all 4). (Pfingstmontag 1922 = Easter Apr 16 + 50 = June 5.) + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py +git commit -m "feat(normalizer): feast + season resolution" +``` + +--- + +### Task 4: Year expansion / century rule (`REQ-DATE-03`) + +**Files:** +- Modify: `tools/import-normalizer/dates.py` +- Modify: `tools/import-normalizer/tests/test_dates.py` + +- [ ] **Step 1: Add the failing test** + +```python +def test_expand_year(): + assert dates.expand_year("1888") == 1888 + assert dates.expand_year("889") == 1889 # 3-digit -> 1DDD + assert dates.expand_year("923") == 1923 + assert dates.expand_year("08") == 1908 # 00..57 -> 19xx + assert dates.expand_year("17") == 1917 + assert dates.expand_year("57") == 1957 + assert dates.expand_year("73") == 1873 # 73..99 -> 18xx + assert dates.expand_year("99") == 1899 + assert dates.expand_year("65") is None # 58..72 ambiguous + assert dates.expand_year("x") is None +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_expand_year -v && cd -` +Expected: FAIL — `expand_year` not defined. + +- [ ] **Step 3: Implement** — add to `dates.py` + +```python +def expand_year(token: str): + """Expand a 2/3/4-digit year string per the 1873–1957 century rule. None if ambiguous.""" + token = token.strip() + if not token.isdigit(): + return None + n, v = len(token), int(token) + if n == 4: + return v + if n == 3: + return 1000 + v + if n == 2: + if v <= config.TWO_DIGIT_19XX_MAX: + return 1900 + v + if v >= config.TWO_DIGIT_18XX_MIN: + return 1800 + v + return None + return None +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_expand_year -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py +git commit -m "feat(normalizer): year expansion century rule" +``` + +--- + +### Task 5: `parse_date` dispatch + ISO + numeric forms (`FR-DATE`, `REQ-DATE-01/04/05`) + +**Files:** +- Modify: `tools/import-normalizer/dates.py` +- Modify: `tools/import-normalizer/tests/test_dates.py` + +- [ ] **Step 1: Add failing tests** + +```python +def test_parse_iso_and_empty(): + assert dates.parse_date("1910-04-23") == dates.ParsedDate("1910-04-23", Precision.DAY, "1910-04-23") + assert dates.parse_date("") == dates.ParsedDate(None, Precision.UNKNOWN, "") + assert dates.parse_date("?") == dates.ParsedDate(None, Precision.UNKNOWN, "?") + +def test_parse_numeric_forms(): + assert dates.parse_date("15.2.1888").iso == "1888-02-15" + assert dates.parse_date("13.5.09").iso == "1909-05-13" + assert dates.parse_date("17/6. 1916").iso == "1916-06-17" + assert dates.parse_date("11.10.08").iso == "1908-10-11" + assert dates.parse_date("30.1.889").iso == "1889-01-30" + assert dates.parse_date("15.2.1888").precision == Precision.DAY + +def test_parse_numeric_unparseable(): + assert dates.parse_date("8.9.").precision == Precision.UNKNOWN # no year + assert dates.parse_date("13.5.65").precision == Precision.UNKNOWN # ambiguous 2-digit year + +def test_parse_approx_marker_upgrades_precision(): + r = dates.parse_date("17.Nov (?) 1887") # month-name handled in a later task; here just the marker path + # after the marker is detected, a parsed date becomes APPROX (verified fully in Task 8) + assert r.raw == "17.Nov (?) 1887" +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "parse_" -v && cd -` +Expected: FAIL — `ParsedDate` / `parse_date` not defined. + +- [ ] **Step 3: Implement** — add to `dates.py` + +```python +import re +from dataclasses import dataclass + + +@dataclass(frozen=True) +class ParsedDate: + iso: str | None + precision: Precision + raw: str + + +_LEADING_MARKERS = re.compile( + r"^(um|ca\.?|circa|etwa|wohl|vermutlich|nach|vor|anfang|mitte|ende)\s+", re.I) + + +def _preprocess(raw: str): + """Return (cleaned_string, approx_flag).""" + s = (raw or "").strip() + if not s: + return "", False + low = s.lower() + approx = ("?" in s) or any( + m in low for m in ("um ", "ca.", "ca ", "circa", "etwa", "wohl", "vermutlich")) + s = re.sub(r"\(\s*\?\s*\)", " ", s) # remove "(?)" + s = s.replace("?", " ") + s = re.sub(r",.*$", "", s) # drop trailing editorial note (", 2. Brief") + s = _LEADING_MARKERS.sub("", s) + s = re.sub(r"\s+", " ", s).strip(" .,") + return s, approx + + +_NUM_RE = re.compile(r"(\d{1,2})[./](\d{1,2})\.?\s*(\d{2,4})") + + +def _match_iso(s): + if re.fullmatch(r"\d{4}-\d{2}-\d{2}", s): + try: + datetime.date.fromisoformat(s) + return s, Precision.DAY + except ValueError: + return None + return None + + +def _match_numeric(s): + m = _NUM_RE.fullmatch(s) + if not m: + return None + day, month = int(m.group(1)), int(m.group(2)) + year = expand_year(m.group(3)) + if year is None or not (1 <= month <= 12): + return None + try: + return datetime.date(year, month, day).isoformat(), Precision.DAY + except ValueError: + return None + + +# Matchers are tried in order. Later tasks append to this list. +_MATCHERS = [_match_iso, _match_numeric] + + +def parse_date(raw: str, date_overrides: dict | None = None) -> ParsedDate: + if date_overrides: + key = (raw or "").strip() + if key in date_overrides: + iso, prec = date_overrides[key] + return ParsedDate(iso or None, Precision(prec), raw) + cleaned, approx = _preprocess(raw) + if not cleaned: + return ParsedDate(None, Precision.UNKNOWN, raw) + for matcher in _MATCHERS: + result = matcher(cleaned) + if result: + iso, precision = result + if approx: + precision = Precision.APPROX + return ParsedDate(iso, precision, raw) + return ParsedDate(None, Precision.UNKNOWN, raw) +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "parse_" -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py +git commit -m "feat(normalizer): parse_date dispatch + iso/numeric matchers" +``` + +--- + +### Task 6: Roman-numeral month matcher + +**Files:** +- Modify: `tools/import-normalizer/dates.py` +- Modify: `tools/import-normalizer/tests/test_dates.py` + +- [ ] **Step 1: Add failing test** + +```python +def test_parse_roman_months(): + assert dates.parse_date("22.III.18").iso == "1918-03-22" + assert dates.parse_date("19.XII.1954").iso == "1954-12-19" + assert dates.parse_date("1.III.27").iso == "1927-03-01" + assert dates.parse_date("22.III.18").precision == Precision.DAY +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_roman_months -v && cd -` +Expected: FAIL — Roman dates currently fall through to UNKNOWN. + +- [ ] **Step 3: Implement** — add to `dates.py` and register the matcher + +```python +_ROMAN_RE = re.compile(r"(\d{1,2})\.\s*([IVXLC]+)\.?\s*(\d{2,4})", re.I) + + +def _match_roman(s): + m = _ROMAN_RE.fullmatch(s) + if not m: + return None + day = int(m.group(1)) + month = config.ROMAN_MONTHS.get(m.group(2).lower()) + year = expand_year(m.group(3)) + if not month or year is None: + return None + try: + return datetime.date(year, month, day).isoformat(), Precision.DAY + except ValueError: + return None +``` + +Then change the matcher list line to: +```python +_MATCHERS = [_match_iso, _match_numeric, _match_roman] +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_roman_months -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py +git commit -m "feat(normalizer): roman-numeral month matcher" +``` + +--- + +### Task 7: Month-name matchers (day-first + English month-first) + +**Files:** +- Modify: `tools/import-normalizer/dates.py` +- Modify: `tools/import-normalizer/tests/test_dates.py` + +- [ ] **Step 1: Add failing tests** + +```python +def test_parse_monthname_day_first(): + assert dates.parse_date("6.März 1888").iso == "1888-03-06" + assert dates.parse_date("29.Sept.1891").iso == "1891-09-29" + assert dates.parse_date("10.Oct.95").iso == "1895-10-10" + assert dates.parse_date("9.December1889").iso == "1889-12-09" + assert dates.parse_date("18.Dez.1916").iso == "1916-12-18" + assert dates.parse_date("4Dezember 1936").iso == "1936-12-04" + assert dates.parse_date("25 August 1968").iso == "1968-08-25" + +def test_parse_monthname_english_month_first(): + assert dates.parse_date("April 12. 1922").iso == "1922-04-12" + assert dates.parse_date("Oct.5. 1916").iso == "1916-10-05" +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k monthname -v && cd -` +Expected: FAIL. + +- [ ] **Step 3: Implement** — add to `dates.py`. `_match_monthname_a` is day-first; `_match_monthname_b` is English month-first. + +```python +_MONTH_A_RE = re.compile(r"(\d{1,2})[.\s]*([A-Za-zÄÖÜäöü]+)\.?\s*(\d{2,4})") +_MONTH_B_RE = re.compile(r"([A-Za-zÄÖÜäöü]+)\.?\s*(\d{1,2})\.?\s*(\d{2,4})") + + +def _lookup_month(token: str): + return config.MONTHS.get(token.lower().strip(" .")) + + +def _build_day_month_year(day, month, year): + if not month or year is None or not (1 <= month <= 12): + return None + try: + return datetime.date(year, month, day).isoformat(), Precision.DAY + except ValueError: + return None + + +def _match_monthname_a(s): + m = _MONTH_A_RE.fullmatch(s) + if not m: + return None + return _build_day_month_year(int(m.group(1)), _lookup_month(m.group(2)), expand_year(m.group(3))) + + +def _match_monthname_b(s): + m = _MONTH_B_RE.fullmatch(s) + if not m: + return None + return _build_day_month_year(int(m.group(2)), _lookup_month(m.group(1)), expand_year(m.group(3))) +``` + +Then update the matcher list (order matters — `_match_monthname_a` is day-first and safe to place before the month/year matcher; `_match_monthname_b` goes *after* the month/year matcher added in Task 8, so for now append only `_a`): +```python +_MATCHERS = [_match_iso, _match_numeric, _match_roman, _match_monthname_a] +``` + +- [ ] **Step 4: Run — expect `_a` cases to pass, `_b` (English) still failing** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_monthname_day_first -v && cd -` +Expected: PASS. + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_monthname_english_month_first -v && cd -` +Expected: FAIL (`_match_monthname_b` not yet registered — it is wired in Task 8 to sit after the month/year matcher so it doesn't shadow `Mai 1895`). + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py +git commit -m "feat(normalizer): day-first month-name matcher" +``` + +--- + +### Task 8: Month/year, feast/season, year-only, range matchers + final ordering + overrides + +**Files:** +- Modify: `tools/import-normalizer/dates.py` +- Modify: `tools/import-normalizer/tests/test_dates.py` + +- [ ] **Step 1: Add failing tests** + +```python +def test_parse_month_year_year_only(): + assert dates.parse_date("Mai 1895") == dates.ParsedDate("1895-05-01", Precision.MONTH, "Mai 1895") + assert dates.parse_date("October 1903").iso == "1903-10-01" + assert dates.parse_date("1905") == dates.ParsedDate("1905-01-01", Precision.YEAR, "1905") + +def test_parse_feast_and_season_via_parse_date(): + assert dates.parse_date("Pfingsten 1922") == dates.ParsedDate("1922-06-04", Precision.DAY, "Pfingsten 1922") + assert dates.parse_date("Herbst 1913") == dates.ParsedDate("1913-10-01", Precision.SEASON, "Herbst 1913") + assert dates.parse_date("Pfingstsonntag 1915").precision == Precision.DAY + +def test_parse_ranges(): + assert dates.parse_date("8.1.1916 - 15.3.1916") == dates.ParsedDate("1916-01-08", Precision.RANGE, "8.1.1916 - 15.3.1916") + assert dates.parse_date("1881/82") == dates.ParsedDate("1881-01-01", Precision.RANGE, "1881/82") + assert dates.parse_date("1945/46?").iso == "1945-01-01" # '?' stripped -> RANGE, then APPROX + assert dates.parse_date("1945/46?").precision == Precision.APPROX + +def test_parse_approx_full(): + r = dates.parse_date("17.Nov (?) 1887") + assert r.iso == "1887-11-17" + assert r.precision == Precision.APPROX + +def test_parse_english_month_first_now_works(): + assert dates.parse_date("April 12. 1922").iso == "1922-04-12" + assert dates.parse_date("Mai 1895").iso == "1895-05-01" # not shadowed by month-first matcher + +def test_parse_unparseable_examples(): + assert dates.parse_date("Freitag 1919").precision == Precision.UNKNOWN + +def test_parse_invalid_calendar_date_is_unknown(): + # try/except ValueError in the matchers must route impossible dates to UNKNOWN (-> review), + # never silently clamp. This is the most likely real-data bug class at 7,600 rows. + assert dates.parse_date("30.2.1888").precision == Precision.UNKNOWN + assert dates.parse_date("31.4.1916").precision == Precision.UNKNOWN + +def test_parse_intra_month_day_range(): + # "7./8. Sept.1923" -> start day, RANGE. Must NOT be confused with slash-date "17/6. 1916". + assert dates.parse_date("7./8. Sept.1923") == dates.ParsedDate("1923-09-07", Precision.RANGE, "7./8. Sept.1923") + assert dates.parse_date("17/6. 1916") == dates.ParsedDate("1916-06-17", Precision.DAY, "17/6. 1916") + +def test_parse_trailing_note_stripped_but_raw_preserved(): + r = dates.parse_date("17.Nov 1887, 2. Brief") # REQ-DATE-04 + assert r.iso == "1887-11-17" + assert "2. Brief" in r.raw # original string preserved verbatim +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "month_year or feast_and_season or ranges or approx_full or english_month_first_now or unparseable_examples" -v && cd -` +Expected: FAIL. + +- [ ] **Step 3: Implement** — add matchers to `dates.py` + +```python +_MONTH_YEAR_RE = re.compile(r"([A-Za-zÄÖÜäöü]+)\.?\s+(\d{2,4})") +_TOKEN_YEAR_RE = re.compile(r"(.+?)\.?\s+(\d{2,4})") +_YEAR_ONLY_RE = re.compile(r"\d{4}") +_RANGE_YY_RE = re.compile(r"(\d{4})\s*/\s*\d{2}") +_RANGE_HYPHEN_RE = re.compile(r"(.*\d)\s*[-–]\s*\d.*") +# Intra-month day range, e.g. "7./8. Sept.1923" — require a dot before the slash so it +# does NOT swallow slash-as-dot single dates like "17/6. 1916" (which has no dot before "/"). +_RANGE_DAY_RE = re.compile(r"(\d{1,2})\./(\d{1,2})\.\s*(.+)") + + +def _match_month_year(s): + m = _MONTH_YEAR_RE.fullmatch(s) + if not m: + return None + month = _lookup_month(m.group(1)) + year = expand_year(m.group(2)) + if not month or year is None: + return None + return datetime.date(year, month, 1).isoformat(), Precision.MONTH + + +def _match_feast_season(s): + m = _TOKEN_YEAR_RE.fullmatch(s) + if not m: + return None + year = expand_year(m.group(2)) + if year is None: + return None + return resolve_feast_or_season(m.group(1), year) + + +def _match_year_only(s): + if _YEAR_ONLY_RE.fullmatch(s): + return datetime.date(int(s), 1, 1).isoformat(), Precision.YEAR + return None + + +def _match_range(s): + m = _RANGE_YY_RE.fullmatch(s) + if m: + return datetime.date(int(m.group(1)), 1, 1).isoformat(), Precision.RANGE + m = _RANGE_DAY_RE.fullmatch(s) + if m: + first = f"{m.group(1)}.{m.group(3)}" # "7." + "Sept.1923" -> "7.Sept.1923" + for matcher in (_match_numeric, _match_monthname_a): + r = matcher(first) + if r: + return r[0], Precision.RANGE + m = _RANGE_HYPHEN_RE.fullmatch(s) + if m: + start = m.group(1).strip() + for matcher in (_match_numeric, _match_roman, _match_monthname_a, _match_year_only): + r = matcher(start) + if r: + return r[0], Precision.RANGE + return None +``` + +Then replace the matcher list with the final ordering: +```python +_MATCHERS = [ + _match_iso, + _match_range, + _match_numeric, + _match_roman, + _match_monthname_a, + _match_month_year, + _match_monthname_b, + _match_feast_season, + _match_year_only, +] +``` + +- [ ] **Step 4: Run the full date test file** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -v && cd -` +Expected: PASS (all tests, including the English month-first test from Task 7). + +- [ ] **Step 5: Add an overrides test, then commit** + +Append to `tests/test_dates.py`: +```python +def test_parse_date_override_wins(): + ovr = {"13.5.65": ("1965-05-13", "DAY")} + r = dates.parse_date("13.5.65", ovr) # ambiguous without override + assert r == dates.ParsedDate("1965-05-13", Precision.DAY, "13.5.65") +``` +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -v && cd -` +Expected: PASS. + +```bash +git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py +git commit -m "feat(normalizer): month/year, feast/season, range matchers + overrides" +``` + +--- + +### Task 9: Person register parsing (`FR-PERS`, US-PERS-01) + +**Files:** +- Create: `tools/import-normalizer/persons.py` +- Create: `tools/import-normalizer/tests/test_persons.py` + +- [ ] **Step 1: Write the failing test** in `tests/test_persons.py` + +```python +import persons + +def test_slugify(): + assert persons.slugify("de Gruyter", "Eugenie") == "de-gruyter-eugenie" + assert persons.slugify("Müller", "Karl Erhard") == "mueller-karl-erhard" + +def test_parse_register_basic(): + rows = [ + {"generation": "G 1", "last_name": "Blomquist", "first_name": "Charlotte,Meta,Jacobi", + "maiden_name": "Ruge", "birth_date": "30.8.1862", "birth_place": "Schülperneusiel", + "death_date": "1934-07-23", "death_place": "Göteborg", "spouse": '"Tante Lolly"', + "notes": "Schwester v Marie Cram"}, + {"generation": "G 2", "last_name": "Bohrmann", "first_name": "Else", + "maiden_name": "Cram", "birth_date": "28.11.1888", "spouse": "Ludwig Bohrmann", + "notes": "Schwester v Herbert"}, + ] + people = persons.parse_register(rows) + p = people[0] + assert p.person_id == "blomquist-charlotte" + assert p.first_name == "Charlotte" + assert p.maiden_name == "Ruge" + assert p.birth_date == "1862-08-30" + assert p.nickname == "Tante Lolly" # quoted spouse field is a nickname, not a spouse + assert p.spouse == "" + assert "Meta" in p.extra_given_names and "Jacobi" in p.extra_given_names + p2 = people[1] + assert p2.maiden_name == "Cram" + assert p2.spouse == "Ludwig Bohrmann" + assert p2.provisional is False +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd -` +Expected: FAIL — `persons` module / symbols not defined. + +- [ ] **Step 3: Implement `persons.py`** + +```python +"""Person register parsing, name splitting, alias resolution.""" +import re +import unicodedata +from dataclasses import dataclass, field + +import config +import dates + +_DIACRITIC_MAP = str.maketrans({"ä": "ae", "ö": "oe", "ü": "ue", "ß": "ss", + "Ä": "ae", "Ö": "oe", "Ü": "ue"}) + + +def _strip_accents(s: str) -> str: + s = s.translate(_DIACRITIC_MAP) + s = unicodedata.normalize("NFKD", s) + return "".join(c for c in s if not unicodedata.combining(c)) + + +def slugify(last: str, first: str) -> str: + raw = f"{last} {first}".strip() + raw = _strip_accents(raw).lower() + raw = re.sub(r"[^a-z0-9]+", "-", raw).strip("-") + return raw or "unknown" + + +@dataclass +class Person: + person_id: str + last_name: str = "" + first_name: str = "" + maiden_name: str = "" + title: str = "" + nickname: str = "" + extra_given_names: list = field(default_factory=list) + birth_date: str | None = None + birth_date_raw: str = "" + birth_place: str = "" + death_date: str | None = None + death_date_raw: str = "" + death_place: str = "" + spouse: str = "" + generation: str = "" + notes: str = "" + aliases: list = field(default_factory=list) + provisional: bool = False + + +_QUOTED_RE = re.compile(r'^[“"\']\s*(.+?)\s*[”"\']$') + + +def parse_register(rows: list[dict]) -> list[Person]: + people = [] + for r in rows: + last = (r.get("last_name") or "").strip() + if not last: + continue + given_raw = (r.get("first_name") or "").strip() + givens = [g.strip() for g in given_raw.split(",") if g.strip()] + first = givens[0] if givens else "" + extra = givens[1:] + + spouse_raw = (r.get("spouse") or "").strip() + nickname = "" + m = _QUOTED_RE.match(spouse_raw) + if m: + nickname = m.group(1) + spouse_raw = "" + + birth = dates.parse_date(r.get("birth_date") or "") + death = dates.parse_date(r.get("death_date") or "") + people.append(Person( + person_id=slugify(last, first), + last_name=last, first_name=first, maiden_name=(r.get("maiden_name") or "").strip(), + nickname=nickname, extra_given_names=extra, + birth_date=birth.iso, birth_date_raw=(r.get("birth_date") or "").strip(), birth_place=(r.get("birth_place") or "").strip(), + death_date=death.iso, death_date_raw=(r.get("death_date") or "").strip(), death_place=(r.get("death_place") or "").strip(), + spouse=spouse_raw, generation=(r.get("generation") or "").strip(), + notes=(r.get("notes") or "").strip(), provisional=False, + )) + # De-duplicate colliding ids with numeric suffix + seen = {} + for p in people: + if p.person_id in seen: + seen[p.person_id] += 1 + p.person_id = f"{p.person_id}-{seen[p.person_id]}" + else: + seen[p.person_id] = 1 + return people +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py +git commit -m "feat(normalizer): person register parsing" +``` + +--- + +### Task 10: Receiver splitting (`REQ-PERS-01`, US-PERS-02 AC4) + +**Files:** +- Modify: `tools/import-normalizer/persons.py` +- Modify: `tools/import-normalizer/tests/test_persons.py` + +- [ ] **Step 1: Add failing tests** (ported from the Java `PersonNameParser` contract) + +```python +def test_split_receivers(): + assert persons.split_receivers("Eugenie Müller") == ["Eugenie Müller"] + assert persons.split_receivers("Walter und Eugenie de Gruyter") == ["Walter de Gruyter", "Eugenie de Gruyter"] + assert persons.split_receivers("Hedi und Tutu (Gruber)") == ["Hedi Gruber", "Tutu Gruber"] + assert persons.split_receivers("Clara u Familie") == ["Clara"] + assert persons.split_receivers("Eugenie de Gruyter geb. Müller") == ["Eugenie de Gruyter"] + assert persons.split_receivers("Herbert u Clara") == ["Herbert", "Clara"] + assert persons.split_receivers("") == [] + +def test_find_known_last_name(): + assert persons.find_known_last_name("Eugenie de Gruyter") == "de Gruyter" + assert persons.find_known_last_name("Clara") is None +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k "split_receivers or known_last" -v && cd -` +Expected: FAIL. + +- [ ] **Step 3: Implement** — add to `persons.py` + +```python +_GEB_RE = re.compile(r",?\s*geb\.?\s+.+$", re.I) +_PAREN_RE = re.compile(r"\(([^)]+)\)\s*$") +_MULTI_RE = re.compile(r"\s+(?:und|u)\s+", re.I) + + +def find_known_last_name(segment: str): + seg = segment.strip() + for ln in config.KNOWN_LAST_NAMES: # config lists longest-first + if seg == ln or seg.endswith(" " + ln): + return ln + return None + + +def split_receivers(raw: str) -> list[str]: + if not raw or not raw.strip(): + return [] + # 0. split on "//" + if "//" in raw: + out = [] + for seg in raw.split("//"): + out.extend(split_receivers(seg)) + return out + cleaned = _GEB_RE.sub("", raw).strip() + if not _MULTI_RE.search(cleaned): + return [cleaned] + shared_last = None + pm = _PAREN_RE.search(cleaned) + if pm: + shared_last = pm.group(1).strip() + cleaned = cleaned[:pm.start()].strip() + parts = [p.strip() for p in _MULTI_RE.split(cleaned)] + parts = [p for p in parts if p and p.lower() != "familie"] + if not parts: + return [] + if len(parts) == 1: + return [parts[0]] + if shared_last: + return [p if " " in p else f"{p} {shared_last}" for p in parts] + last_seg = parts[-1] + detected = find_known_last_name(last_seg) + if detected: + result = [] + for p in parts[:-1]: + if " " not in p and find_known_last_name(p) is None: + result.append(f"{p} {detected}") + else: + result.append(p) + result.append(last_seg) + return result + return parts +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k "split_receivers or known_last" -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py +git commit -m "feat(normalizer): receiver splitting" +``` + +--- + +### Task 11: Alias index (`FR-DEDUP`, REQ-DEDUP-01/02) + +**Files:** +- Modify: `tools/import-normalizer/persons.py` +- Modify: `tools/import-normalizer/tests/test_persons.py` + +- [ ] **Step 1: Add failing tests** + +```python +def test_alias_index_resolves_maiden_and_married(): + people = persons.parse_register([ + {"last_name": "de Gruyter", "first_name": "Eugenie", "maiden_name": "Müller"}, + {"last_name": "Cram", "first_name": "Clara"}, + ]) + idx = persons.AliasIndex(people) + eugenie = people[0].person_id + assert idx.resolve("Eugenie de Gruyter") == eugenie # canonical + assert idx.resolve("Eugenie Müller") == eugenie # maiden alias + assert idx.resolve("eugenie müller") == eugenie # normalized + assert idx.resolve("Nobody Unknown") is None + +def test_alias_index_suggestion(): + people = persons.parse_register([{"last_name": "Wittkopf", "first_name": "Hans"}]) + idx = persons.AliasIndex(people) + sid, score = idx.suggest("Hans Wittkop") # typo + assert sid == people[0].person_id and score >= config.FUZZY_SUGGEST_THRESHOLD +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k alias -v && cd -` +Expected: FAIL — `AliasIndex` not defined. + +- [ ] **Step 3: Implement** — add to `persons.py` + +```python +import difflib + + +def _norm(name: str) -> str: + return re.sub(r"\s+", " ", _strip_accents(name).lower().replace(".", " ")).strip() + + +class AliasIndex: + def __init__(self, people: list[Person]): + self._by_alias: dict[str, str] = {} + self._display: dict[str, str] = {} + self.known_ids: set[str] = {p.person_id for p in people} + first_name_ids: dict[str, list] = {} + for p in people: + self._display[p.person_id] = f"{p.first_name} {p.last_name}".strip() + # Ordered, de-duplicated forms (NOT a set) so alias order is deterministic — NFR-IDEM-01. + forms = [f"{p.first_name} {p.last_name}".strip()] + if p.maiden_name: + forms.append(f"{p.first_name} {p.maiden_name}".strip()) + for extra in p.extra_given_names: + forms.append(f"{extra} {p.last_name}".strip()) + if p.nickname: + forms.append(p.nickname) + seen = set() + for form in forms: + if form in seen: + continue + seen.add(form) + key = _norm(form) + if key and key not in self._by_alias: + self._by_alias[key] = p.person_id + p.aliases.append(form) + if p.first_name: + ids = first_name_ids.setdefault(_norm(p.first_name), []) + if p.person_id not in ids: + ids.append(p.person_id) + # first-name-only alias, only when unambiguous + for fname, ids in first_name_ids.items(): + if len(ids) == 1 and fname not in self._by_alias: + self._by_alias[fname] = ids[0] + + def resolve(self, name: str): + return self._by_alias.get(_norm(name)) + + def display(self, person_id: str) -> str: + return self._display.get(person_id, "") + + def suggest(self, name: str): + keys = list(self._by_alias.keys()) + match = difflib.get_close_matches(_norm(name), keys, n=1, cutoff=config.FUZZY_SUGGEST_THRESHOLD) + if not match: + return None, 0.0 + score = difflib.SequenceMatcher(None, _norm(name), match[0]).ratio() + return self._by_alias[match[0]], score +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k alias -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py +git commit -m "feat(normalizer): alias index with maiden/married/nickname resolution" +``` + +--- + +### Task 12: Spreadsheet ingest (`FR-INGEST`, `FR-MAP`, REQ-INGEST-01, REQ-MAP-01) + +**Files:** +- Create: `tools/import-normalizer/ingest.py` +- Create: `tools/import-normalizer/tests/test_ingest.py` + +- [ ] **Step 1: Write failing tests** (build a tiny workbook on disk with openpyxl) + +```python +import datetime +import openpyxl +import pytest +import ingest + +def _make_workbook(tmp_path, sheet_name, rows): + wb = openpyxl.Workbook() + ws = wb.active + ws.title = sheet_name + for r in rows: + ws.append(r) + path = tmp_path / "wb.xlsx" + wb.save(path) + return path + +def test_read_sheet_converts_cells(tmp_path): + path = _make_workbook(tmp_path, "S", [ + ["Index", "Datum"], + ["W-0001", datetime.datetime(1888, 2, 15)], + ["W-0002", 1], + ]) + rows = ingest.read_sheet(path, "S") + assert rows[0] == ["Index", "Datum"] + assert rows[1] == ["W-0001", "1888-02-15"] # Excel date -> ISO string + assert rows[2] == ["W-0002", "1"] # integer -> plain string + +def test_build_header_map_collapses_whitespace_and_case(): + header = ["Index", "Datum des Briefes", "EmpfängerIn", "Mystery"] + field_map = {"index": "index", "datum des briefes": "date", "empfängerin": "receivers"} + fields, unknown = ingest.build_header_map(header, field_map, required={"index"}) + assert fields == {"index": 0, "date": 1, "receivers": 2} + assert unknown == ["Mystery"] + +def test_build_header_map_missing_required_raises(): + with pytest.raises(ValueError, match="index"): + ingest.build_header_map(["Box", "Ort"], {"box": "box", "ort": "location"}, required={"index"}) +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_ingest.py -v && cd -` +Expected: FAIL — `ingest` not defined. + +- [ ] **Step 3: Implement `ingest.py`** + +```python +"""Read .xlsx sheets into neutral list[list[str]] and map headers to fields.""" +import datetime +from pathlib import Path +import openpyxl + + +def _cell_to_str(value) -> str: + if value is None: + return "" + if isinstance(value, datetime.datetime): + return value.date().isoformat() + if isinstance(value, datetime.date): + return value.isoformat() + if isinstance(value, float) and value.is_integer(): + return str(int(value)) + if isinstance(value, int): + return str(value) + return str(value).strip() + + +def read_sheet(path: Path, sheet_name: str) -> list[list[str]]: + wb = openpyxl.load_workbook(path, read_only=True, data_only=True) + if sheet_name not in wb.sheetnames: + raise ValueError(f"Sheet '{sheet_name}' not found in {path.name}; sheets: {wb.sheetnames}") + ws = wb[sheet_name] + rows = [[_cell_to_str(v) for v in row] for row in ws.iter_rows(values_only=True)] + wb.close() + return rows + + +def _norm_header(text: str) -> str: + return " ".join(text.lower().split()) + + +def build_header_map(header_row: list[str], field_map: dict[str, str], required: set[str]): + """Return (field->col_index, unknown_headers). Raise ValueError if a required field is missing.""" + fields: dict[str, int] = {} + unknown: list[str] = [] + for idx, raw in enumerate(header_row): + key = _norm_header(raw) + if key in field_map: + fields[field_map[key]] = idx + elif raw.strip(): + unknown.append(raw) + missing = required - set(fields) + if missing: + raise ValueError(f"Required header(s) missing: {sorted(missing)} (found headers: {header_row})") + return fields, unknown +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_ingest.py -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/ingest.py tools/import-normalizer/tests/test_ingest.py +git commit -m "feat(normalizer): xlsx ingest + header mapping" +``` + +--- + +### Task 13: Row extraction, triage & CanonicalDocument (`FR-TRIAGE`, REQ-TRIAGE-01/02/03, `FR-PROV`) + +**Files:** +- Create: `tools/import-normalizer/documents.py` +- Create: `tools/import-normalizer/tests/test_documents.py` + +- [ ] **Step 1: Write failing tests** + +```python +import documents +from documents import Triage + +def test_extract_row(): + header = {"index": 0, "file": 1, "box": 2, "folder": 3, "sender": 4, + "receivers": 5, "date": 6, "location": 7, "tags": 8, "summary": 9} + cells = ["W-0001", r"..\__scan\W-0001.pdf", "V", "1", "Walter de Gruyter", + "Eugenie Müller", "15.2.1888", "Rotterdam", "Brautbriefe", "Geschäftsreise"] + raw = documents.extract_row(cells, header, source_row=3) + assert raw.index == "W-0001" + assert raw.sender == "Walter de Gruyter" + assert raw.date == "15.2.1888" + assert raw.source_row == 3 + +def test_triage(): + assert documents.triage(["", "", ""]) == Triage.EMPTY + assert documents.triage(["", "", "Walter"]) == Triage.BLANK_INDEX # data but no index + assert documents.triage(["W-0001x", "x"]) == Triage.X_SUFFIX + assert documents.triage(["W-0001", "x"]) == Triage.OK + +def test_classify_blank_index(): + header = {"sender": 4, "receivers": 5} + banner = ["", "", "", "", "Brautbriefe von Walter an Eugenie", ""] + data = ["", "", "V", "1", "", "Eugenie"] + assert documents.classify_blank_index(banner, header) == "section_banner" + assert documents.classify_blank_index(data, header) == "data_no_index" + +def test_index_file_mismatch(): + assert documents.index_file_mismatch("W-0010x", r"..\__scan\W-0011x.pdf") is True + assert documents.index_file_mismatch("W-0001", r"..\__scan\W-0001.pdf") is False + assert documents.index_file_mismatch("W-0001", "") is False +``` + +Note `triage` takes the raw `cells` list and uses column 0 as the index (matching `extract_row`'s header where `index` is col 0 in these tests). + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -v && cd -` +Expected: FAIL — `documents` not defined. + +- [ ] **Step 3: Implement `documents.py`** (extraction + triage + dataclasses; resolution added in Task 14) + +```python +"""Document row extraction, triage, and the canonical document record.""" +from dataclasses import dataclass, field +from enum import Enum, auto + + +class Triage(Enum): + OK = auto() + EMPTY = auto() + BLANK_INDEX = auto() + X_SUFFIX = auto() + + +@dataclass +class RawRow: + source_row: int + index: str = "" + file: str = "" + box: str = "" + folder: str = "" + sender: str = "" + receivers: str = "" + date: str = "" + location: str = "" + tags: str = "" + summary: str = "" + + +@dataclass +class CanonicalDocument: + index: str + box: str = "" + folder: str = "" + sender_person_id: str = "" + sender_name: str = "" + receiver_person_ids: list = field(default_factory=list) + receiver_names: list = field(default_factory=list) + date_iso: str = "" + date_raw: str = "" + date_precision: str = "" + location: str = "" + tags: list = field(default_factory=list) + summary: str = "" + source_row: int = 0 + needs_review: list = field(default_factory=list) + + +_FIELDS = ["index", "file", "box", "folder", "sender", "receivers", "date", "location", "tags", "summary"] + + +def extract_row(cells: list[str], header: dict[str, int], source_row: int) -> RawRow: + def get(field_name): + idx = header.get(field_name) + if idx is None or idx >= len(cells): + return "" + return (cells[idx] or "").strip() + return RawRow(source_row=source_row, **{f: get(f) for f in _FIELDS}) + + +def triage(cells: list[str], index_col: int = 0) -> Triage: + nonempty = [c for c in cells if c and str(c).strip()] + if not nonempty: + return Triage.EMPTY + index = (cells[index_col] or "").strip() if index_col < len(cells) else "" + if not index: + return Triage.BLANK_INDEX + if index.endswith("x"): + return Triage.X_SUFFIX + return Triage.OK + + +def classify_blank_index(cells: list[str], header: dict[str, int]) -> str: + """REQ-TRIAGE-02: 'section_banner' if only name columns are populated, else 'data_no_index'.""" + name_cols = {header.get("sender"), header.get("receivers")} - {None} + populated = {i for i, c in enumerate(cells) if c and str(c).strip()} + if populated and populated <= name_cols: + return "section_banner" + return "data_no_index" + + +def index_file_mismatch(index: str, file_path: str) -> bool: + if not file_path.strip(): + return False + basename = file_path.replace("\\", "/").rsplit("/", 1)[-1] + stem = basename.rsplit(".", 1)[0] + return stem != index +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/documents.py tools/import-normalizer/tests/test_documents.py +git commit -m "feat(normalizer): row extraction, triage, canonical record" +``` + +--- + +### Task 14: Resolution context + to_canonical (`FR-PERS`, `FR-DATE` integration, REQ-PROV-02) + +**Files:** +- Modify: `tools/import-normalizer/persons.py` +- Modify: `tools/import-normalizer/documents.py` +- Modify: `tools/import-normalizer/tests/test_documents.py` + +- [ ] **Step 1: Add failing tests** to `tests/test_documents.py` + +```python +import persons +import documents + +def _ctx(): + people = persons.parse_register([ + {"last_name": "de Gruyter", "first_name": "Walter"}, + {"last_name": "de Gruyter", "first_name": "Eugenie", "maiden_name": "Müller"}, + ]) + return persons.ResolutionContext(persons.AliasIndex(people), name_overrides={}) + +def test_to_canonical_resolves_and_flags(): + ctx = _ctx() + raw = documents.RawRow(source_row=3, index="W-0001", box="V", folder="1", + sender="Walter de Gruyter", receivers="Eugenie Müller", + date="15.2.1888", location="Rotterdam", tags="Brautbriefe", + summary="Geschäftsreise", file=r"..\__scan\W-0001.pdf") + doc = documents.to_canonical(raw, ctx, date_overrides={}) + assert doc.sender_person_id == "de-gruyter-walter" + assert doc.receiver_person_ids == ["de-gruyter-eugenie"] # matched via maiden alias + assert doc.date_iso == "1888-02-15" and doc.date_precision == "DAY" + assert doc.tags == ["Brautbriefe"] + assert doc.needs_review == [] + +def test_to_canonical_unmatched_and_unparsed(): + ctx = _ctx() + raw = documents.RawRow(source_row=9, index="C-0001", + sender="Hans Wittkopf", receivers="", date="Freitag 1919") + doc = documents.to_canonical(raw, ctx, date_overrides={}) + assert doc.sender_person_id == "wittkopf-hans" # provisional + assert "unmatched_sender" in doc.needs_review + assert "unparsed_date" in doc.needs_review + assert ctx.unmatched["Hans Wittkopf"] == [9] + assert any(p.provisional for p in ctx.provisional.values()) + +def test_to_canonical_splits_multi_sender(): + # REQ-PERS-01 / IMP-11: a multi-person sender is parsed, primary kept, flagged. + ctx = _ctx() + raw = documents.RawRow(source_row=5, index="C-0100", sender="Walter und Eugenie de Gruyter", receivers="") + doc = documents.to_canonical(raw, ctx, date_overrides={}) + assert doc.sender_person_id == "de-gruyter-walter" # first part is primary + assert "multi_sender" in doc.needs_review + +def test_provisional_id_never_collides_with_register(): + # A provisional built from an unmatched string must not steal a register person_id. + people = persons.parse_register([{"last_name": "Cram", "first_name": "Clara"}]) + ctx = persons.ResolutionContext(persons.AliasIndex(people), name_overrides={}) + # Force a provisional whose natural slug equals the register id by using a string the + # alias index will not resolve but that slugs to "cram-clara": + pid, _, matched = ctx.resolve_one("Clara Cram (unsicher)", source_row=1) + assert matched is False + assert pid not in {"cram-clara"} or pid.endswith("-2") # suffixed away from the register id + +def test_ambiguous_space_pair_flagged_not_split(): + # US-PERS-02 AC4: "Ella Anita" is kept as one provisional + flagged, never guessed into two. + ctx = _ctx() + raw = documents.RawRow(source_row=7, index="C-0200", sender="", receivers="Ella Anita") + doc = documents.to_canonical(raw, ctx, date_overrides={}) + assert len(doc.receiver_person_ids) == 1 # not split + assert any(part == "Ella Anita" for _, part, _ in ctx.ambiguous) +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -k "to_canonical" -v && cd -` +Expected: FAIL — `ResolutionContext` / `to_canonical` not defined. + +- [ ] **Step 3a: Implement `ResolutionContext`** — add to `persons.py` + +```python +class ResolutionContext: + """Resolves raw name strings to person ids; accumulates provisional persons and review data.""" + def __init__(self, alias_index: AliasIndex, name_overrides: dict[str, str]): + self.index = alias_index + self.name_overrides = name_overrides + self.provisional: dict[str, Person] = {} + self.unmatched: dict[str, list] = {} + self.ambiguous: list[tuple] = [] + self._raw_to_pid: dict[str, str] = {} + self.override_hits = 0 + + def _unique_id(self, base: str) -> str: + """A provisional id must never collide with a register id or another provisional.""" + used = self.index.known_ids | set(self.provisional) + pid, n = base, 1 + while pid in used: + n += 1 + pid = f"{base}-{n}" + return pid + + def resolve_one(self, raw_name: str, source_row: int): + """Return (person_id, display_name, matched: bool). '' name -> ('', '', True).""" + name = (raw_name or "").strip() + if not name: + return "", "", True + if name in self.name_overrides: + self.override_hits += 1 + pid = self.name_overrides[name] + return pid, self.index.display(pid) or name, True + pid = self.index.resolve(name) + if pid: + return pid, self.index.display(pid) or name, True + # provisional person (unmatched) — never reuse a register id + self.unmatched.setdefault(name, []).append(source_row) + if name in self._raw_to_pid: + return self._raw_to_pid[name], name, False + last, first = _last_first(name) + pid = self._unique_id(slugify(last, first)) + self.provisional[pid] = Person(person_id=pid, last_name=last, first_name=first, provisional=True) + self._raw_to_pid[name] = pid + return pid, name, False + + def resolve_sender(self, raw: str, source_row: int): + """Senders are split like receivers (REQ-PERS-01). Primary = first part; multi flagged.""" + parts = split_receivers(raw) + if not parts: + return "", "", True, False + pid, name, matched = self.resolve_one(parts[0], source_row) + for extra in parts[1:]: + self.resolve_one(extra, source_row) # register the others as persons too + return pid, name, matched, len(parts) > 1 + + def resolve_receivers(self, raw: str, source_row: int): + results = [] + for part in split_receivers(raw): + pid, name, matched = self.resolve_one(part, source_row) + if not matched and " " in part and find_known_last_name(part) is None and len(part.split()) == 2: + self.ambiguous.append((raw, part, source_row)) + results.append((pid, name, matched)) + return results + + +def _last_first(name: str): + """Best-effort split of a free name string into (last, first) for slug/provisional building.""" + name = name.strip() + ln = find_known_last_name(name) + if ln: + first = name[: -len(ln)].strip() + return ln, first + tokens = name.split() + if len(tokens) >= 2: + return tokens[-1], " ".join(tokens[:-1]) + return name, "" +``` + +- [ ] **Step 3b: Implement `to_canonical`** — add to `documents.py` + +```python +import dates as _dates + + +def to_canonical(raw, ctx, date_overrides: dict) -> CanonicalDocument: + pd = _dates.parse_date(raw.date, date_overrides) + flags = [] + + sender_id, sender_name, sender_matched, sender_multi = ctx.resolve_sender(raw.sender, raw.source_row) + if raw.sender.strip() and not sender_matched: + flags.append("unmatched_sender") + if sender_multi: + flags.append("multi_sender") + + receivers = ctx.resolve_receivers(raw.receivers, raw.source_row) + if any(not matched for _, _, matched in receivers): + flags.append("unmatched_receiver") + + if raw.date.strip() and pd.precision == _dates.Precision.UNKNOWN: + flags.append("unparsed_date") + if index_file_mismatch(raw.index, raw.file): + flags.append("index_file_mismatch") + + return CanonicalDocument( + index=raw.index, box=raw.box, folder=raw.folder, + sender_person_id=sender_id, sender_name=sender_name, + receiver_person_ids=[r[0] for r in receivers], + receiver_names=[r[1] for r in receivers], + date_iso=pd.iso or "", date_raw=raw.date, date_precision=str(pd.precision), + location=raw.location, tags=[raw.tags] if raw.tags else [], summary=raw.summary, + source_row=raw.source_row, needs_review=flags, + ) +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/persons.py tools/import-normalizer/documents.py tools/import-normalizer/tests/test_documents.py +git commit -m "feat(normalizer): person resolution context + to_canonical" +``` + +--- + +### Task 15: Overrides loader + writers (`FR-OVR`, `FR-OUT`, NFR-OBSERV-01) + +**Files:** +- Create: `tools/import-normalizer/overrides.py` +- Create: `tools/import-normalizer/writers.py` +- Create: `tools/import-normalizer/tests/test_writers.py` + +- [ ] **Step 1: Write failing tests** + +```python +import csv +import openpyxl +import overrides +import writers +import documents + +def test_load_overrides_missing_files(tmp_path): + d, n = overrides.load_overrides(tmp_path / "dates.csv", tmp_path / "names.csv") + assert d == {} and n == {} + +def test_load_overrides_parsed(tmp_path): + dp = tmp_path / "dates.csv" + dp.write_text("raw,iso,precision\n13.5.65,1965-05-13,DAY\n", encoding="utf-8") + np = tmp_path / "names.csv" + np.write_text("raw,person_id\nEugenie Müller,de-gruyter-eugenie\n", encoding="utf-8") + d, n = overrides.load_overrides(dp, np) + assert d["13.5.65"] == ("1965-05-13", "DAY") + assert n["Eugenie Müller"] == "de-gruyter-eugenie" + +def test_write_documents_xlsx_joins_lists(tmp_path): + doc = documents.CanonicalDocument( + index="W-0001", receiver_person_ids=["a", "b"], receiver_names=["A", "B"], + tags=["Brautbriefe"], date_precision="DAY", needs_review=["unparsed_date"]) + out = tmp_path / "docs.xlsx" + writers.write_documents_xlsx([doc], out) + wb = openpyxl.load_workbook(out) + ws = wb.active + header = [c.value for c in ws[1]] + assert "receiver_person_ids" in header and "needs_review" in header + row = {h: c.value for h, c in zip(header, ws[2])} + assert row["receiver_person_ids"] == "a|b" + assert row["needs_review"] == "unparsed_date" + +def test_write_review_csv(tmp_path): + out = tmp_path / "r.csv" + writers.write_review_csv(out, ["raw", "count"], [["?", 3], ["x", 1]]) + rows = list(csv.reader(out.open(encoding="utf-8"))) + assert rows[0] == ["raw", "count"] + assert rows[1] == ["?", "3"] + +def test_write_review_csv_defangs_formula_injection(tmp_path): + out = tmp_path / "r.csv" + writers.write_review_csv(out, ["raw", "count"], [["=cmd|'/C calc'!A0", 1], ["-2+3", 2]]) + rows = list(csv.reader(out.open(encoding="utf-8"))) + assert rows[1][0].startswith("'=") # leading '=' neutralised + assert rows[2][0].startswith("'-") + +def test_write_summary_sections(tmp_path): + out = tmp_path / "s.txt" + writers.write_summary(out, {"# INPUTS": "", "rows": 10, "# DATES": "", "unknown_date_rate": "3.2%"}) + text = out.read_text(encoding="utf-8") + assert "INPUTS:" in text and "DATES:" in text and " rows: 10" in text +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_writers.py -v && cd -` +Expected: FAIL — modules not defined. + +- [ ] **Step 3a: Implement `overrides.py`** + +```python +"""Load human-supplied corrections. Missing files are not an error.""" +import csv +from pathlib import Path + + +def load_overrides(dates_path: Path, names_path: Path): + date_overrides: dict[str, tuple[str, str]] = {} + name_overrides: dict[str, str] = {} + if Path(dates_path).exists(): + with open(dates_path, encoding="utf-8", newline="") as f: + for row in csv.DictReader(f): + raw = (row.get("raw") or "").strip() + if raw: + date_overrides[raw] = ((row.get("iso") or "").strip(), (row.get("precision") or "UNKNOWN").strip()) + if Path(names_path).exists(): + with open(names_path, encoding="utf-8", newline="") as f: + for row in csv.DictReader(f): + raw = (row.get("raw") or "").strip() + if raw: + name_overrides[raw] = (row.get("person_id") or "").strip() + return date_overrides, name_overrides +``` + +- [ ] **Step 3b: Implement `writers.py`** + +```python +"""Write canonical .xlsx outputs and review .csv files.""" +import csv +import datetime +from pathlib import Path +import openpyxl + +_PIPE = "|" +# Pinned workbook metadata so reruns are content-deterministic (NFR-IDEM-01); openpyxl +# otherwise stamps docProps with the current time on every save. +_FIXED_TS = datetime.datetime(2020, 1, 1, 0, 0, 0) + + +def _join(value): + if isinstance(value, list): + return _PIPE.join(str(v) for v in value) + return "" if value is None else str(value) + + +def _csv_safe(value): + """Neutralise spreadsheet formula injection (CWE-1236) in human-opened review CSVs.""" + s = "" if value is None else str(value) + return "'" + s if s[:1] in ("=", "+", "-", "@", "\t", "\r") else s + + +DOC_COLUMNS = ["index", "box", "folder", "sender_person_id", "sender_name", + "receiver_person_ids", "receiver_names", "date_iso", "date_raw", + "date_precision", "location", "tags", "summary", "source_row", "needs_review"] + +PERSON_COLUMNS = ["person_id", "last_name", "first_name", "maiden_name", "title", "nickname", + "birth_date", "birth_date_raw", "birth_place", "death_date", "death_date_raw", + "death_place", "spouse", "generation", "notes", "aliases", "provisional"] + + +def _write_xlsx(records, columns, path: Path): + wb = openpyxl.Workbook() + ws = wb.active + ws.append(columns) + for rec in records: + ws.append([_join(getattr(rec, col)) for col in columns]) + wb.properties.created = _FIXED_TS + wb.properties.modified = _FIXED_TS + Path(path).parent.mkdir(parents=True, exist_ok=True) + wb.save(path) + + +def write_documents_xlsx(docs, path: Path): + _write_xlsx(docs, DOC_COLUMNS, path) + + +def write_persons_xlsx(people, path: Path): + _write_xlsx(people, PERSON_COLUMNS, path) + + +def write_review_csv(path: Path, header: list[str], rows: list[list]): + Path(path).parent.mkdir(parents=True, exist_ok=True) + with open(path, "w", encoding="utf-8", newline="") as f: + w = csv.writer(f) + w.writerow(header) + for row in rows: + w.writerow([_csv_safe(c) for c in row]) + + +def write_summary(path: Path, stats: dict): + """Render a grouped, scannable summary. Keys beginning with '#' are section headers.""" + Path(path).parent.mkdir(parents=True, exist_ok=True) + lines = [] + for k, v in stats.items(): + if k.startswith("#"): + lines.append("") + lines.append(k[1:].strip() + ":") + else: + lines.append(f" {k}: {v}") + Path(path).write_text("\n".join(lines).strip() + "\n", encoding="utf-8") +``` + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_writers.py -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/overrides.py tools/import-normalizer/writers.py tools/import-normalizer/tests/test_writers.py +git commit -m "feat(normalizer): overrides loader + xlsx/csv writers" +``` + +--- + +### Task 16: Orchestrator `normalize.py` + integration test (`FR-OUT`, `FR-TRIAGE`, REQ-TRIAGE-01/03, NFR-IDEM-01) + +**Files:** +- Create: `tools/import-normalizer/normalize.py` +- Create: `tools/import-normalizer/tests/test_normalize.py` + +- [ ] **Step 1: Write the failing integration test** (tiny in-memory fixtures, not the real 7,900-row file) + +```python +import openpyxl +import normalize + +def _doc_wb(tmp_path): + wb = openpyxl.Workbook(); ws = wb.active; ws.title = "Familienarchiv" + ws.append(["Index", "Datei", "Box", "Mappe", "BriefeschreiberIn", "EmpfängerIn", + "Datum des Briefes", "Ort", "Schlagwort", "Inhalt"]) + ws.append(["W-0001", r"..\__scan\W-0001.pdf", "V", "1", "Walter de Gruyter", + "Eugenie Müller", "15.2.1888", "Rotterdam", "Brautbriefe", "Geschäftsreise"]) + ws.append(["W-0001x", r"..\__scan\W-0001x.pdf", "", "", "Walter de Gruyter", "Eugenie Müller", "", "", "", ""]) + ws.append(["", "", "", "", "Section banner row", "", "", "", "", ""]) + ws.append(["C-0001", "", "", "", "Hans Wittkopf", "", "Freitag 1919", "", "", ""]) + ws.append(["W-0001", r"..\__scan\W-0001.pdf", "V", "1", "Walter de Gruyter", + "Eugenie Müller", "15.2.1888", "Rotterdam", "Brautbriefe", "dup"]) + p = tmp_path / "docs.xlsx"; wb.save(p); return p + +def _person_wb(tmp_path): + wb = openpyxl.Workbook(); ws = wb.active; ws.title = "Tabelle1" + ws.append(["Generation", "Familienname", "Vorname", "geb als", "Geburtsdatum", + "Geburtsort", "Todesdatum", "Sterbeort", "verheiratet mit", "Bemerkung"]) + ws.append(["G 1", "de Gruyter", "Walter", "", "", "", "", "", "", ""]) + ws.append(["G 1", "de Gruyter", "Eugenie", "Müller", "", "", "", "", "", ""]) + p = tmp_path / "persons.xlsx"; wb.save(p); return p + +def test_run_end_to_end(tmp_path): + out_dir = tmp_path / "out"; review_dir = tmp_path / "review" + stats = normalize.run( + document_workbook=_doc_wb(tmp_path), document_sheet="Familienarchiv", + person_workbook=_person_wb(tmp_path), person_sheet="Tabelle1", + out_dir=out_dir, review_dir=review_dir, + date_overrides={}, name_overrides={}) + assert (out_dir / "canonical-documents.xlsx").exists() + assert (out_dir / "canonical-persons.xlsx").exists() + assert stats["documents_emitted"] == 3 # W-0001, C-0001, W-0001 (dup) — x and blank excluded + assert stats["skipped_x_suffix"] == 1 + assert stats["blank_index_rows"] == 1 + assert stats["duplicate_index_rows"] == 2 + assert (review_dir / "skipped-x-suffix.csv").exists() + assert (review_dir / "unparsed-dates.csv").exists() + # C-0001's "Freitag 1919" is unparseable -> must appear in the review file (NFR-DATA-01) + assert "Freitag 1919" in (review_dir / "unparsed-dates.csv").read_text(encoding="utf-8") + + # determinism (NFR-IDEM-01): a second run yields identical canonical content + review files + def _matrix(p): + wb = openpyxl.load_workbook(p) + return [[c.value for c in row] for row in wb.active.iter_rows()] + docs1 = _matrix(out_dir / "canonical-documents.xlsx") + persons1 = _matrix(out_dir / "canonical-persons.xlsx") + unparsed1 = (review_dir / "unparsed-dates.csv").read_text(encoding="utf-8") + normalize.run(document_workbook=_doc_wb(tmp_path), document_sheet="Familienarchiv", + person_workbook=_person_wb(tmp_path), person_sheet="Tabelle1", + out_dir=out_dir, review_dir=review_dir, date_overrides={}, name_overrides={}) + assert _matrix(out_dir / "canonical-documents.xlsx") == docs1 + assert _matrix(out_dir / "canonical-persons.xlsx") == persons1 + assert (review_dir / "unparsed-dates.csv").read_text(encoding="utf-8") == unparsed1 + assert len(docs1) == 4 # header + 3 docs +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_normalize.py -v && cd -` +Expected: FAIL — `normalize` not defined. + +- [ ] **Step 3: Implement `normalize.py`** + +```python +"""Orchestrator: read raw workbooks -> canonical outputs + review reports.""" +import argparse +from collections import Counter +from pathlib import Path + +import config +import ingest +import persons +import documents +import overrides as overrides_mod +import writers + + +def run(*, document_workbook, document_sheet, person_workbook, person_sheet, + out_dir, review_dir, date_overrides, name_overrides) -> dict: + out_dir, review_dir = Path(out_dir), Path(review_dir) + + # --- persons --- + person_rows = ingest.read_sheet(person_workbook, person_sheet) + p_fields, _ = ingest.build_header_map(person_rows[0], config.PERSON_HEADER_MAP, config.PERSON_REQUIRED_FIELDS) + person_dicts = [{f: (row[i] if i < len(row) else "") for f, i in p_fields.items()} for row in person_rows[1:]] + register = persons.parse_register(person_dicts) + alias_index = persons.AliasIndex(register) + ctx = persons.ResolutionContext(alias_index, name_overrides) + + # --- documents --- + doc_rows = ingest.read_sheet(document_workbook, document_sheet) + d_fields, unknown_headers = ingest.build_header_map(doc_rows[0], config.DOCUMENT_HEADER_MAP, config.DOCUMENT_REQUIRED_FIELDS) + index_col = d_fields["index"] + + canon_docs, blank_index, skipped_x, mismatches = [], [], [], [] + unparsed_by_raw: dict[str, list] = {} + dates_by_override = 0 + empty_count = 0 + seen_index = Counter() + + for source_row, cells in enumerate(doc_rows[1:], start=2): + t = documents.triage(cells, index_col) + if t is documents.Triage.EMPTY: + empty_count += 1 + continue + if t is documents.Triage.BLANK_INDEX: + blank_index.append([source_row, documents.classify_blank_index(cells, d_fields), + " | ".join(c for c in cells if c)]) + continue + if t is documents.Triage.X_SUFFIX: + idx = (cells[index_col] or "").strip() + skipped_x.append([source_row, idx, idx[:-1]]) + continue + raw = documents.extract_row(cells, d_fields, source_row) + seen_index[raw.index] += 1 + if raw.date.strip() and raw.date.strip() in date_overrides: + dates_by_override += 1 + doc = documents.to_canonical(raw, ctx, date_overrides) + if "unparsed_date" in doc.needs_review: + unparsed_by_raw.setdefault(raw.date, []).append(source_row) + if "index_file_mismatch" in doc.needs_review: + mismatches.append([source_row, raw.index, raw.file]) + canon_docs.append(doc) + + # REQ-TRIAGE-01: flag EVERY occurrence of a duplicated index and report all of them. + dup_indexes = {idx for idx, n in seen_index.items() if n > 1} + duplicates = [] + for doc in canon_docs: + if doc.index in dup_indexes: + if "duplicate_index" not in doc.needs_review: + doc.needs_review.append("duplicate_index") + duplicates.append([doc.source_row, doc.index]) + + all_people = register + list(ctx.provisional.values()) + + # --- write canonical outputs --- + writers.write_documents_xlsx(canon_docs, out_dir / "canonical-documents.xlsx") + writers.write_persons_xlsx(all_people, out_dir / "canonical-persons.xlsx") + + # --- review files --- + # unparsed dates: most-frequent first, with example source rows + blank override cells so a + # corrected row can be pasted straight into overrides/dates.csv (same raw,iso,precision shape). + unparsed_rows = sorted( + ([raw, len(rows), " ".join(map(str, rows[:5])), "", ""] for raw, rows in unparsed_by_raw.items()), + key=lambda r: (-r[1], r[0])) + writers.write_review_csv(review_dir / "unparsed-dates.csv", + ["raw", "count", "example_rows", "suggested_iso", "suggested_precision"], unparsed_rows) + + unmatched_rows = [] + for name, rows in sorted(ctx.unmatched.items()): + sid, score = alias_index.suggest(name) + unmatched_rows.append([name, len(rows), " ".join(map(str, rows[:5])), + sid or "", f"{score:.2f}" if sid else ""]) + writers.write_review_csv(review_dir / "unmatched-names.csv", + ["raw", "count", "example_rows", "suggested_id", "suggested_score"], unmatched_rows) + + writers.write_review_csv(review_dir / "duplicate-index.csv", ["source_row", "index"], duplicates) + writers.write_review_csv(review_dir / "blank-index-rows.csv", ["source_row", "kind", "content"], blank_index) + writers.write_review_csv(review_dir / "skipped-x-suffix.csv", ["source_row", "index", "base_index"], skipped_x) + writers.write_review_csv(review_dir / "ambiguous-receivers.csv", ["raw", "part", "source_row"], ctx.ambiguous) + writers.write_review_csv(review_dir / "index-file-mismatch.csv", ["source_row", "index", "file"], mismatches) + + dated = sum(1 for d in canon_docs if d.date_raw.strip()) + unknown = sum(1 for d in canon_docs if d.date_raw.strip() and d.date_precision == "UNKNOWN") + unknown_rate = f"{(100 * unknown / dated):.1f}%" if dated else "0.0%" + + stats = { + "# INPUTS": "", + "document_rows_read": len(doc_rows) - 1, + "register_persons": len(register), + "unknown_headers": ", ".join(unknown_headers) or "(none)", + "# OUTPUTS": "", + "documents_emitted": len(canon_docs), + "provisional_persons": len(ctx.provisional), + "# DATES": "", + "dated_rows": dated, + "unparsed_dates": unknown, + "unknown_date_rate": f"{unknown_rate} (target <=5%)", + "distinct_unparsed_formats": len(unparsed_by_raw), + "# NAMES": "", + "unmatched_name_strings": len(ctx.unmatched), + "ambiguous_receivers": len(ctx.ambiguous), + "# ANOMALIES": "", + "empty_rows": empty_count, + "blank_index_rows": len(blank_index), + "skipped_x_suffix": len(skipped_x), + "duplicate_index_rows": len(duplicates), + "index_file_mismatches": len(mismatches), + "# OVERRIDES": "", + "date_overrides_loaded": len(date_overrides), + "name_overrides_loaded": len(name_overrides), + "dates_resolved_by_override": dates_by_override, + "names_resolved_by_override": ctx.override_hits, + } + writers.write_summary(review_dir / "summary.txt", stats) + return stats + + +def main(): + parser = argparse.ArgumentParser(description="Normalize the family archive spreadsheets.") + parser.parse_args() + date_overrides, name_overrides = overrides_mod.load_overrides( + config.OVERRIDES_DIR / "dates.csv", config.OVERRIDES_DIR / "names.csv") + stats = run( + document_workbook=config.DOCUMENT_WORKBOOK, document_sheet=config.DOCUMENT_SHEET, + person_workbook=config.PERSON_WORKBOOK, person_sheet=config.PERSON_SHEET, + out_dir=config.OUT_DIR, review_dir=config.REVIEW_DIR, + date_overrides=date_overrides, name_overrides=name_overrides) + print("Normalization complete:") + for k, v in stats.items(): + print(f" {k}: {v}") + + +if __name__ == "__main__": + main() +``` + +> **Note for the implementer:** duplicate-index handling is a single second pass over `canon_docs` (`for doc in canon_docs: if doc.index in dup_indexes`) — this flags AND reports *every* colliding occurrence including the first (REQ-TRIAGE-01), not just repeats. Do not reintroduce a per-row append in the main loop. + +- [ ] **Step 4: Run to verify it passes** + +Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_normalize.py -v && cd -` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tools/import-normalizer/normalize.py tools/import-normalizer/tests/test_normalize.py +git commit -m "feat(normalizer): orchestrator + end-to-end integration test" +``` + +--- + +### Task 17: README, seed overrides, and a real dry-run + +**Files:** +- Create: `tools/import-normalizer/README.md` +- Create: `tools/import-normalizer/overrides/dates.csv` +- Create: `tools/import-normalizer/overrides/names.csv` + +- [ ] **Step 1: Seed the overrides files** (header-only) + +`overrides/dates.csv`: +``` +raw,iso,precision +``` +`overrides/names.csv`: +``` +raw,person_id +``` + +- [ ] **Step 2: Write `README.md`** + +````markdown +# Import Normalizer + +Transforms the raw family-archive spreadsheets in `../../import/` into a clean canonical +dataset (`out/`) plus review reports (`review/`). See the spec: +`../../docs/import-migration/02-normalization-spec.md`. + +## Setup +Requires **Python 3.12** (uses `StrEnum`). +```bash +python3 -m venv .venv && .venv/bin/pip install -r requirements.txt +``` + +## Run +```bash +.venv/bin/python normalize.py +``` +Outputs: +- `out/canonical-documents.xlsx`, `out/canonical-persons.xlsx` +- `review/*.csv` (residue to fix), `review/summary.txt` (grouped run stats incl. unknown-date rate) + +## Iteration loop +1. **Run.** Read `review/summary.txt` for the health snapshot. +2. **Fix the residue** by editing the version-controlled overrides files, then re-run. Repeat. + +| Review file | What to do | +| --- | --- | +| `unparsed-dates.csv` | For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`). | +| `unmatched-names.csv` | If `suggested_id` is right, copy `raw,suggested_id` into `overrides/names.csv`; else look up the correct id in `out/canonical-persons.xlsx` (the `person_id` column). | +| `ambiguous-receivers.csv` | A space-joined pair we refused to auto-split (e.g. `Ella Anita`). Decide and add a names override if it is really two people. | +| `index-file-mismatch.csv` | The `Datei` path disagrees with the index-derived filename — reconcile when the PDFs arrive. | +| `duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv` | Inspect; fix in the source spreadsheet if needed. | + +**Valid `person_id` values** all come from the `person_id` column of `out/canonical-persons.xlsx`. + +## Tests +```bash +.venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once) +``` +```` + +- [ ] **Step 3: Run the whole test suite file-by-file to confirm green** + +Run each individually (per the "no full-suite" rule): +```bash +cd tools/import-normalizer +for t in config dates persons ingest documents writers normalize; do .venv/bin/python -m pytest tests/test_$t.py -q || break; done +cd - +``` +Expected: every file reports all passed. + +- [ ] **Step 4: Real dry-run against the actual import data (manual verification, not a test)** + +Run: `cd tools/import-normalizer && .venv/bin/python normalize.py && cd -` +Expected: prints stats. Then inspect: +- `review/summary.txt` — sanity-check counts (≈7,600 documents emitted, register_persons ≈163). +- `review/unparsed-dates.csv` — confirm `UNKNOWN` rate is in the low single-digit %% of dated rows (NFR-ACCUR-01 target ≤5% before overrides). If higher, note the dominant unhandled formats for a follow-up parser tweak. +- Spot-check `out/canonical-documents.xlsx`: open the first ~20 rows; verify `date_iso`/`date_precision`, `sender_person_id`, and `receiver_person_ids` look right (e.g. `Eugenie Müller` → `de-gruyter-eugenie`). + +Record the run's `summary.txt` figures in `../../docs/import-migration/WORKLOG.md`. + +- [ ] **Step 5: Commit** (commit only source + seeds; `out/` and `review/` are gitignored) + +```bash +git add tools/import-normalizer/README.md tools/import-normalizer/overrides/dates.csv tools/import-normalizer/overrides/names.csv +git commit -m "docs(normalizer): README + seed overrides" +``` + +--- + +## Self-Review + +**Spec coverage check:** +- `FR-INGEST`/`FR-MAP` → Task 12 (header-name mapping, missing-required raises, unknown headers reported). ✓ +- `FR-TRIAGE` (REQ-TRIAGE-01/02/03) → Task 13 (triage by index-col, `classify_blank_index` banner detection) + Task 16 (single-pass duplicate flagging of *all* occurrences, blank-index report with `kind`, x-suffix skip+log). ✓ +- `FR-DATE` (REQ-DATE-01..06) → Tasks 2–8 (computus, feast/season, century rule, all matchers, overrides). ✓ +- `FR-PERS`/US-PERS-01 → Task 9; `REQ-PERS-01`/receiver split/AC4 ambiguous → Tasks 10, 14. ✓ +- `FR-DEDUP` (REQ-DEDUP-01/02) → Task 11 (maiden/married/nickname aliases, conservative; fuzzy = suggestion only). ✓ +- `FR-OVR` (REQ-OVR-01/02/03) → Task 15 (loader, missing-file tolerant) + Task 16 (applied + counted: `dates_resolved_by_override` / `names_resolved_by_override`) + Task 16 content-determinism assertion (two-run cell-matrix + review-file equality). ✓ +- `FR-OUT`/`FR-PROV` (REQ-OUT-01/02, REQ-PROV-01/02) → Tasks 13 (source_row, needs_review), 15 (writers), 16 (mismatch report). ✓ +- NFRs: DATA-01 (every row → output or review) covered by triage routing; OBSERV-01 → summary.txt; I18N-01 → utf-8 everywhere + diacritic map; TEST-01 → per-module tests; MAINT-01 → config tables. ✓ +- Data dictionary §6 → `DOC_COLUMNS`/`PERSON_COLUMNS` in Task 15 match the spec field list. ✓ + +**Placeholder scan:** No TBD/TODO; every code step shows complete code. The one `pass`/dead-line in Task 16 is explicitly called out with deletion instructions. ✓ + +**Type consistency:** `ParsedDate(iso, precision, raw)`, `Precision` (StrEnum → `str()` yields the value), `Person`, `RawRow`, `CanonicalDocument`, `AliasIndex.resolve/display/suggest`, `ResolutionContext.resolve_one/resolve_receivers`, `to_canonical(raw, ctx, date_overrides)`, `run(**kwargs)` — names line up across tasks. ✓ + +**Known follow-ups (out of scope for this plan):** Phase-2 importer wiring (`B11`); comma-splitting `Inhalt` into extra tags (`B10`, Could). These are intentionally deferred. + +--- + +## Review feedback incorporated (2026-05-25) + +Six personas reviewed this plan inline; the following changes were applied (see the session summary for detail): + +- **Idempotency redefined (architect/tester/req-eng):** spec G4/NFR-IDEM-01 changed from "byte-identical" to **content-deterministic**; Task 15 pins workbook `created`/`modified`; Task 11 builds aliases via ordered lists (no set-iteration leakage); Task 16 test now compares two runs' cell matrices + review files. +- **Duplicate-index bug fixed (developer/architect):** Task 16 now flags and reports *every* occurrence of a duplicated index in one pass; the dead `pass` line was removed; the test stat (`==2`) is correct. +- **Provisional id collision guarded (architect):** Task 14 `ResolutionContext._unique_id` suffixes provisional ids so they never overwrite a register `person_id`. +- **Date gaps closed (tester):** added invalid-calendar-date → UNKNOWN test, intra-month day-range matcher (`7./8. Sept.1923` → RANGE) + test, and a trailing-note-preservation test. +- **Multi-person sender (tester/req-eng, REQ-PERS-01):** Task 14 `resolve_sender` splits the sender, keeps the primary, flags `multi_sender`. +- **CSV injection defanged (security):** Task 15 `write_review_csv` neutralises leading `= + - @` etc. in human-opened CSVs (+ test). +- **REQ-TRIAGE-02 / REQ-OVR-03 realized (req-eng):** banner-vs-data classification in `blank-index-rows.csv`; override-application counts + an `unknown_date_rate` headline in `summary.txt`. +- **Ergonomics (UX):** `unparsed-dates.csv` now carries `example_rows` + blank `suggested_iso/precision` (paste-ready); `unmatched-names.csv` suggestion blanks-out on no-match and rounds the score; grouped `summary.txt`; README documents every review file + where to source `person_id`. +- **Repo hygiene (devops):** pinned `openpyxl==3.1.5` / `pytest==8.3.4`; hardened the **root** `.gitignore` against the committed-`.venv` class of mistake; documented the Python 3.12 requirement. diff --git a/docs/import-migration/WORKLOG.md b/docs/import-migration/WORKLOG.md index ef7b2e38..2f82baf5 100644 --- a/docs/import-migration/WORKLOG.md +++ b/docs/import-migration/WORKLOG.md @@ -4,6 +4,30 @@ Running log of each working session. **Resume here.** Newest entry on top. --- +## 2026-05-25 (session 3) — Implementation plan + persona review + +**Did:** +- Wrote [`03-normalizer-implementation-plan.md`](./03-normalizer-implementation-plan.md): 17 + bite-sized TDD tasks for `tools/import-normalizer/` (Python, openpyxl), bottom-up — date + parser w/ Easter computus first, then persons/alias, ingest, mapping, orchestrator, writers. +- Ran a 6-persona inline review (architect, developer, tester, req-engineer, security, devops; + ui-expert too) via parallel agents. Acted on all material findings. + +**Key fixes from review (see plan §"Review feedback incorporated"):** +- Idempotency redefined byte-identical → **content-deterministic** (spec G4/NFR-IDEM-01); + pinned workbook timestamps + deterministic alias ordering + a real two-run equality test. +- Real bug: duplicate-index only reported repeats → now flags/reports every occurrence. +- Provisional `person_id` could overwrite a register id → now suffixed. +- Date parser gaps: invalid-calendar-date → UNKNOWN, intra-month day-range (`7./8. Sept.1923`). +- Multi-person sender now split + flagged (REQ-PERS-01); CSV-injection defanged in review files; + pinned deps + hardened root `.gitignore`. + +**Next:** +- Marcel reviews the plan. Then execute it (subagent-driven or inline) — the date parser + (Task 3/8 + Easter computus) is the meatiest piece. + +--- + ## 2026-05-25 (session 2) — Strategy + normalizer spec **Did:**