Files
familienarchiv/docs/import-migration/03-normalizer-implementation-plan.md
Marcel 6f7aa643c9 docs(import): add normalizer implementation plan + apply persona review
17-task TDD plan for tools/import-normalizer/. Incorporates inline
6-persona review: content-deterministic idempotency, duplicate-index
fix, provisional-id collision guard, date-parser edge cases, multi-sender
split, CSV-injection defang, pinned deps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:55:50 +02:00

86 KiB
Raw Blame History

Import Normalizer Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Build an offline Python tool that turns the raw family-archive spreadsheets into a clean, canonical dataset (canonical-documents.xlsx, canonical-persons.xlsx) plus review CSVs, with a deterministic overrides-and-rerun loop.

Architecture: A standalone Python package at tools/import-normalizer/. Pure, independently-testable units — date parsing (dates.py), person/register logic (persons.py), spreadsheet ingest (ingest.py), row mapping (documents.py) — are orchestrated by normalize.py. Source workbooks are read-only; all tunables live in config.py. Residue (unparseable dates, unmatched names) is reported to review/*.csv and corrected via version-controlled overrides/*.csv applied on each run.

Tech Stack: Python 3.12, openpyxl (xlsx read/write), pytest. No third-party fuzzy library — difflib (stdlib) provides suggestions only (never auto-applied), per the conservative-matching requirement.

Spec: 02-normalization-spec.md. Requirement IDs (FR-*, REQ-*, NFR-*) referenced per task.


File Structure

tools/import-normalizer/
├── config.py        # paths, header maps, century rule, season/feast tables, month tables, matching config
├── dates.py         # Easter computus, feast/season resolution, year expansion, parse_date()
├── persons.py       # slug, Person, parse_register(), split_receivers(), AliasIndex, ResolutionContext
├── ingest.py        # read_sheet(), build_header_map()
├── documents.py     # RawRow, extract_row(), triage helpers, CanonicalDocument, to_canonical()
├── writers.py       # write_documents_xlsx(), write_persons_xlsx(), write_review_csv(), write_summary()
├── overrides.py     # load_overrides()
├── normalize.py     # main() orchestrator + CLI
├── requirements.txt
├── .gitignore       # .venv/ out/ review/ __pycache__/
├── README.md
├── overrides/
│   ├── dates.csv    # seed header: raw,iso,precision
│   └── names.csv    # seed header: raw,person_id
└── tests/
    ├── __init__.py
    ├── test_dates.py
    ├── test_persons.py
    ├── test_ingest.py
    ├── test_documents.py
    ├── test_writers.py
    └── test_normalize.py

Test command convention (per the "never run the full suite" rule — run targeted files): tools/import-normalizer/.venv/bin/python -m pytest tools/import-normalizer/tests/test_X.py -v

All git commands assume CWD = repo root and the current branch docs/import-migration.


Task 1: Project scaffold, venv, config constants

Files:

  • Create: tools/import-normalizer/requirements.txt

  • Create: tools/import-normalizer/.gitignore

  • Create: tools/import-normalizer/config.py

  • Create: tools/import-normalizer/tests/__init__.py

  • Create: tools/import-normalizer/tests/test_config.py

  • Step 1: Create requirements.txt (pinned — an openpyxl minor bump can change xlsx serialization and break determinism, NFR-IDEM-01)

openpyxl==3.1.5
pytest==8.3.4
  • Step 2: Create the tool-local .gitignore
.venv/
out/
review/
__pycache__/
*.pyc
  • Step 2b: Harden the repo-root .gitignore (the root file currently has no venv pattern — that is how ocr-service/.venv got committed; prevent the whole class). Append these lines to /home/marcel/Desktop/familienarchiv/.gitignore if not already present:
**/.venv/
**/__pycache__/
*.pyc

(Cleaning up the already-committed ocr-service/.venv via git rm -r --cached ocr-service/.venv is a separate task — do NOT bundle it into this branch.)

  • Step 3: Create config.py
"""Tunables for the import normalizer. No logic here — only data tables."""
from pathlib import Path

# --- Paths ---
BASE_DIR = Path(__file__).resolve().parent
REPO_ROOT = BASE_DIR.parent.parent
IMPORT_DIR = REPO_ROOT / "import"

DOCUMENT_WORKBOOK = IMPORT_DIR / "zzfamilienarchiv aktuell  2 - Kopie 2025-07-05.xlsx"
DOCUMENT_SHEET = "Familienarchiv"
PERSON_WORKBOOK = IMPORT_DIR / "Personendatei 2.xlsx"
PERSON_SHEET = "Tabelle1"

OUT_DIR = BASE_DIR / "out"
REVIEW_DIR = BASE_DIR / "review"
OVERRIDES_DIR = BASE_DIR / "overrides"

# --- Header text (lowercased, whitespace-collapsed) -> canonical field ---
DOCUMENT_HEADER_MAP = {
    "index": "index",
    "datei": "file",
    "box": "box",
    "mappe": "folder",
    "briefeschreiberin": "sender",
    "empfängerin": "receivers",
    "datum des briefes": "date",
    "ort": "location",
    "schlagwort": "tags",
    "inhalt": "summary",
}
DOCUMENT_REQUIRED_FIELDS = {"index"}

PERSON_HEADER_MAP = {
    "generation": "generation",
    "familienname": "last_name",
    "vorname": "first_name",
    "geb als": "maiden_name",
    "geburtsdatum": "birth_date",
    "geburtsort": "birth_place",
    "todesdatum": "death_date",
    "sterbeort": "death_place",
    "verheiratet mit": "spouse",
    "bemerkung": "notes",
}
PERSON_REQUIRED_FIELDS = {"last_name"}

# --- Century rule (archive 18731957) ---
TWO_DIGIT_19XX_MAX = 57   # 00..57 -> 1900+yy
TWO_DIGIT_18XX_MIN = 73   # 73..99 -> 1800+yy ; 58..72 -> ambiguous -> UNKNOWN

# --- Seasons -> representative month (day = 1) ---
SEASON_MONTHS = {
    "frühling": 4, "fruehling": 4, "frühjahr": 4, "fruehjahr": 4,
    "sommer": 7, "herbst": 10, "winter": 1,
}

# --- Fixed feasts -> (month, day) ---
FIXED_FEASTS = {
    "neujahr": (1, 1),
    "heiligabend": (12, 24), "heiliger abend": (12, 24),
    "weihnachten": (12, 25), "weihnacht": (12, 25), "1. weihnachtstag": (12, 25),
    "silvester": (12, 31), "sylvester": (12, 31),
}

# --- Movable feasts -> day offset from Easter Sunday ---
MOVABLE_FEASTS = {
    "karfreitag": -2,
    "ostern": 0, "ostersonntag": 0, "ostermontag": 1,
    "himmelfahrt": 39, "christi himmelfahrt": 39,
    "pfingsten": 49, "pfingstsonntag": 49, "pfingstmontag": 50,
    "fronleichnam": 60,
}

# --- Month names -> number (German + English, full + abbreviations) ---
MONTHS = {
    "januar": 1, "jan": 1, "january": 1,
    "februar": 2, "feb": 2, "febr": 2, "february": 2,
    "märz": 3, "maerz": 3, "mär": 3, "mar": 3, "march": 3,
    "april": 4, "apr": 4,
    "mai": 5, "may": 5,
    "juni": 6, "jun": 6, "june": 6,
    "juli": 7, "jul": 7, "july": 7,
    "august": 8, "aug": 8,
    "september": 9, "sep": 9, "sept": 9,
    "oktober": 10, "okt": 10, "oct": 10, "october": 10,
    "november": 11, "nov": 11,
    "dezember": 12, "dez": 12, "dec": 12, "december": 12,
}

ROMAN_MONTHS = {
    "i": 1, "ii": 2, "iii": 3, "iv": 4, "v": 5, "vi": 6,
    "vii": 7, "viii": 8, "ix": 9, "x": 10, "xi": 11, "xii": 12,
}

# --- Person matching ---
KNOWN_LAST_NAMES = [
    "von der Heide", "von Massenbach", "von Geldern", "von Gelden", "von Staa",
    "de Gruyter", "Dieckmann", "Gruber", "Müller", "Wolff", "Cram",
]
FUZZY_SUGGEST_THRESHOLD = 0.82  # difflib ratio; suggestions only, never auto-applied
  • Step 4: Create empty tests/__init__.py (empty file).

  • Step 5: Write tests/test_config.py

import config

def test_century_boundaries():
    assert config.TWO_DIGIT_19XX_MAX == 57
    assert config.TWO_DIGIT_18XX_MIN == 73

def test_header_maps_cover_required_fields():
    assert "index" in config.DOCUMENT_HEADER_MAP.values()
    assert "last_name" in config.PERSON_HEADER_MAP.values()

def test_feast_tables_present():
    assert config.MOVABLE_FEASTS["pfingsten"] == 49
    assert config.SEASON_MONTHS["herbst"] == 10
  • Step 6: Create the venv and install deps

Run:

cd tools/import-normalizer && python3 -m venv .venv && .venv/bin/pip install -r requirements.txt && cd -

Expected: openpyxl + pytest install successfully.

  • Step 7: Run the config test

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_config.py -v && cd - Expected: 3 passed. (Tests import config directly, so pytest must run with CWD = the tool dir; conftest.py is unnecessary because the modules are flat in that dir.)

  • Step 8: Commit
git add .gitignore tools/import-normalizer/requirements.txt tools/import-normalizer/.gitignore tools/import-normalizer/config.py tools/import-normalizer/tests/__init__.py tools/import-normalizer/tests/test_config.py
git commit -m "feat(normalizer): scaffold tool + config tables"

Task 2: Easter computus (REQ-DATE-06)

Files:

  • Create: tools/import-normalizer/dates.py

  • Create: tools/import-normalizer/tests/test_dates.py

  • Step 1: Write the failing test in tests/test_dates.py

import datetime
import dates

def test_easter_known_years():
    # Anonymous Gregorian algorithm — verified against published tables
    assert dates.easter(2024) == datetime.date(2024, 3, 31)
    assert dates.easter(2000) == datetime.date(2000, 4, 23)
    assert dates.easter(1922) == datetime.date(1922, 4, 16)
    assert dates.easter(1888) == datetime.date(1888, 4, 1)
  • Step 2: Run test to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_easter_known_years -v && cd - Expected: FAIL with ModuleNotFoundError: No module named 'dates' or AttributeError: module 'dates' has no attribute 'easter'.

  • Step 3: Create dates.py with the computus
"""Tolerant historical date parsing for the family archive."""
import datetime


def easter(year: int) -> datetime.date:
    """Easter Sunday (Gregorian) via the Anonymous Gregorian / Butcher algorithm."""
    a = year % 19
    b = year // 100
    c = year % 100
    d = b // 4
    e = b % 4
    f = (b + 8) // 25
    g = (b - f + 1) // 3
    h = (19 * a + b - d - g + 15) % 30
    i = c // 4
    k = c % 4
    l = (32 + 2 * e + 2 * i - h - k) % 7
    m = (a + 11 * h + 22 * l) // 451
    month = (h + l - 7 * m + 114) // 31
    day = ((h + l - 7 * m + 114) % 31) + 1
    return datetime.date(year, month, day)
  • Step 4: Run test to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_easter_known_years -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): Easter computus"

Task 3: Feast & season resolution (REQ-DATE-02, REQ-DATE-06)

Files:

  • Modify: tools/import-normalizer/dates.py

  • Modify: tools/import-normalizer/tests/test_dates.py

  • Step 1: Add the failing test to tests/test_dates.py

from dates import Precision

def test_resolve_feast_movable():
    assert dates.resolve_feast_or_season("Pfingsten", 1922) == ("1922-06-04", Precision.DAY)
    assert dates.resolve_feast_or_season("Ostern", 2024) == ("2024-03-31", Precision.DAY)
    assert dates.resolve_feast_or_season("Pfingstmontag", 1922) == ("1922-06-05", Precision.DAY)

def test_resolve_feast_fixed():
    assert dates.resolve_feast_or_season("Weihnachten", 1900) == ("1900-12-25", Precision.DAY)
    assert dates.resolve_feast_or_season("Neujahr", 1910) == ("1910-01-01", Precision.DAY)

def test_resolve_season():
    assert dates.resolve_feast_or_season("Herbst", 1913) == ("1913-10-01", Precision.SEASON)
    assert dates.resolve_feast_or_season("Sommer", 1910) == ("1910-07-01", Precision.SEASON)

def test_resolve_unknown_token_returns_none():
    assert dates.resolve_feast_or_season("Freitag", 1919) is None
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "feast or season" -v && cd - Expected: FAIL — Precision and resolve_feast_or_season not defined.

  • Step 3: Implement — add to dates.py (top imports + new code)
from enum import StrEnum
import config


class Precision(StrEnum):
    DAY = "DAY"
    MONTH = "MONTH"
    SEASON = "SEASON"
    YEAR = "YEAR"
    RANGE = "RANGE"
    APPROX = "APPROX"
    UNKNOWN = "UNKNOWN"


def _advent_sunday(year: int, n: int) -> datetime.date:
    """n-th Advent (1..4). 4th Advent = last Sunday on/before Dec 24."""
    dec24 = datetime.date(year, 12, 24)
    back_to_sunday = (dec24.weekday() - 6) % 7  # Mon=0..Sun=6
    fourth = dec24 - datetime.timedelta(days=back_to_sunday)
    return fourth - datetime.timedelta(days=(4 - n) * 7)


def resolve_feast_or_season(token: str, year: int):
    """Return (iso, Precision) for a known feast/season token, else None."""
    key = " ".join(token.lower().split()).strip(" .")
    if key in config.MOVABLE_FEASTS:
        d = easter(year) + datetime.timedelta(days=config.MOVABLE_FEASTS[key])
        return d.isoformat(), Precision.DAY
    if key in config.FIXED_FEASTS:
        month, day = config.FIXED_FEASTS[key]
        return datetime.date(year, month, day).isoformat(), Precision.DAY
    advent = {"1. advent": 1, "2. advent": 2, "3. advent": 3, "4. advent": 4, "advent": 1}
    if key in advent:
        return _advent_sunday(year, advent[key]).isoformat(), Precision.DAY
    if key in config.SEASON_MONTHS:
        return datetime.date(year, config.SEASON_MONTHS[key], 1).isoformat(), Precision.SEASON
    return None
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "feast or season" -v && cd - Expected: PASS (all 4). (Pfingstmontag 1922 = Easter Apr 16 + 50 = June 5.)

  • Step 5: Commit
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): feast + season resolution"

Task 4: Year expansion / century rule (REQ-DATE-03)

Files:

  • Modify: tools/import-normalizer/dates.py

  • Modify: tools/import-normalizer/tests/test_dates.py

  • Step 1: Add the failing test

def test_expand_year():
    assert dates.expand_year("1888") == 1888
    assert dates.expand_year("889") == 1889      # 3-digit -> 1DDD
    assert dates.expand_year("923") == 1923
    assert dates.expand_year("08") == 1908       # 00..57 -> 19xx
    assert dates.expand_year("17") == 1917
    assert dates.expand_year("57") == 1957
    assert dates.expand_year("73") == 1873       # 73..99 -> 18xx
    assert dates.expand_year("99") == 1899
    assert dates.expand_year("65") is None       # 58..72 ambiguous
    assert dates.expand_year("x") is None
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_expand_year -v && cd - Expected: FAIL — expand_year not defined.

  • Step 3: Implement — add to dates.py
def expand_year(token: str):
    """Expand a 2/3/4-digit year string per the 18731957 century rule. None if ambiguous."""
    token = token.strip()
    if not token.isdigit():
        return None
    n, v = len(token), int(token)
    if n == 4:
        return v
    if n == 3:
        return 1000 + v
    if n == 2:
        if v <= config.TWO_DIGIT_19XX_MAX:
            return 1900 + v
        if v >= config.TWO_DIGIT_18XX_MIN:
            return 1800 + v
        return None
    return None
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_expand_year -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): year expansion century rule"

Task 5: parse_date dispatch + ISO + numeric forms (FR-DATE, REQ-DATE-01/04/05)

Files:

  • Modify: tools/import-normalizer/dates.py

  • Modify: tools/import-normalizer/tests/test_dates.py

  • Step 1: Add failing tests

def test_parse_iso_and_empty():
    assert dates.parse_date("1910-04-23") == dates.ParsedDate("1910-04-23", Precision.DAY, "1910-04-23")
    assert dates.parse_date("") == dates.ParsedDate(None, Precision.UNKNOWN, "")
    assert dates.parse_date("?") == dates.ParsedDate(None, Precision.UNKNOWN, "?")

def test_parse_numeric_forms():
    assert dates.parse_date("15.2.1888").iso == "1888-02-15"
    assert dates.parse_date("13.5.09").iso == "1909-05-13"
    assert dates.parse_date("17/6. 1916").iso == "1916-06-17"
    assert dates.parse_date("11.10.08").iso == "1908-10-11"
    assert dates.parse_date("30.1.889").iso == "1889-01-30"
    assert dates.parse_date("15.2.1888").precision == Precision.DAY

def test_parse_numeric_unparseable():
    assert dates.parse_date("8.9.").precision == Precision.UNKNOWN     # no year
    assert dates.parse_date("13.5.65").precision == Precision.UNKNOWN  # ambiguous 2-digit year

def test_parse_approx_marker_upgrades_precision():
    r = dates.parse_date("17.Nov (?) 1887")  # month-name handled in a later task; here just the marker path
    # after the marker is detected, a parsed date becomes APPROX (verified fully in Task 8)
    assert r.raw == "17.Nov (?) 1887"
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "parse_" -v && cd - Expected: FAIL — ParsedDate / parse_date not defined.

  • Step 3: Implement — add to dates.py
import re
from dataclasses import dataclass


@dataclass(frozen=True)
class ParsedDate:
    iso: str | None
    precision: Precision
    raw: str


_LEADING_MARKERS = re.compile(
    r"^(um|ca\.?|circa|etwa|wohl|vermutlich|nach|vor|anfang|mitte|ende)\s+", re.I)


def _preprocess(raw: str):
    """Return (cleaned_string, approx_flag)."""
    s = (raw or "").strip()
    if not s:
        return "", False
    low = s.lower()
    approx = ("?" in s) or any(
        m in low for m in ("um ", "ca.", "ca ", "circa", "etwa", "wohl", "vermutlich"))
    s = re.sub(r"\(\s*\?\s*\)", " ", s)   # remove "(?)"
    s = s.replace("?", " ")
    s = re.sub(r",.*$", "", s)            # drop trailing editorial note (", 2. Brief")
    s = _LEADING_MARKERS.sub("", s)
    s = re.sub(r"\s+", " ", s).strip(" .,")
    return s, approx


_NUM_RE = re.compile(r"(\d{1,2})[./](\d{1,2})\.?\s*(\d{2,4})")


def _match_iso(s):
    if re.fullmatch(r"\d{4}-\d{2}-\d{2}", s):
        try:
            datetime.date.fromisoformat(s)
            return s, Precision.DAY
        except ValueError:
            return None
    return None


def _match_numeric(s):
    m = _NUM_RE.fullmatch(s)
    if not m:
        return None
    day, month = int(m.group(1)), int(m.group(2))
    year = expand_year(m.group(3))
    if year is None or not (1 <= month <= 12):
        return None
    try:
        return datetime.date(year, month, day).isoformat(), Precision.DAY
    except ValueError:
        return None


# Matchers are tried in order. Later tasks append to this list.
_MATCHERS = [_match_iso, _match_numeric]


def parse_date(raw: str, date_overrides: dict | None = None) -> ParsedDate:
    if date_overrides:
        key = (raw or "").strip()
        if key in date_overrides:
            iso, prec = date_overrides[key]
            return ParsedDate(iso or None, Precision(prec), raw)
    cleaned, approx = _preprocess(raw)
    if not cleaned:
        return ParsedDate(None, Precision.UNKNOWN, raw)
    for matcher in _MATCHERS:
        result = matcher(cleaned)
        if result:
            iso, precision = result
            if approx:
                precision = Precision.APPROX
            return ParsedDate(iso, precision, raw)
    return ParsedDate(None, Precision.UNKNOWN, raw)
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "parse_" -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): parse_date dispatch + iso/numeric matchers"

Task 6: Roman-numeral month matcher

Files:

  • Modify: tools/import-normalizer/dates.py

  • Modify: tools/import-normalizer/tests/test_dates.py

  • Step 1: Add failing test

def test_parse_roman_months():
    assert dates.parse_date("22.III.18").iso == "1918-03-22"
    assert dates.parse_date("19.XII.1954").iso == "1954-12-19"
    assert dates.parse_date("1.III.27").iso == "1927-03-01"
    assert dates.parse_date("22.III.18").precision == Precision.DAY
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_roman_months -v && cd - Expected: FAIL — Roman dates currently fall through to UNKNOWN.

  • Step 3: Implement — add to dates.py and register the matcher
_ROMAN_RE = re.compile(r"(\d{1,2})\.\s*([IVXLC]+)\.?\s*(\d{2,4})", re.I)


def _match_roman(s):
    m = _ROMAN_RE.fullmatch(s)
    if not m:
        return None
    day = int(m.group(1))
    month = config.ROMAN_MONTHS.get(m.group(2).lower())
    year = expand_year(m.group(3))
    if not month or year is None:
        return None
    try:
        return datetime.date(year, month, day).isoformat(), Precision.DAY
    except ValueError:
        return None

Then change the matcher list line to:

_MATCHERS = [_match_iso, _match_numeric, _match_roman]
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_roman_months -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): roman-numeral month matcher"

Task 7: Month-name matchers (day-first + English month-first)

Files:

  • Modify: tools/import-normalizer/dates.py

  • Modify: tools/import-normalizer/tests/test_dates.py

  • Step 1: Add failing tests

def test_parse_monthname_day_first():
    assert dates.parse_date("6.März 1888").iso == "1888-03-06"
    assert dates.parse_date("29.Sept.1891").iso == "1891-09-29"
    assert dates.parse_date("10.Oct.95").iso == "1895-10-10"
    assert dates.parse_date("9.December1889").iso == "1889-12-09"
    assert dates.parse_date("18.Dez.1916").iso == "1916-12-18"
    assert dates.parse_date("4Dezember 1936").iso == "1936-12-04"
    assert dates.parse_date("25 August 1968").iso == "1968-08-25"

def test_parse_monthname_english_month_first():
    assert dates.parse_date("April 12. 1922").iso == "1922-04-12"
    assert dates.parse_date("Oct.5. 1916").iso == "1916-10-05"
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k monthname -v && cd - Expected: FAIL.

  • Step 3: Implement — add to dates.py. _match_monthname_a is day-first; _match_monthname_b is English month-first.
_MONTH_A_RE = re.compile(r"(\d{1,2})[.\s]*([A-Za-zÄÖÜäöü]+)\.?\s*(\d{2,4})")
_MONTH_B_RE = re.compile(r"([A-Za-zÄÖÜäöü]+)\.?\s*(\d{1,2})\.?\s*(\d{2,4})")


def _lookup_month(token: str):
    return config.MONTHS.get(token.lower().strip(" ."))


def _build_day_month_year(day, month, year):
    if not month or year is None or not (1 <= month <= 12):
        return None
    try:
        return datetime.date(year, month, day).isoformat(), Precision.DAY
    except ValueError:
        return None


def _match_monthname_a(s):
    m = _MONTH_A_RE.fullmatch(s)
    if not m:
        return None
    return _build_day_month_year(int(m.group(1)), _lookup_month(m.group(2)), expand_year(m.group(3)))


def _match_monthname_b(s):
    m = _MONTH_B_RE.fullmatch(s)
    if not m:
        return None
    return _build_day_month_year(int(m.group(2)), _lookup_month(m.group(1)), expand_year(m.group(3)))

Then update the matcher list (order matters — _match_monthname_a is day-first and safe to place before the month/year matcher; _match_monthname_b goes after the month/year matcher added in Task 8, so for now append only _a):

_MATCHERS = [_match_iso, _match_numeric, _match_roman, _match_monthname_a]
  • Step 4: Run — expect _a cases to pass, _b (English) still failing

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_monthname_day_first -v && cd - Expected: PASS.

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_monthname_english_month_first -v && cd - Expected: FAIL (_match_monthname_b not yet registered — it is wired in Task 8 to sit after the month/year matcher so it doesn't shadow Mai 1895).

  • Step 5: Commit
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): day-first month-name matcher"

Task 8: Month/year, feast/season, year-only, range matchers + final ordering + overrides

Files:

  • Modify: tools/import-normalizer/dates.py

  • Modify: tools/import-normalizer/tests/test_dates.py

  • Step 1: Add failing tests

def test_parse_month_year_year_only():
    assert dates.parse_date("Mai 1895") == dates.ParsedDate("1895-05-01", Precision.MONTH, "Mai 1895")
    assert dates.parse_date("October 1903").iso == "1903-10-01"
    assert dates.parse_date("1905") == dates.ParsedDate("1905-01-01", Precision.YEAR, "1905")

def test_parse_feast_and_season_via_parse_date():
    assert dates.parse_date("Pfingsten 1922") == dates.ParsedDate("1922-06-04", Precision.DAY, "Pfingsten 1922")
    assert dates.parse_date("Herbst 1913") == dates.ParsedDate("1913-10-01", Precision.SEASON, "Herbst 1913")
    assert dates.parse_date("Pfingstsonntag 1915").precision == Precision.DAY

def test_parse_ranges():
    assert dates.parse_date("8.1.1916 - 15.3.1916") == dates.ParsedDate("1916-01-08", Precision.RANGE, "8.1.1916 - 15.3.1916")
    assert dates.parse_date("1881/82") == dates.ParsedDate("1881-01-01", Precision.RANGE, "1881/82")
    assert dates.parse_date("1945/46?").iso == "1945-01-01"  # '?' stripped -> RANGE, then APPROX
    assert dates.parse_date("1945/46?").precision == Precision.APPROX

def test_parse_approx_full():
    r = dates.parse_date("17.Nov (?) 1887")
    assert r.iso == "1887-11-17"
    assert r.precision == Precision.APPROX

def test_parse_english_month_first_now_works():
    assert dates.parse_date("April 12. 1922").iso == "1922-04-12"
    assert dates.parse_date("Mai 1895").iso == "1895-05-01"  # not shadowed by month-first matcher

def test_parse_unparseable_examples():
    assert dates.parse_date("Freitag 1919").precision == Precision.UNKNOWN

def test_parse_invalid_calendar_date_is_unknown():
    # try/except ValueError in the matchers must route impossible dates to UNKNOWN (-> review),
    # never silently clamp. This is the most likely real-data bug class at 7,600 rows.
    assert dates.parse_date("30.2.1888").precision == Precision.UNKNOWN
    assert dates.parse_date("31.4.1916").precision == Precision.UNKNOWN

def test_parse_intra_month_day_range():
    # "7./8. Sept.1923" -> start day, RANGE. Must NOT be confused with slash-date "17/6. 1916".
    assert dates.parse_date("7./8. Sept.1923") == dates.ParsedDate("1923-09-07", Precision.RANGE, "7./8. Sept.1923")
    assert dates.parse_date("17/6. 1916") == dates.ParsedDate("1916-06-17", Precision.DAY, "17/6. 1916")

def test_parse_trailing_note_stripped_but_raw_preserved():
    r = dates.parse_date("17.Nov 1887, 2. Brief")  # REQ-DATE-04
    assert r.iso == "1887-11-17"
    assert "2. Brief" in r.raw   # original string preserved verbatim
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "month_year or feast_and_season or ranges or approx_full or english_month_first_now or unparseable_examples" -v && cd - Expected: FAIL.

  • Step 3: Implement — add matchers to dates.py
_MONTH_YEAR_RE = re.compile(r"([A-Za-zÄÖÜäöü]+)\.?\s+(\d{2,4})")
_TOKEN_YEAR_RE = re.compile(r"(.+?)\.?\s+(\d{2,4})")
_YEAR_ONLY_RE = re.compile(r"\d{4}")
_RANGE_YY_RE = re.compile(r"(\d{4})\s*/\s*\d{2}")
_RANGE_HYPHEN_RE = re.compile(r"(.*\d)\s*[-]\s*\d.*")
# Intra-month day range, e.g. "7./8. Sept.1923" — require a dot before the slash so it
# does NOT swallow slash-as-dot single dates like "17/6. 1916" (which has no dot before "/").
_RANGE_DAY_RE = re.compile(r"(\d{1,2})\./(\d{1,2})\.\s*(.+)")


def _match_month_year(s):
    m = _MONTH_YEAR_RE.fullmatch(s)
    if not m:
        return None
    month = _lookup_month(m.group(1))
    year = expand_year(m.group(2))
    if not month or year is None:
        return None
    return datetime.date(year, month, 1).isoformat(), Precision.MONTH


def _match_feast_season(s):
    m = _TOKEN_YEAR_RE.fullmatch(s)
    if not m:
        return None
    year = expand_year(m.group(2))
    if year is None:
        return None
    return resolve_feast_or_season(m.group(1), year)


def _match_year_only(s):
    if _YEAR_ONLY_RE.fullmatch(s):
        return datetime.date(int(s), 1, 1).isoformat(), Precision.YEAR
    return None


def _match_range(s):
    m = _RANGE_YY_RE.fullmatch(s)
    if m:
        return datetime.date(int(m.group(1)), 1, 1).isoformat(), Precision.RANGE
    m = _RANGE_DAY_RE.fullmatch(s)
    if m:
        first = f"{m.group(1)}.{m.group(3)}"  # "7." + "Sept.1923" -> "7.Sept.1923"
        for matcher in (_match_numeric, _match_monthname_a):
            r = matcher(first)
            if r:
                return r[0], Precision.RANGE
    m = _RANGE_HYPHEN_RE.fullmatch(s)
    if m:
        start = m.group(1).strip()
        for matcher in (_match_numeric, _match_roman, _match_monthname_a, _match_year_only):
            r = matcher(start)
            if r:
                return r[0], Precision.RANGE
    return None

Then replace the matcher list with the final ordering:

_MATCHERS = [
    _match_iso,
    _match_range,
    _match_numeric,
    _match_roman,
    _match_monthname_a,
    _match_month_year,
    _match_monthname_b,
    _match_feast_season,
    _match_year_only,
]
  • Step 4: Run the full date test file

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -v && cd - Expected: PASS (all tests, including the English month-first test from Task 7).

  • Step 5: Add an overrides test, then commit

Append to tests/test_dates.py:

def test_parse_date_override_wins():
    ovr = {"13.5.65": ("1965-05-13", "DAY")}
    r = dates.parse_date("13.5.65", ovr)  # ambiguous without override
    assert r == dates.ParsedDate("1965-05-13", Precision.DAY, "13.5.65")

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -v && cd - Expected: PASS.

git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): month/year, feast/season, range matchers + overrides"

Task 9: Person register parsing (FR-PERS, US-PERS-01)

Files:

  • Create: tools/import-normalizer/persons.py

  • Create: tools/import-normalizer/tests/test_persons.py

  • Step 1: Write the failing test in tests/test_persons.py

import persons

def test_slugify():
    assert persons.slugify("de Gruyter", "Eugenie") == "de-gruyter-eugenie"
    assert persons.slugify("Müller", "Karl Erhard") == "mueller-karl-erhard"

def test_parse_register_basic():
    rows = [
        {"generation": "G 1", "last_name": "Blomquist", "first_name": "Charlotte,Meta,Jacobi",
         "maiden_name": "Ruge", "birth_date": "30.8.1862", "birth_place": "Schülperneusiel",
         "death_date": "1934-07-23", "death_place": "Göteborg", "spouse": '"Tante Lolly"',
         "notes": "Schwester v Marie Cram"},
        {"generation": "G 2", "last_name": "Bohrmann", "first_name": "Else",
         "maiden_name": "Cram", "birth_date": "28.11.1888", "spouse": "Ludwig Bohrmann",
         "notes": "Schwester v Herbert"},
    ]
    people = persons.parse_register(rows)
    p = people[0]
    assert p.person_id == "blomquist-charlotte"
    assert p.first_name == "Charlotte"
    assert p.maiden_name == "Ruge"
    assert p.birth_date == "1862-08-30"
    assert p.nickname == "Tante Lolly"     # quoted spouse field is a nickname, not a spouse
    assert p.spouse == ""
    assert "Meta" in p.extra_given_names and "Jacobi" in p.extra_given_names
    p2 = people[1]
    assert p2.maiden_name == "Cram"
    assert p2.spouse == "Ludwig Bohrmann"
    assert p2.provisional is False
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd - Expected: FAIL — persons module / symbols not defined.

  • Step 3: Implement persons.py
"""Person register parsing, name splitting, alias resolution."""
import re
import unicodedata
from dataclasses import dataclass, field

import config
import dates

_DIACRITIC_MAP = str.maketrans({"ä": "ae", "ö": "oe", "ü": "ue", "ß": "ss",
                                "Ä": "ae", "Ö": "oe", "Ü": "ue"})


def _strip_accents(s: str) -> str:
    s = s.translate(_DIACRITIC_MAP)
    s = unicodedata.normalize("NFKD", s)
    return "".join(c for c in s if not unicodedata.combining(c))


def slugify(last: str, first: str) -> str:
    raw = f"{last} {first}".strip()
    raw = _strip_accents(raw).lower()
    raw = re.sub(r"[^a-z0-9]+", "-", raw).strip("-")
    return raw or "unknown"


@dataclass
class Person:
    person_id: str
    last_name: str = ""
    first_name: str = ""
    maiden_name: str = ""
    title: str = ""
    nickname: str = ""
    extra_given_names: list = field(default_factory=list)
    birth_date: str | None = None
    birth_date_raw: str = ""
    birth_place: str = ""
    death_date: str | None = None
    death_date_raw: str = ""
    death_place: str = ""
    spouse: str = ""
    generation: str = ""
    notes: str = ""
    aliases: list = field(default_factory=list)
    provisional: bool = False


_QUOTED_RE = re.compile(r'^[“"\']\s*(.+?)\s*[”"\']$')


def parse_register(rows: list[dict]) -> list[Person]:
    people = []
    for r in rows:
        last = (r.get("last_name") or "").strip()
        if not last:
            continue
        given_raw = (r.get("first_name") or "").strip()
        givens = [g.strip() for g in given_raw.split(",") if g.strip()]
        first = givens[0] if givens else ""
        extra = givens[1:]

        spouse_raw = (r.get("spouse") or "").strip()
        nickname = ""
        m = _QUOTED_RE.match(spouse_raw)
        if m:
            nickname = m.group(1)
            spouse_raw = ""

        birth = dates.parse_date(r.get("birth_date") or "")
        death = dates.parse_date(r.get("death_date") or "")
        people.append(Person(
            person_id=slugify(last, first),
            last_name=last, first_name=first, maiden_name=(r.get("maiden_name") or "").strip(),
            nickname=nickname, extra_given_names=extra,
            birth_date=birth.iso, birth_date_raw=(r.get("birth_date") or "").strip(), birth_place=(r.get("birth_place") or "").strip(),
            death_date=death.iso, death_date_raw=(r.get("death_date") or "").strip(), death_place=(r.get("death_place") or "").strip(),
            spouse=spouse_raw, generation=(r.get("generation") or "").strip(),
            notes=(r.get("notes") or "").strip(), provisional=False,
        ))
    # De-duplicate colliding ids with numeric suffix
    seen = {}
    for p in people:
        if p.person_id in seen:
            seen[p.person_id] += 1
            p.person_id = f"{p.person_id}-{seen[p.person_id]}"
        else:
            seen[p.person_id] = 1
    return people
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
git commit -m "feat(normalizer): person register parsing"

Task 10: Receiver splitting (REQ-PERS-01, US-PERS-02 AC4)

Files:

  • Modify: tools/import-normalizer/persons.py

  • Modify: tools/import-normalizer/tests/test_persons.py

  • Step 1: Add failing tests (ported from the Java PersonNameParser contract)

def test_split_receivers():
    assert persons.split_receivers("Eugenie Müller") == ["Eugenie Müller"]
    assert persons.split_receivers("Walter und Eugenie de Gruyter") == ["Walter de Gruyter", "Eugenie de Gruyter"]
    assert persons.split_receivers("Hedi und Tutu (Gruber)") == ["Hedi Gruber", "Tutu Gruber"]
    assert persons.split_receivers("Clara u Familie") == ["Clara"]
    assert persons.split_receivers("Eugenie de Gruyter geb. Müller") == ["Eugenie de Gruyter"]
    assert persons.split_receivers("Herbert u Clara") == ["Herbert", "Clara"]
    assert persons.split_receivers("") == []

def test_find_known_last_name():
    assert persons.find_known_last_name("Eugenie de Gruyter") == "de Gruyter"
    assert persons.find_known_last_name("Clara") is None
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k "split_receivers or known_last" -v && cd - Expected: FAIL.

  • Step 3: Implement — add to persons.py
_GEB_RE = re.compile(r",?\s*geb\.?\s+.+$", re.I)
_PAREN_RE = re.compile(r"\(([^)]+)\)\s*$")
_MULTI_RE = re.compile(r"\s+(?:und|u)\s+", re.I)


def find_known_last_name(segment: str):
    seg = segment.strip()
    for ln in config.KNOWN_LAST_NAMES:  # config lists longest-first
        if seg == ln or seg.endswith(" " + ln):
            return ln
    return None


def split_receivers(raw: str) -> list[str]:
    if not raw or not raw.strip():
        return []
    # 0. split on "//"
    if "//" in raw:
        out = []
        for seg in raw.split("//"):
            out.extend(split_receivers(seg))
        return out
    cleaned = _GEB_RE.sub("", raw).strip()
    if not _MULTI_RE.search(cleaned):
        return [cleaned]
    shared_last = None
    pm = _PAREN_RE.search(cleaned)
    if pm:
        shared_last = pm.group(1).strip()
        cleaned = cleaned[:pm.start()].strip()
    parts = [p.strip() for p in _MULTI_RE.split(cleaned)]
    parts = [p for p in parts if p and p.lower() != "familie"]
    if not parts:
        return []
    if len(parts) == 1:
        return [parts[0]]
    if shared_last:
        return [p if " " in p else f"{p} {shared_last}" for p in parts]
    last_seg = parts[-1]
    detected = find_known_last_name(last_seg)
    if detected:
        result = []
        for p in parts[:-1]:
            if " " not in p and find_known_last_name(p) is None:
                result.append(f"{p} {detected}")
            else:
                result.append(p)
        result.append(last_seg)
        return result
    return parts
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k "split_receivers or known_last" -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
git commit -m "feat(normalizer): receiver splitting"

Task 11: Alias index (FR-DEDUP, REQ-DEDUP-01/02)

Files:

  • Modify: tools/import-normalizer/persons.py

  • Modify: tools/import-normalizer/tests/test_persons.py

  • Step 1: Add failing tests

def test_alias_index_resolves_maiden_and_married():
    people = persons.parse_register([
        {"last_name": "de Gruyter", "first_name": "Eugenie", "maiden_name": "Müller"},
        {"last_name": "Cram", "first_name": "Clara"},
    ])
    idx = persons.AliasIndex(people)
    eugenie = people[0].person_id
    assert idx.resolve("Eugenie de Gruyter") == eugenie   # canonical
    assert idx.resolve("Eugenie Müller") == eugenie        # maiden alias
    assert idx.resolve("eugenie  müller") == eugenie        # normalized
    assert idx.resolve("Nobody Unknown") is None

def test_alias_index_suggestion():
    people = persons.parse_register([{"last_name": "Wittkopf", "first_name": "Hans"}])
    idx = persons.AliasIndex(people)
    sid, score = idx.suggest("Hans Wittkop")  # typo
    assert sid == people[0].person_id and score >= config.FUZZY_SUGGEST_THRESHOLD
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k alias -v && cd - Expected: FAIL — AliasIndex not defined.

  • Step 3: Implement — add to persons.py
import difflib


def _norm(name: str) -> str:
    return re.sub(r"\s+", " ", _strip_accents(name).lower().replace(".", " ")).strip()


class AliasIndex:
    def __init__(self, people: list[Person]):
        self._by_alias: dict[str, str] = {}
        self._display: dict[str, str] = {}
        self.known_ids: set[str] = {p.person_id for p in people}
        first_name_ids: dict[str, list] = {}
        for p in people:
            self._display[p.person_id] = f"{p.first_name} {p.last_name}".strip()
            # Ordered, de-duplicated forms (NOT a set) so alias order is deterministic — NFR-IDEM-01.
            forms = [f"{p.first_name} {p.last_name}".strip()]
            if p.maiden_name:
                forms.append(f"{p.first_name} {p.maiden_name}".strip())
            for extra in p.extra_given_names:
                forms.append(f"{extra} {p.last_name}".strip())
            if p.nickname:
                forms.append(p.nickname)
            seen = set()
            for form in forms:
                if form in seen:
                    continue
                seen.add(form)
                key = _norm(form)
                if key and key not in self._by_alias:
                    self._by_alias[key] = p.person_id
                    p.aliases.append(form)
            if p.first_name:
                ids = first_name_ids.setdefault(_norm(p.first_name), [])
                if p.person_id not in ids:
                    ids.append(p.person_id)
        # first-name-only alias, only when unambiguous
        for fname, ids in first_name_ids.items():
            if len(ids) == 1 and fname not in self._by_alias:
                self._by_alias[fname] = ids[0]

    def resolve(self, name: str):
        return self._by_alias.get(_norm(name))

    def display(self, person_id: str) -> str:
        return self._display.get(person_id, "")

    def suggest(self, name: str):
        keys = list(self._by_alias.keys())
        match = difflib.get_close_matches(_norm(name), keys, n=1, cutoff=config.FUZZY_SUGGEST_THRESHOLD)
        if not match:
            return None, 0.0
        score = difflib.SequenceMatcher(None, _norm(name), match[0]).ratio()
        return self._by_alias[match[0]], score
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k alias -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
git commit -m "feat(normalizer): alias index with maiden/married/nickname resolution"

Task 12: Spreadsheet ingest (FR-INGEST, FR-MAP, REQ-INGEST-01, REQ-MAP-01)

Files:

  • Create: tools/import-normalizer/ingest.py

  • Create: tools/import-normalizer/tests/test_ingest.py

  • Step 1: Write failing tests (build a tiny workbook on disk with openpyxl)

import datetime
import openpyxl
import pytest
import ingest

def _make_workbook(tmp_path, sheet_name, rows):
    wb = openpyxl.Workbook()
    ws = wb.active
    ws.title = sheet_name
    for r in rows:
        ws.append(r)
    path = tmp_path / "wb.xlsx"
    wb.save(path)
    return path

def test_read_sheet_converts_cells(tmp_path):
    path = _make_workbook(tmp_path, "S", [
        ["Index", "Datum"],
        ["W-0001", datetime.datetime(1888, 2, 15)],
        ["W-0002", 1],
    ])
    rows = ingest.read_sheet(path, "S")
    assert rows[0] == ["Index", "Datum"]
    assert rows[1] == ["W-0001", "1888-02-15"]   # Excel date -> ISO string
    assert rows[2] == ["W-0002", "1"]            # integer -> plain string

def test_build_header_map_collapses_whitespace_and_case():
    header = ["Index", "Datum  des Briefes", "EmpfängerIn", "Mystery"]
    field_map = {"index": "index", "datum des briefes": "date", "empfängerin": "receivers"}
    fields, unknown = ingest.build_header_map(header, field_map, required={"index"})
    assert fields == {"index": 0, "date": 1, "receivers": 2}
    assert unknown == ["Mystery"]

def test_build_header_map_missing_required_raises():
    with pytest.raises(ValueError, match="index"):
        ingest.build_header_map(["Box", "Ort"], {"box": "box", "ort": "location"}, required={"index"})
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_ingest.py -v && cd - Expected: FAIL — ingest not defined.

  • Step 3: Implement ingest.py
"""Read .xlsx sheets into neutral list[list[str]] and map headers to fields."""
import datetime
from pathlib import Path
import openpyxl


def _cell_to_str(value) -> str:
    if value is None:
        return ""
    if isinstance(value, datetime.datetime):
        return value.date().isoformat()
    if isinstance(value, datetime.date):
        return value.isoformat()
    if isinstance(value, float) and value.is_integer():
        return str(int(value))
    if isinstance(value, int):
        return str(value)
    return str(value).strip()


def read_sheet(path: Path, sheet_name: str) -> list[list[str]]:
    wb = openpyxl.load_workbook(path, read_only=True, data_only=True)
    if sheet_name not in wb.sheetnames:
        raise ValueError(f"Sheet '{sheet_name}' not found in {path.name}; sheets: {wb.sheetnames}")
    ws = wb[sheet_name]
    rows = [[_cell_to_str(v) for v in row] for row in ws.iter_rows(values_only=True)]
    wb.close()
    return rows


def _norm_header(text: str) -> str:
    return " ".join(text.lower().split())


def build_header_map(header_row: list[str], field_map: dict[str, str], required: set[str]):
    """Return (field->col_index, unknown_headers). Raise ValueError if a required field is missing."""
    fields: dict[str, int] = {}
    unknown: list[str] = []
    for idx, raw in enumerate(header_row):
        key = _norm_header(raw)
        if key in field_map:
            fields[field_map[key]] = idx
        elif raw.strip():
            unknown.append(raw)
    missing = required - set(fields)
    if missing:
        raise ValueError(f"Required header(s) missing: {sorted(missing)} (found headers: {header_row})")
    return fields, unknown
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_ingest.py -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/ingest.py tools/import-normalizer/tests/test_ingest.py
git commit -m "feat(normalizer): xlsx ingest + header mapping"

Task 13: Row extraction, triage & CanonicalDocument (FR-TRIAGE, REQ-TRIAGE-01/02/03, FR-PROV)

Files:

  • Create: tools/import-normalizer/documents.py

  • Create: tools/import-normalizer/tests/test_documents.py

  • Step 1: Write failing tests

import documents
from documents import Triage

def test_extract_row():
    header = {"index": 0, "file": 1, "box": 2, "folder": 3, "sender": 4,
              "receivers": 5, "date": 6, "location": 7, "tags": 8, "summary": 9}
    cells = ["W-0001", r"..\__scan\W-0001.pdf", "V", "1", "Walter de Gruyter",
             "Eugenie Müller", "15.2.1888", "Rotterdam", "Brautbriefe", "Geschäftsreise"]
    raw = documents.extract_row(cells, header, source_row=3)
    assert raw.index == "W-0001"
    assert raw.sender == "Walter de Gruyter"
    assert raw.date == "15.2.1888"
    assert raw.source_row == 3

def test_triage():
    assert documents.triage(["", "", ""]) == Triage.EMPTY
    assert documents.triage(["", "", "Walter"]) == Triage.BLANK_INDEX  # data but no index
    assert documents.triage(["W-0001x", "x"]) == Triage.X_SUFFIX
    assert documents.triage(["W-0001", "x"]) == Triage.OK

def test_classify_blank_index():
    header = {"sender": 4, "receivers": 5}
    banner = ["", "", "", "", "Brautbriefe von Walter an Eugenie", ""]
    data = ["", "", "V", "1", "", "Eugenie"]
    assert documents.classify_blank_index(banner, header) == "section_banner"
    assert documents.classify_blank_index(data, header) == "data_no_index"

def test_index_file_mismatch():
    assert documents.index_file_mismatch("W-0010x", r"..\__scan\W-0011x.pdf") is True
    assert documents.index_file_mismatch("W-0001", r"..\__scan\W-0001.pdf") is False
    assert documents.index_file_mismatch("W-0001", "") is False

Note triage takes the raw cells list and uses column 0 as the index (matching extract_row's header where index is col 0 in these tests).

  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -v && cd - Expected: FAIL — documents not defined.

  • Step 3: Implement documents.py (extraction + triage + dataclasses; resolution added in Task 14)
"""Document row extraction, triage, and the canonical document record."""
from dataclasses import dataclass, field
from enum import Enum, auto


class Triage(Enum):
    OK = auto()
    EMPTY = auto()
    BLANK_INDEX = auto()
    X_SUFFIX = auto()


@dataclass
class RawRow:
    source_row: int
    index: str = ""
    file: str = ""
    box: str = ""
    folder: str = ""
    sender: str = ""
    receivers: str = ""
    date: str = ""
    location: str = ""
    tags: str = ""
    summary: str = ""


@dataclass
class CanonicalDocument:
    index: str
    box: str = ""
    folder: str = ""
    sender_person_id: str = ""
    sender_name: str = ""
    receiver_person_ids: list = field(default_factory=list)
    receiver_names: list = field(default_factory=list)
    date_iso: str = ""
    date_raw: str = ""
    date_precision: str = ""
    location: str = ""
    tags: list = field(default_factory=list)
    summary: str = ""
    source_row: int = 0
    needs_review: list = field(default_factory=list)


_FIELDS = ["index", "file", "box", "folder", "sender", "receivers", "date", "location", "tags", "summary"]


def extract_row(cells: list[str], header: dict[str, int], source_row: int) -> RawRow:
    def get(field_name):
        idx = header.get(field_name)
        if idx is None or idx >= len(cells):
            return ""
        return (cells[idx] or "").strip()
    return RawRow(source_row=source_row, **{f: get(f) for f in _FIELDS})


def triage(cells: list[str], index_col: int = 0) -> Triage:
    nonempty = [c for c in cells if c and str(c).strip()]
    if not nonempty:
        return Triage.EMPTY
    index = (cells[index_col] or "").strip() if index_col < len(cells) else ""
    if not index:
        return Triage.BLANK_INDEX
    if index.endswith("x"):
        return Triage.X_SUFFIX
    return Triage.OK


def classify_blank_index(cells: list[str], header: dict[str, int]) -> str:
    """REQ-TRIAGE-02: 'section_banner' if only name columns are populated, else 'data_no_index'."""
    name_cols = {header.get("sender"), header.get("receivers")} - {None}
    populated = {i for i, c in enumerate(cells) if c and str(c).strip()}
    if populated and populated <= name_cols:
        return "section_banner"
    return "data_no_index"


def index_file_mismatch(index: str, file_path: str) -> bool:
    if not file_path.strip():
        return False
    basename = file_path.replace("\\", "/").rsplit("/", 1)[-1]
    stem = basename.rsplit(".", 1)[0]
    return stem != index
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/documents.py tools/import-normalizer/tests/test_documents.py
git commit -m "feat(normalizer): row extraction, triage, canonical record"

Task 14: Resolution context + to_canonical (FR-PERS, FR-DATE integration, REQ-PROV-02)

Files:

  • Modify: tools/import-normalizer/persons.py

  • Modify: tools/import-normalizer/documents.py

  • Modify: tools/import-normalizer/tests/test_documents.py

  • Step 1: Add failing tests to tests/test_documents.py

import persons
import documents

def _ctx():
    people = persons.parse_register([
        {"last_name": "de Gruyter", "first_name": "Walter"},
        {"last_name": "de Gruyter", "first_name": "Eugenie", "maiden_name": "Müller"},
    ])
    return persons.ResolutionContext(persons.AliasIndex(people), name_overrides={})

def test_to_canonical_resolves_and_flags():
    ctx = _ctx()
    raw = documents.RawRow(source_row=3, index="W-0001", box="V", folder="1",
                           sender="Walter de Gruyter", receivers="Eugenie Müller",
                           date="15.2.1888", location="Rotterdam", tags="Brautbriefe",
                           summary="Geschäftsreise", file=r"..\__scan\W-0001.pdf")
    doc = documents.to_canonical(raw, ctx, date_overrides={})
    assert doc.sender_person_id == "de-gruyter-walter"
    assert doc.receiver_person_ids == ["de-gruyter-eugenie"]   # matched via maiden alias
    assert doc.date_iso == "1888-02-15" and doc.date_precision == "DAY"
    assert doc.tags == ["Brautbriefe"]
    assert doc.needs_review == []

def test_to_canonical_unmatched_and_unparsed():
    ctx = _ctx()
    raw = documents.RawRow(source_row=9, index="C-0001",
                           sender="Hans Wittkopf", receivers="", date="Freitag 1919")
    doc = documents.to_canonical(raw, ctx, date_overrides={})
    assert doc.sender_person_id == "wittkopf-hans"            # provisional
    assert "unmatched_sender" in doc.needs_review
    assert "unparsed_date" in doc.needs_review
    assert ctx.unmatched["Hans Wittkopf"] == [9]
    assert any(p.provisional for p in ctx.provisional.values())

def test_to_canonical_splits_multi_sender():
    # REQ-PERS-01 / IMP-11: a multi-person sender is parsed, primary kept, flagged.
    ctx = _ctx()
    raw = documents.RawRow(source_row=5, index="C-0100", sender="Walter und Eugenie de Gruyter", receivers="")
    doc = documents.to_canonical(raw, ctx, date_overrides={})
    assert doc.sender_person_id == "de-gruyter-walter"   # first part is primary
    assert "multi_sender" in doc.needs_review

def test_provisional_id_never_collides_with_register():
    # A provisional built from an unmatched string must not steal a register person_id.
    people = persons.parse_register([{"last_name": "Cram", "first_name": "Clara"}])
    ctx = persons.ResolutionContext(persons.AliasIndex(people), name_overrides={})
    # Force a provisional whose natural slug equals the register id by using a string the
    # alias index will not resolve but that slugs to "cram-clara":
    pid, _, matched = ctx.resolve_one("Clara Cram (unsicher)", source_row=1)
    assert matched is False
    assert pid not in {"cram-clara"} or pid.endswith("-2")  # suffixed away from the register id

def test_ambiguous_space_pair_flagged_not_split():
    # US-PERS-02 AC4: "Ella Anita" is kept as one provisional + flagged, never guessed into two.
    ctx = _ctx()
    raw = documents.RawRow(source_row=7, index="C-0200", sender="", receivers="Ella Anita")
    doc = documents.to_canonical(raw, ctx, date_overrides={})
    assert len(doc.receiver_person_ids) == 1          # not split
    assert any(part == "Ella Anita" for _, part, _ in ctx.ambiguous)
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -k "to_canonical" -v && cd - Expected: FAIL — ResolutionContext / to_canonical not defined.

  • Step 3a: Implement ResolutionContext — add to persons.py
class ResolutionContext:
    """Resolves raw name strings to person ids; accumulates provisional persons and review data."""
    def __init__(self, alias_index: AliasIndex, name_overrides: dict[str, str]):
        self.index = alias_index
        self.name_overrides = name_overrides
        self.provisional: dict[str, Person] = {}
        self.unmatched: dict[str, list] = {}
        self.ambiguous: list[tuple] = []
        self._raw_to_pid: dict[str, str] = {}
        self.override_hits = 0

    def _unique_id(self, base: str) -> str:
        """A provisional id must never collide with a register id or another provisional."""
        used = self.index.known_ids | set(self.provisional)
        pid, n = base, 1
        while pid in used:
            n += 1
            pid = f"{base}-{n}"
        return pid

    def resolve_one(self, raw_name: str, source_row: int):
        """Return (person_id, display_name, matched: bool). '' name -> ('', '', True)."""
        name = (raw_name or "").strip()
        if not name:
            return "", "", True
        if name in self.name_overrides:
            self.override_hits += 1
            pid = self.name_overrides[name]
            return pid, self.index.display(pid) or name, True
        pid = self.index.resolve(name)
        if pid:
            return pid, self.index.display(pid) or name, True
        # provisional person (unmatched) — never reuse a register id
        self.unmatched.setdefault(name, []).append(source_row)
        if name in self._raw_to_pid:
            return self._raw_to_pid[name], name, False
        last, first = _last_first(name)
        pid = self._unique_id(slugify(last, first))
        self.provisional[pid] = Person(person_id=pid, last_name=last, first_name=first, provisional=True)
        self._raw_to_pid[name] = pid
        return pid, name, False

    def resolve_sender(self, raw: str, source_row: int):
        """Senders are split like receivers (REQ-PERS-01). Primary = first part; multi flagged."""
        parts = split_receivers(raw)
        if not parts:
            return "", "", True, False
        pid, name, matched = self.resolve_one(parts[0], source_row)
        for extra in parts[1:]:
            self.resolve_one(extra, source_row)  # register the others as persons too
        return pid, name, matched, len(parts) > 1

    def resolve_receivers(self, raw: str, source_row: int):
        results = []
        for part in split_receivers(raw):
            pid, name, matched = self.resolve_one(part, source_row)
            if not matched and " " in part and find_known_last_name(part) is None and len(part.split()) == 2:
                self.ambiguous.append((raw, part, source_row))
            results.append((pid, name, matched))
        return results


def _last_first(name: str):
    """Best-effort split of a free name string into (last, first) for slug/provisional building."""
    name = name.strip()
    ln = find_known_last_name(name)
    if ln:
        first = name[: -len(ln)].strip()
        return ln, first
    tokens = name.split()
    if len(tokens) >= 2:
        return tokens[-1], " ".join(tokens[:-1])
    return name, ""
  • Step 3b: Implement to_canonical — add to documents.py
import dates as _dates


def to_canonical(raw, ctx, date_overrides: dict) -> CanonicalDocument:
    pd = _dates.parse_date(raw.date, date_overrides)
    flags = []

    sender_id, sender_name, sender_matched, sender_multi = ctx.resolve_sender(raw.sender, raw.source_row)
    if raw.sender.strip() and not sender_matched:
        flags.append("unmatched_sender")
    if sender_multi:
        flags.append("multi_sender")

    receivers = ctx.resolve_receivers(raw.receivers, raw.source_row)
    if any(not matched for _, _, matched in receivers):
        flags.append("unmatched_receiver")

    if raw.date.strip() and pd.precision == _dates.Precision.UNKNOWN:
        flags.append("unparsed_date")
    if index_file_mismatch(raw.index, raw.file):
        flags.append("index_file_mismatch")

    return CanonicalDocument(
        index=raw.index, box=raw.box, folder=raw.folder,
        sender_person_id=sender_id, sender_name=sender_name,
        receiver_person_ids=[r[0] for r in receivers],
        receiver_names=[r[1] for r in receivers],
        date_iso=pd.iso or "", date_raw=raw.date, date_precision=str(pd.precision),
        location=raw.location, tags=[raw.tags] if raw.tags else [], summary=raw.summary,
        source_row=raw.source_row, needs_review=flags,
    )
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/persons.py tools/import-normalizer/documents.py tools/import-normalizer/tests/test_documents.py
git commit -m "feat(normalizer): person resolution context + to_canonical"

Task 15: Overrides loader + writers (FR-OVR, FR-OUT, NFR-OBSERV-01)

Files:

  • Create: tools/import-normalizer/overrides.py

  • Create: tools/import-normalizer/writers.py

  • Create: tools/import-normalizer/tests/test_writers.py

  • Step 1: Write failing tests

import csv
import openpyxl
import overrides
import writers
import documents

def test_load_overrides_missing_files(tmp_path):
    d, n = overrides.load_overrides(tmp_path / "dates.csv", tmp_path / "names.csv")
    assert d == {} and n == {}

def test_load_overrides_parsed(tmp_path):
    dp = tmp_path / "dates.csv"
    dp.write_text("raw,iso,precision\n13.5.65,1965-05-13,DAY\n", encoding="utf-8")
    np = tmp_path / "names.csv"
    np.write_text("raw,person_id\nEugenie Müller,de-gruyter-eugenie\n", encoding="utf-8")
    d, n = overrides.load_overrides(dp, np)
    assert d["13.5.65"] == ("1965-05-13", "DAY")
    assert n["Eugenie Müller"] == "de-gruyter-eugenie"

def test_write_documents_xlsx_joins_lists(tmp_path):
    doc = documents.CanonicalDocument(
        index="W-0001", receiver_person_ids=["a", "b"], receiver_names=["A", "B"],
        tags=["Brautbriefe"], date_precision="DAY", needs_review=["unparsed_date"])
    out = tmp_path / "docs.xlsx"
    writers.write_documents_xlsx([doc], out)
    wb = openpyxl.load_workbook(out)
    ws = wb.active
    header = [c.value for c in ws[1]]
    assert "receiver_person_ids" in header and "needs_review" in header
    row = {h: c.value for h, c in zip(header, ws[2])}
    assert row["receiver_person_ids"] == "a|b"
    assert row["needs_review"] == "unparsed_date"

def test_write_review_csv(tmp_path):
    out = tmp_path / "r.csv"
    writers.write_review_csv(out, ["raw", "count"], [["?", 3], ["x", 1]])
    rows = list(csv.reader(out.open(encoding="utf-8")))
    assert rows[0] == ["raw", "count"]
    assert rows[1] == ["?", "3"]

def test_write_review_csv_defangs_formula_injection(tmp_path):
    out = tmp_path / "r.csv"
    writers.write_review_csv(out, ["raw", "count"], [["=cmd|'/C calc'!A0", 1], ["-2+3", 2]])
    rows = list(csv.reader(out.open(encoding="utf-8")))
    assert rows[1][0].startswith("'=")   # leading '=' neutralised
    assert rows[2][0].startswith("'-")

def test_write_summary_sections(tmp_path):
    out = tmp_path / "s.txt"
    writers.write_summary(out, {"# INPUTS": "", "rows": 10, "# DATES": "", "unknown_date_rate": "3.2%"})
    text = out.read_text(encoding="utf-8")
    assert "INPUTS:" in text and "DATES:" in text and "  rows: 10" in text
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_writers.py -v && cd - Expected: FAIL — modules not defined.

  • Step 3a: Implement overrides.py
"""Load human-supplied corrections. Missing files are not an error."""
import csv
from pathlib import Path


def load_overrides(dates_path: Path, names_path: Path):
    date_overrides: dict[str, tuple[str, str]] = {}
    name_overrides: dict[str, str] = {}
    if Path(dates_path).exists():
        with open(dates_path, encoding="utf-8", newline="") as f:
            for row in csv.DictReader(f):
                raw = (row.get("raw") or "").strip()
                if raw:
                    date_overrides[raw] = ((row.get("iso") or "").strip(), (row.get("precision") or "UNKNOWN").strip())
    if Path(names_path).exists():
        with open(names_path, encoding="utf-8", newline="") as f:
            for row in csv.DictReader(f):
                raw = (row.get("raw") or "").strip()
                if raw:
                    name_overrides[raw] = (row.get("person_id") or "").strip()
    return date_overrides, name_overrides
  • Step 3b: Implement writers.py
"""Write canonical .xlsx outputs and review .csv files."""
import csv
import datetime
from pathlib import Path
import openpyxl

_PIPE = "|"
# Pinned workbook metadata so reruns are content-deterministic (NFR-IDEM-01); openpyxl
# otherwise stamps docProps with the current time on every save.
_FIXED_TS = datetime.datetime(2020, 1, 1, 0, 0, 0)


def _join(value):
    if isinstance(value, list):
        return _PIPE.join(str(v) for v in value)
    return "" if value is None else str(value)


def _csv_safe(value):
    """Neutralise spreadsheet formula injection (CWE-1236) in human-opened review CSVs."""
    s = "" if value is None else str(value)
    return "'" + s if s[:1] in ("=", "+", "-", "@", "\t", "\r") else s


DOC_COLUMNS = ["index", "box", "folder", "sender_person_id", "sender_name",
               "receiver_person_ids", "receiver_names", "date_iso", "date_raw",
               "date_precision", "location", "tags", "summary", "source_row", "needs_review"]

PERSON_COLUMNS = ["person_id", "last_name", "first_name", "maiden_name", "title", "nickname",
                  "birth_date", "birth_date_raw", "birth_place", "death_date", "death_date_raw",
                  "death_place", "spouse", "generation", "notes", "aliases", "provisional"]


def _write_xlsx(records, columns, path: Path):
    wb = openpyxl.Workbook()
    ws = wb.active
    ws.append(columns)
    for rec in records:
        ws.append([_join(getattr(rec, col)) for col in columns])
    wb.properties.created = _FIXED_TS
    wb.properties.modified = _FIXED_TS
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    wb.save(path)


def write_documents_xlsx(docs, path: Path):
    _write_xlsx(docs, DOC_COLUMNS, path)


def write_persons_xlsx(people, path: Path):
    _write_xlsx(people, PERSON_COLUMNS, path)


def write_review_csv(path: Path, header: list[str], rows: list[list]):
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w", encoding="utf-8", newline="") as f:
        w = csv.writer(f)
        w.writerow(header)
        for row in rows:
            w.writerow([_csv_safe(c) for c in row])


def write_summary(path: Path, stats: dict):
    """Render a grouped, scannable summary. Keys beginning with '#' are section headers."""
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    lines = []
    for k, v in stats.items():
        if k.startswith("#"):
            lines.append("")
            lines.append(k[1:].strip() + ":")
        else:
            lines.append(f"  {k}: {v}")
    Path(path).write_text("\n".join(lines).strip() + "\n", encoding="utf-8")
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_writers.py -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/overrides.py tools/import-normalizer/writers.py tools/import-normalizer/tests/test_writers.py
git commit -m "feat(normalizer): overrides loader + xlsx/csv writers"

Task 16: Orchestrator normalize.py + integration test (FR-OUT, FR-TRIAGE, REQ-TRIAGE-01/03, NFR-IDEM-01)

Files:

  • Create: tools/import-normalizer/normalize.py

  • Create: tools/import-normalizer/tests/test_normalize.py

  • Step 1: Write the failing integration test (tiny in-memory fixtures, not the real 7,900-row file)

import openpyxl
import normalize

def _doc_wb(tmp_path):
    wb = openpyxl.Workbook(); ws = wb.active; ws.title = "Familienarchiv"
    ws.append(["Index", "Datei", "Box", "Mappe", "BriefeschreiberIn", "EmpfängerIn",
               "Datum des Briefes", "Ort", "Schlagwort", "Inhalt"])
    ws.append(["W-0001", r"..\__scan\W-0001.pdf", "V", "1", "Walter de Gruyter",
               "Eugenie Müller", "15.2.1888", "Rotterdam", "Brautbriefe", "Geschäftsreise"])
    ws.append(["W-0001x", r"..\__scan\W-0001x.pdf", "", "", "Walter de Gruyter", "Eugenie Müller", "", "", "", ""])
    ws.append(["", "", "", "", "Section banner row", "", "", "", "", ""])
    ws.append(["C-0001", "", "", "", "Hans Wittkopf", "", "Freitag 1919", "", "", ""])
    ws.append(["W-0001", r"..\__scan\W-0001.pdf", "V", "1", "Walter de Gruyter",
               "Eugenie Müller", "15.2.1888", "Rotterdam", "Brautbriefe", "dup"])
    p = tmp_path / "docs.xlsx"; wb.save(p); return p

def _person_wb(tmp_path):
    wb = openpyxl.Workbook(); ws = wb.active; ws.title = "Tabelle1"
    ws.append(["Generation", "Familienname", "Vorname", "geb als", "Geburtsdatum",
               "Geburtsort", "Todesdatum", "Sterbeort", "verheiratet mit", "Bemerkung"])
    ws.append(["G 1", "de Gruyter", "Walter", "", "", "", "", "", "", ""])
    ws.append(["G 1", "de Gruyter", "Eugenie", "Müller", "", "", "", "", "", ""])
    p = tmp_path / "persons.xlsx"; wb.save(p); return p

def test_run_end_to_end(tmp_path):
    out_dir = tmp_path / "out"; review_dir = tmp_path / "review"
    stats = normalize.run(
        document_workbook=_doc_wb(tmp_path), document_sheet="Familienarchiv",
        person_workbook=_person_wb(tmp_path), person_sheet="Tabelle1",
        out_dir=out_dir, review_dir=review_dir,
        date_overrides={}, name_overrides={})
    assert (out_dir / "canonical-documents.xlsx").exists()
    assert (out_dir / "canonical-persons.xlsx").exists()
    assert stats["documents_emitted"] == 3        # W-0001, C-0001, W-0001 (dup) — x and blank excluded
    assert stats["skipped_x_suffix"] == 1
    assert stats["blank_index_rows"] == 1
    assert stats["duplicate_index_rows"] == 2
    assert (review_dir / "skipped-x-suffix.csv").exists()
    assert (review_dir / "unparsed-dates.csv").exists()
    # C-0001's "Freitag 1919" is unparseable -> must appear in the review file (NFR-DATA-01)
    assert "Freitag 1919" in (review_dir / "unparsed-dates.csv").read_text(encoding="utf-8")

    # determinism (NFR-IDEM-01): a second run yields identical canonical content + review files
    def _matrix(p):
        wb = openpyxl.load_workbook(p)
        return [[c.value for c in row] for row in wb.active.iter_rows()]
    docs1 = _matrix(out_dir / "canonical-documents.xlsx")
    persons1 = _matrix(out_dir / "canonical-persons.xlsx")
    unparsed1 = (review_dir / "unparsed-dates.csv").read_text(encoding="utf-8")
    normalize.run(document_workbook=_doc_wb(tmp_path), document_sheet="Familienarchiv",
                  person_workbook=_person_wb(tmp_path), person_sheet="Tabelle1",
                  out_dir=out_dir, review_dir=review_dir, date_overrides={}, name_overrides={})
    assert _matrix(out_dir / "canonical-documents.xlsx") == docs1
    assert _matrix(out_dir / "canonical-persons.xlsx") == persons1
    assert (review_dir / "unparsed-dates.csv").read_text(encoding="utf-8") == unparsed1
    assert len(docs1) == 4  # header + 3 docs
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_normalize.py -v && cd - Expected: FAIL — normalize not defined.

  • Step 3: Implement normalize.py
"""Orchestrator: read raw workbooks -> canonical outputs + review reports."""
import argparse
from collections import Counter
from pathlib import Path

import config
import ingest
import persons
import documents
import overrides as overrides_mod
import writers


def run(*, document_workbook, document_sheet, person_workbook, person_sheet,
        out_dir, review_dir, date_overrides, name_overrides) -> dict:
    out_dir, review_dir = Path(out_dir), Path(review_dir)

    # --- persons ---
    person_rows = ingest.read_sheet(person_workbook, person_sheet)
    p_fields, _ = ingest.build_header_map(person_rows[0], config.PERSON_HEADER_MAP, config.PERSON_REQUIRED_FIELDS)
    person_dicts = [{f: (row[i] if i < len(row) else "") for f, i in p_fields.items()} for row in person_rows[1:]]
    register = persons.parse_register(person_dicts)
    alias_index = persons.AliasIndex(register)
    ctx = persons.ResolutionContext(alias_index, name_overrides)

    # --- documents ---
    doc_rows = ingest.read_sheet(document_workbook, document_sheet)
    d_fields, unknown_headers = ingest.build_header_map(doc_rows[0], config.DOCUMENT_HEADER_MAP, config.DOCUMENT_REQUIRED_FIELDS)
    index_col = d_fields["index"]

    canon_docs, blank_index, skipped_x, mismatches = [], [], [], []
    unparsed_by_raw: dict[str, list] = {}
    dates_by_override = 0
    empty_count = 0
    seen_index = Counter()

    for source_row, cells in enumerate(doc_rows[1:], start=2):
        t = documents.triage(cells, index_col)
        if t is documents.Triage.EMPTY:
            empty_count += 1
            continue
        if t is documents.Triage.BLANK_INDEX:
            blank_index.append([source_row, documents.classify_blank_index(cells, d_fields),
                                " | ".join(c for c in cells if c)])
            continue
        if t is documents.Triage.X_SUFFIX:
            idx = (cells[index_col] or "").strip()
            skipped_x.append([source_row, idx, idx[:-1]])
            continue
        raw = documents.extract_row(cells, d_fields, source_row)
        seen_index[raw.index] += 1
        if raw.date.strip() and raw.date.strip() in date_overrides:
            dates_by_override += 1
        doc = documents.to_canonical(raw, ctx, date_overrides)
        if "unparsed_date" in doc.needs_review:
            unparsed_by_raw.setdefault(raw.date, []).append(source_row)
        if "index_file_mismatch" in doc.needs_review:
            mismatches.append([source_row, raw.index, raw.file])
        canon_docs.append(doc)

    # REQ-TRIAGE-01: flag EVERY occurrence of a duplicated index and report all of them.
    dup_indexes = {idx for idx, n in seen_index.items() if n > 1}
    duplicates = []
    for doc in canon_docs:
        if doc.index in dup_indexes:
            if "duplicate_index" not in doc.needs_review:
                doc.needs_review.append("duplicate_index")
            duplicates.append([doc.source_row, doc.index])

    all_people = register + list(ctx.provisional.values())

    # --- write canonical outputs ---
    writers.write_documents_xlsx(canon_docs, out_dir / "canonical-documents.xlsx")
    writers.write_persons_xlsx(all_people, out_dir / "canonical-persons.xlsx")

    # --- review files ---
    # unparsed dates: most-frequent first, with example source rows + blank override cells so a
    # corrected row can be pasted straight into overrides/dates.csv (same raw,iso,precision shape).
    unparsed_rows = sorted(
        ([raw, len(rows), " ".join(map(str, rows[:5])), "", ""] for raw, rows in unparsed_by_raw.items()),
        key=lambda r: (-r[1], r[0]))
    writers.write_review_csv(review_dir / "unparsed-dates.csv",
                             ["raw", "count", "example_rows", "suggested_iso", "suggested_precision"], unparsed_rows)

    unmatched_rows = []
    for name, rows in sorted(ctx.unmatched.items()):
        sid, score = alias_index.suggest(name)
        unmatched_rows.append([name, len(rows), " ".join(map(str, rows[:5])),
                               sid or "", f"{score:.2f}" if sid else ""])
    writers.write_review_csv(review_dir / "unmatched-names.csv",
                             ["raw", "count", "example_rows", "suggested_id", "suggested_score"], unmatched_rows)

    writers.write_review_csv(review_dir / "duplicate-index.csv", ["source_row", "index"], duplicates)
    writers.write_review_csv(review_dir / "blank-index-rows.csv", ["source_row", "kind", "content"], blank_index)
    writers.write_review_csv(review_dir / "skipped-x-suffix.csv", ["source_row", "index", "base_index"], skipped_x)
    writers.write_review_csv(review_dir / "ambiguous-receivers.csv", ["raw", "part", "source_row"], ctx.ambiguous)
    writers.write_review_csv(review_dir / "index-file-mismatch.csv", ["source_row", "index", "file"], mismatches)

    dated = sum(1 for d in canon_docs if d.date_raw.strip())
    unknown = sum(1 for d in canon_docs if d.date_raw.strip() and d.date_precision == "UNKNOWN")
    unknown_rate = f"{(100 * unknown / dated):.1f}%" if dated else "0.0%"

    stats = {
        "# INPUTS": "",
        "document_rows_read": len(doc_rows) - 1,
        "register_persons": len(register),
        "unknown_headers": ", ".join(unknown_headers) or "(none)",
        "# OUTPUTS": "",
        "documents_emitted": len(canon_docs),
        "provisional_persons": len(ctx.provisional),
        "# DATES": "",
        "dated_rows": dated,
        "unparsed_dates": unknown,
        "unknown_date_rate": f"{unknown_rate} (target <=5%)",
        "distinct_unparsed_formats": len(unparsed_by_raw),
        "# NAMES": "",
        "unmatched_name_strings": len(ctx.unmatched),
        "ambiguous_receivers": len(ctx.ambiguous),
        "# ANOMALIES": "",
        "empty_rows": empty_count,
        "blank_index_rows": len(blank_index),
        "skipped_x_suffix": len(skipped_x),
        "duplicate_index_rows": len(duplicates),
        "index_file_mismatches": len(mismatches),
        "# OVERRIDES": "",
        "date_overrides_loaded": len(date_overrides),
        "name_overrides_loaded": len(name_overrides),
        "dates_resolved_by_override": dates_by_override,
        "names_resolved_by_override": ctx.override_hits,
    }
    writers.write_summary(review_dir / "summary.txt", stats)
    return stats


def main():
    parser = argparse.ArgumentParser(description="Normalize the family archive spreadsheets.")
    parser.parse_args()
    date_overrides, name_overrides = overrides_mod.load_overrides(
        config.OVERRIDES_DIR / "dates.csv", config.OVERRIDES_DIR / "names.csv")
    stats = run(
        document_workbook=config.DOCUMENT_WORKBOOK, document_sheet=config.DOCUMENT_SHEET,
        person_workbook=config.PERSON_WORKBOOK, person_sheet=config.PERSON_SHEET,
        out_dir=config.OUT_DIR, review_dir=config.REVIEW_DIR,
        date_overrides=date_overrides, name_overrides=name_overrides)
    print("Normalization complete:")
    for k, v in stats.items():
        print(f"  {k}: {v}")


if __name__ == "__main__":
    main()

Note for the implementer: duplicate-index handling is a single second pass over canon_docs (for doc in canon_docs: if doc.index in dup_indexes) — this flags AND reports every colliding occurrence including the first (REQ-TRIAGE-01), not just repeats. Do not reintroduce a per-row append in the main loop.

  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_normalize.py -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/normalize.py tools/import-normalizer/tests/test_normalize.py
git commit -m "feat(normalizer): orchestrator + end-to-end integration test"

Task 17: README, seed overrides, and a real dry-run

Files:

  • Create: tools/import-normalizer/README.md

  • Create: tools/import-normalizer/overrides/dates.csv

  • Create: tools/import-normalizer/overrides/names.csv

  • Step 1: Seed the overrides files (header-only)

overrides/dates.csv:

raw,iso,precision

overrides/names.csv:

raw,person_id
  • Step 2: Write README.md
# Import Normalizer

Transforms the raw family-archive spreadsheets in `../../import/` into a clean canonical
dataset (`out/`) plus review reports (`review/`). See the spec:
`../../docs/import-migration/02-normalization-spec.md`.

## Setup
Requires **Python 3.12** (uses `StrEnum`).
```bash
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
```

## Run
```bash
.venv/bin/python normalize.py
```
Outputs:
- `out/canonical-documents.xlsx`, `out/canonical-persons.xlsx`
- `review/*.csv` (residue to fix), `review/summary.txt` (grouped run stats incl. unknown-date rate)

## Iteration loop
1. **Run.** Read `review/summary.txt` for the health snapshot.
2. **Fix the residue** by editing the version-controlled overrides files, then re-run. Repeat.

| Review file | What to do |
| --- | --- |
| `unparsed-dates.csv` | For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`). |
| `unmatched-names.csv` | If `suggested_id` is right, copy `raw,suggested_id` into `overrides/names.csv`; else look up the correct id in `out/canonical-persons.xlsx` (the `person_id` column). |
| `ambiguous-receivers.csv` | A space-joined pair we refused to auto-split (e.g. `Ella Anita`). Decide and add a names override if it is really two people. |
| `index-file-mismatch.csv` | The `Datei` path disagrees with the index-derived filename — reconcile when the PDFs arrive. |
| `duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv` | Inspect; fix in the source spreadsheet if needed. |

**Valid `person_id` values** all come from the `person_id` column of `out/canonical-persons.xlsx`.

## Tests
```bash
.venv/bin/python -m pytest tests/test_dates.py -v   # run files individually (never the whole suite at once)
```
  • Step 3: Run the whole test suite file-by-file to confirm green

Run each individually (per the "no full-suite" rule):

cd tools/import-normalizer
for t in config dates persons ingest documents writers normalize; do .venv/bin/python -m pytest tests/test_$t.py -q || break; done
cd -

Expected: every file reports all passed.

  • Step 4: Real dry-run against the actual import data (manual verification, not a test)

Run: cd tools/import-normalizer && .venv/bin/python normalize.py && cd - Expected: prints stats. Then inspect:

  • review/summary.txt — sanity-check counts (≈7,600 documents emitted, register_persons ≈163).
  • review/unparsed-dates.csv — confirm UNKNOWN rate is in the low single-digit %% of dated rows (NFR-ACCUR-01 target ≤5% before overrides). If higher, note the dominant unhandled formats for a follow-up parser tweak.
  • Spot-check out/canonical-documents.xlsx: open the first ~20 rows; verify date_iso/date_precision, sender_person_id, and receiver_person_ids look right (e.g. Eugenie Müllerde-gruyter-eugenie).

Record the run's summary.txt figures in ../../docs/import-migration/WORKLOG.md.

  • Step 5: Commit (commit only source + seeds; out/ and review/ are gitignored)
git add tools/import-normalizer/README.md tools/import-normalizer/overrides/dates.csv tools/import-normalizer/overrides/names.csv
git commit -m "docs(normalizer): README + seed overrides"

Self-Review

Spec coverage check:

  • FR-INGEST/FR-MAP → Task 12 (header-name mapping, missing-required raises, unknown headers reported). ✓
  • FR-TRIAGE (REQ-TRIAGE-01/02/03) → Task 13 (triage by index-col, classify_blank_index banner detection) + Task 16 (single-pass duplicate flagging of all occurrences, blank-index report with kind, x-suffix skip+log). ✓
  • FR-DATE (REQ-DATE-01..06) → Tasks 28 (computus, feast/season, century rule, all matchers, overrides). ✓
  • FR-PERS/US-PERS-01 → Task 9; REQ-PERS-01/receiver split/AC4 ambiguous → Tasks 10, 14. ✓
  • FR-DEDUP (REQ-DEDUP-01/02) → Task 11 (maiden/married/nickname aliases, conservative; fuzzy = suggestion only). ✓
  • FR-OVR (REQ-OVR-01/02/03) → Task 15 (loader, missing-file tolerant) + Task 16 (applied + counted: dates_resolved_by_override / names_resolved_by_override) + Task 16 content-determinism assertion (two-run cell-matrix + review-file equality). ✓
  • FR-OUT/FR-PROV (REQ-OUT-01/02, REQ-PROV-01/02) → Tasks 13 (source_row, needs_review), 15 (writers), 16 (mismatch report). ✓
  • NFRs: DATA-01 (every row → output or review) covered by triage routing; OBSERV-01 → summary.txt; I18N-01 → utf-8 everywhere + diacritic map; TEST-01 → per-module tests; MAINT-01 → config tables. ✓
  • Data dictionary §6 → DOC_COLUMNS/PERSON_COLUMNS in Task 15 match the spec field list. ✓

Placeholder scan: No TBD/TODO; every code step shows complete code. The one pass/dead-line in Task 16 is explicitly called out with deletion instructions. ✓

Type consistency: ParsedDate(iso, precision, raw), Precision (StrEnum → str() yields the value), Person, RawRow, CanonicalDocument, AliasIndex.resolve/display/suggest, ResolutionContext.resolve_one/resolve_receivers, to_canonical(raw, ctx, date_overrides), run(**kwargs) — names line up across tasks. ✓

Known follow-ups (out of scope for this plan): Phase-2 importer wiring (B11); comma-splitting Inhalt into extra tags (B10, Could). These are intentionally deferred.


Review feedback incorporated (2026-05-25)

Six personas reviewed this plan inline; the following changes were applied (see the session summary for detail):

  • Idempotency redefined (architect/tester/req-eng): spec G4/NFR-IDEM-01 changed from "byte-identical" to content-deterministic; Task 15 pins workbook created/modified; Task 11 builds aliases via ordered lists (no set-iteration leakage); Task 16 test now compares two runs' cell matrices + review files.
  • Duplicate-index bug fixed (developer/architect): Task 16 now flags and reports every occurrence of a duplicated index in one pass; the dead pass line was removed; the test stat (==2) is correct.
  • Provisional id collision guarded (architect): Task 14 ResolutionContext._unique_id suffixes provisional ids so they never overwrite a register person_id.
  • Date gaps closed (tester): added invalid-calendar-date → UNKNOWN test, intra-month day-range matcher (7./8. Sept.1923 → RANGE) + test, and a trailing-note-preservation test.
  • Multi-person sender (tester/req-eng, REQ-PERS-01): Task 14 resolve_sender splits the sender, keeps the primary, flags multi_sender.
  • CSV injection defanged (security): Task 15 write_review_csv neutralises leading = + - @ etc. in human-opened CSVs (+ test).
  • REQ-TRIAGE-02 / REQ-OVR-03 realized (req-eng): banner-vs-data classification in blank-index-rows.csv; override-application counts + an unknown_date_rate headline in summary.txt.
  • Ergonomics (UX): unparsed-dates.csv now carries example_rows + blank suggested_iso/precision (paste-ready); unmatched-names.csv suggestion blanks-out on no-match and rounds the score; grouped summary.txt; README documents every review file + where to source person_id.
  • Repo hygiene (devops): pinned openpyxl==3.1.5 / pytest==8.3.4; hardened the root .gitignore against the committed-.venv class of mistake; documented the Python 3.12 requirement.