Files
familienarchiv/docs/import-migration/03-normalizer-implementation-plan.md
Marcel 6f7aa643c9 docs(import): add normalizer implementation plan + apply persona review
17-task TDD plan for tools/import-normalizer/. Incorporates inline
6-persona review: content-deterministic idempotency, duplicate-index
fix, provisional-id collision guard, date-parser edge cases, multi-sender
split, CSV-injection defang, pinned deps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:55:50 +02:00

2274 lines
86 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Import Normalizer Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Build an offline Python tool that turns the raw family-archive spreadsheets into a clean, canonical dataset (`canonical-documents.xlsx`, `canonical-persons.xlsx`) plus review CSVs, with a deterministic overrides-and-rerun loop.
**Architecture:** A standalone Python package at `tools/import-normalizer/`. Pure, independently-testable units — date parsing (`dates.py`), person/register logic (`persons.py`), spreadsheet ingest (`ingest.py`), row mapping (`documents.py`) — are orchestrated by `normalize.py`. Source workbooks are read-only; all tunables live in `config.py`. Residue (unparseable dates, unmatched names) is reported to `review/*.csv` and corrected via version-controlled `overrides/*.csv` applied on each run.
**Tech Stack:** Python 3.12, `openpyxl` (xlsx read/write), `pytest`. No third-party fuzzy library — `difflib` (stdlib) provides *suggestions only* (never auto-applied), per the conservative-matching requirement.
**Spec:** [`02-normalization-spec.md`](./02-normalization-spec.md). Requirement IDs (`FR-*`, `REQ-*`, `NFR-*`) referenced per task.
---
## File Structure
```
tools/import-normalizer/
├── config.py # paths, header maps, century rule, season/feast tables, month tables, matching config
├── dates.py # Easter computus, feast/season resolution, year expansion, parse_date()
├── persons.py # slug, Person, parse_register(), split_receivers(), AliasIndex, ResolutionContext
├── ingest.py # read_sheet(), build_header_map()
├── documents.py # RawRow, extract_row(), triage helpers, CanonicalDocument, to_canonical()
├── writers.py # write_documents_xlsx(), write_persons_xlsx(), write_review_csv(), write_summary()
├── overrides.py # load_overrides()
├── normalize.py # main() orchestrator + CLI
├── requirements.txt
├── .gitignore # .venv/ out/ review/ __pycache__/
├── README.md
├── overrides/
│ ├── dates.csv # seed header: raw,iso,precision
│ └── names.csv # seed header: raw,person_id
└── tests/
├── __init__.py
├── test_dates.py
├── test_persons.py
├── test_ingest.py
├── test_documents.py
├── test_writers.py
└── test_normalize.py
```
**Test command convention** (per the "never run the full suite" rule — run targeted files):
`tools/import-normalizer/.venv/bin/python -m pytest tools/import-normalizer/tests/test_X.py -v`
All `git` commands assume CWD = repo root and the current branch `docs/import-migration`.
---
### Task 1: Project scaffold, venv, config constants
**Files:**
- Create: `tools/import-normalizer/requirements.txt`
- Create: `tools/import-normalizer/.gitignore`
- Create: `tools/import-normalizer/config.py`
- Create: `tools/import-normalizer/tests/__init__.py`
- Create: `tools/import-normalizer/tests/test_config.py`
- [ ] **Step 1: Create `requirements.txt`** (pinned — an openpyxl minor bump can change xlsx serialization and break determinism, NFR-IDEM-01)
```
openpyxl==3.1.5
pytest==8.3.4
```
- [ ] **Step 2: Create the tool-local `.gitignore`**
```
.venv/
out/
review/
__pycache__/
*.pyc
```
- [ ] **Step 2b: Harden the repo-root `.gitignore`** (the root file currently has no venv pattern — that is how `ocr-service/.venv` got committed; prevent the whole class). Append these lines to `/home/marcel/Desktop/familienarchiv/.gitignore` if not already present:
```
**/.venv/
**/__pycache__/
*.pyc
```
(Cleaning up the *already-committed* `ocr-service/.venv` via `git rm -r --cached ocr-service/.venv` is a separate task — do NOT bundle it into this branch.)
- [ ] **Step 3: Create `config.py`**
```python
"""Tunables for the import normalizer. No logic here — only data tables."""
from pathlib import Path
# --- Paths ---
BASE_DIR = Path(__file__).resolve().parent
REPO_ROOT = BASE_DIR.parent.parent
IMPORT_DIR = REPO_ROOT / "import"
DOCUMENT_WORKBOOK = IMPORT_DIR / "zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx"
DOCUMENT_SHEET = "Familienarchiv"
PERSON_WORKBOOK = IMPORT_DIR / "Personendatei 2.xlsx"
PERSON_SHEET = "Tabelle1"
OUT_DIR = BASE_DIR / "out"
REVIEW_DIR = BASE_DIR / "review"
OVERRIDES_DIR = BASE_DIR / "overrides"
# --- Header text (lowercased, whitespace-collapsed) -> canonical field ---
DOCUMENT_HEADER_MAP = {
"index": "index",
"datei": "file",
"box": "box",
"mappe": "folder",
"briefeschreiberin": "sender",
"empfängerin": "receivers",
"datum des briefes": "date",
"ort": "location",
"schlagwort": "tags",
"inhalt": "summary",
}
DOCUMENT_REQUIRED_FIELDS = {"index"}
PERSON_HEADER_MAP = {
"generation": "generation",
"familienname": "last_name",
"vorname": "first_name",
"geb als": "maiden_name",
"geburtsdatum": "birth_date",
"geburtsort": "birth_place",
"todesdatum": "death_date",
"sterbeort": "death_place",
"verheiratet mit": "spouse",
"bemerkung": "notes",
}
PERSON_REQUIRED_FIELDS = {"last_name"}
# --- Century rule (archive 18731957) ---
TWO_DIGIT_19XX_MAX = 57 # 00..57 -> 1900+yy
TWO_DIGIT_18XX_MIN = 73 # 73..99 -> 1800+yy ; 58..72 -> ambiguous -> UNKNOWN
# --- Seasons -> representative month (day = 1) ---
SEASON_MONTHS = {
"frühling": 4, "fruehling": 4, "frühjahr": 4, "fruehjahr": 4,
"sommer": 7, "herbst": 10, "winter": 1,
}
# --- Fixed feasts -> (month, day) ---
FIXED_FEASTS = {
"neujahr": (1, 1),
"heiligabend": (12, 24), "heiliger abend": (12, 24),
"weihnachten": (12, 25), "weihnacht": (12, 25), "1. weihnachtstag": (12, 25),
"silvester": (12, 31), "sylvester": (12, 31),
}
# --- Movable feasts -> day offset from Easter Sunday ---
MOVABLE_FEASTS = {
"karfreitag": -2,
"ostern": 0, "ostersonntag": 0, "ostermontag": 1,
"himmelfahrt": 39, "christi himmelfahrt": 39,
"pfingsten": 49, "pfingstsonntag": 49, "pfingstmontag": 50,
"fronleichnam": 60,
}
# --- Month names -> number (German + English, full + abbreviations) ---
MONTHS = {
"januar": 1, "jan": 1, "january": 1,
"februar": 2, "feb": 2, "febr": 2, "february": 2,
"märz": 3, "maerz": 3, "mär": 3, "mar": 3, "march": 3,
"april": 4, "apr": 4,
"mai": 5, "may": 5,
"juni": 6, "jun": 6, "june": 6,
"juli": 7, "jul": 7, "july": 7,
"august": 8, "aug": 8,
"september": 9, "sep": 9, "sept": 9,
"oktober": 10, "okt": 10, "oct": 10, "october": 10,
"november": 11, "nov": 11,
"dezember": 12, "dez": 12, "dec": 12, "december": 12,
}
ROMAN_MONTHS = {
"i": 1, "ii": 2, "iii": 3, "iv": 4, "v": 5, "vi": 6,
"vii": 7, "viii": 8, "ix": 9, "x": 10, "xi": 11, "xii": 12,
}
# --- Person matching ---
KNOWN_LAST_NAMES = [
"von der Heide", "von Massenbach", "von Geldern", "von Gelden", "von Staa",
"de Gruyter", "Dieckmann", "Gruber", "Müller", "Wolff", "Cram",
]
FUZZY_SUGGEST_THRESHOLD = 0.82 # difflib ratio; suggestions only, never auto-applied
```
- [ ] **Step 4: Create empty `tests/__init__.py`** (empty file).
- [ ] **Step 5: Write `tests/test_config.py`**
```python
import config
def test_century_boundaries():
assert config.TWO_DIGIT_19XX_MAX == 57
assert config.TWO_DIGIT_18XX_MIN == 73
def test_header_maps_cover_required_fields():
assert "index" in config.DOCUMENT_HEADER_MAP.values()
assert "last_name" in config.PERSON_HEADER_MAP.values()
def test_feast_tables_present():
assert config.MOVABLE_FEASTS["pfingsten"] == 49
assert config.SEASON_MONTHS["herbst"] == 10
```
- [ ] **Step 6: Create the venv and install deps**
Run:
```bash
cd tools/import-normalizer && python3 -m venv .venv && .venv/bin/pip install -r requirements.txt && cd -
```
Expected: openpyxl + pytest install successfully.
- [ ] **Step 7: Run the config test**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_config.py -v && cd -`
Expected: 3 passed. (Tests import `config` directly, so pytest must run with CWD = the tool dir; `conftest.py` is unnecessary because the modules are flat in that dir.)
- [ ] **Step 8: Commit**
```bash
git add .gitignore tools/import-normalizer/requirements.txt tools/import-normalizer/.gitignore tools/import-normalizer/config.py tools/import-normalizer/tests/__init__.py tools/import-normalizer/tests/test_config.py
git commit -m "feat(normalizer): scaffold tool + config tables"
```
---
### Task 2: Easter computus (`REQ-DATE-06`)
**Files:**
- Create: `tools/import-normalizer/dates.py`
- Create: `tools/import-normalizer/tests/test_dates.py`
- [ ] **Step 1: Write the failing test** in `tests/test_dates.py`
```python
import datetime
import dates
def test_easter_known_years():
# Anonymous Gregorian algorithm — verified against published tables
assert dates.easter(2024) == datetime.date(2024, 3, 31)
assert dates.easter(2000) == datetime.date(2000, 4, 23)
assert dates.easter(1922) == datetime.date(1922, 4, 16)
assert dates.easter(1888) == datetime.date(1888, 4, 1)
```
- [ ] **Step 2: Run test to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_easter_known_years -v && cd -`
Expected: FAIL with `ModuleNotFoundError: No module named 'dates'` or `AttributeError: module 'dates' has no attribute 'easter'`.
- [ ] **Step 3: Create `dates.py` with the computus**
```python
"""Tolerant historical date parsing for the family archive."""
import datetime
def easter(year: int) -> datetime.date:
"""Easter Sunday (Gregorian) via the Anonymous Gregorian / Butcher algorithm."""
a = year % 19
b = year // 100
c = year % 100
d = b // 4
e = b % 4
f = (b + 8) // 25
g = (b - f + 1) // 3
h = (19 * a + b - d - g + 15) % 30
i = c // 4
k = c % 4
l = (32 + 2 * e + 2 * i - h - k) % 7
m = (a + 11 * h + 22 * l) // 451
month = (h + l - 7 * m + 114) // 31
day = ((h + l - 7 * m + 114) % 31) + 1
return datetime.date(year, month, day)
```
- [ ] **Step 4: Run test to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_easter_known_years -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): Easter computus"
```
---
### Task 3: Feast & season resolution (`REQ-DATE-02`, `REQ-DATE-06`)
**Files:**
- Modify: `tools/import-normalizer/dates.py`
- Modify: `tools/import-normalizer/tests/test_dates.py`
- [ ] **Step 1: Add the failing test** to `tests/test_dates.py`
```python
from dates import Precision
def test_resolve_feast_movable():
assert dates.resolve_feast_or_season("Pfingsten", 1922) == ("1922-06-04", Precision.DAY)
assert dates.resolve_feast_or_season("Ostern", 2024) == ("2024-03-31", Precision.DAY)
assert dates.resolve_feast_or_season("Pfingstmontag", 1922) == ("1922-06-05", Precision.DAY)
def test_resolve_feast_fixed():
assert dates.resolve_feast_or_season("Weihnachten", 1900) == ("1900-12-25", Precision.DAY)
assert dates.resolve_feast_or_season("Neujahr", 1910) == ("1910-01-01", Precision.DAY)
def test_resolve_season():
assert dates.resolve_feast_or_season("Herbst", 1913) == ("1913-10-01", Precision.SEASON)
assert dates.resolve_feast_or_season("Sommer", 1910) == ("1910-07-01", Precision.SEASON)
def test_resolve_unknown_token_returns_none():
assert dates.resolve_feast_or_season("Freitag", 1919) is None
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "feast or season" -v && cd -`
Expected: FAIL — `Precision` and `resolve_feast_or_season` not defined.
- [ ] **Step 3: Implement** — add to `dates.py` (top imports + new code)
```python
from enum import StrEnum
import config
class Precision(StrEnum):
DAY = "DAY"
MONTH = "MONTH"
SEASON = "SEASON"
YEAR = "YEAR"
RANGE = "RANGE"
APPROX = "APPROX"
UNKNOWN = "UNKNOWN"
def _advent_sunday(year: int, n: int) -> datetime.date:
"""n-th Advent (1..4). 4th Advent = last Sunday on/before Dec 24."""
dec24 = datetime.date(year, 12, 24)
back_to_sunday = (dec24.weekday() - 6) % 7 # Mon=0..Sun=6
fourth = dec24 - datetime.timedelta(days=back_to_sunday)
return fourth - datetime.timedelta(days=(4 - n) * 7)
def resolve_feast_or_season(token: str, year: int):
"""Return (iso, Precision) for a known feast/season token, else None."""
key = " ".join(token.lower().split()).strip(" .")
if key in config.MOVABLE_FEASTS:
d = easter(year) + datetime.timedelta(days=config.MOVABLE_FEASTS[key])
return d.isoformat(), Precision.DAY
if key in config.FIXED_FEASTS:
month, day = config.FIXED_FEASTS[key]
return datetime.date(year, month, day).isoformat(), Precision.DAY
advent = {"1. advent": 1, "2. advent": 2, "3. advent": 3, "4. advent": 4, "advent": 1}
if key in advent:
return _advent_sunday(year, advent[key]).isoformat(), Precision.DAY
if key in config.SEASON_MONTHS:
return datetime.date(year, config.SEASON_MONTHS[key], 1).isoformat(), Precision.SEASON
return None
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "feast or season" -v && cd -`
Expected: PASS (all 4). (Pfingstmontag 1922 = Easter Apr 16 + 50 = June 5.)
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): feast + season resolution"
```
---
### Task 4: Year expansion / century rule (`REQ-DATE-03`)
**Files:**
- Modify: `tools/import-normalizer/dates.py`
- Modify: `tools/import-normalizer/tests/test_dates.py`
- [ ] **Step 1: Add the failing test**
```python
def test_expand_year():
assert dates.expand_year("1888") == 1888
assert dates.expand_year("889") == 1889 # 3-digit -> 1DDD
assert dates.expand_year("923") == 1923
assert dates.expand_year("08") == 1908 # 00..57 -> 19xx
assert dates.expand_year("17") == 1917
assert dates.expand_year("57") == 1957
assert dates.expand_year("73") == 1873 # 73..99 -> 18xx
assert dates.expand_year("99") == 1899
assert dates.expand_year("65") is None # 58..72 ambiguous
assert dates.expand_year("x") is None
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_expand_year -v && cd -`
Expected: FAIL — `expand_year` not defined.
- [ ] **Step 3: Implement** — add to `dates.py`
```python
def expand_year(token: str):
"""Expand a 2/3/4-digit year string per the 18731957 century rule. None if ambiguous."""
token = token.strip()
if not token.isdigit():
return None
n, v = len(token), int(token)
if n == 4:
return v
if n == 3:
return 1000 + v
if n == 2:
if v <= config.TWO_DIGIT_19XX_MAX:
return 1900 + v
if v >= config.TWO_DIGIT_18XX_MIN:
return 1800 + v
return None
return None
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_expand_year -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): year expansion century rule"
```
---
### Task 5: `parse_date` dispatch + ISO + numeric forms (`FR-DATE`, `REQ-DATE-01/04/05`)
**Files:**
- Modify: `tools/import-normalizer/dates.py`
- Modify: `tools/import-normalizer/tests/test_dates.py`
- [ ] **Step 1: Add failing tests**
```python
def test_parse_iso_and_empty():
assert dates.parse_date("1910-04-23") == dates.ParsedDate("1910-04-23", Precision.DAY, "1910-04-23")
assert dates.parse_date("") == dates.ParsedDate(None, Precision.UNKNOWN, "")
assert dates.parse_date("?") == dates.ParsedDate(None, Precision.UNKNOWN, "?")
def test_parse_numeric_forms():
assert dates.parse_date("15.2.1888").iso == "1888-02-15"
assert dates.parse_date("13.5.09").iso == "1909-05-13"
assert dates.parse_date("17/6. 1916").iso == "1916-06-17"
assert dates.parse_date("11.10.08").iso == "1908-10-11"
assert dates.parse_date("30.1.889").iso == "1889-01-30"
assert dates.parse_date("15.2.1888").precision == Precision.DAY
def test_parse_numeric_unparseable():
assert dates.parse_date("8.9.").precision == Precision.UNKNOWN # no year
assert dates.parse_date("13.5.65").precision == Precision.UNKNOWN # ambiguous 2-digit year
def test_parse_approx_marker_upgrades_precision():
r = dates.parse_date("17.Nov (?) 1887") # month-name handled in a later task; here just the marker path
# after the marker is detected, a parsed date becomes APPROX (verified fully in Task 8)
assert r.raw == "17.Nov (?) 1887"
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "parse_" -v && cd -`
Expected: FAIL — `ParsedDate` / `parse_date` not defined.
- [ ] **Step 3: Implement** — add to `dates.py`
```python
import re
from dataclasses import dataclass
@dataclass(frozen=True)
class ParsedDate:
iso: str | None
precision: Precision
raw: str
_LEADING_MARKERS = re.compile(
r"^(um|ca\.?|circa|etwa|wohl|vermutlich|nach|vor|anfang|mitte|ende)\s+", re.I)
def _preprocess(raw: str):
"""Return (cleaned_string, approx_flag)."""
s = (raw or "").strip()
if not s:
return "", False
low = s.lower()
approx = ("?" in s) or any(
m in low for m in ("um ", "ca.", "ca ", "circa", "etwa", "wohl", "vermutlich"))
s = re.sub(r"\(\s*\?\s*\)", " ", s) # remove "(?)"
s = s.replace("?", " ")
s = re.sub(r",.*$", "", s) # drop trailing editorial note (", 2. Brief")
s = _LEADING_MARKERS.sub("", s)
s = re.sub(r"\s+", " ", s).strip(" .,")
return s, approx
_NUM_RE = re.compile(r"(\d{1,2})[./](\d{1,2})\.?\s*(\d{2,4})")
def _match_iso(s):
if re.fullmatch(r"\d{4}-\d{2}-\d{2}", s):
try:
datetime.date.fromisoformat(s)
return s, Precision.DAY
except ValueError:
return None
return None
def _match_numeric(s):
m = _NUM_RE.fullmatch(s)
if not m:
return None
day, month = int(m.group(1)), int(m.group(2))
year = expand_year(m.group(3))
if year is None or not (1 <= month <= 12):
return None
try:
return datetime.date(year, month, day).isoformat(), Precision.DAY
except ValueError:
return None
# Matchers are tried in order. Later tasks append to this list.
_MATCHERS = [_match_iso, _match_numeric]
def parse_date(raw: str, date_overrides: dict | None = None) -> ParsedDate:
if date_overrides:
key = (raw or "").strip()
if key in date_overrides:
iso, prec = date_overrides[key]
return ParsedDate(iso or None, Precision(prec), raw)
cleaned, approx = _preprocess(raw)
if not cleaned:
return ParsedDate(None, Precision.UNKNOWN, raw)
for matcher in _MATCHERS:
result = matcher(cleaned)
if result:
iso, precision = result
if approx:
precision = Precision.APPROX
return ParsedDate(iso, precision, raw)
return ParsedDate(None, Precision.UNKNOWN, raw)
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "parse_" -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): parse_date dispatch + iso/numeric matchers"
```
---
### Task 6: Roman-numeral month matcher
**Files:**
- Modify: `tools/import-normalizer/dates.py`
- Modify: `tools/import-normalizer/tests/test_dates.py`
- [ ] **Step 1: Add failing test**
```python
def test_parse_roman_months():
assert dates.parse_date("22.III.18").iso == "1918-03-22"
assert dates.parse_date("19.XII.1954").iso == "1954-12-19"
assert dates.parse_date("1.III.27").iso == "1927-03-01"
assert dates.parse_date("22.III.18").precision == Precision.DAY
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_roman_months -v && cd -`
Expected: FAIL — Roman dates currently fall through to UNKNOWN.
- [ ] **Step 3: Implement** — add to `dates.py` and register the matcher
```python
_ROMAN_RE = re.compile(r"(\d{1,2})\.\s*([IVXLC]+)\.?\s*(\d{2,4})", re.I)
def _match_roman(s):
m = _ROMAN_RE.fullmatch(s)
if not m:
return None
day = int(m.group(1))
month = config.ROMAN_MONTHS.get(m.group(2).lower())
year = expand_year(m.group(3))
if not month or year is None:
return None
try:
return datetime.date(year, month, day).isoformat(), Precision.DAY
except ValueError:
return None
```
Then change the matcher list line to:
```python
_MATCHERS = [_match_iso, _match_numeric, _match_roman]
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_roman_months -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): roman-numeral month matcher"
```
---
### Task 7: Month-name matchers (day-first + English month-first)
**Files:**
- Modify: `tools/import-normalizer/dates.py`
- Modify: `tools/import-normalizer/tests/test_dates.py`
- [ ] **Step 1: Add failing tests**
```python
def test_parse_monthname_day_first():
assert dates.parse_date("6.März 1888").iso == "1888-03-06"
assert dates.parse_date("29.Sept.1891").iso == "1891-09-29"
assert dates.parse_date("10.Oct.95").iso == "1895-10-10"
assert dates.parse_date("9.December1889").iso == "1889-12-09"
assert dates.parse_date("18.Dez.1916").iso == "1916-12-18"
assert dates.parse_date("4Dezember 1936").iso == "1936-12-04"
assert dates.parse_date("25 August 1968").iso == "1968-08-25"
def test_parse_monthname_english_month_first():
assert dates.parse_date("April 12. 1922").iso == "1922-04-12"
assert dates.parse_date("Oct.5. 1916").iso == "1916-10-05"
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k monthname -v && cd -`
Expected: FAIL.
- [ ] **Step 3: Implement** — add to `dates.py`. `_match_monthname_a` is day-first; `_match_monthname_b` is English month-first.
```python
_MONTH_A_RE = re.compile(r"(\d{1,2})[.\s]*([A-Za-zÄÖÜäöü]+)\.?\s*(\d{2,4})")
_MONTH_B_RE = re.compile(r"([A-Za-zÄÖÜäöü]+)\.?\s*(\d{1,2})\.?\s*(\d{2,4})")
def _lookup_month(token: str):
return config.MONTHS.get(token.lower().strip(" ."))
def _build_day_month_year(day, month, year):
if not month or year is None or not (1 <= month <= 12):
return None
try:
return datetime.date(year, month, day).isoformat(), Precision.DAY
except ValueError:
return None
def _match_monthname_a(s):
m = _MONTH_A_RE.fullmatch(s)
if not m:
return None
return _build_day_month_year(int(m.group(1)), _lookup_month(m.group(2)), expand_year(m.group(3)))
def _match_monthname_b(s):
m = _MONTH_B_RE.fullmatch(s)
if not m:
return None
return _build_day_month_year(int(m.group(2)), _lookup_month(m.group(1)), expand_year(m.group(3)))
```
Then update the matcher list (order matters — `_match_monthname_a` is day-first and safe to place before the month/year matcher; `_match_monthname_b` goes *after* the month/year matcher added in Task 8, so for now append only `_a`):
```python
_MATCHERS = [_match_iso, _match_numeric, _match_roman, _match_monthname_a]
```
- [ ] **Step 4: Run — expect `_a` cases to pass, `_b` (English) still failing**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_monthname_day_first -v && cd -`
Expected: PASS.
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py::test_parse_monthname_english_month_first -v && cd -`
Expected: FAIL (`_match_monthname_b` not yet registered — it is wired in Task 8 to sit after the month/year matcher so it doesn't shadow `Mai 1895`).
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): day-first month-name matcher"
```
---
### Task 8: Month/year, feast/season, year-only, range matchers + final ordering + overrides
**Files:**
- Modify: `tools/import-normalizer/dates.py`
- Modify: `tools/import-normalizer/tests/test_dates.py`
- [ ] **Step 1: Add failing tests**
```python
def test_parse_month_year_year_only():
assert dates.parse_date("Mai 1895") == dates.ParsedDate("1895-05-01", Precision.MONTH, "Mai 1895")
assert dates.parse_date("October 1903").iso == "1903-10-01"
assert dates.parse_date("1905") == dates.ParsedDate("1905-01-01", Precision.YEAR, "1905")
def test_parse_feast_and_season_via_parse_date():
assert dates.parse_date("Pfingsten 1922") == dates.ParsedDate("1922-06-04", Precision.DAY, "Pfingsten 1922")
assert dates.parse_date("Herbst 1913") == dates.ParsedDate("1913-10-01", Precision.SEASON, "Herbst 1913")
assert dates.parse_date("Pfingstsonntag 1915").precision == Precision.DAY
def test_parse_ranges():
assert dates.parse_date("8.1.1916 - 15.3.1916") == dates.ParsedDate("1916-01-08", Precision.RANGE, "8.1.1916 - 15.3.1916")
assert dates.parse_date("1881/82") == dates.ParsedDate("1881-01-01", Precision.RANGE, "1881/82")
assert dates.parse_date("1945/46?").iso == "1945-01-01" # '?' stripped -> RANGE, then APPROX
assert dates.parse_date("1945/46?").precision == Precision.APPROX
def test_parse_approx_full():
r = dates.parse_date("17.Nov (?) 1887")
assert r.iso == "1887-11-17"
assert r.precision == Precision.APPROX
def test_parse_english_month_first_now_works():
assert dates.parse_date("April 12. 1922").iso == "1922-04-12"
assert dates.parse_date("Mai 1895").iso == "1895-05-01" # not shadowed by month-first matcher
def test_parse_unparseable_examples():
assert dates.parse_date("Freitag 1919").precision == Precision.UNKNOWN
def test_parse_invalid_calendar_date_is_unknown():
# try/except ValueError in the matchers must route impossible dates to UNKNOWN (-> review),
# never silently clamp. This is the most likely real-data bug class at 7,600 rows.
assert dates.parse_date("30.2.1888").precision == Precision.UNKNOWN
assert dates.parse_date("31.4.1916").precision == Precision.UNKNOWN
def test_parse_intra_month_day_range():
# "7./8. Sept.1923" -> start day, RANGE. Must NOT be confused with slash-date "17/6. 1916".
assert dates.parse_date("7./8. Sept.1923") == dates.ParsedDate("1923-09-07", Precision.RANGE, "7./8. Sept.1923")
assert dates.parse_date("17/6. 1916") == dates.ParsedDate("1916-06-17", Precision.DAY, "17/6. 1916")
def test_parse_trailing_note_stripped_but_raw_preserved():
r = dates.parse_date("17.Nov 1887, 2. Brief") # REQ-DATE-04
assert r.iso == "1887-11-17"
assert "2. Brief" in r.raw # original string preserved verbatim
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -k "month_year or feast_and_season or ranges or approx_full or english_month_first_now or unparseable_examples" -v && cd -`
Expected: FAIL.
- [ ] **Step 3: Implement** — add matchers to `dates.py`
```python
_MONTH_YEAR_RE = re.compile(r"([A-Za-zÄÖÜäöü]+)\.?\s+(\d{2,4})")
_TOKEN_YEAR_RE = re.compile(r"(.+?)\.?\s+(\d{2,4})")
_YEAR_ONLY_RE = re.compile(r"\d{4}")
_RANGE_YY_RE = re.compile(r"(\d{4})\s*/\s*\d{2}")
_RANGE_HYPHEN_RE = re.compile(r"(.*\d)\s*[-]\s*\d.*")
# Intra-month day range, e.g. "7./8. Sept.1923" — require a dot before the slash so it
# does NOT swallow slash-as-dot single dates like "17/6. 1916" (which has no dot before "/").
_RANGE_DAY_RE = re.compile(r"(\d{1,2})\./(\d{1,2})\.\s*(.+)")
def _match_month_year(s):
m = _MONTH_YEAR_RE.fullmatch(s)
if not m:
return None
month = _lookup_month(m.group(1))
year = expand_year(m.group(2))
if not month or year is None:
return None
return datetime.date(year, month, 1).isoformat(), Precision.MONTH
def _match_feast_season(s):
m = _TOKEN_YEAR_RE.fullmatch(s)
if not m:
return None
year = expand_year(m.group(2))
if year is None:
return None
return resolve_feast_or_season(m.group(1), year)
def _match_year_only(s):
if _YEAR_ONLY_RE.fullmatch(s):
return datetime.date(int(s), 1, 1).isoformat(), Precision.YEAR
return None
def _match_range(s):
m = _RANGE_YY_RE.fullmatch(s)
if m:
return datetime.date(int(m.group(1)), 1, 1).isoformat(), Precision.RANGE
m = _RANGE_DAY_RE.fullmatch(s)
if m:
first = f"{m.group(1)}.{m.group(3)}" # "7." + "Sept.1923" -> "7.Sept.1923"
for matcher in (_match_numeric, _match_monthname_a):
r = matcher(first)
if r:
return r[0], Precision.RANGE
m = _RANGE_HYPHEN_RE.fullmatch(s)
if m:
start = m.group(1).strip()
for matcher in (_match_numeric, _match_roman, _match_monthname_a, _match_year_only):
r = matcher(start)
if r:
return r[0], Precision.RANGE
return None
```
Then replace the matcher list with the final ordering:
```python
_MATCHERS = [
_match_iso,
_match_range,
_match_numeric,
_match_roman,
_match_monthname_a,
_match_month_year,
_match_monthname_b,
_match_feast_season,
_match_year_only,
]
```
- [ ] **Step 4: Run the full date test file**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -v && cd -`
Expected: PASS (all tests, including the English month-first test from Task 7).
- [ ] **Step 5: Add an overrides test, then commit**
Append to `tests/test_dates.py`:
```python
def test_parse_date_override_wins():
ovr = {"13.5.65": ("1965-05-13", "DAY")}
r = dates.parse_date("13.5.65", ovr) # ambiguous without override
assert r == dates.ParsedDate("1965-05-13", Precision.DAY, "13.5.65")
```
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_dates.py -v && cd -`
Expected: PASS.
```bash
git add tools/import-normalizer/dates.py tools/import-normalizer/tests/test_dates.py
git commit -m "feat(normalizer): month/year, feast/season, range matchers + overrides"
```
---
### Task 9: Person register parsing (`FR-PERS`, US-PERS-01)
**Files:**
- Create: `tools/import-normalizer/persons.py`
- Create: `tools/import-normalizer/tests/test_persons.py`
- [ ] **Step 1: Write the failing test** in `tests/test_persons.py`
```python
import persons
def test_slugify():
assert persons.slugify("de Gruyter", "Eugenie") == "de-gruyter-eugenie"
assert persons.slugify("Müller", "Karl Erhard") == "mueller-karl-erhard"
def test_parse_register_basic():
rows = [
{"generation": "G 1", "last_name": "Blomquist", "first_name": "Charlotte,Meta,Jacobi",
"maiden_name": "Ruge", "birth_date": "30.8.1862", "birth_place": "Schülperneusiel",
"death_date": "1934-07-23", "death_place": "Göteborg", "spouse": '"Tante Lolly"',
"notes": "Schwester v Marie Cram"},
{"generation": "G 2", "last_name": "Bohrmann", "first_name": "Else",
"maiden_name": "Cram", "birth_date": "28.11.1888", "spouse": "Ludwig Bohrmann",
"notes": "Schwester v Herbert"},
]
people = persons.parse_register(rows)
p = people[0]
assert p.person_id == "blomquist-charlotte"
assert p.first_name == "Charlotte"
assert p.maiden_name == "Ruge"
assert p.birth_date == "1862-08-30"
assert p.nickname == "Tante Lolly" # quoted spouse field is a nickname, not a spouse
assert p.spouse == ""
assert "Meta" in p.extra_given_names and "Jacobi" in p.extra_given_names
p2 = people[1]
assert p2.maiden_name == "Cram"
assert p2.spouse == "Ludwig Bohrmann"
assert p2.provisional is False
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd -`
Expected: FAIL — `persons` module / symbols not defined.
- [ ] **Step 3: Implement `persons.py`**
```python
"""Person register parsing, name splitting, alias resolution."""
import re
import unicodedata
from dataclasses import dataclass, field
import config
import dates
_DIACRITIC_MAP = str.maketrans({"ä": "ae", "ö": "oe", "ü": "ue", "ß": "ss",
"Ä": "ae", "Ö": "oe", "Ü": "ue"})
def _strip_accents(s: str) -> str:
s = s.translate(_DIACRITIC_MAP)
s = unicodedata.normalize("NFKD", s)
return "".join(c for c in s if not unicodedata.combining(c))
def slugify(last: str, first: str) -> str:
raw = f"{last} {first}".strip()
raw = _strip_accents(raw).lower()
raw = re.sub(r"[^a-z0-9]+", "-", raw).strip("-")
return raw or "unknown"
@dataclass
class Person:
person_id: str
last_name: str = ""
first_name: str = ""
maiden_name: str = ""
title: str = ""
nickname: str = ""
extra_given_names: list = field(default_factory=list)
birth_date: str | None = None
birth_date_raw: str = ""
birth_place: str = ""
death_date: str | None = None
death_date_raw: str = ""
death_place: str = ""
spouse: str = ""
generation: str = ""
notes: str = ""
aliases: list = field(default_factory=list)
provisional: bool = False
_QUOTED_RE = re.compile(r'^[“"\']\s*(.+?)\s*[”"\']$')
def parse_register(rows: list[dict]) -> list[Person]:
people = []
for r in rows:
last = (r.get("last_name") or "").strip()
if not last:
continue
given_raw = (r.get("first_name") or "").strip()
givens = [g.strip() for g in given_raw.split(",") if g.strip()]
first = givens[0] if givens else ""
extra = givens[1:]
spouse_raw = (r.get("spouse") or "").strip()
nickname = ""
m = _QUOTED_RE.match(spouse_raw)
if m:
nickname = m.group(1)
spouse_raw = ""
birth = dates.parse_date(r.get("birth_date") or "")
death = dates.parse_date(r.get("death_date") or "")
people.append(Person(
person_id=slugify(last, first),
last_name=last, first_name=first, maiden_name=(r.get("maiden_name") or "").strip(),
nickname=nickname, extra_given_names=extra,
birth_date=birth.iso, birth_date_raw=(r.get("birth_date") or "").strip(), birth_place=(r.get("birth_place") or "").strip(),
death_date=death.iso, death_date_raw=(r.get("death_date") or "").strip(), death_place=(r.get("death_place") or "").strip(),
spouse=spouse_raw, generation=(r.get("generation") or "").strip(),
notes=(r.get("notes") or "").strip(), provisional=False,
))
# De-duplicate colliding ids with numeric suffix
seen = {}
for p in people:
if p.person_id in seen:
seen[p.person_id] += 1
p.person_id = f"{p.person_id}-{seen[p.person_id]}"
else:
seen[p.person_id] = 1
return people
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
git commit -m "feat(normalizer): person register parsing"
```
---
### Task 10: Receiver splitting (`REQ-PERS-01`, US-PERS-02 AC4)
**Files:**
- Modify: `tools/import-normalizer/persons.py`
- Modify: `tools/import-normalizer/tests/test_persons.py`
- [ ] **Step 1: Add failing tests** (ported from the Java `PersonNameParser` contract)
```python
def test_split_receivers():
assert persons.split_receivers("Eugenie Müller") == ["Eugenie Müller"]
assert persons.split_receivers("Walter und Eugenie de Gruyter") == ["Walter de Gruyter", "Eugenie de Gruyter"]
assert persons.split_receivers("Hedi und Tutu (Gruber)") == ["Hedi Gruber", "Tutu Gruber"]
assert persons.split_receivers("Clara u Familie") == ["Clara"]
assert persons.split_receivers("Eugenie de Gruyter geb. Müller") == ["Eugenie de Gruyter"]
assert persons.split_receivers("Herbert u Clara") == ["Herbert", "Clara"]
assert persons.split_receivers("") == []
def test_find_known_last_name():
assert persons.find_known_last_name("Eugenie de Gruyter") == "de Gruyter"
assert persons.find_known_last_name("Clara") is None
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k "split_receivers or known_last" -v && cd -`
Expected: FAIL.
- [ ] **Step 3: Implement** — add to `persons.py`
```python
_GEB_RE = re.compile(r",?\s*geb\.?\s+.+$", re.I)
_PAREN_RE = re.compile(r"\(([^)]+)\)\s*$")
_MULTI_RE = re.compile(r"\s+(?:und|u)\s+", re.I)
def find_known_last_name(segment: str):
seg = segment.strip()
for ln in config.KNOWN_LAST_NAMES: # config lists longest-first
if seg == ln or seg.endswith(" " + ln):
return ln
return None
def split_receivers(raw: str) -> list[str]:
if not raw or not raw.strip():
return []
# 0. split on "//"
if "//" in raw:
out = []
for seg in raw.split("//"):
out.extend(split_receivers(seg))
return out
cleaned = _GEB_RE.sub("", raw).strip()
if not _MULTI_RE.search(cleaned):
return [cleaned]
shared_last = None
pm = _PAREN_RE.search(cleaned)
if pm:
shared_last = pm.group(1).strip()
cleaned = cleaned[:pm.start()].strip()
parts = [p.strip() for p in _MULTI_RE.split(cleaned)]
parts = [p for p in parts if p and p.lower() != "familie"]
if not parts:
return []
if len(parts) == 1:
return [parts[0]]
if shared_last:
return [p if " " in p else f"{p} {shared_last}" for p in parts]
last_seg = parts[-1]
detected = find_known_last_name(last_seg)
if detected:
result = []
for p in parts[:-1]:
if " " not in p and find_known_last_name(p) is None:
result.append(f"{p} {detected}")
else:
result.append(p)
result.append(last_seg)
return result
return parts
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k "split_receivers or known_last" -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
git commit -m "feat(normalizer): receiver splitting"
```
---
### Task 11: Alias index (`FR-DEDUP`, REQ-DEDUP-01/02)
**Files:**
- Modify: `tools/import-normalizer/persons.py`
- Modify: `tools/import-normalizer/tests/test_persons.py`
- [ ] **Step 1: Add failing tests**
```python
def test_alias_index_resolves_maiden_and_married():
people = persons.parse_register([
{"last_name": "de Gruyter", "first_name": "Eugenie", "maiden_name": "Müller"},
{"last_name": "Cram", "first_name": "Clara"},
])
idx = persons.AliasIndex(people)
eugenie = people[0].person_id
assert idx.resolve("Eugenie de Gruyter") == eugenie # canonical
assert idx.resolve("Eugenie Müller") == eugenie # maiden alias
assert idx.resolve("eugenie müller") == eugenie # normalized
assert idx.resolve("Nobody Unknown") is None
def test_alias_index_suggestion():
people = persons.parse_register([{"last_name": "Wittkopf", "first_name": "Hans"}])
idx = persons.AliasIndex(people)
sid, score = idx.suggest("Hans Wittkop") # typo
assert sid == people[0].person_id and score >= config.FUZZY_SUGGEST_THRESHOLD
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k alias -v && cd -`
Expected: FAIL — `AliasIndex` not defined.
- [ ] **Step 3: Implement** — add to `persons.py`
```python
import difflib
def _norm(name: str) -> str:
return re.sub(r"\s+", " ", _strip_accents(name).lower().replace(".", " ")).strip()
class AliasIndex:
def __init__(self, people: list[Person]):
self._by_alias: dict[str, str] = {}
self._display: dict[str, str] = {}
self.known_ids: set[str] = {p.person_id for p in people}
first_name_ids: dict[str, list] = {}
for p in people:
self._display[p.person_id] = f"{p.first_name} {p.last_name}".strip()
# Ordered, de-duplicated forms (NOT a set) so alias order is deterministic — NFR-IDEM-01.
forms = [f"{p.first_name} {p.last_name}".strip()]
if p.maiden_name:
forms.append(f"{p.first_name} {p.maiden_name}".strip())
for extra in p.extra_given_names:
forms.append(f"{extra} {p.last_name}".strip())
if p.nickname:
forms.append(p.nickname)
seen = set()
for form in forms:
if form in seen:
continue
seen.add(form)
key = _norm(form)
if key and key not in self._by_alias:
self._by_alias[key] = p.person_id
p.aliases.append(form)
if p.first_name:
ids = first_name_ids.setdefault(_norm(p.first_name), [])
if p.person_id not in ids:
ids.append(p.person_id)
# first-name-only alias, only when unambiguous
for fname, ids in first_name_ids.items():
if len(ids) == 1 and fname not in self._by_alias:
self._by_alias[fname] = ids[0]
def resolve(self, name: str):
return self._by_alias.get(_norm(name))
def display(self, person_id: str) -> str:
return self._display.get(person_id, "")
def suggest(self, name: str):
keys = list(self._by_alias.keys())
match = difflib.get_close_matches(_norm(name), keys, n=1, cutoff=config.FUZZY_SUGGEST_THRESHOLD)
if not match:
return None, 0.0
score = difflib.SequenceMatcher(None, _norm(name), match[0]).ratio()
return self._by_alias[match[0]], score
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k alias -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
git commit -m "feat(normalizer): alias index with maiden/married/nickname resolution"
```
---
### Task 12: Spreadsheet ingest (`FR-INGEST`, `FR-MAP`, REQ-INGEST-01, REQ-MAP-01)
**Files:**
- Create: `tools/import-normalizer/ingest.py`
- Create: `tools/import-normalizer/tests/test_ingest.py`
- [ ] **Step 1: Write failing tests** (build a tiny workbook on disk with openpyxl)
```python
import datetime
import openpyxl
import pytest
import ingest
def _make_workbook(tmp_path, sheet_name, rows):
wb = openpyxl.Workbook()
ws = wb.active
ws.title = sheet_name
for r in rows:
ws.append(r)
path = tmp_path / "wb.xlsx"
wb.save(path)
return path
def test_read_sheet_converts_cells(tmp_path):
path = _make_workbook(tmp_path, "S", [
["Index", "Datum"],
["W-0001", datetime.datetime(1888, 2, 15)],
["W-0002", 1],
])
rows = ingest.read_sheet(path, "S")
assert rows[0] == ["Index", "Datum"]
assert rows[1] == ["W-0001", "1888-02-15"] # Excel date -> ISO string
assert rows[2] == ["W-0002", "1"] # integer -> plain string
def test_build_header_map_collapses_whitespace_and_case():
header = ["Index", "Datum des Briefes", "EmpfängerIn", "Mystery"]
field_map = {"index": "index", "datum des briefes": "date", "empfängerin": "receivers"}
fields, unknown = ingest.build_header_map(header, field_map, required={"index"})
assert fields == {"index": 0, "date": 1, "receivers": 2}
assert unknown == ["Mystery"]
def test_build_header_map_missing_required_raises():
with pytest.raises(ValueError, match="index"):
ingest.build_header_map(["Box", "Ort"], {"box": "box", "ort": "location"}, required={"index"})
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_ingest.py -v && cd -`
Expected: FAIL — `ingest` not defined.
- [ ] **Step 3: Implement `ingest.py`**
```python
"""Read .xlsx sheets into neutral list[list[str]] and map headers to fields."""
import datetime
from pathlib import Path
import openpyxl
def _cell_to_str(value) -> str:
if value is None:
return ""
if isinstance(value, datetime.datetime):
return value.date().isoformat()
if isinstance(value, datetime.date):
return value.isoformat()
if isinstance(value, float) and value.is_integer():
return str(int(value))
if isinstance(value, int):
return str(value)
return str(value).strip()
def read_sheet(path: Path, sheet_name: str) -> list[list[str]]:
wb = openpyxl.load_workbook(path, read_only=True, data_only=True)
if sheet_name not in wb.sheetnames:
raise ValueError(f"Sheet '{sheet_name}' not found in {path.name}; sheets: {wb.sheetnames}")
ws = wb[sheet_name]
rows = [[_cell_to_str(v) for v in row] for row in ws.iter_rows(values_only=True)]
wb.close()
return rows
def _norm_header(text: str) -> str:
return " ".join(text.lower().split())
def build_header_map(header_row: list[str], field_map: dict[str, str], required: set[str]):
"""Return (field->col_index, unknown_headers). Raise ValueError if a required field is missing."""
fields: dict[str, int] = {}
unknown: list[str] = []
for idx, raw in enumerate(header_row):
key = _norm_header(raw)
if key in field_map:
fields[field_map[key]] = idx
elif raw.strip():
unknown.append(raw)
missing = required - set(fields)
if missing:
raise ValueError(f"Required header(s) missing: {sorted(missing)} (found headers: {header_row})")
return fields, unknown
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_ingest.py -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/ingest.py tools/import-normalizer/tests/test_ingest.py
git commit -m "feat(normalizer): xlsx ingest + header mapping"
```
---
### Task 13: Row extraction, triage & CanonicalDocument (`FR-TRIAGE`, REQ-TRIAGE-01/02/03, `FR-PROV`)
**Files:**
- Create: `tools/import-normalizer/documents.py`
- Create: `tools/import-normalizer/tests/test_documents.py`
- [ ] **Step 1: Write failing tests**
```python
import documents
from documents import Triage
def test_extract_row():
header = {"index": 0, "file": 1, "box": 2, "folder": 3, "sender": 4,
"receivers": 5, "date": 6, "location": 7, "tags": 8, "summary": 9}
cells = ["W-0001", r"..\__scan\W-0001.pdf", "V", "1", "Walter de Gruyter",
"Eugenie Müller", "15.2.1888", "Rotterdam", "Brautbriefe", "Geschäftsreise"]
raw = documents.extract_row(cells, header, source_row=3)
assert raw.index == "W-0001"
assert raw.sender == "Walter de Gruyter"
assert raw.date == "15.2.1888"
assert raw.source_row == 3
def test_triage():
assert documents.triage(["", "", ""]) == Triage.EMPTY
assert documents.triage(["", "", "Walter"]) == Triage.BLANK_INDEX # data but no index
assert documents.triage(["W-0001x", "x"]) == Triage.X_SUFFIX
assert documents.triage(["W-0001", "x"]) == Triage.OK
def test_classify_blank_index():
header = {"sender": 4, "receivers": 5}
banner = ["", "", "", "", "Brautbriefe von Walter an Eugenie", ""]
data = ["", "", "V", "1", "", "Eugenie"]
assert documents.classify_blank_index(banner, header) == "section_banner"
assert documents.classify_blank_index(data, header) == "data_no_index"
def test_index_file_mismatch():
assert documents.index_file_mismatch("W-0010x", r"..\__scan\W-0011x.pdf") is True
assert documents.index_file_mismatch("W-0001", r"..\__scan\W-0001.pdf") is False
assert documents.index_file_mismatch("W-0001", "") is False
```
Note `triage` takes the raw `cells` list and uses column 0 as the index (matching `extract_row`'s header where `index` is col 0 in these tests).
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -v && cd -`
Expected: FAIL — `documents` not defined.
- [ ] **Step 3: Implement `documents.py`** (extraction + triage + dataclasses; resolution added in Task 14)
```python
"""Document row extraction, triage, and the canonical document record."""
from dataclasses import dataclass, field
from enum import Enum, auto
class Triage(Enum):
OK = auto()
EMPTY = auto()
BLANK_INDEX = auto()
X_SUFFIX = auto()
@dataclass
class RawRow:
source_row: int
index: str = ""
file: str = ""
box: str = ""
folder: str = ""
sender: str = ""
receivers: str = ""
date: str = ""
location: str = ""
tags: str = ""
summary: str = ""
@dataclass
class CanonicalDocument:
index: str
box: str = ""
folder: str = ""
sender_person_id: str = ""
sender_name: str = ""
receiver_person_ids: list = field(default_factory=list)
receiver_names: list = field(default_factory=list)
date_iso: str = ""
date_raw: str = ""
date_precision: str = ""
location: str = ""
tags: list = field(default_factory=list)
summary: str = ""
source_row: int = 0
needs_review: list = field(default_factory=list)
_FIELDS = ["index", "file", "box", "folder", "sender", "receivers", "date", "location", "tags", "summary"]
def extract_row(cells: list[str], header: dict[str, int], source_row: int) -> RawRow:
def get(field_name):
idx = header.get(field_name)
if idx is None or idx >= len(cells):
return ""
return (cells[idx] or "").strip()
return RawRow(source_row=source_row, **{f: get(f) for f in _FIELDS})
def triage(cells: list[str], index_col: int = 0) -> Triage:
nonempty = [c for c in cells if c and str(c).strip()]
if not nonempty:
return Triage.EMPTY
index = (cells[index_col] or "").strip() if index_col < len(cells) else ""
if not index:
return Triage.BLANK_INDEX
if index.endswith("x"):
return Triage.X_SUFFIX
return Triage.OK
def classify_blank_index(cells: list[str], header: dict[str, int]) -> str:
"""REQ-TRIAGE-02: 'section_banner' if only name columns are populated, else 'data_no_index'."""
name_cols = {header.get("sender"), header.get("receivers")} - {None}
populated = {i for i, c in enumerate(cells) if c and str(c).strip()}
if populated and populated <= name_cols:
return "section_banner"
return "data_no_index"
def index_file_mismatch(index: str, file_path: str) -> bool:
if not file_path.strip():
return False
basename = file_path.replace("\\", "/").rsplit("/", 1)[-1]
stem = basename.rsplit(".", 1)[0]
return stem != index
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/documents.py tools/import-normalizer/tests/test_documents.py
git commit -m "feat(normalizer): row extraction, triage, canonical record"
```
---
### Task 14: Resolution context + to_canonical (`FR-PERS`, `FR-DATE` integration, REQ-PROV-02)
**Files:**
- Modify: `tools/import-normalizer/persons.py`
- Modify: `tools/import-normalizer/documents.py`
- Modify: `tools/import-normalizer/tests/test_documents.py`
- [ ] **Step 1: Add failing tests** to `tests/test_documents.py`
```python
import persons
import documents
def _ctx():
people = persons.parse_register([
{"last_name": "de Gruyter", "first_name": "Walter"},
{"last_name": "de Gruyter", "first_name": "Eugenie", "maiden_name": "Müller"},
])
return persons.ResolutionContext(persons.AliasIndex(people), name_overrides={})
def test_to_canonical_resolves_and_flags():
ctx = _ctx()
raw = documents.RawRow(source_row=3, index="W-0001", box="V", folder="1",
sender="Walter de Gruyter", receivers="Eugenie Müller",
date="15.2.1888", location="Rotterdam", tags="Brautbriefe",
summary="Geschäftsreise", file=r"..\__scan\W-0001.pdf")
doc = documents.to_canonical(raw, ctx, date_overrides={})
assert doc.sender_person_id == "de-gruyter-walter"
assert doc.receiver_person_ids == ["de-gruyter-eugenie"] # matched via maiden alias
assert doc.date_iso == "1888-02-15" and doc.date_precision == "DAY"
assert doc.tags == ["Brautbriefe"]
assert doc.needs_review == []
def test_to_canonical_unmatched_and_unparsed():
ctx = _ctx()
raw = documents.RawRow(source_row=9, index="C-0001",
sender="Hans Wittkopf", receivers="", date="Freitag 1919")
doc = documents.to_canonical(raw, ctx, date_overrides={})
assert doc.sender_person_id == "wittkopf-hans" # provisional
assert "unmatched_sender" in doc.needs_review
assert "unparsed_date" in doc.needs_review
assert ctx.unmatched["Hans Wittkopf"] == [9]
assert any(p.provisional for p in ctx.provisional.values())
def test_to_canonical_splits_multi_sender():
# REQ-PERS-01 / IMP-11: a multi-person sender is parsed, primary kept, flagged.
ctx = _ctx()
raw = documents.RawRow(source_row=5, index="C-0100", sender="Walter und Eugenie de Gruyter", receivers="")
doc = documents.to_canonical(raw, ctx, date_overrides={})
assert doc.sender_person_id == "de-gruyter-walter" # first part is primary
assert "multi_sender" in doc.needs_review
def test_provisional_id_never_collides_with_register():
# A provisional built from an unmatched string must not steal a register person_id.
people = persons.parse_register([{"last_name": "Cram", "first_name": "Clara"}])
ctx = persons.ResolutionContext(persons.AliasIndex(people), name_overrides={})
# Force a provisional whose natural slug equals the register id by using a string the
# alias index will not resolve but that slugs to "cram-clara":
pid, _, matched = ctx.resolve_one("Clara Cram (unsicher)", source_row=1)
assert matched is False
assert pid not in {"cram-clara"} or pid.endswith("-2") # suffixed away from the register id
def test_ambiguous_space_pair_flagged_not_split():
# US-PERS-02 AC4: "Ella Anita" is kept as one provisional + flagged, never guessed into two.
ctx = _ctx()
raw = documents.RawRow(source_row=7, index="C-0200", sender="", receivers="Ella Anita")
doc = documents.to_canonical(raw, ctx, date_overrides={})
assert len(doc.receiver_person_ids) == 1 # not split
assert any(part == "Ella Anita" for _, part, _ in ctx.ambiguous)
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -k "to_canonical" -v && cd -`
Expected: FAIL — `ResolutionContext` / `to_canonical` not defined.
- [ ] **Step 3a: Implement `ResolutionContext`** — add to `persons.py`
```python
class ResolutionContext:
"""Resolves raw name strings to person ids; accumulates provisional persons and review data."""
def __init__(self, alias_index: AliasIndex, name_overrides: dict[str, str]):
self.index = alias_index
self.name_overrides = name_overrides
self.provisional: dict[str, Person] = {}
self.unmatched: dict[str, list] = {}
self.ambiguous: list[tuple] = []
self._raw_to_pid: dict[str, str] = {}
self.override_hits = 0
def _unique_id(self, base: str) -> str:
"""A provisional id must never collide with a register id or another provisional."""
used = self.index.known_ids | set(self.provisional)
pid, n = base, 1
while pid in used:
n += 1
pid = f"{base}-{n}"
return pid
def resolve_one(self, raw_name: str, source_row: int):
"""Return (person_id, display_name, matched: bool). '' name -> ('', '', True)."""
name = (raw_name or "").strip()
if not name:
return "", "", True
if name in self.name_overrides:
self.override_hits += 1
pid = self.name_overrides[name]
return pid, self.index.display(pid) or name, True
pid = self.index.resolve(name)
if pid:
return pid, self.index.display(pid) or name, True
# provisional person (unmatched) — never reuse a register id
self.unmatched.setdefault(name, []).append(source_row)
if name in self._raw_to_pid:
return self._raw_to_pid[name], name, False
last, first = _last_first(name)
pid = self._unique_id(slugify(last, first))
self.provisional[pid] = Person(person_id=pid, last_name=last, first_name=first, provisional=True)
self._raw_to_pid[name] = pid
return pid, name, False
def resolve_sender(self, raw: str, source_row: int):
"""Senders are split like receivers (REQ-PERS-01). Primary = first part; multi flagged."""
parts = split_receivers(raw)
if not parts:
return "", "", True, False
pid, name, matched = self.resolve_one(parts[0], source_row)
for extra in parts[1:]:
self.resolve_one(extra, source_row) # register the others as persons too
return pid, name, matched, len(parts) > 1
def resolve_receivers(self, raw: str, source_row: int):
results = []
for part in split_receivers(raw):
pid, name, matched = self.resolve_one(part, source_row)
if not matched and " " in part and find_known_last_name(part) is None and len(part.split()) == 2:
self.ambiguous.append((raw, part, source_row))
results.append((pid, name, matched))
return results
def _last_first(name: str):
"""Best-effort split of a free name string into (last, first) for slug/provisional building."""
name = name.strip()
ln = find_known_last_name(name)
if ln:
first = name[: -len(ln)].strip()
return ln, first
tokens = name.split()
if len(tokens) >= 2:
return tokens[-1], " ".join(tokens[:-1])
return name, ""
```
- [ ] **Step 3b: Implement `to_canonical`** — add to `documents.py`
```python
import dates as _dates
def to_canonical(raw, ctx, date_overrides: dict) -> CanonicalDocument:
pd = _dates.parse_date(raw.date, date_overrides)
flags = []
sender_id, sender_name, sender_matched, sender_multi = ctx.resolve_sender(raw.sender, raw.source_row)
if raw.sender.strip() and not sender_matched:
flags.append("unmatched_sender")
if sender_multi:
flags.append("multi_sender")
receivers = ctx.resolve_receivers(raw.receivers, raw.source_row)
if any(not matched for _, _, matched in receivers):
flags.append("unmatched_receiver")
if raw.date.strip() and pd.precision == _dates.Precision.UNKNOWN:
flags.append("unparsed_date")
if index_file_mismatch(raw.index, raw.file):
flags.append("index_file_mismatch")
return CanonicalDocument(
index=raw.index, box=raw.box, folder=raw.folder,
sender_person_id=sender_id, sender_name=sender_name,
receiver_person_ids=[r[0] for r in receivers],
receiver_names=[r[1] for r in receivers],
date_iso=pd.iso or "", date_raw=raw.date, date_precision=str(pd.precision),
location=raw.location, tags=[raw.tags] if raw.tags else [], summary=raw.summary,
source_row=raw.source_row, needs_review=flags,
)
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/persons.py tools/import-normalizer/documents.py tools/import-normalizer/tests/test_documents.py
git commit -m "feat(normalizer): person resolution context + to_canonical"
```
---
### Task 15: Overrides loader + writers (`FR-OVR`, `FR-OUT`, NFR-OBSERV-01)
**Files:**
- Create: `tools/import-normalizer/overrides.py`
- Create: `tools/import-normalizer/writers.py`
- Create: `tools/import-normalizer/tests/test_writers.py`
- [ ] **Step 1: Write failing tests**
```python
import csv
import openpyxl
import overrides
import writers
import documents
def test_load_overrides_missing_files(tmp_path):
d, n = overrides.load_overrides(tmp_path / "dates.csv", tmp_path / "names.csv")
assert d == {} and n == {}
def test_load_overrides_parsed(tmp_path):
dp = tmp_path / "dates.csv"
dp.write_text("raw,iso,precision\n13.5.65,1965-05-13,DAY\n", encoding="utf-8")
np = tmp_path / "names.csv"
np.write_text("raw,person_id\nEugenie Müller,de-gruyter-eugenie\n", encoding="utf-8")
d, n = overrides.load_overrides(dp, np)
assert d["13.5.65"] == ("1965-05-13", "DAY")
assert n["Eugenie Müller"] == "de-gruyter-eugenie"
def test_write_documents_xlsx_joins_lists(tmp_path):
doc = documents.CanonicalDocument(
index="W-0001", receiver_person_ids=["a", "b"], receiver_names=["A", "B"],
tags=["Brautbriefe"], date_precision="DAY", needs_review=["unparsed_date"])
out = tmp_path / "docs.xlsx"
writers.write_documents_xlsx([doc], out)
wb = openpyxl.load_workbook(out)
ws = wb.active
header = [c.value for c in ws[1]]
assert "receiver_person_ids" in header and "needs_review" in header
row = {h: c.value for h, c in zip(header, ws[2])}
assert row["receiver_person_ids"] == "a|b"
assert row["needs_review"] == "unparsed_date"
def test_write_review_csv(tmp_path):
out = tmp_path / "r.csv"
writers.write_review_csv(out, ["raw", "count"], [["?", 3], ["x", 1]])
rows = list(csv.reader(out.open(encoding="utf-8")))
assert rows[0] == ["raw", "count"]
assert rows[1] == ["?", "3"]
def test_write_review_csv_defangs_formula_injection(tmp_path):
out = tmp_path / "r.csv"
writers.write_review_csv(out, ["raw", "count"], [["=cmd|'/C calc'!A0", 1], ["-2+3", 2]])
rows = list(csv.reader(out.open(encoding="utf-8")))
assert rows[1][0].startswith("'=") # leading '=' neutralised
assert rows[2][0].startswith("'-")
def test_write_summary_sections(tmp_path):
out = tmp_path / "s.txt"
writers.write_summary(out, {"# INPUTS": "", "rows": 10, "# DATES": "", "unknown_date_rate": "3.2%"})
text = out.read_text(encoding="utf-8")
assert "INPUTS:" in text and "DATES:" in text and " rows: 10" in text
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_writers.py -v && cd -`
Expected: FAIL — modules not defined.
- [ ] **Step 3a: Implement `overrides.py`**
```python
"""Load human-supplied corrections. Missing files are not an error."""
import csv
from pathlib import Path
def load_overrides(dates_path: Path, names_path: Path):
date_overrides: dict[str, tuple[str, str]] = {}
name_overrides: dict[str, str] = {}
if Path(dates_path).exists():
with open(dates_path, encoding="utf-8", newline="") as f:
for row in csv.DictReader(f):
raw = (row.get("raw") or "").strip()
if raw:
date_overrides[raw] = ((row.get("iso") or "").strip(), (row.get("precision") or "UNKNOWN").strip())
if Path(names_path).exists():
with open(names_path, encoding="utf-8", newline="") as f:
for row in csv.DictReader(f):
raw = (row.get("raw") or "").strip()
if raw:
name_overrides[raw] = (row.get("person_id") or "").strip()
return date_overrides, name_overrides
```
- [ ] **Step 3b: Implement `writers.py`**
```python
"""Write canonical .xlsx outputs and review .csv files."""
import csv
import datetime
from pathlib import Path
import openpyxl
_PIPE = "|"
# Pinned workbook metadata so reruns are content-deterministic (NFR-IDEM-01); openpyxl
# otherwise stamps docProps with the current time on every save.
_FIXED_TS = datetime.datetime(2020, 1, 1, 0, 0, 0)
def _join(value):
if isinstance(value, list):
return _PIPE.join(str(v) for v in value)
return "" if value is None else str(value)
def _csv_safe(value):
"""Neutralise spreadsheet formula injection (CWE-1236) in human-opened review CSVs."""
s = "" if value is None else str(value)
return "'" + s if s[:1] in ("=", "+", "-", "@", "\t", "\r") else s
DOC_COLUMNS = ["index", "box", "folder", "sender_person_id", "sender_name",
"receiver_person_ids", "receiver_names", "date_iso", "date_raw",
"date_precision", "location", "tags", "summary", "source_row", "needs_review"]
PERSON_COLUMNS = ["person_id", "last_name", "first_name", "maiden_name", "title", "nickname",
"birth_date", "birth_date_raw", "birth_place", "death_date", "death_date_raw",
"death_place", "spouse", "generation", "notes", "aliases", "provisional"]
def _write_xlsx(records, columns, path: Path):
wb = openpyxl.Workbook()
ws = wb.active
ws.append(columns)
for rec in records:
ws.append([_join(getattr(rec, col)) for col in columns])
wb.properties.created = _FIXED_TS
wb.properties.modified = _FIXED_TS
Path(path).parent.mkdir(parents=True, exist_ok=True)
wb.save(path)
def write_documents_xlsx(docs, path: Path):
_write_xlsx(docs, DOC_COLUMNS, path)
def write_persons_xlsx(people, path: Path):
_write_xlsx(people, PERSON_COLUMNS, path)
def write_review_csv(path: Path, header: list[str], rows: list[list]):
Path(path).parent.mkdir(parents=True, exist_ok=True)
with open(path, "w", encoding="utf-8", newline="") as f:
w = csv.writer(f)
w.writerow(header)
for row in rows:
w.writerow([_csv_safe(c) for c in row])
def write_summary(path: Path, stats: dict):
"""Render a grouped, scannable summary. Keys beginning with '#' are section headers."""
Path(path).parent.mkdir(parents=True, exist_ok=True)
lines = []
for k, v in stats.items():
if k.startswith("#"):
lines.append("")
lines.append(k[1:].strip() + ":")
else:
lines.append(f" {k}: {v}")
Path(path).write_text("\n".join(lines).strip() + "\n", encoding="utf-8")
```
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_writers.py -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/overrides.py tools/import-normalizer/writers.py tools/import-normalizer/tests/test_writers.py
git commit -m "feat(normalizer): overrides loader + xlsx/csv writers"
```
---
### Task 16: Orchestrator `normalize.py` + integration test (`FR-OUT`, `FR-TRIAGE`, REQ-TRIAGE-01/03, NFR-IDEM-01)
**Files:**
- Create: `tools/import-normalizer/normalize.py`
- Create: `tools/import-normalizer/tests/test_normalize.py`
- [ ] **Step 1: Write the failing integration test** (tiny in-memory fixtures, not the real 7,900-row file)
```python
import openpyxl
import normalize
def _doc_wb(tmp_path):
wb = openpyxl.Workbook(); ws = wb.active; ws.title = "Familienarchiv"
ws.append(["Index", "Datei", "Box", "Mappe", "BriefeschreiberIn", "EmpfängerIn",
"Datum des Briefes", "Ort", "Schlagwort", "Inhalt"])
ws.append(["W-0001", r"..\__scan\W-0001.pdf", "V", "1", "Walter de Gruyter",
"Eugenie Müller", "15.2.1888", "Rotterdam", "Brautbriefe", "Geschäftsreise"])
ws.append(["W-0001x", r"..\__scan\W-0001x.pdf", "", "", "Walter de Gruyter", "Eugenie Müller", "", "", "", ""])
ws.append(["", "", "", "", "Section banner row", "", "", "", "", ""])
ws.append(["C-0001", "", "", "", "Hans Wittkopf", "", "Freitag 1919", "", "", ""])
ws.append(["W-0001", r"..\__scan\W-0001.pdf", "V", "1", "Walter de Gruyter",
"Eugenie Müller", "15.2.1888", "Rotterdam", "Brautbriefe", "dup"])
p = tmp_path / "docs.xlsx"; wb.save(p); return p
def _person_wb(tmp_path):
wb = openpyxl.Workbook(); ws = wb.active; ws.title = "Tabelle1"
ws.append(["Generation", "Familienname", "Vorname", "geb als", "Geburtsdatum",
"Geburtsort", "Todesdatum", "Sterbeort", "verheiratet mit", "Bemerkung"])
ws.append(["G 1", "de Gruyter", "Walter", "", "", "", "", "", "", ""])
ws.append(["G 1", "de Gruyter", "Eugenie", "Müller", "", "", "", "", "", ""])
p = tmp_path / "persons.xlsx"; wb.save(p); return p
def test_run_end_to_end(tmp_path):
out_dir = tmp_path / "out"; review_dir = tmp_path / "review"
stats = normalize.run(
document_workbook=_doc_wb(tmp_path), document_sheet="Familienarchiv",
person_workbook=_person_wb(tmp_path), person_sheet="Tabelle1",
out_dir=out_dir, review_dir=review_dir,
date_overrides={}, name_overrides={})
assert (out_dir / "canonical-documents.xlsx").exists()
assert (out_dir / "canonical-persons.xlsx").exists()
assert stats["documents_emitted"] == 3 # W-0001, C-0001, W-0001 (dup) — x and blank excluded
assert stats["skipped_x_suffix"] == 1
assert stats["blank_index_rows"] == 1
assert stats["duplicate_index_rows"] == 2
assert (review_dir / "skipped-x-suffix.csv").exists()
assert (review_dir / "unparsed-dates.csv").exists()
# C-0001's "Freitag 1919" is unparseable -> must appear in the review file (NFR-DATA-01)
assert "Freitag 1919" in (review_dir / "unparsed-dates.csv").read_text(encoding="utf-8")
# determinism (NFR-IDEM-01): a second run yields identical canonical content + review files
def _matrix(p):
wb = openpyxl.load_workbook(p)
return [[c.value for c in row] for row in wb.active.iter_rows()]
docs1 = _matrix(out_dir / "canonical-documents.xlsx")
persons1 = _matrix(out_dir / "canonical-persons.xlsx")
unparsed1 = (review_dir / "unparsed-dates.csv").read_text(encoding="utf-8")
normalize.run(document_workbook=_doc_wb(tmp_path), document_sheet="Familienarchiv",
person_workbook=_person_wb(tmp_path), person_sheet="Tabelle1",
out_dir=out_dir, review_dir=review_dir, date_overrides={}, name_overrides={})
assert _matrix(out_dir / "canonical-documents.xlsx") == docs1
assert _matrix(out_dir / "canonical-persons.xlsx") == persons1
assert (review_dir / "unparsed-dates.csv").read_text(encoding="utf-8") == unparsed1
assert len(docs1) == 4 # header + 3 docs
```
- [ ] **Step 2: Run to verify it fails**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_normalize.py -v && cd -`
Expected: FAIL — `normalize` not defined.
- [ ] **Step 3: Implement `normalize.py`**
```python
"""Orchestrator: read raw workbooks -> canonical outputs + review reports."""
import argparse
from collections import Counter
from pathlib import Path
import config
import ingest
import persons
import documents
import overrides as overrides_mod
import writers
def run(*, document_workbook, document_sheet, person_workbook, person_sheet,
out_dir, review_dir, date_overrides, name_overrides) -> dict:
out_dir, review_dir = Path(out_dir), Path(review_dir)
# --- persons ---
person_rows = ingest.read_sheet(person_workbook, person_sheet)
p_fields, _ = ingest.build_header_map(person_rows[0], config.PERSON_HEADER_MAP, config.PERSON_REQUIRED_FIELDS)
person_dicts = [{f: (row[i] if i < len(row) else "") for f, i in p_fields.items()} for row in person_rows[1:]]
register = persons.parse_register(person_dicts)
alias_index = persons.AliasIndex(register)
ctx = persons.ResolutionContext(alias_index, name_overrides)
# --- documents ---
doc_rows = ingest.read_sheet(document_workbook, document_sheet)
d_fields, unknown_headers = ingest.build_header_map(doc_rows[0], config.DOCUMENT_HEADER_MAP, config.DOCUMENT_REQUIRED_FIELDS)
index_col = d_fields["index"]
canon_docs, blank_index, skipped_x, mismatches = [], [], [], []
unparsed_by_raw: dict[str, list] = {}
dates_by_override = 0
empty_count = 0
seen_index = Counter()
for source_row, cells in enumerate(doc_rows[1:], start=2):
t = documents.triage(cells, index_col)
if t is documents.Triage.EMPTY:
empty_count += 1
continue
if t is documents.Triage.BLANK_INDEX:
blank_index.append([source_row, documents.classify_blank_index(cells, d_fields),
" | ".join(c for c in cells if c)])
continue
if t is documents.Triage.X_SUFFIX:
idx = (cells[index_col] or "").strip()
skipped_x.append([source_row, idx, idx[:-1]])
continue
raw = documents.extract_row(cells, d_fields, source_row)
seen_index[raw.index] += 1
if raw.date.strip() and raw.date.strip() in date_overrides:
dates_by_override += 1
doc = documents.to_canonical(raw, ctx, date_overrides)
if "unparsed_date" in doc.needs_review:
unparsed_by_raw.setdefault(raw.date, []).append(source_row)
if "index_file_mismatch" in doc.needs_review:
mismatches.append([source_row, raw.index, raw.file])
canon_docs.append(doc)
# REQ-TRIAGE-01: flag EVERY occurrence of a duplicated index and report all of them.
dup_indexes = {idx for idx, n in seen_index.items() if n > 1}
duplicates = []
for doc in canon_docs:
if doc.index in dup_indexes:
if "duplicate_index" not in doc.needs_review:
doc.needs_review.append("duplicate_index")
duplicates.append([doc.source_row, doc.index])
all_people = register + list(ctx.provisional.values())
# --- write canonical outputs ---
writers.write_documents_xlsx(canon_docs, out_dir / "canonical-documents.xlsx")
writers.write_persons_xlsx(all_people, out_dir / "canonical-persons.xlsx")
# --- review files ---
# unparsed dates: most-frequent first, with example source rows + blank override cells so a
# corrected row can be pasted straight into overrides/dates.csv (same raw,iso,precision shape).
unparsed_rows = sorted(
([raw, len(rows), " ".join(map(str, rows[:5])), "", ""] for raw, rows in unparsed_by_raw.items()),
key=lambda r: (-r[1], r[0]))
writers.write_review_csv(review_dir / "unparsed-dates.csv",
["raw", "count", "example_rows", "suggested_iso", "suggested_precision"], unparsed_rows)
unmatched_rows = []
for name, rows in sorted(ctx.unmatched.items()):
sid, score = alias_index.suggest(name)
unmatched_rows.append([name, len(rows), " ".join(map(str, rows[:5])),
sid or "", f"{score:.2f}" if sid else ""])
writers.write_review_csv(review_dir / "unmatched-names.csv",
["raw", "count", "example_rows", "suggested_id", "suggested_score"], unmatched_rows)
writers.write_review_csv(review_dir / "duplicate-index.csv", ["source_row", "index"], duplicates)
writers.write_review_csv(review_dir / "blank-index-rows.csv", ["source_row", "kind", "content"], blank_index)
writers.write_review_csv(review_dir / "skipped-x-suffix.csv", ["source_row", "index", "base_index"], skipped_x)
writers.write_review_csv(review_dir / "ambiguous-receivers.csv", ["raw", "part", "source_row"], ctx.ambiguous)
writers.write_review_csv(review_dir / "index-file-mismatch.csv", ["source_row", "index", "file"], mismatches)
dated = sum(1 for d in canon_docs if d.date_raw.strip())
unknown = sum(1 for d in canon_docs if d.date_raw.strip() and d.date_precision == "UNKNOWN")
unknown_rate = f"{(100 * unknown / dated):.1f}%" if dated else "0.0%"
stats = {
"# INPUTS": "",
"document_rows_read": len(doc_rows) - 1,
"register_persons": len(register),
"unknown_headers": ", ".join(unknown_headers) or "(none)",
"# OUTPUTS": "",
"documents_emitted": len(canon_docs),
"provisional_persons": len(ctx.provisional),
"# DATES": "",
"dated_rows": dated,
"unparsed_dates": unknown,
"unknown_date_rate": f"{unknown_rate} (target <=5%)",
"distinct_unparsed_formats": len(unparsed_by_raw),
"# NAMES": "",
"unmatched_name_strings": len(ctx.unmatched),
"ambiguous_receivers": len(ctx.ambiguous),
"# ANOMALIES": "",
"empty_rows": empty_count,
"blank_index_rows": len(blank_index),
"skipped_x_suffix": len(skipped_x),
"duplicate_index_rows": len(duplicates),
"index_file_mismatches": len(mismatches),
"# OVERRIDES": "",
"date_overrides_loaded": len(date_overrides),
"name_overrides_loaded": len(name_overrides),
"dates_resolved_by_override": dates_by_override,
"names_resolved_by_override": ctx.override_hits,
}
writers.write_summary(review_dir / "summary.txt", stats)
return stats
def main():
parser = argparse.ArgumentParser(description="Normalize the family archive spreadsheets.")
parser.parse_args()
date_overrides, name_overrides = overrides_mod.load_overrides(
config.OVERRIDES_DIR / "dates.csv", config.OVERRIDES_DIR / "names.csv")
stats = run(
document_workbook=config.DOCUMENT_WORKBOOK, document_sheet=config.DOCUMENT_SHEET,
person_workbook=config.PERSON_WORKBOOK, person_sheet=config.PERSON_SHEET,
out_dir=config.OUT_DIR, review_dir=config.REVIEW_DIR,
date_overrides=date_overrides, name_overrides=name_overrides)
print("Normalization complete:")
for k, v in stats.items():
print(f" {k}: {v}")
if __name__ == "__main__":
main()
```
> **Note for the implementer:** duplicate-index handling is a single second pass over `canon_docs` (`for doc in canon_docs: if doc.index in dup_indexes`) — this flags AND reports *every* colliding occurrence including the first (REQ-TRIAGE-01), not just repeats. Do not reintroduce a per-row append in the main loop.
- [ ] **Step 4: Run to verify it passes**
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_normalize.py -v && cd -`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add tools/import-normalizer/normalize.py tools/import-normalizer/tests/test_normalize.py
git commit -m "feat(normalizer): orchestrator + end-to-end integration test"
```
---
### Task 17: README, seed overrides, and a real dry-run
**Files:**
- Create: `tools/import-normalizer/README.md`
- Create: `tools/import-normalizer/overrides/dates.csv`
- Create: `tools/import-normalizer/overrides/names.csv`
- [ ] **Step 1: Seed the overrides files** (header-only)
`overrides/dates.csv`:
```
raw,iso,precision
```
`overrides/names.csv`:
```
raw,person_id
```
- [ ] **Step 2: Write `README.md`**
````markdown
# Import Normalizer
Transforms the raw family-archive spreadsheets in `../../import/` into a clean canonical
dataset (`out/`) plus review reports (`review/`). See the spec:
`../../docs/import-migration/02-normalization-spec.md`.
## Setup
Requires **Python 3.12** (uses `StrEnum`).
```bash
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
```
## Run
```bash
.venv/bin/python normalize.py
```
Outputs:
- `out/canonical-documents.xlsx`, `out/canonical-persons.xlsx`
- `review/*.csv` (residue to fix), `review/summary.txt` (grouped run stats incl. unknown-date rate)
## Iteration loop
1. **Run.** Read `review/summary.txt` for the health snapshot.
2. **Fix the residue** by editing the version-controlled overrides files, then re-run. Repeat.
| Review file | What to do |
| --- | --- |
| `unparsed-dates.csv` | For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`). |
| `unmatched-names.csv` | If `suggested_id` is right, copy `raw,suggested_id` into `overrides/names.csv`; else look up the correct id in `out/canonical-persons.xlsx` (the `person_id` column). |
| `ambiguous-receivers.csv` | A space-joined pair we refused to auto-split (e.g. `Ella Anita`). Decide and add a names override if it is really two people. |
| `index-file-mismatch.csv` | The `Datei` path disagrees with the index-derived filename — reconcile when the PDFs arrive. |
| `duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv` | Inspect; fix in the source spreadsheet if needed. |
**Valid `person_id` values** all come from the `person_id` column of `out/canonical-persons.xlsx`.
## Tests
```bash
.venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once)
```
````
- [ ] **Step 3: Run the whole test suite file-by-file to confirm green**
Run each individually (per the "no full-suite" rule):
```bash
cd tools/import-normalizer
for t in config dates persons ingest documents writers normalize; do .venv/bin/python -m pytest tests/test_$t.py -q || break; done
cd -
```
Expected: every file reports all passed.
- [ ] **Step 4: Real dry-run against the actual import data (manual verification, not a test)**
Run: `cd tools/import-normalizer && .venv/bin/python normalize.py && cd -`
Expected: prints stats. Then inspect:
- `review/summary.txt` — sanity-check counts (≈7,600 documents emitted, register_persons ≈163).
- `review/unparsed-dates.csv` — confirm `UNKNOWN` rate is in the low single-digit %% of dated rows (NFR-ACCUR-01 target ≤5% before overrides). If higher, note the dominant unhandled formats for a follow-up parser tweak.
- Spot-check `out/canonical-documents.xlsx`: open the first ~20 rows; verify `date_iso`/`date_precision`, `sender_person_id`, and `receiver_person_ids` look right (e.g. `Eugenie Müller` → `de-gruyter-eugenie`).
Record the run's `summary.txt` figures in `../../docs/import-migration/WORKLOG.md`.
- [ ] **Step 5: Commit** (commit only source + seeds; `out/` and `review/` are gitignored)
```bash
git add tools/import-normalizer/README.md tools/import-normalizer/overrides/dates.csv tools/import-normalizer/overrides/names.csv
git commit -m "docs(normalizer): README + seed overrides"
```
---
## Self-Review
**Spec coverage check:**
- `FR-INGEST`/`FR-MAP` → Task 12 (header-name mapping, missing-required raises, unknown headers reported). ✓
- `FR-TRIAGE` (REQ-TRIAGE-01/02/03) → Task 13 (triage by index-col, `classify_blank_index` banner detection) + Task 16 (single-pass duplicate flagging of *all* occurrences, blank-index report with `kind`, x-suffix skip+log). ✓
- `FR-DATE` (REQ-DATE-01..06) → Tasks 28 (computus, feast/season, century rule, all matchers, overrides). ✓
- `FR-PERS`/US-PERS-01 → Task 9; `REQ-PERS-01`/receiver split/AC4 ambiguous → Tasks 10, 14. ✓
- `FR-DEDUP` (REQ-DEDUP-01/02) → Task 11 (maiden/married/nickname aliases, conservative; fuzzy = suggestion only). ✓
- `FR-OVR` (REQ-OVR-01/02/03) → Task 15 (loader, missing-file tolerant) + Task 16 (applied + counted: `dates_resolved_by_override` / `names_resolved_by_override`) + Task 16 content-determinism assertion (two-run cell-matrix + review-file equality). ✓
- `FR-OUT`/`FR-PROV` (REQ-OUT-01/02, REQ-PROV-01/02) → Tasks 13 (source_row, needs_review), 15 (writers), 16 (mismatch report). ✓
- NFRs: DATA-01 (every row → output or review) covered by triage routing; OBSERV-01 → summary.txt; I18N-01 → utf-8 everywhere + diacritic map; TEST-01 → per-module tests; MAINT-01 → config tables. ✓
- Data dictionary §6 → `DOC_COLUMNS`/`PERSON_COLUMNS` in Task 15 match the spec field list. ✓
**Placeholder scan:** No TBD/TODO; every code step shows complete code. The one `pass`/dead-line in Task 16 is explicitly called out with deletion instructions. ✓
**Type consistency:** `ParsedDate(iso, precision, raw)`, `Precision` (StrEnum → `str()` yields the value), `Person`, `RawRow`, `CanonicalDocument`, `AliasIndex.resolve/display/suggest`, `ResolutionContext.resolve_one/resolve_receivers`, `to_canonical(raw, ctx, date_overrides)`, `run(**kwargs)` — names line up across tasks. ✓
**Known follow-ups (out of scope for this plan):** Phase-2 importer wiring (`B11`); comma-splitting `Inhalt` into extra tags (`B10`, Could). These are intentionally deferred.
---
## Review feedback incorporated (2026-05-25)
Six personas reviewed this plan inline; the following changes were applied (see the session summary for detail):
- **Idempotency redefined (architect/tester/req-eng):** spec G4/NFR-IDEM-01 changed from "byte-identical" to **content-deterministic**; Task 15 pins workbook `created`/`modified`; Task 11 builds aliases via ordered lists (no set-iteration leakage); Task 16 test now compares two runs' cell matrices + review files.
- **Duplicate-index bug fixed (developer/architect):** Task 16 now flags and reports *every* occurrence of a duplicated index in one pass; the dead `pass` line was removed; the test stat (`==2`) is correct.
- **Provisional id collision guarded (architect):** Task 14 `ResolutionContext._unique_id` suffixes provisional ids so they never overwrite a register `person_id`.
- **Date gaps closed (tester):** added invalid-calendar-date → UNKNOWN test, intra-month day-range matcher (`7./8. Sept.1923` → RANGE) + test, and a trailing-note-preservation test.
- **Multi-person sender (tester/req-eng, REQ-PERS-01):** Task 14 `resolve_sender` splits the sender, keeps the primary, flags `multi_sender`.
- **CSV injection defanged (security):** Task 15 `write_review_csv` neutralises leading `= + - @` etc. in human-opened CSVs (+ test).
- **REQ-TRIAGE-02 / REQ-OVR-03 realized (req-eng):** banner-vs-data classification in `blank-index-rows.csv`; override-application counts + an `unknown_date_rate` headline in `summary.txt`.
- **Ergonomics (UX):** `unparsed-dates.csv` now carries `example_rows` + blank `suggested_iso/precision` (paste-ready); `unmatched-names.csv` suggestion blanks-out on no-match and rounds the score; grouped `summary.txt`; README documents every review file + where to source `person_id`.
- **Repo hygiene (devops):** pinned `openpyxl==3.1.5` / `pytest==8.3.4`; hardened the **root** `.gitignore` against the committed-`.venv` class of mistake; documented the Python 3.12 requirement.