2026-05-27 08:58:46 +02:00
15 changed files with 680 additions and 185 deletions
--- a/docs/import-migration/02-normalization-spec.md
+++ b/docs/import-migration/02-normalization-spec.md
@@ -176,6 +176,14 @@ letter actually said.*
  Silvester=12-31, …). Seasons map to representative months: Frühling/Frühjahr=Apr, Sommer=Jul,
  Herbst=Oct, Winter=Jan. The feast/season tables and Easter algorithm live in `config.py`
  (NFR-MAINT-01).
 - **REQ-DATE-07** — **Intra-month day ranges carry an end day; half-resolved ranges are
  flagged.** For a day range like `7./8. Sept.1923`, `date_iso` holds the start day, the end
  day is resolved against the shared month/year into `date_end`, and `date_precision` =
  `RANGE`. If the **start** parses but the **end day is impossible** (e.g. `10./40.1.1917`),
  the row keeps the start and `RANGE` precision, leaves `date_end` **empty**, and is flagged
  `needs_review = range_end_unparsed` — the unparseable end is dropped honestly (surfaced for
  review), never silently invented or clamped. A `RANGE` row **may** therefore legitimately
  have an empty `date_end`; the importer must treat `date_end` as optional even on a `RANGE`.
 ### 4.4 Person resolution & dedup (`FR-PERS`, `FR-DEDUP`) — resolves IMP-04, IMP-05, IMP-11
@@ -262,6 +270,7 @@ DB schema.
 | Field | Required | Format / values | Notes |
 | --- | --- | --- | --- |
 | `index` | yes | string | Stable key; basis for PDF matching. |
 | `file` | no | string | verbatim `Datei` value (e.g. `H-0730.pdf`); carried through for the importer to link the scanned PDF. |
 | `box` | no | string | from `Box`. |
 | `folder` | no | string | from `Mappe`. |
 | `sender_person_id` | no | person_id | resolved; empty if no sender. |
@@ -271,11 +280,12 @@ DB schema.
 | `date_iso` | no | `YYYY-MM-DD` | best-effort; empty if `UNKNOWN`. |
 | `date_raw` | no | string | verbatim source date. |
 | `date_precision` | yes | enum | `DAY\|MONTH\|SEASON\|YEAR\|RANGE\|APPROX\|UNKNOWN`. |
 | `date_end` | no | `YYYY-MM-DD` or empty | RANGE end day (e.g. `7./8. Sept.1923` → `date_iso` = start, `date_end` = end). Empty for every non-RANGE precision **and** for a half-resolved RANGE whose end did not parse (see REQ-DATE-07). |
 | `location` | no | string | from `Ort`. |
 | `tags` | no | `tag\|tag` | from `Schlagwort`. |
 | `summary` | no | string | from `Inhalt`. |
 | `source_row` | yes | int | provenance (NFR-DATA-01). |
-| `needs_review` | yes | `flag\|flag` or empty | review flags (REQ-PROV-02). |
+| `needs_review` | yes | `flag\|flag` or empty | review flags (REQ-PROV-02). Flags include `unparsed_date`, `range_end_unparsed` (half-resolved RANGE, REQ-DATE-07), `unmatched_sender`, `unmatched_receiver`, `multi_sender`, `index_file_mismatch`, `duplicate_index`. |
 ### 6.2 `canonical-persons.xlsx`
@@ -295,6 +305,27 @@ DB schema.
 | `aliases` | no | `a\|b\|c` | every surface form that maps here. |
 | `provisional` | yes | bool | true if created from a document string, not the register. |
 ### 6.3 `canonical-persons-tree.json`
 The de-duplicated genealogical tree (family members + their relationships) the importer
 uses to seed the family graph. Each `persons[]` entry carries a `personId` that **joins
 1:1 onto** `person_id` in `canonical-persons.xlsx`.
 | Field | Required | Format | Notes |
 | --- | --- | --- | --- |
 | `personId` | yes | slug | The register's **verbatim** `person_id` (e.g. `cram-hans-1`), propagated — never re-slugified — so collision suffixes match `canonical-persons.xlsx` exactly. Every tree `personId` exists in the register; the register is the sole slug authority. |
 | `firstName` / `lastName` / `maidenName` | first/last yes | string | name parts. |
 | `birthYear` / `deathYear` | no | int or null | year only (tree granularity). |
 | `birthPlace` / `deathPlace` | no | string or null | from the register. |
 | `generation` | no | int or null | parsed from `G n`. |
 | `notes` | no | string or null | leftover Bemerkung text after relationship extraction. |
 | `familyMember` | yes | bool | always true for tree persons. |
 A top-level `generated_at` is pinned to a fixed timestamp (`2020-01-01T00:00:00`) for
 reproducibility (NFR-IDEM-01), not a wall-clock value. `relationships[]` carry `SPOUSE_OF`
 and `PARENT_OF` edges keyed by `rowId`; `unresolved[]` lists relationship strings that did
 not match a tree person.
 ---
 ## 7. Prioritized Backlog (MoSCoW)
@@ -339,7 +370,7 @@ DB schema.
 | ID | Question | Why it matters | Ref | Resolution |
 | --- | --- | --- | --- | --- |
 | OQ-01 ✅ | Season/holiday → date. | Accuracy of ~70 SEASON/feast rows. | REQ-DATE-06 | **Resolved (2026-05-25):** movable feasts (Ostern, Pfingsten, Himmelfahrt, Advent, …) **computed per year from Easter — never a fixed month**; fixed feasts looked up (Weihnachten=12-25, Neujahr=01-01, …); seasons = mid-season month (Frühling=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan). |
-| OQ-02 ✅ | Date ranges: start only, or start+end? | Sorting/display of ~315 range values. | REQ-DATE-02 | **Confirmed:** store **start** in `date_iso`, precision `RANGE`, full text in `date_raw`. |
+| OQ-02 ✅ | Date ranges: start only, or start+end? | Sorting/display of ~315 range values. | REQ-DATE-02, REQ-DATE-07 | **Confirmed (updated #670):** store **start** in `date_iso`, precision `RANGE`, full text in `date_raw`, **and the resolved end day in `date_end`** for intra-month day ranges. A half-resolved range (start parsed, end impossible) keeps `date_end` empty and is flagged `range_end_unparsed`. |
 | OQ-03 ✅ | `person_id` format. | Stability across re-runs; diffability. | §6 | **Confirmed:** readable slug `lastname-firstname`, numeric suffix on collision. |
 | OQ-04 ✅ | `x`-suffix row handling. | 42 rows. | REQ-TRIAGE-03 | **Resolved (2026-05-25):** `x` rows are transcriptions of the base letter but not yet mappable → **skip this pass**, log to `review/skipped-x-suffix.csv` for later linking. |
 | OQ-05 ✅ | Importer output format. | Phase-2 reader. | B11 | **Confirmed:** `.xlsx` (openpyxl-native, headered). |
--- a/tools/import-normalizer/.gitignore
+++ b/tools/import-normalizer/.gitignore
@@ -1,6 +1,7 @@
 .venv/
-out/
+out/*
 !out/canonical-persons-tree.json
 !out/*.xlsx
 review/
 __pycache__/
 *.pyc
--- a/tools/import-normalizer/dates.py
+++ b/tools/import-normalizer/dates.py
@@ -66,6 +66,24 @@ class ParsedDate:
    iso: str | None
    precision: Precision
    raw: str
    end: str | None = None   # RANGE end day; None for every non-RANGE precision
    # True only for a half-resolved RANGE: the start parsed but the end did not, so
    # the end was dropped and the row should surface in review (#670, Gap 2).
    needs_review: bool = False
@dataclass(frozen=True)
 class MatchResult:
    """Uniform return shape for every _match_* matcher.
    A matcher returns None when it does not match, or a MatchResult when it does.
    `end` is the RANGE end day (None for every non-RANGE precision); `needs_review`
    is True only for a half-resolved RANGE whose start parsed but end did not.
    """
    iso: str
    precision: Precision
    end: str | None = None
    needs_review: bool = False
 _LEADING_MARKERS = re.compile(
@@ -97,7 +115,7 @@ def _match_iso(s):
    if re.fullmatch(r"\d{4}-\d{2}-\d{2}", s):
        try:
            datetime.date.fromisoformat(s)
-            return s, Precision.DAY
+            return MatchResult(s, Precision.DAY)
        except ValueError:
            return None
    return None
@@ -112,7 +130,7 @@ def _match_numeric(s):
    if year is None or not (1 <= month <= 12):
        return None
    try:
-        return datetime.date(year, month, day).isoformat(), Precision.DAY
+        return MatchResult(datetime.date(year, month, day).isoformat(), Precision.DAY)
    except ValueError:
        return None
@@ -130,7 +148,7 @@ def _match_roman(s):
    if not month or year is None:
        return None
    try:
-        return datetime.date(year, month, day).isoformat(), Precision.DAY
+        return MatchResult(datetime.date(year, month, day).isoformat(), Precision.DAY)
    except ValueError:
        return None
@@ -146,7 +164,7 @@ def _build_day_month_year(day, month, year):
    if not month or year is None or not (1 <= month <= 12):
        return None
    try:
-        return datetime.date(year, month, day).isoformat(), Precision.DAY
+        return MatchResult(datetime.date(year, month, day).isoformat(), Precision.DAY)
    except ValueError:
        return None
@@ -188,7 +206,7 @@ def _match_month_year(s):
    year = expand_year(m.group(2))
    if not month or year is None:
        return None
-    return datetime.date(year, month, 1).isoformat(), Precision.MONTH
+    return MatchResult(datetime.date(year, month, 1).isoformat(), Precision.MONTH)
 def _match_feast_season(s):
@@ -198,33 +216,44 @@ def _match_feast_season(s):
    year = expand_year(m.group(2))
    if year is None:
        return None
-    return resolve_feast_or_season(m.group(1), year)
+    resolved = resolve_feast_or_season(m.group(1), year)
    if resolved is None:
        return None
    iso, precision = resolved
    return MatchResult(iso, precision)
 def _match_year_only(s):
    if _YEAR_ONLY_RE.fullmatch(s):
-        return datetime.date(int(s), 1, 1).isoformat(), Precision.YEAR
+        return MatchResult(datetime.date(int(s), 1, 1).isoformat(), Precision.YEAR)
    return None
 def _match_range(s):
    m = _RANGE_YY_RE.fullmatch(s)
    if m:
-        return datetime.date(int(m.group(1)), 1, 1).isoformat(), Precision.RANGE
+        return MatchResult(datetime.date(int(m.group(1)), 1, 1).isoformat(), Precision.RANGE)
    m = _RANGE_DAY_RE.fullmatch(s)
    if m:
-        first = f"{m.group(1)}.{m.group(3)}"  # "7." + "Sept.1923" -> "7.Sept.1923"
+        day_start, day_end, rest = m.group(1), m.group(2), m.group(3)
-        for matcher in (_match_numeric, _match_monthname_a):
+        # "10." + "1.1917" -> "10.1.1917"; resolve start and end day against the shared month/year
-            r = matcher(first)
+        for matcher in (_match_numeric, _match_roman, _match_monthname_a):
-            if r:
+            start = matcher(f"{day_start}.{rest}")
-                return r[0], Precision.RANGE
+            if start:
                end = matcher(f"{day_end}.{rest}")
                # Half-resolved range (start parsed, end did not — e.g. the impossible
                # end day in "10./40.1.1917"): keep the start and RANGE precision, drop
                # the end, and flag needs_review so the dropped end surfaces (#670, Gap 2).
                return MatchResult(start.iso, Precision.RANGE,
                                   end.iso if end else None,
                                   needs_review=end is None)
    m = _RANGE_HYPHEN_RE.fullmatch(s)
    if m:
        start = m.group(1).strip()
        for matcher in (_match_numeric, _match_roman, _match_monthname_a, _match_year_only):
            r = matcher(start)
            if r:
-                return r[0], Precision.RANGE
+                return MatchResult(r.iso, Precision.RANGE)
    return None
@@ -253,10 +282,8 @@ def parse_date(raw: str, date_overrides: dict | None = None) -> ParsedDate:
    for matcher in _MATCHERS:
        result = matcher(cleaned)
        if result:
-            iso, precision = result
+            precision = Precision.APPROX if approx else result.precision
-            if approx:
+            return ParsedDate(result.iso, precision, raw, result.end, result.needs_review)
                precision = Precision.APPROX
            return ParsedDate(iso, precision, raw)
    return ParsedDate(None, Precision.UNKNOWN, raw)
--- a/tools/import-normalizer/documents.py
+++ b/tools/import-normalizer/documents.py
@@ -31,6 +31,7 @@ class RawRow:
@dataclass
 class CanonicalDocument:
    index: str
    file: str = ""
    box: str = ""
    folder: str = ""
    sender_person_id: str = ""
@@ -40,6 +41,7 @@ class CanonicalDocument:
    date_iso: str = ""
    date_raw: str = ""
    date_precision: str = ""
    date_end: str = ""
    location: str = ""
    tags: list = field(default_factory=list)
    summary: str = ""
@@ -105,15 +107,18 @@ def to_canonical(raw, ctx, date_overrides: dict, approved_themes: frozenset = fr
    if raw.date.strip() and pd.precision == _dates.Precision.UNKNOWN:
        flags.append("unparsed_date")
    if pd.needs_review:
        flags.append("range_end_unparsed")
    if index_file_mismatch(raw.index, raw.file):
        flags.append("index_file_mismatch")
    return CanonicalDocument(
-        index=raw.index, box=raw.box, folder=raw.folder,
+        index=raw.index, file=raw.file, box=raw.box, folder=raw.folder,
        sender_person_id=sender_id, sender_name=sender_name,
        receiver_person_ids=[r[0] for r in receivers],
        receiver_names=[r[1] for r in receivers],
        date_iso=pd.iso or "", date_raw=raw.date, date_precision=str(pd.precision),
        date_end=pd.end or "",
        location=raw.location, tags=_tags.generate_tags(raw.tags, raw.summary, approved_themes), summary=raw.summary,
        source_row=raw.source_row, needs_review=flags,
    )
--- a/tools/import-normalizer/out/canonical-documents.xlsx
+++ b/tools/import-normalizer/out/canonical-documents.xlsx
--- a/tools/import-normalizer/out/canonical-persons-tree.json
+++ b/tools/import-normalizer/out/canonical-persons-tree.json
--- a/tools/import-normalizer/out/canonical-persons.xlsx
+++ b/tools/import-normalizer/out/canonical-persons.xlsx
--- a/tools/import-normalizer/out/canonical-tag-tree.xlsx
+++ b/tools/import-normalizer/out/canonical-tag-tree.xlsx
--- a/tools/import-normalizer/persons_tree.py
+++ b/tools/import-normalizer/persons_tree.py
@@ -8,9 +8,14 @@ from pathlib import Path
 import config
 import dates
 import persons as _persons
 from persons import _strip_accents
 # Pinned so the committed tree JSON is reproducible and does not churn on every run
 # (NFR-IDEM-01) — mirrors writers._FIXED_TS for the xlsx exports.
 _GENERATED_AT = "2020-01-01T00:00:00"
 _MIN_YEAR = 1700
 _MAX_YEAR = 2100
 # Threshold: if parse_date parses a pure-digit string as a year outside [_MIN_YEAR, _MAX_YEAR],
@@ -175,6 +180,29 @@ def _parse_row(row_num: int, fields: dict) -> dict:
    }
 def _attach_person_ids(tree_persons: list[dict], raw_dicts: list[dict]) -> None:
    """Attach the register's verbatim person_id to each tree person, in place.
    The register (persons.parse_register) is the sole authority for person_id; it
    slugifies and suffixes colliding ids exactly once. We propagate that id rather
    than re-slugify in the tree, because re-slugifying would not reproduce the
    register's collision suffixes and so would not reconcile 1:1 with the register
    (#670, Gap 3).
    tree_persons and raw_dicts must be the same length and in the same row order —
    parse_register and _parse_row both keep exactly the rows that have a last name.
    """
    register = _persons.parse_register(raw_dicts)
    if len(tree_persons) != len(register):
        raise ValueError(
            "person_id propagation requires equal length: "
            f"{len(tree_persons)} tree persons vs {len(register)} register persons "
            "(the positional zip would otherwise silently truncate and mis-join ids)"
        )
    for tree_person, register_person in zip(tree_persons, register):
        tree_person["personId"] = register_person.person_id
 def _deduplicate(persons: list[dict]) -> tuple[list[dict], list[str]]:
    """Remove duplicate rows. Two-stage:
@@ -339,11 +367,17 @@ def main() -> None:
    # --- Pass 1: parse rows ---
    persons_raw: list[dict] = []
    raw_dicts: list[dict] = []
    for row_num, row in enumerate(rows[1:], start=2):
        field_dict = {field: (row[col] if col < len(row) else "") for field, col in fields_map.items()}
        if not field_dict.get("last_name", "").strip():
            continue
        persons_raw.append(_parse_row(row_num, field_dict))
        raw_dicts.append(field_dict)
    # Propagate the register's verbatim person_id before dedup so the tree reconciles 1:1
    # with canonical-persons.xlsx (#670, Gap 3).
    _attach_person_ids(persons_raw, raw_dicts)
    persons, skipped_msgs = _deduplicate(persons_raw)
    for msg in skipped_msgs:
@@ -387,7 +421,7 @@ def main() -> None:
        return
    output = {
-        "generated_at": datetime.datetime.now().isoformat(),
+        "generated_at": _GENERATED_AT,
        "source": Path(args.input).name,
        "stats": {
            "persons": len(persons),
--- a/tools/import-normalizer/tests/test_dates.py
+++ b/tools/import-normalizer/tests/test_dates.py
@@ -2,6 +2,18 @@ import datetime
 import dates
 from dates import Precision
 def test_matchers_return_uniform_matchresult():
    # Every matcher returns a MatchResult(iso, precision, end) — no 2- vs 3-tuple
    # length-sniffing. A non-range matcher leaves end=None; a range matcher sets it.
    day = dates._match_numeric("15.2.1888")
    assert isinstance(day, dates.MatchResult)
    assert (day.iso, day.precision, day.end) == ("1888-02-15", Precision.DAY, None)
    rng = dates._match_range("10./11.1.1917")
    assert isinstance(rng, dates.MatchResult)
    assert (rng.iso, rng.precision, rng.end) == ("1917-01-10", Precision.RANGE, "1917-01-11")
 def test_easter_known_years():
    # Anonymous Gregorian algorithm — verified against published tables
    assert dates.easter(2024) == datetime.date(2024, 3, 31)
@@ -115,10 +127,55 @@ def test_parse_invalid_calendar_date_is_unknown():
    assert dates.parse_date("31.4.1916").precision == Precision.UNKNOWN
 def test_parse_intra_month_day_range():
-    # "7./8. Sept.1923" -> start day, RANGE. Must NOT be confused with slash-date "17/6. 1916".
+    # "7./8. Sept.1923" -> start day, RANGE, end day 8th. Must NOT be confused with slash-date "17/6. 1916".
-    assert dates.parse_date("7./8. Sept.1923") == dates.ParsedDate("1923-09-07", Precision.RANGE, "7./8. Sept.1923")
+    assert dates.parse_date("7./8. Sept.1923") == dates.ParsedDate("1923-09-07", Precision.RANGE, "7./8. Sept.1923", "1923-09-08")
    assert dates.parse_date("17/6. 1916") == dates.ParsedDate("1916-06-17", Precision.DAY, "17/6. 1916")
 def test_parse_intra_month_day_range_carries_end_day():
    # the intra-month day range surfaces the END day so Phase 4 can render meta_date_end
    r = dates.parse_date("10./11.1.1917")
    assert r.iso == "1917-01-10"
    assert r.precision == Precision.RANGE
    assert r.end == "1917-01-11"
 def test_parse_roman_month_day_range():
    # "10./11.I.1917" — Roman-numeral-month range; previously fell through to UNKNOWN
    r = dates.parse_date("10./11.I.1917")
    assert r.iso == "1917-01-10"
    assert r.precision == Precision.RANGE
    assert r.end == "1917-01-11"
 def test_parse_range_invalid_end_keeps_start_flags_review():
    # "10./40.1.1917" — the 40th is an impossible end day. The start parses fine,
    # so the row stays RANGE with the start preserved, the unparseable end is dropped
    # (end is None), and the half-resolved range is flagged needs_review so the
    # dropped end surfaces honestly instead of vanishing silently (#670, Gap 2).
    r = dates.parse_date("10./40.1.1917")
    assert r.iso == "1917-01-10"
    assert r.precision == Precision.RANGE
    assert r.end is None
    assert r.needs_review is True
 def test_parse_range_valid_end_not_flagged():
    # a fully-resolved range carries its end and is NOT flagged for review
    r = dates.parse_date("10./11.1.1917")
    assert r.end == "1917-01-11"
    assert r.needs_review is False
 def test_parse_non_range_has_no_review_flag():
    # every fully-parsed non-range date is never flagged for review by the date layer
    assert dates.parse_date("15.2.1888").needs_review is False
    assert dates.parse_date("Mai 1895").needs_review is False
    assert dates.parse_date("").needs_review is False
 def test_parse_non_range_has_no_end():
    assert dates.parse_date("15.2.1888").end is None
    assert dates.parse_date("Mai 1895").end is None
    assert dates.parse_date("").end is None
 def test_parse_trailing_note_stripped_but_raw_preserved():
    r = dates.parse_date("17.Nov 1887, 2. Brief")  # REQ-DATE-04
    assert r.iso == "1887-11-17"
--- a/tools/import-normalizer/tests/test_documents.py
+++ b/tools/import-normalizer/tests/test_documents.py
@@ -52,8 +52,59 @@ def test_to_canonical_resolves_and_flags():
    assert doc.receiver_person_ids == ["de-gruyter-eugenie"]   # matched via maiden alias
    assert doc.date_iso == "1888-02-15" and doc.date_precision == "DAY"
    assert doc.tags == ["Themen/Brautbriefe"]
    assert doc.file == r"..\__scan\W-0001.pdf"   # file name carried through for the importer
    assert doc.needs_review == []
 def test_to_canonical_carries_file_name():
    ctx = _ctx()
    raw = documents.RawRow(source_row=4, index="H-0730", sender="", receivers="",
                           file="H-0730.pdf")
    doc = documents.to_canonical(raw, ctx, date_overrides={})
    assert doc.file == "H-0730.pdf"
 def test_to_canonical_range_carries_date_end():
    ctx = _ctx()
    raw = documents.RawRow(source_row=4, index="H-0730", sender="", receivers="",
                           date="10./11.1.1917")
    doc = documents.to_canonical(raw, ctx, date_overrides={})
    assert doc.date_iso == "1917-01-10"
    assert doc.date_precision == "RANGE"
    assert doc.date_end == "1917-01-11"
 def test_to_canonical_non_range_has_empty_date_end():
    ctx = _ctx()
    raw = documents.RawRow(source_row=4, index="H-0730", sender="", receivers="",
                           date="15.2.1888")
    doc = documents.to_canonical(raw, ctx, date_overrides={})
    assert doc.date_precision == "DAY"
    assert doc.date_end == ""
 def test_to_canonical_half_resolved_range_flags_review():
    # an impossible end day ("10./40.1.1917") keeps the start + RANGE precision but
    # drops the unparseable end; the document must surface this as a review flag
    # so the importer (#669) knows date_end is empty on a RANGE row by design.
    ctx = _ctx()
    raw = documents.RawRow(source_row=5, index="H-0731", sender="", receivers="",
                           date="10./40.1.1917")
    doc = documents.to_canonical(raw, ctx, date_overrides={})
    assert doc.date_iso == "1917-01-10"
    assert doc.date_precision == "RANGE"
    assert doc.date_end == ""
    assert "range_end_unparsed" in doc.needs_review
 def test_to_canonical_full_range_not_flagged():
    ctx = _ctx()
    raw = documents.RawRow(source_row=5, index="H-0730", sender="", receivers="",
                           date="10./11.1.1917")
    doc = documents.to_canonical(raw, ctx, date_overrides={})
    assert doc.date_end == "1917-01-11"
    assert "range_end_unparsed" not in doc.needs_review
 def test_to_canonical_unmatched_and_unparsed():
    ctx = _ctx()
    raw = documents.RawRow(source_row=9, index="C-0001",
--- a/tools/import-normalizer/tests/test_normalize.py
+++ b/tools/import-normalizer/tests/test_normalize.py
@@ -1,3 +1,8 @@
 import json
 import subprocess
 import sys
 from pathlib import Path
 import openpyxl
 import normalize
@@ -119,3 +124,56 @@ def test_approved_themes_applied(tmp_path):
    tag_values = [ws.cell(row=r, column=tag_col + 1).value for r in range(2, ws.max_row + 1)]
    # W-0001 has Inhalt "Geschäftsreise" — should get an extra Themen/geschäftsreise tag
    assert any(v and "Themen/geschäftsreise" in v for v in tag_values)
 def _person_wb_with_collision(tmp_path):
    # Two "Hans Cram" rows force the register to suffix the colliding slug (-1/-2);
    # the tree must carry those exact suffixed ids so the join still reconciles.
    wb = openpyxl.Workbook(); ws = wb.active; ws.title = "Tabelle1"
    ws.append(["Generation", "Familienname", "Vorname", "geb als", "Geburtsdatum",
               "Geburtsort", "Todesdatum", "Sterbeort", "verheiratet mit", "Bemerkung"])
    ws.append(["G 1", "de Gruyter", "Walter", "", "", "", "", "", "", ""])
    ws.append(["G 1", "de Gruyter", "Eugenie", "Müller", "", "", "", "", "", ""])
    ws.append(["G 2", "Cram", "Hans", "", "1890", "", "", "", "", ""])
    ws.append(["G 3", "Cram", "Hans", "", "1925", "", "", "", "", ""])
    p = tmp_path / "persons.xlsx"; wb.save(p); return p
 def _generate_tree(person_wb, out_path):
    script = Path(__file__).parent.parent / "persons_tree.py"
    result = subprocess.run(
        [sys.executable, str(script), "--input", str(person_wb), "--output", str(out_path)],
        capture_output=True, text=True,
    )
    assert result.returncode == 0, result.stderr
    return json.loads(out_path.read_text(encoding="utf-8"))
 def test_tree_person_ids_reconcile_with_persons_xlsx(tmp_path):
    # The real #669 contract: every personId in canonical-persons-tree.json must join
    # 1:1 onto a person_id in canonical-persons.xlsx — no orphan tree id, no duplicate.
    # Both artifacts are produced from the SAME person workbook (collision included).
    person_wb = _person_wb_with_collision(tmp_path)
    out_dir = tmp_path / "out"; review_dir = tmp_path / "review"
    normalize.run(
        document_workbook=_doc_wb(tmp_path), document_sheet="Familienarchiv",
        person_workbook=person_wb, person_sheet="Tabelle1",
        out_dir=out_dir, review_dir=review_dir, date_overrides={}, name_overrides={})
    tree = _generate_tree(person_wb, tmp_path / "tree.json")
    tree_ids = [p["personId"] for p in tree["persons"]]
    wb = openpyxl.load_workbook(out_dir / "canonical-persons.xlsx")
    ws = wb.active
    header = [c.value for c in ws[1]]
    pid_col = header.index("person_id")
    register_ids = [ws.cell(row=r, column=pid_col + 1).value for r in range(2, ws.max_row + 1)]
    # tree ids are unique (no duplicate join key)
    assert len(tree_ids) == len(set(tree_ids))
    # the suffixed collision ids actually reached the tree
    assert "cram-hans-1" in tree_ids and "cram-hans-2" in tree_ids
    # every tree id resolves to exactly one register row — the join is total and 1:1
    register_counts = {pid: register_ids.count(pid) for pid in tree_ids}
    assert all(count == 1 for count in register_counts.values()), register_counts
--- a/tools/import-normalizer/tests/test_persons_tree.py
+++ b/tools/import-normalizer/tests/test_persons_tree.py
@@ -433,6 +433,64 @@ def test_parse_bemerkung_sohn_with_trailing_remark():
    assert notes == "nach Mexiko emigriert"
 def test_generated_at_is_fixed_for_reproducibility():
    # NFR-IDEM-01: a pinned timestamp so the committed tree JSON doesn't churn on every run
    assert persons_tree._GENERATED_AT == "2020-01-01T00:00:00"
 def test_attach_person_ids_propagates_register_slug():
    # the tree person must carry the register's verbatim person_id (slug), not a recomputed one
    raw_dicts = [
        {"generation": "G 1", "last_name": "de Gruyter", "first_name": "Walter",
         "maiden_name": "", "birth_date": "", "birth_place": "",
         "death_date": "", "death_place": "", "spouse": "", "notes": ""},
        {"generation": "G 1", "last_name": "de Gruyter", "first_name": "Eugenie",
         "maiden_name": "Müller", "birth_date": "", "birth_place": "",
         "death_date": "", "death_place": "", "spouse": "", "notes": ""},
    ]
    tree_persons = [persons_tree._parse_row(n, d) for n, d in enumerate(raw_dicts, start=2)]
    persons_tree._attach_person_ids(tree_persons, raw_dicts)
    assert tree_persons[0]["personId"] == "de-gruyter-walter"
    assert tree_persons[1]["personId"] == "de-gruyter-eugenie"
 def test_attach_person_ids_raises_on_length_divergence():
    # The propagation is a positional zip; if tree_persons and the register drift in
    # length (e.g. a future filter change), zip would silently truncate and mis-join ids.
    # The guard must fail loudly instead.
    raw_dicts = [
        {"generation": "G 1", "last_name": "de Gruyter", "first_name": "Walter",
         "maiden_name": "", "birth_date": "", "birth_place": "",
         "death_date": "", "death_place": "", "spouse": "", "notes": ""},
        # second register row has a last name -> parse_register keeps it ...
        {"generation": "G 1", "last_name": "de Gruyter", "first_name": "Eugenie",
         "maiden_name": "Müller", "birth_date": "", "birth_place": "",
         "death_date": "", "death_place": "", "spouse": "", "notes": ""},
    ]
    # ... but the tree side only has one person -> lengths diverge.
    tree_persons = [persons_tree._parse_row(2, raw_dicts[0])]
    import pytest
    with pytest.raises(ValueError, match="length"):
        persons_tree._attach_person_ids(tree_persons, raw_dicts)
 def test_attach_person_ids_carries_register_collision_suffix():
    # when two register rows slug-collide, the register suffixes the ids (-1, -2);
    # those exact suffixed ids must reach the tree persons, never a recomputed bare slug
    raw_dicts = [
        {"generation": "G 2", "last_name": "Cram", "first_name": "Hans",
         "maiden_name": "", "birth_date": "1890", "birth_place": "",
         "death_date": "", "death_place": "", "spouse": "", "notes": ""},
        {"generation": "G 3", "last_name": "Cram", "first_name": "Hans",
         "maiden_name": "", "birth_date": "1925", "birth_place": "",
         "death_date": "", "death_place": "", "spouse": "", "notes": ""},
    ]
    tree_persons = [persons_tree._parse_row(n, d) for n, d in enumerate(raw_dicts, start=2)]
    persons_tree._attach_person_ids(tree_persons, raw_dicts)
    assert tree_persons[0]["personId"] == "cram-hans-1"
    assert tree_persons[1]["personId"] == "cram-hans-2"
 import subprocess
--- a/tools/import-normalizer/tests/test_writers.py
+++ b/tools/import-normalizer/tests/test_writers.py
@@ -31,6 +31,21 @@ def test_write_documents_xlsx_joins_lists(tmp_path):
    assert row["receiver_person_ids"] == "a|b"
    assert row["needs_review"] == "unparsed_date"
 def test_write_documents_xlsx_carries_file_and_date_end(tmp_path):
    doc = documents.CanonicalDocument(
        index="H-0730", file="H-0730.pdf", date_iso="1917-01-10",
        date_precision="RANGE", date_end="1917-01-11")
    out = tmp_path / "docs.xlsx"
    writers.write_documents_xlsx([doc], out)
    wb = openpyxl.load_workbook(out)
    ws = wb.active
    header = [c.value for c in ws[1]]
    assert "file" in header and "date_end" in header
    row = {h: c.value for h, c in zip(header, ws[2])}
    assert row["file"] == "H-0730.pdf"
    assert row["date_end"] == "1917-01-11"
 def test_write_documents_xlsx_pins_timestamp(tmp_path):
    # determinism (NFR-IDEM-01): workbook created/modified are pinned, not the current time
    doc = documents.CanonicalDocument(index="W-0001")
--- a/tools/import-normalizer/writers.py
+++ b/tools/import-normalizer/writers.py
@@ -22,9 +22,10 @@ def _csv_safe(value):
    return "'" + s if s[:1] in ("=", "+", "-", "@", "\t", "\r", "\n") else s
-DOC_COLUMNS = ["index", "box", "folder", "sender_person_id", "sender_name",
+DOC_COLUMNS = ["index", "file", "box", "folder", "sender_person_id", "sender_name",
               "receiver_person_ids", "receiver_names", "date_iso", "date_raw",
-               "date_precision", "location", "tags", "summary", "source_row", "needs_review"]
+               "date_precision", "date_end", "location", "tags", "summary",
               "source_row", "needs_review"]
 PERSON_COLUMNS = ["person_id", "last_name", "first_name", "maiden_name", "title", "nickname",
                  "birth_date", "birth_date_raw", "birth_place", "death_date", "death_date_raw",