As the archive owner I want the normalizer's canonical exports to carry every field the importer needs (file, date-range end, person ids in the tree) so the loaders have a complete, joinable contract #670

Closed
opened 2026-05-26 22:17:16 +02:00 by marcel · 8 comments
Owner

Phase 1 of the "Handling the Unknowns" milestone — build this first

The new modular importer (Phase 3) loads the normalizer's canonical exports in tools/import-normalizer/out/ instead of re-parsing the raw spreadsheet. Three gaps in those exports would block downstream phases, so close them before any backend work. This is Python-only — no backend dependency — and unblocks Phases 2/3.

The three gaps (verified against the normalizer source)

Gap 1 — the document export drops the file name

RawRow.file is read from the "datei" column (documents.py:20, _FIELDS line 50) but CanonicalDocument has no file fieldto_canonical uses raw.file only for the index_file_mismatch flag (documents.py:108) then discards it, and DOC_COLUMNS (writers.py:25-27) has no file. The importer therefore can't deterministically locate the PDF; it must guess from index.

  • Fix: add file: str = "" to CanonicalDocument (~line 33); pass file=raw.file in the to_canonical(...) return (~line 111); add "file" to DOC_COLUMNS.

Gap 2 — RANGE dates lose the end day

For a range like 10./11.1.1917, the normalizer resolves a representative start day into date_iso but discards the end day, and DOC_COLUMNS has no date_end. This makes Phase 4's RANGE rendering (meta_date_end) impossible to populate.

  • Fix: capture the range end in dates.py range matching and surface it; add a date_end field to CanonicalDocument + a date_end column to DOC_COLUMNS. date_end is populated only for RANGE precision, empty otherwise.
  • Note: Roman-numeral-month ranges (e.g. 10./11.I.1917) currently fall through to UNKNOWN because the range matcher only handles numeric / month-name forms (dates.py ~line 217). Wiring the Roman-month range form is in scope for this issue, so the RANGE data exists for Phase 4's rendering.

Gap 3 — the persons-tree JSON has no joinable id

canonical-persons.xlsx keys persons by person_id (slug); canonical-persons-tree.json keys persons by rowId (row_002, …) with no person_id. Phase 3's PersonTreeImporter has nothing to join the relationship tree onto the persons loaded from the register.

  • Fix: the persons-tree export must carry the register's actual person_id. The id is assigned exactly once by the register — canonical-persons.xlsx is the authority — and that exact id must be propagated into the tree JSON (persons_tree.py), never recomputed. Re-slugifying in the tree builder would NOT reliably match the register id, because the tree JSON and the register xlsx are built from different inputs / dedup. Propagating the register's id verbatim is what makes the tree reconcile to register persons 1:1.

Confirmed already present (no change)

canonical-persons.xlsx already includes a provisional column (PERSON_COLUMNS, writers.py:31). Downstream phases consume it as-is.

Acceptance criteria

Scenario: Document export carries the file name
  Given a spreadsheet row whose "datei" cell is "H-0730.pdf"
  When the canonical documents export is generated
  Then the row has a "file" column equal to "H-0730.pdf"

Scenario: RANGE dates carry an end day
  Given a date cell "10./11.1.1917"
  When the canonical document is generated
  Then date_iso is the start day 1917-01-10
  And date_precision is "RANGE"
  And date_end is 1917-01-11
  And for any non-RANGE row date_end is empty

Scenario: Tree persons are joinable to the register
  Given a person appears in both canonical-persons.xlsx and canonical-persons-tree.json
  Then the tree entry carries the register's person_id verbatim

File-level breakdown

  • documents.pyCanonicalDocument (+file, +date_end), to_canonical (pass both), check RawRow.
  • dates.py — range matching surfaces the end day alongside the representative iso.
  • writers.py — add file and date_end to DOC_COLUMNS.
  • persons_tree.py — propagate the register's person_id per tree person. Also note: the committed canonical-persons-tree.json currently stamps generated_at with datetime.now(), which churns the file on every run — use a fixed/deterministic timestamp (as the xlsx writer already does with its _FIXED_TS) so committed artifacts are reproducible.
  • tests/ — extend test_normalize.py (and date tests) with the cases above.
  • Regenerate out/ artifacts. The regenerated out/ artifacts are now committed to the repo per the milestone decision (#669) — update .gitignore to stop excluding out/*.xlsx.

Out of scope

  • Any backend / schema / loader work (Phases 2-3).
  • Briefwechsel (dead feature).

Dependency

None (Python only). Blocks Phase 2 (schema) and Phase 3 (importer).

## Phase 1 of the "Handling the Unknowns" milestone — build this first The new modular importer (Phase 3) loads the normalizer's canonical exports in `tools/import-normalizer/out/` instead of re-parsing the raw spreadsheet. Three gaps in those exports would block downstream phases, so close them **before** any backend work. This is **Python-only** — no backend dependency — and unblocks Phases 2/3. ## The three gaps (verified against the normalizer source) ### Gap 1 — the document export drops the file name `RawRow.file` is read from the "datei" column (`documents.py:20`, `_FIELDS` line 50) but `CanonicalDocument` has **no `file` field** — `to_canonical` uses `raw.file` only for the `index_file_mismatch` flag (`documents.py:108`) then discards it, and `DOC_COLUMNS` (`writers.py:25-27`) has no `file`. The importer therefore can't deterministically locate the PDF; it must guess from `index`. - **Fix:** add `file: str = ""` to `CanonicalDocument` (~line 33); pass `file=raw.file` in the `to_canonical(...)` return (~line 111); add `"file"` to `DOC_COLUMNS`. ### Gap 2 — RANGE dates lose the end day For a range like `10./11.1.1917`, the normalizer resolves a representative start day into `date_iso` but **discards the end day**, and `DOC_COLUMNS` has no `date_end`. This makes Phase 4's RANGE rendering (`meta_date_end`) impossible to populate. - **Fix:** capture the range end in `dates.py` range matching and surface it; add a `date_end` field to `CanonicalDocument` + a `date_end` column to `DOC_COLUMNS`. `date_end` is populated only for `RANGE` precision, empty otherwise. - **Note:** Roman-numeral-month ranges (e.g. `10./11.I.1917`) currently fall through to `UNKNOWN` because the range matcher only handles numeric / month-name forms (`dates.py` ~line 217). Wiring the Roman-month range form is **in scope** for this issue, so the RANGE data exists for Phase 4's rendering. ### Gap 3 — the persons-tree JSON has no joinable id `canonical-persons.xlsx` keys persons by `person_id` (slug); `canonical-persons-tree.json` keys persons by `rowId` (`row_002`, …) with **no `person_id`**. Phase 3's PersonTreeImporter has nothing to join the relationship tree onto the persons loaded from the register. - **Fix:** the persons-tree export must carry the register's **actual** `person_id`. The id is assigned exactly once by the register — `canonical-persons.xlsx` is the authority — and that exact id must be propagated into the tree JSON (`persons_tree.py`), **never recomputed**. Re-slugifying in the tree builder would NOT reliably match the register id, because the tree JSON and the register xlsx are built from different inputs / dedup. Propagating the register's id verbatim is what makes the tree reconcile to register persons 1:1. ### Confirmed already present (no change) `canonical-persons.xlsx` already includes a `provisional` column (`PERSON_COLUMNS`, `writers.py:31`). Downstream phases consume it as-is. ## Acceptance criteria ```gherkin Scenario: Document export carries the file name Given a spreadsheet row whose "datei" cell is "H-0730.pdf" When the canonical documents export is generated Then the row has a "file" column equal to "H-0730.pdf" Scenario: RANGE dates carry an end day Given a date cell "10./11.1.1917" When the canonical document is generated Then date_iso is the start day 1917-01-10 And date_precision is "RANGE" And date_end is 1917-01-11 And for any non-RANGE row date_end is empty Scenario: Tree persons are joinable to the register Given a person appears in both canonical-persons.xlsx and canonical-persons-tree.json Then the tree entry carries the register's person_id verbatim ``` ## File-level breakdown - `documents.py` — `CanonicalDocument` (+`file`, +`date_end`), `to_canonical` (pass both), check `RawRow`. - `dates.py` — range matching surfaces the end day alongside the representative `iso`. - `writers.py` — add `file` and `date_end` to `DOC_COLUMNS`. - `persons_tree.py` — propagate the register's `person_id` per tree person. Also note: the committed `canonical-persons-tree.json` currently stamps `generated_at` with `datetime.now()`, which churns the file on every run — use a fixed/deterministic timestamp (as the xlsx writer already does with its `_FIXED_TS`) so committed artifacts are reproducible. - `tests/` — extend `test_normalize.py` (and date tests) with the cases above. - Regenerate `out/` artifacts. The regenerated `out/` artifacts are now **committed to the repo** per the milestone decision (#669) — update `.gitignore` to stop excluding `out/*.xlsx`. ## Out of scope - Any backend / schema / loader work (Phases 2-3). - Briefwechsel (dead feature). ## Dependency None (Python only). **Blocks Phase 2 (schema) and Phase 3 (importer).**
marcel added this to the Handling the Unknowns — honest uncertainty in dates & people milestone 2026-05-26 22:17:16 +02:00
marcel added the P0-criticalfeature labels 2026-05-26 22:18:27 +02:00
Author
Owner

Markus Keller — Senior Application Architect

I read documents.py, dates.py, writers.py, persons.py, and persons_tree.py to ground this. The three gaps are real and the fix direction is sound. My concern is the contract Gap 3 promises versus what the code can actually deliver.

Observations

  • Gap 3's "1:1 joinable" claim is the architecturally risky one. The register and the tree are built from two different source workbooks with two different dedup strategies and two different normalizers:
    • Register: persons.parse_register() reads DOCUMENT_WORKBOOK (config.py:9), slugs via slugify(last, first) (persons.py:75), and resolves id collisions with a -N suffix (persons.py:85-90).
    • Tree: persons_tree.py reads PERSON_WORKBOOK = Personendatei 2.xlsx (config.py:11), keys by row_NNN, and runs its own _deduplicate() (name+birthYear) at persons_tree.py:178. Names normalize through _norm_tree(), which is a different function than slugify.
    • Net effect: emitting slugify(lastName, firstName) inside the tree does not guarantee a matching id exists in the register, and does not reproduce the register's -N collision suffixes (the tree has no visibility into register collisions). A person present only in Personendatei 2.xlsx will carry a person_id that joins to nothing.
  • Gap 1 (file) and Gap 2 (date_end) are clean, additive column changes — low architectural risk.
  • This is a tool-internal data contract, not a DB schema change, so my doc-update matrix (Flyway/C4 diagrams) does not trigger. No ADR is needed for adding two columns. Good.

Recommendations

  • Be explicit that person_id in the tree is a best-effort join key, not a guaranteed foreign key. The honest contract is: emit slugify(lastName, firstName) as person_id, and accept that Phase 3's PersonTreeImporter must treat an unmatched id as "tree-only person" rather than assuming referential integrity. Encode that expectation in the issue so Phase 3 isn't built on a false assumption.
  • Add a reconciliation diagnostic now, in Phase 1. When regenerating out/, count how many tree person_ids have no register counterpart and emit it to a review file (same pattern as the existing review/*.csv). This surfaces the join-gap as data, not a Phase-3 surprise.
  • Keep date_end populated only for RANGE (as the AC states) — a sparse column with a clear single-precision contract is easier to reason about downstream than a nullable-everywhere field.

Open Decisions

  • Tree↔register join guarantee. The issue asserts a 1:1 join "so the tree reconciles to register persons 1:1," but the two pipelines are independent. Options: (a) accept best-effort slug join + a reconciliation report (cheap, honest, but Phase 3 must handle misses); (b) make the register the single source of person identity and have the tree look up ids from the already-built register rather than re-slugging (more correct, but couples the two scripts and is more work). The right answer depends on how much Phase 3 will trust this id — decide before implementing.
## Markus Keller — Senior Application Architect I read `documents.py`, `dates.py`, `writers.py`, `persons.py`, and `persons_tree.py` to ground this. The three gaps are real and the fix direction is sound. My concern is the **contract** Gap 3 promises versus what the code can actually deliver. ### Observations - **Gap 3's "1:1 joinable" claim is the architecturally risky one.** The register and the tree are built from **two different source workbooks** with **two different dedup strategies and two different normalizers**: - Register: `persons.parse_register()` reads `DOCUMENT_WORKBOOK` (`config.py:9`), slugs via `slugify(last, first)` (`persons.py:75`), and resolves id collisions with a `-N` suffix (`persons.py:85-90`). - Tree: `persons_tree.py` reads `PERSON_WORKBOOK` = `Personendatei 2.xlsx` (`config.py:11`), keys by `row_NNN`, and runs its own `_deduplicate()` (name+birthYear) at `persons_tree.py:178`. Names normalize through `_norm_tree()`, which is a *different* function than `slugify`. - Net effect: emitting `slugify(lastName, firstName)` inside the tree does **not** guarantee a matching id exists in the register, and does **not** reproduce the register's `-N` collision suffixes (the tree has no visibility into register collisions). A person present only in `Personendatei 2.xlsx` will carry a `person_id` that joins to nothing. - Gap 1 (`file`) and Gap 2 (`date_end`) are clean, additive column changes — low architectural risk. - This is a tool-internal data contract, not a DB schema change, so my doc-update matrix (Flyway/C4 diagrams) does **not** trigger. No ADR is needed for adding two columns. Good. ### Recommendations - **Be explicit that `person_id` in the tree is a *best-effort* join key, not a guaranteed foreign key.** The honest contract is: emit `slugify(lastName, firstName)` as `person_id`, and accept that Phase 3's `PersonTreeImporter` must treat an unmatched id as "tree-only person" rather than assuming referential integrity. Encode that expectation in the issue so Phase 3 isn't built on a false assumption. - **Add a reconciliation diagnostic now, in Phase 1.** When regenerating `out/`, count how many tree `person_id`s have no register counterpart and emit it to a review file (same pattern as the existing `review/*.csv`). This surfaces the join-gap as data, not a Phase-3 surprise. - Keep `date_end` populated **only** for `RANGE` (as the AC states) — a sparse column with a clear single-precision contract is easier to reason about downstream than a nullable-everywhere field. ### Open Decisions - **Tree↔register join guarantee.** The issue asserts a 1:1 join "so the tree reconciles to register persons 1:1," but the two pipelines are independent. Options: (a) accept best-effort slug join + a reconciliation report (cheap, honest, but Phase 3 must handle misses); (b) make the register the single source of person identity and have the tree *look up* ids from the already-built register rather than re-slugging (more correct, but couples the two scripts and is more work). The right answer depends on how much Phase 3 will trust this id — decide before implementing.
Author
Owner

Felix Brandt — Senior Fullstack Developer

I traced all four files and ran the AC's own date example. There's a blocker hiding in the Gap 2 acceptance criterion. TDD red-first on every gap.

Observations

  • The AC's RANGE example does not currently parse as RANGE. I ran it:
    parse_date("10./11.I.1917")  -> iso=None, precision=UNKNOWN   # the AC's literal example
    parse_date("10./11.1.1917")  -> iso=1917-01-10, precision=RANGE  # ASCII month works
    
    _RANGE_DAY_RE (dates.py:180) reconstructs "10." + "I.1917" and only retries _match_numeric and _match_monthname_a (dates.py:217) — not _match_roman. So Roman-numeral month ranges fall through to UNKNOWN. Implementing the AC verbatim first requires adding _match_roman to that retry list. The issue doesn't mention this — it's unscoped work.
  • date_end ripples through the matcher contract. Every matcher in _MATCHERS (dates.py:231) returns a 2-tuple (iso, precision), consumed at dates.py:254-259, and ParsedDate is a frozen dataclass with no end field. Adding date_end means either (a) widening every matcher to a 3-tuple, or (b) adding a 4th field to ParsedDate and only the range matcher sets it. Option (b) is the smaller, safer change — ParsedDate(iso, precision, raw, iso_end=None).
  • Gap 1 is trivial and the existing test already feeds file= into RawRow (test_documents.py:49). Just thread it through to_canonical and DOC_COLUMNS.
  • to_canonical is already at the size limit for one function; adding date_end=pd.iso_end or "" is fine, but don't grow the flag-collection block further without extracting.

Recommendations

  • Write the failing test first for each gap, in this order:
    1. test_documents.py: assert doc.file == "H-0730.pdf" and "file" in DOC_COLUMNS.
    2. test_dates.py: assert parse_date("10./11.1.1917") yields iso_end == "1917-01-11" and add the Roman case 10./11.I.1917 — watch it go red, then fix _RANGE_DAY_RE's retry list to include _match_roman. Also pin the non-RANGE contract: parse_date("15.2.1888").iso_end is None.
    3. test_persons_tree.py: assert each emitted person dict has person_id == slugify(lastName, firstName).
  • Add iso_end to ParsedDate, not a 3-tuple. Keeps all eight other matchers untouched. Only _match_range and parse_date change. Then documents.py maps pd.iso_end or "" into CanonicalDocument.date_end.
  • Define date_end for the year/hyphen ranges too, or explicitly null them. 1881/82 and 8.1.1916 - 15.3.1916 are also RANGE today. Decide whether date_end is the end year-start (1882-01-01), the parsed end day (1916-03-15), or empty for these. A test must pin whichever you pick — otherwise the column is silently inconsistent.

Open Decisions

  • What date_end holds for non-day ranges. RANGE today covers three shapes: intra-month day ranges (10./11.), year ranges (1881/82), and full hyphen ranges (8.1.1916 - 15.3.1916). The AC only specifies the day-range case. Options: parse and emit a real end for all three (most useful for Phase 4 rendering, more parsing work) vs. emit date_end only for the day-range shape and leave the others empty (matches the AC literally, less work, but Phase 4 can't render year/hyphen range ends). This is a product call about what Phase 4 needs to display.
## Felix Brandt — Senior Fullstack Developer I traced all four files and ran the AC's own date example. There's a blocker hiding in the Gap 2 acceptance criterion. TDD red-first on every gap. ### Observations - **The AC's RANGE example does not currently parse as RANGE.** I ran it: ``` parse_date("10./11.I.1917") -> iso=None, precision=UNKNOWN # the AC's literal example parse_date("10./11.1.1917") -> iso=1917-01-10, precision=RANGE # ASCII month works ``` `_RANGE_DAY_RE` (`dates.py:180`) reconstructs `"10." + "I.1917"` and only retries `_match_numeric` and `_match_monthname_a` (`dates.py:217`) — **not** `_match_roman`. So Roman-numeral month ranges fall through to UNKNOWN. Implementing the AC verbatim first requires adding `_match_roman` to that retry list. The issue doesn't mention this — it's unscoped work. - **`date_end` ripples through the matcher contract.** Every matcher in `_MATCHERS` (`dates.py:231`) returns a 2-tuple `(iso, precision)`, consumed at `dates.py:254-259`, and `ParsedDate` is a frozen dataclass with no end field. Adding `date_end` means either (a) widening every matcher to a 3-tuple, or (b) adding a 4th field to `ParsedDate` and only the range matcher sets it. Option (b) is the smaller, safer change — `ParsedDate(iso, precision, raw, iso_end=None)`. - Gap 1 is trivial and the existing test already feeds `file=` into `RawRow` (`test_documents.py:49`). Just thread it through `to_canonical` and `DOC_COLUMNS`. - `to_canonical` is already at the size limit for one function; adding `date_end=pd.iso_end or ""` is fine, but don't grow the flag-collection block further without extracting. ### Recommendations - **Write the failing test first for each gap, in this order:** 1. `test_documents.py`: assert `doc.file == "H-0730.pdf"` and `"file" in DOC_COLUMNS`. 2. `test_dates.py`: assert `parse_date("10./11.1.1917")` yields `iso_end == "1917-01-11"` **and** add the Roman case `10./11.I.1917` — watch it go red, then fix `_RANGE_DAY_RE`'s retry list to include `_match_roman`. Also pin the *non-RANGE* contract: `parse_date("15.2.1888").iso_end is None`. 3. `test_persons_tree.py`: assert each emitted person dict has `person_id == slugify(lastName, firstName)`. - **Add `iso_end` to `ParsedDate`, not a 3-tuple.** Keeps all eight other matchers untouched. Only `_match_range` and `parse_date` change. Then `documents.py` maps `pd.iso_end or ""` into `CanonicalDocument.date_end`. - **Define `date_end` for the year/hyphen ranges too, or explicitly null them.** `1881/82` and `8.1.1916 - 15.3.1916` are also `RANGE` today. Decide whether `date_end` is the end *year-start* (`1882-01-01`), the parsed end day (`1916-03-15`), or empty for these. A test must pin whichever you pick — otherwise the column is silently inconsistent. ### Open Decisions - **What `date_end` holds for non-day ranges.** `RANGE` today covers three shapes: intra-month day ranges (`10./11.`), year ranges (`1881/82`), and full hyphen ranges (`8.1.1916 - 15.3.1916`). The AC only specifies the day-range case. Options: parse and emit a real end for all three (most useful for Phase 4 rendering, more parsing work) vs. emit `date_end` only for the day-range shape and leave the others empty (matches the AC literally, less work, but Phase 4 can't render year/hyphen range ends). This is a product call about what Phase 4 needs to display.
Author
Owner

Sara Holt — Senior QA Engineer

I verified the acceptance criteria against the actual parser. One AC scenario fails on its own example today, and the other two need negative cases added.

Observations

  • AC "RANGE dates carry an end day" uses an example that currently parses to UNKNOWN. Run it yourself:
    parse_date("10./11.I.1917") -> iso=None, precision=UNKNOWN
    
    The Roman-month form is not handled by the range matcher (dates.py:217 retries numeric/monthname only). So a literal Gherkin test of the AC fails at date_iso == 1917-01-10 before ever reaching the date_end assertion. Either fix the Roman case as part of this issue, or change the AC example to an ASCII form (10./11.1.1917), which does produce RANGE. Right now the AC is not implementable as written.
  • AC 1 (file) has no negative case. The scenario only checks a populated datei. The corpus has rows with empty Datei (see test_normalize.py:13, C-0001 has no file). Need: empty dateifile column is empty string, not None/missing.
  • AC 2's "non-RANGE → empty" clause is good but untested across precisions. date_end must be empty for DAY, MONTH, SEASON, YEAR, APPROX, and UNKNOWN — not just "any non-RANGE row." That's six precisions; at minimum test DAY and YEAR.
  • AC 3 (tree person_id) has no collision case. slugify collisions get -N suffixes in the register (persons.py:85-90), but the tree's _deduplicate is independent. Two tree persons that slug identically need a defined, tested outcome.
  • Determinism gate at risk. test_normalize.py:52-63 asserts byte-identical reruns of the xlsx outputs. New file/date_end columns must preserve that. Separately, out/canonical-persons-tree.json stamps generated_at = datetime.now() (persons_tree.py:390) — regenerating it for this issue produces a non-reproducible diff. CI can't assert determinism on that file as-is.

Recommendations

  • Add these tests (red first):
    • test_documents.py: file populated AND file empty-string when datei blank.
    • test_dates.py: iso_end for 10./11.1.1917; the Roman 10./11.I.1917 case (expected to drive the matcher fix); iso_end is None for a DAY date and a YEAR date.
    • test_persons_tree.py: person_id == slugify(...) for a normal person; defined behavior for two identically-slugging persons.
    • test_writers.py: "file" and "date_end" present in DOC_COLUMNS in the expected positions.
  • Extend the end-to-end determinism test to assert the two new columns appear in the header and survive a second run identically.
  • Run only the touched test files (test_dates.py, test_documents.py, test_writers.py, test_persons_tree.py, test_normalize.py) per the project rule — never the whole suite locally.

Open Decisions

  • Fix the Roman range or rewrite the AC? The AC example 10./11.I.1917 doesn't parse as RANGE today. Option A: extend _RANGE_DAY_RE's retry to include _match_roman (more corpus coverage, in scope of "RANGE dates carry an end day," small fix). Option B: change the AC to the ASCII form and defer Roman ranges (less work now, but the literal corpus value stays UNKNOWN). Pick one so the test author has an unambiguous target.
## Sara Holt — Senior QA Engineer I verified the acceptance criteria against the actual parser. One AC scenario fails on its own example today, and the other two need negative cases added. ### Observations - **AC "RANGE dates carry an end day" uses an example that currently parses to UNKNOWN.** Run it yourself: ``` parse_date("10./11.I.1917") -> iso=None, precision=UNKNOWN ``` The Roman-month form is not handled by the range matcher (`dates.py:217` retries numeric/monthname only). So a literal Gherkin test of the AC fails at `date_iso == 1917-01-10` **before** ever reaching the `date_end` assertion. Either fix the Roman case as part of this issue, or change the AC example to an ASCII form (`10./11.1.1917`), which does produce RANGE. Right now the AC is not implementable as written. - **AC 1 (`file`) has no negative case.** The scenario only checks a populated `datei`. The corpus has rows with empty `Datei` (see `test_normalize.py:13`, C-0001 has no file). Need: empty `datei` → `file` column is empty string, not `None`/missing. - **AC 2's "non-RANGE → empty" clause is good but untested across precisions.** `date_end` must be empty for DAY, MONTH, SEASON, YEAR, APPROX, and UNKNOWN — not just "any non-RANGE row." That's six precisions; at minimum test DAY and YEAR. - **AC 3 (tree `person_id`) has no collision case.** `slugify` collisions get `-N` suffixes in the register (`persons.py:85-90`), but the tree's `_deduplicate` is independent. Two tree persons that slug identically need a defined, tested outcome. - **Determinism gate at risk.** `test_normalize.py:52-63` asserts byte-identical reruns of the xlsx outputs. New `file`/`date_end` columns must preserve that. Separately, `out/canonical-persons-tree.json` stamps `generated_at = datetime.now()` (`persons_tree.py:390`) — regenerating it for this issue produces a non-reproducible diff. CI can't assert determinism on that file as-is. ### Recommendations - **Add these tests (red first):** - `test_documents.py`: `file` populated AND `file` empty-string when `datei` blank. - `test_dates.py`: `iso_end` for `10./11.1.1917`; the Roman `10./11.I.1917` case (expected to drive the matcher fix); `iso_end is None` for a DAY date and a YEAR date. - `test_persons_tree.py`: `person_id == slugify(...)` for a normal person; defined behavior for two identically-slugging persons. - `test_writers.py`: `"file"` and `"date_end"` present in `DOC_COLUMNS` in the expected positions. - **Extend the end-to-end determinism test** to assert the two new columns appear in the header and survive a second run identically. - **Run only the touched test files** (`test_dates.py`, `test_documents.py`, `test_writers.py`, `test_persons_tree.py`, `test_normalize.py`) per the project rule — never the whole suite locally. ### Open Decisions - **Fix the Roman range or rewrite the AC?** The AC example `10./11.I.1917` doesn't parse as RANGE today. Option A: extend `_RANGE_DAY_RE`'s retry to include `_match_roman` (more corpus coverage, in scope of "RANGE dates carry an end day," small fix). Option B: change the AC to the ASCII form and defer Roman ranges (less work now, but the literal corpus value stays UNKNOWN). Pick one so the test author has an unambiguous target.
Author
Owner

Nora Steiner ("NullX") — Application Security Engineer

This is an offline Python ETL tool with no network, no auth surface, and no user-supplied input at runtime — the spreadsheet is trusted family data the owner controls. So the OWASP-style attack surface is essentially nil. Two data-integrity / output-safety notes still apply.

Observations

  • CSV formula-injection (CWE-1236) is already handled for review files via _csv_safe() (writers.py:19-22), which neutralizes leading =, +, -, @. Good. The file column from the datei cell will flow into the xlsx outputs (canonical-documents.xlsx), not the CSVs — and _csv_safe is only applied to CSV writes, not xlsx. A datei value like =cmd|... would land raw in the xlsx. Low risk (data is owner-controlled, downstream consumer is a programmatic importer, not a human double-clicking), but worth a conscious note rather than a silent gap.
  • file is later used to locate a PDF on disk (Phase 3). The issue says the importer "can't deterministically locate the PDF; it must guess from index." When Phase 3 consumes this column, the file value becomes a path. Corpus values are Windows paths like ..\__scan\W-0001.pdf (test_documents.py:8). That is path-traversal-shaped input. Phase 1 only carries the string, so nothing to fix here — but flag it loudly for Phase 3: the loader must basename-normalize and confine to the scan directory, never open(file) directly.
  • No secrets, no credentials, no logging of sensitive data in any of the touched files. Clean.

Recommendations

  • Carry file verbatim in Phase 1 (the issue is right to keep it as-is for the join), but add a one-line note to the issue / Phase 3 ticket that file is untrusted-shaped path data requiring basename + directory-confinement when it is eventually resolved to a filesystem read. Codify that as a test when Phase 3 lands (path-traversal ..\..\etc\passwd.pdf → rejected).
  • If you want defense-in-depth now at zero cost: store only the basename of file (H-0730.pdf), not the full ..\__scan\... path. The importer joins on filename anyway, and stripping the directory removes the traversal shape at the source. This also makes index_file_mismatch and the new column agree on representation.

Open Decisions

None — no live attack surface in this phase. The path-handling item is a Phase 3 control, noted above for traceability.

## Nora Steiner ("NullX") — Application Security Engineer This is an offline Python ETL tool with no network, no auth surface, and no user-supplied input at runtime — the spreadsheet is trusted family data the owner controls. So the OWASP-style attack surface is essentially nil. Two data-integrity / output-safety notes still apply. ### Observations - **CSV formula-injection (CWE-1236) is already handled for review files** via `_csv_safe()` (`writers.py:19-22`), which neutralizes leading `=`, `+`, `-`, `@`. Good. The `file` column from the `datei` cell will flow into the **xlsx** outputs (`canonical-documents.xlsx`), not the CSVs — and `_csv_safe` is only applied to CSV writes, not xlsx. A `datei` value like `=cmd|...` would land raw in the xlsx. Low risk (data is owner-controlled, downstream consumer is a programmatic importer, not a human double-clicking), but worth a conscious note rather than a silent gap. - **`file` is later used to locate a PDF on disk (Phase 3).** The issue says the importer "can't deterministically locate the PDF; it must guess from `index`." When Phase 3 consumes this column, the `file` value becomes a **path**. Corpus values are Windows paths like `..\__scan\W-0001.pdf` (`test_documents.py:8`). That is path-traversal-shaped input. Phase 1 only carries the string, so nothing to fix here — but flag it loudly for Phase 3: the loader must basename-normalize and confine to the scan directory, never `open(file)` directly. - No secrets, no credentials, no logging of sensitive data in any of the touched files. Clean. ### Recommendations - **Carry `file` verbatim in Phase 1** (the issue is right to keep it as-is for the join), but **add a one-line note to the issue / Phase 3 ticket** that `file` is untrusted-shaped path data requiring basename + directory-confinement when it is eventually resolved to a filesystem read. Codify that as a test when Phase 3 lands (path-traversal `..\..\etc\passwd.pdf` → rejected). - If you want defense-in-depth now at zero cost: store only the **basename** of `file` (`H-0730.pdf`), not the full `..\__scan\...` path. The importer joins on filename anyway, and stripping the directory removes the traversal shape at the source. This also makes `index_file_mismatch` and the new column agree on representation. ### Open Decisions _None — no live attack surface in this phase. The path-handling item is a Phase 3 control, noted above for traceability._
Author
Owner

Tobias Wendt — DevOps & Platform Engineer

This is a local developer tool (Python venv, run by hand to regenerate out/), so there's no Compose/CI infra to size. My angle is reproducibility of the committed artifacts and how this runs in CI.

Observations

  • out/canonical-persons-tree.json is the one out/ artifact tracked in git.gitignore has out/ then !out/canonical-persons-tree.json, and a recent commit (2e59c0ef) un-ignored it intentionally. Regenerating it for Gap 3 will produce a non-reproducible diff because persons_tree.py:390 writes "generated_at": datetime.datetime.now().isoformat(). Every regen churns that timestamp even when the person data is unchanged. The xlsx writers already solved this — they pin _FIXED_TS = 2020-01-01 (writers.py:10) precisely so reruns are content-deterministic (NFR-IDEM-01). The tree JSON skipped that treatment.
  • The xlsx outputs (canonical-documents.xlsx, canonical-persons.xlsx) are gitignored, so adding file/date_end columns won't show up as a tracked diff — only the tree JSON will.
  • The issue says "Regenerate out/ artifacts" as a step but doesn't say which are committed. Only the tree JSON is.

Recommendations

  • Pin generated_at to a fixed value (or drop it) in persons_tree.py as part of this issue, matching the _FIXED_TS pattern in writers.py. Otherwise the committed tree JSON diff will be noise on every regen and you lose the ability to assert "regenerating produces no change." Cheapest version: set generated_at to the same fixed constant the xlsx writers use, or read it from a single shared source.
  • Regenerate the committed tree JSON in the same commit as the persons_tree.py change, so the tracked artifact and the code that produces it never drift. Reviewers can then python persons_tree.py and confirm a clean git diff.
  • No CI wiring needed beyond what exists — these tests run under the same pytest invocation as the rest of the tool. Keep running individual test files locally per the project rule; let CI do the full sweep.

Open Decisions

  • Keep generated_at at all? It's the only thing making the committed artifact non-deterministic. Options: (a) pin it to a fixed constant — preserves the field, gives clean diffs; (b) remove it — simplest, but loses provenance of when the snapshot was made. Both fix the churn; (a) keeps a (now-meaningless) field, (b) is honest that the file is content-addressed, not time-stamped.
## Tobias Wendt — DevOps & Platform Engineer This is a local developer tool (Python venv, run by hand to regenerate `out/`), so there's no Compose/CI infra to size. My angle is reproducibility of the committed artifacts and how this runs in CI. ### Observations - **`out/canonical-persons-tree.json` is the one `out/` artifact tracked in git** — `.gitignore` has `out/` then `!out/canonical-persons-tree.json`, and a recent commit (`2e59c0ef`) un-ignored it intentionally. Regenerating it for Gap 3 will produce a **non-reproducible diff** because `persons_tree.py:390` writes `"generated_at": datetime.datetime.now().isoformat()`. Every regen churns that timestamp even when the person data is unchanged. The xlsx writers already solved this — they pin `_FIXED_TS = 2020-01-01` (`writers.py:10`) precisely so reruns are content-deterministic (NFR-IDEM-01). The tree JSON skipped that treatment. - The xlsx outputs (`canonical-documents.xlsx`, `canonical-persons.xlsx`) are gitignored, so adding `file`/`date_end` columns won't show up as a tracked diff — only the tree JSON will. - The issue says "Regenerate `out/` artifacts" as a step but doesn't say which are committed. Only the tree JSON is. ### Recommendations - **Pin `generated_at` to a fixed value (or drop it) in `persons_tree.py`** as part of this issue, matching the `_FIXED_TS` pattern in `writers.py`. Otherwise the committed tree JSON diff will be noise on every regen and you lose the ability to assert "regenerating produces no change." Cheapest version: set `generated_at` to the same fixed constant the xlsx writers use, or read it from a single shared source. - **Regenerate the committed tree JSON in the same commit as the `persons_tree.py` change**, so the tracked artifact and the code that produces it never drift. Reviewers can then `python persons_tree.py` and confirm a clean `git diff`. - No CI wiring needed beyond what exists — these tests run under the same pytest invocation as the rest of the tool. Keep running individual test files locally per the project rule; let CI do the full sweep. ### Open Decisions - **Keep `generated_at` at all?** It's the only thing making the committed artifact non-deterministic. Options: (a) pin it to a fixed constant — preserves the field, gives clean diffs; (b) remove it — simplest, but loses provenance of when the snapshot was made. Both fix the churn; (a) keeps a (now-meaningless) field, (b) is honest that the file is content-addressed, not time-stamped.
Author
Owner

Elicit — Requirements Engineer

The issue is well-structured (verb-noun title, file-level breakdown, Gherkin AC, explicit out-of-scope, dependency stated) — this is the spec-density bar I expect. My job is to hunt ambiguity, contradiction, and hidden assumptions before code starts. I found three.

Observations

  • AMBIGUITY — "RANGE dates carry an end day" silently assumes one range shape. The codebase produces RANGE for at least three different inputs: intra-month day ranges (10./11.…), year ranges (1881/82), and full hyphen ranges (8.1.1916 - 15.3.1916). The AC only specifies the day-range case. For the others, "the end day" is undefined — 1881/82 has no end day at all. The requirement is untestable for two of three shapes as written.
  • CONTRADICTION — the AC's own example doesn't satisfy the AC. 10./11.I.1917 (the literal value in the Given) parses to precision=UNKNOWN today, so date_iso = 1917-01-10 and date_precision = "RANGE" are both false for that input. The Given and the Then can't both hold. (The ASCII variant 10./11.1.1917 works.)
  • HIDDEN ASSUMPTION — "the same slug the register assigns … reconciles to register persons 1:1." This assumes the tree and register describe the same population via the same key. They are built from two different workbooks with two different dedup passes and two different name normalizers. A 1:1 reconciliation is an assumption, not a guarantee. (Markus details the mechanics.)
  • GOOD: "Confirmed already present (no change)" for the provisional column is exactly the kind of scope-trimming that prevents gold-plating. And the out-of-scope section correctly fences off backend work and the dead Briefwechsel feature.

Recommendations

  • Rewrite AC 2 to name the range shape and pin the empty cases. Suggested:

    Given a date cell 10./11.1.1917, when the canonical document is generated, then date_iso is 1917-01-10, date_precision is RANGE, and date_end is 1917-01-11.
    And given a DAY date (15.2.1888), date_end is empty.
    And given a YEAR date (1917), date_end is empty.
    Then add a separate, explicit decision line for year/hyphen ranges (see below).

  • Add the missing-datei AC to Gap 1: empty dateifile is an empty string (matches how the corpus's C-0001 row behaves).
  • Reframe Gap 3's "1:1" as a testable, honest statement: "the tree entry carries a person_id computed by the same slugify(last, first) rule as the register; ids that have no register counterpart are reported, not assumed away." That is verifiable; "reconciles 1:1" is an unverifiable hope.

Open Decisions

  • Scope of date_end across RANGE shapes. Must date_end be populated for year ranges (1881/82) and hyphen ranges (8.1.1916 - 15.3.1916), or only for intra-month day ranges? This drives both the parser work and what Phase 4 can render. It needs a product answer, not a code answer.
  • AC example correction. Fix the Roman-month 10./11.I.1917 example (either implement Roman ranges, or switch the AC to the ASCII form) — the two readings imply different amounts of work.
## Elicit — Requirements Engineer The issue is well-structured (verb-noun title, file-level breakdown, Gherkin AC, explicit out-of-scope, dependency stated) — this is the spec-density bar I expect. My job is to hunt ambiguity, contradiction, and hidden assumptions before code starts. I found three. ### Observations - **AMBIGUITY — "RANGE dates carry an end day" silently assumes one range shape.** The codebase produces `RANGE` for at least three different inputs: intra-month day ranges (`10./11.…`), year ranges (`1881/82`), and full hyphen ranges (`8.1.1916 - 15.3.1916`). The AC only specifies the day-range case. For the others, "the end day" is undefined — `1881/82` has no end *day* at all. The requirement is untestable for two of three shapes as written. - **CONTRADICTION — the AC's own example doesn't satisfy the AC.** `10./11.I.1917` (the literal value in the Given) parses to `precision=UNKNOWN` today, so `date_iso = 1917-01-10` and `date_precision = "RANGE"` are both false for that input. The Given and the Then can't both hold. (The ASCII variant `10./11.1.1917` works.) - **HIDDEN ASSUMPTION — "the same slug the register assigns … reconciles to register persons 1:1."** This assumes the tree and register describe the same population via the same key. They are built from two different workbooks with two different dedup passes and two different name normalizers. A 1:1 reconciliation is an assumption, not a guarantee. (Markus details the mechanics.) - **GOOD:** "Confirmed already present (no change)" for the `provisional` column is exactly the kind of scope-trimming that prevents gold-plating. And the out-of-scope section correctly fences off backend work and the dead Briefwechsel feature. ### Recommendations - **Rewrite AC 2 to name the range shape and pin the empty cases.** Suggested: > Given a date cell `10./11.1.1917`, when the canonical document is generated, then `date_iso` is `1917-01-10`, `date_precision` is `RANGE`, and `date_end` is `1917-01-11`. > And given a DAY date (`15.2.1888`), `date_end` is empty. > And given a YEAR date (`1917`), `date_end` is empty. Then add a separate, explicit decision line for year/hyphen ranges (see below). - **Add the missing-`datei` AC** to Gap 1: empty `datei` → `file` is an empty string (matches how the corpus's C-0001 row behaves). - **Reframe Gap 3's "1:1" as a testable, honest statement:** "the tree entry carries a `person_id` computed by the same `slugify(last, first)` rule as the register; ids that have no register counterpart are reported, not assumed away." That is verifiable; "reconciles 1:1" is an unverifiable hope. ### Open Decisions - **Scope of `date_end` across RANGE shapes.** Must `date_end` be populated for year ranges (`1881/82`) and hyphen ranges (`8.1.1916 - 15.3.1916`), or only for intra-month day ranges? This drives both the parser work and what Phase 4 can render. It needs a product answer, not a code answer. - **AC example correction.** Fix the Roman-month `10./11.I.1917` example (either implement Roman ranges, or switch the AC to the ASCII form) — the two readings imply different amounts of work.
Author
Owner

Leonie Voss — UX & Accessibility

No concerns from my angle. This is a Python normalizer that produces offline canonical exports (xlsx/JSON) consumed by later import phases — there is no UI, no route, no component, and no user-facing rendering in scope. I checked the issue's out-of-scope section (backend and Briefwechsel explicitly excluded) and the touched files (documents.py, dates.py, writers.py, persons_tree.py) — all data plumbing, nothing visual.

One forward-looking note for when this data does surface to users: the issue mentions Phase 4 RANGE rendering via meta_date_end. When that lands, the date-range display will need a human-readable format (e.g. "10.–11. Januar 1917") with the en-dash, and must stay legible at 16px+ for the 60+ audience. That's a Phase 4 concern, not this one — flagging only so the date_end field's eventual presentation isn't an afterthought.

## Leonie Voss — UX & Accessibility No concerns from my angle. This is a Python normalizer that produces offline canonical exports (xlsx/JSON) consumed by later import phases — there is no UI, no route, no component, and no user-facing rendering in scope. I checked the issue's out-of-scope section (backend and Briefwechsel explicitly excluded) and the touched files (`documents.py`, `dates.py`, `writers.py`, `persons_tree.py`) — all data plumbing, nothing visual. One forward-looking note for when this data does surface to users: the issue mentions Phase 4 RANGE rendering via `meta_date_end`. When that lands, the date-range display will need a human-readable format (e.g. "10.–11. Januar 1917") with the en-dash, and must stay legible at 16px+ for the 60+ audience. That's a Phase 4 concern, not this one — flagging only so the `date_end` field's eventual presentation isn't an afterthought.
Author
Owner

Decision Queue — Action Required

3 decisions need your input before implementation starts. Everything else in the persona comments is a concrete recommendation needing no sign-off.

Requirements / Parser scope

  • AC example 10./11.I.1917 does not parse as RANGE today — it returns precision=UNKNOWN because the range matcher only retries numeric/monthname forms, not Roman months (dates.py:217). The ASCII form 10./11.1.1917 works. Option A: extend _RANGE_DAY_RE's retry list to include _match_roman (small fix, more corpus coverage, keeps the AC as written). Option B: change the AC example to the ASCII form and defer Roman ranges (less work now, but the literal corpus value stays UNKNOWN). (Raised by: Felix, Sara, Elicit)

  • What date_end holds for non-day RANGE shapes. RANGE today also covers year ranges (1881/82) and full hyphen ranges (8.1.1916 - 15.3.1916), where "end day" is undefined. Option A: parse and emit a real end for all three shapes (most useful for Phase 4 rendering, more parsing work). Option B: populate date_end only for intra-month day ranges and leave year/hyphen ranges empty (matches the AC literally, less work, but Phase 4 can't render those ends). This is a product call about what Phase 4 needs to display. (Raised by: Felix, Markus, Elicit)

Architecture / Data contract

  • Tree ↔ register join guarantee. The issue promises person_id "reconciles to register persons 1:1," but the tree (Personendatei 2.xlsx) and register (DOCUMENT_WORKBOOK) are built from different sources with different dedup and different name normalizers — a 1:1 join is not guaranteed. Option A: emit slugify(last, first) as a best-effort key + a reconciliation report counting unmatched ids (cheap, honest; Phase 3 must handle misses as "tree-only persons"). Option B: make the register the single source of person identity and have the tree look up ids from the already-built register instead of re-slugging (more correct, couples the two scripts, more work). (Raised by: Markus, Elicit, Sara)

Noted, no decision required

  • generated_at churn in canonical-persons-tree.json (the one committed out/ artifact) — Tobias recommends pinning it like the xlsx _FIXED_TS so regen diffs are clean. Treat as a recommendation unless you want to keep the timestamp deliberately.
  • file path-shape — Nora recommends storing the basename only and flagging path-confinement for Phase 3. Recommendation, not a blocker.
## Decision Queue — Action Required _3 decisions need your input before implementation starts. Everything else in the persona comments is a concrete recommendation needing no sign-off._ ### Requirements / Parser scope - **AC example `10./11.I.1917` does not parse as RANGE today** — it returns `precision=UNKNOWN` because the range matcher only retries numeric/monthname forms, not Roman months (`dates.py:217`). The ASCII form `10./11.1.1917` works. **Option A:** extend `_RANGE_DAY_RE`'s retry list to include `_match_roman` (small fix, more corpus coverage, keeps the AC as written). **Option B:** change the AC example to the ASCII form and defer Roman ranges (less work now, but the literal corpus value stays UNKNOWN). _(Raised by: Felix, Sara, Elicit)_ - **What `date_end` holds for non-day RANGE shapes.** `RANGE` today also covers year ranges (`1881/82`) and full hyphen ranges (`8.1.1916 - 15.3.1916`), where "end day" is undefined. **Option A:** parse and emit a real end for all three shapes (most useful for Phase 4 rendering, more parsing work). **Option B:** populate `date_end` only for intra-month day ranges and leave year/hyphen ranges empty (matches the AC literally, less work, but Phase 4 can't render those ends). This is a product call about what Phase 4 needs to display. _(Raised by: Felix, Markus, Elicit)_ ### Architecture / Data contract - **Tree ↔ register join guarantee.** The issue promises `person_id` "reconciles to register persons 1:1," but the tree (`Personendatei 2.xlsx`) and register (`DOCUMENT_WORKBOOK`) are built from different sources with different dedup and different name normalizers — a 1:1 join is not guaranteed. **Option A:** emit `slugify(last, first)` as a best-effort key + a reconciliation report counting unmatched ids (cheap, honest; Phase 3 must handle misses as "tree-only persons"). **Option B:** make the register the single source of person identity and have the tree look up ids from the already-built register instead of re-slugging (more correct, couples the two scripts, more work). _(Raised by: Markus, Elicit, Sara)_ ### Noted, no decision required - **`generated_at` churn** in `canonical-persons-tree.json` (the one committed `out/` artifact) — Tobias recommends pinning it like the xlsx `_FIXED_TS` so regen diffs are clean. Treat as a recommendation unless you want to keep the timestamp deliberately. - **`file` path-shape** — Nora recommends storing the basename only and flagging path-confinement for Phase 3. Recommendation, not a blocker.
Sign in to join this conversation.
No Label P0-critical feature
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#670