As the archive owner I want the normalizer's canonical exports to carry every field the importer needs (file, date-range end, person ids in the tree) so the loaders have a complete, joinable contract #670
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Phase 1 of the "Handling the Unknowns" milestone — build this first
The new modular importer (Phase 3) loads the normalizer's canonical exports in
tools/import-normalizer/out/instead of re-parsing the raw spreadsheet. Three gaps in those exports would block downstream phases, so close them before any backend work. This is Python-only — no backend dependency — and unblocks Phases 2/3.The three gaps (verified against the normalizer source)
Gap 1 — the document export drops the file name
RawRow.fileis read from the "datei" column (documents.py:20,_FIELDSline 50) butCanonicalDocumenthas nofilefield —to_canonicalusesraw.fileonly for theindex_file_mismatchflag (documents.py:108) then discards it, andDOC_COLUMNS(writers.py:25-27) has nofile. The importer therefore can't deterministically locate the PDF; it must guess fromindex.file: str = ""toCanonicalDocument(~line 33); passfile=raw.filein theto_canonical(...)return (~line 111); add"file"toDOC_COLUMNS.Gap 2 — RANGE dates lose the end day
For a range like
10./11.1.1917, the normalizer resolves a representative start day intodate_isobut discards the end day, andDOC_COLUMNShas nodate_end. This makes Phase 4's RANGE rendering (meta_date_end) impossible to populate.dates.pyrange matching and surface it; add adate_endfield toCanonicalDocument+ adate_endcolumn toDOC_COLUMNS.date_endis populated only forRANGEprecision, empty otherwise.10./11.I.1917) currently fall through toUNKNOWNbecause the range matcher only handles numeric / month-name forms (dates.py~line 217). Wiring the Roman-month range form is in scope for this issue, so the RANGE data exists for Phase 4's rendering.Gap 3 — the persons-tree JSON has no joinable id
canonical-persons.xlsxkeys persons byperson_id(slug);canonical-persons-tree.jsonkeys persons byrowId(row_002, …) with noperson_id. Phase 3's PersonTreeImporter has nothing to join the relationship tree onto the persons loaded from the register.person_id. The id is assigned exactly once by the register —canonical-persons.xlsxis the authority — and that exact id must be propagated into the tree JSON (persons_tree.py), never recomputed. Re-slugifying in the tree builder would NOT reliably match the register id, because the tree JSON and the register xlsx are built from different inputs / dedup. Propagating the register's id verbatim is what makes the tree reconcile to register persons 1:1.Confirmed already present (no change)
canonical-persons.xlsxalready includes aprovisionalcolumn (PERSON_COLUMNS,writers.py:31). Downstream phases consume it as-is.Acceptance criteria
File-level breakdown
documents.py—CanonicalDocument(+file, +date_end),to_canonical(pass both), checkRawRow.dates.py— range matching surfaces the end day alongside the representativeiso.writers.py— addfileanddate_endtoDOC_COLUMNS.persons_tree.py— propagate the register'sperson_idper tree person. Also note: the committedcanonical-persons-tree.jsoncurrently stampsgenerated_atwithdatetime.now(), which churns the file on every run — use a fixed/deterministic timestamp (as the xlsx writer already does with its_FIXED_TS) so committed artifacts are reproducible.tests/— extendtest_normalize.py(and date tests) with the cases above.out/artifacts. The regeneratedout/artifacts are now committed to the repo per the milestone decision (#669) — update.gitignoreto stop excludingout/*.xlsx.Out of scope
Dependency
None (Python only). Blocks Phase 2 (schema) and Phase 3 (importer).
Markus Keller — Senior Application Architect
I read
documents.py,dates.py,writers.py,persons.py, andpersons_tree.pyto ground this. The three gaps are real and the fix direction is sound. My concern is the contract Gap 3 promises versus what the code can actually deliver.Observations
persons.parse_register()readsDOCUMENT_WORKBOOK(config.py:9), slugs viaslugify(last, first)(persons.py:75), and resolves id collisions with a-Nsuffix (persons.py:85-90).persons_tree.pyreadsPERSON_WORKBOOK=Personendatei 2.xlsx(config.py:11), keys byrow_NNN, and runs its own_deduplicate()(name+birthYear) atpersons_tree.py:178. Names normalize through_norm_tree(), which is a different function thanslugify.slugify(lastName, firstName)inside the tree does not guarantee a matching id exists in the register, and does not reproduce the register's-Ncollision suffixes (the tree has no visibility into register collisions). A person present only inPersonendatei 2.xlsxwill carry aperson_idthat joins to nothing.file) and Gap 2 (date_end) are clean, additive column changes — low architectural risk.Recommendations
person_idin the tree is a best-effort join key, not a guaranteed foreign key. The honest contract is: emitslugify(lastName, firstName)asperson_id, and accept that Phase 3'sPersonTreeImportermust treat an unmatched id as "tree-only person" rather than assuming referential integrity. Encode that expectation in the issue so Phase 3 isn't built on a false assumption.out/, count how many treeperson_ids have no register counterpart and emit it to a review file (same pattern as the existingreview/*.csv). This surfaces the join-gap as data, not a Phase-3 surprise.date_endpopulated only forRANGE(as the AC states) — a sparse column with a clear single-precision contract is easier to reason about downstream than a nullable-everywhere field.Open Decisions
Felix Brandt — Senior Fullstack Developer
I traced all four files and ran the AC's own date example. There's a blocker hiding in the Gap 2 acceptance criterion. TDD red-first on every gap.
Observations
_RANGE_DAY_RE(dates.py:180) reconstructs"10." + "I.1917"and only retries_match_numericand_match_monthname_a(dates.py:217) — not_match_roman. So Roman-numeral month ranges fall through to UNKNOWN. Implementing the AC verbatim first requires adding_match_romanto that retry list. The issue doesn't mention this — it's unscoped work.date_endripples through the matcher contract. Every matcher in_MATCHERS(dates.py:231) returns a 2-tuple(iso, precision), consumed atdates.py:254-259, andParsedDateis a frozen dataclass with no end field. Addingdate_endmeans either (a) widening every matcher to a 3-tuple, or (b) adding a 4th field toParsedDateand only the range matcher sets it. Option (b) is the smaller, safer change —ParsedDate(iso, precision, raw, iso_end=None).file=intoRawRow(test_documents.py:49). Just thread it throughto_canonicalandDOC_COLUMNS.to_canonicalis already at the size limit for one function; addingdate_end=pd.iso_end or ""is fine, but don't grow the flag-collection block further without extracting.Recommendations
test_documents.py: assertdoc.file == "H-0730.pdf"and"file" in DOC_COLUMNS.test_dates.py: assertparse_date("10./11.1.1917")yieldsiso_end == "1917-01-11"and add the Roman case10./11.I.1917— watch it go red, then fix_RANGE_DAY_RE's retry list to include_match_roman. Also pin the non-RANGE contract:parse_date("15.2.1888").iso_end is None.test_persons_tree.py: assert each emitted person dict hasperson_id == slugify(lastName, firstName).iso_endtoParsedDate, not a 3-tuple. Keeps all eight other matchers untouched. Only_match_rangeandparse_datechange. Thendocuments.pymapspd.iso_end or ""intoCanonicalDocument.date_end.date_endfor the year/hyphen ranges too, or explicitly null them.1881/82and8.1.1916 - 15.3.1916are alsoRANGEtoday. Decide whetherdate_endis the end year-start (1882-01-01), the parsed end day (1916-03-15), or empty for these. A test must pin whichever you pick — otherwise the column is silently inconsistent.Open Decisions
date_endholds for non-day ranges.RANGEtoday covers three shapes: intra-month day ranges (10./11.), year ranges (1881/82), and full hyphen ranges (8.1.1916 - 15.3.1916). The AC only specifies the day-range case. Options: parse and emit a real end for all three (most useful for Phase 4 rendering, more parsing work) vs. emitdate_endonly for the day-range shape and leave the others empty (matches the AC literally, less work, but Phase 4 can't render year/hyphen range ends). This is a product call about what Phase 4 needs to display.Sara Holt — Senior QA Engineer
I verified the acceptance criteria against the actual parser. One AC scenario fails on its own example today, and the other two need negative cases added.
Observations
dates.py:217retries numeric/monthname only). So a literal Gherkin test of the AC fails atdate_iso == 1917-01-10before ever reaching thedate_endassertion. Either fix the Roman case as part of this issue, or change the AC example to an ASCII form (10./11.1.1917), which does produce RANGE. Right now the AC is not implementable as written.file) has no negative case. The scenario only checks a populateddatei. The corpus has rows with emptyDatei(seetest_normalize.py:13, C-0001 has no file). Need: emptydatei→filecolumn is empty string, notNone/missing.date_endmust be empty for DAY, MONTH, SEASON, YEAR, APPROX, and UNKNOWN — not just "any non-RANGE row." That's six precisions; at minimum test DAY and YEAR.person_id) has no collision case.slugifycollisions get-Nsuffixes in the register (persons.py:85-90), but the tree's_deduplicateis independent. Two tree persons that slug identically need a defined, tested outcome.test_normalize.py:52-63asserts byte-identical reruns of the xlsx outputs. Newfile/date_endcolumns must preserve that. Separately,out/canonical-persons-tree.jsonstampsgenerated_at = datetime.now()(persons_tree.py:390) — regenerating it for this issue produces a non-reproducible diff. CI can't assert determinism on that file as-is.Recommendations
test_documents.py:filepopulated ANDfileempty-string whendateiblank.test_dates.py:iso_endfor10./11.1.1917; the Roman10./11.I.1917case (expected to drive the matcher fix);iso_end is Nonefor a DAY date and a YEAR date.test_persons_tree.py:person_id == slugify(...)for a normal person; defined behavior for two identically-slugging persons.test_writers.py:"file"and"date_end"present inDOC_COLUMNSin the expected positions.test_dates.py,test_documents.py,test_writers.py,test_persons_tree.py,test_normalize.py) per the project rule — never the whole suite locally.Open Decisions
10./11.I.1917doesn't parse as RANGE today. Option A: extend_RANGE_DAY_RE's retry to include_match_roman(more corpus coverage, in scope of "RANGE dates carry an end day," small fix). Option B: change the AC to the ASCII form and defer Roman ranges (less work now, but the literal corpus value stays UNKNOWN). Pick one so the test author has an unambiguous target.Nora Steiner ("NullX") — Application Security Engineer
This is an offline Python ETL tool with no network, no auth surface, and no user-supplied input at runtime — the spreadsheet is trusted family data the owner controls. So the OWASP-style attack surface is essentially nil. Two data-integrity / output-safety notes still apply.
Observations
_csv_safe()(writers.py:19-22), which neutralizes leading=,+,-,@. Good. Thefilecolumn from thedateicell will flow into the xlsx outputs (canonical-documents.xlsx), not the CSVs — and_csv_safeis only applied to CSV writes, not xlsx. Adateivalue like=cmd|...would land raw in the xlsx. Low risk (data is owner-controlled, downstream consumer is a programmatic importer, not a human double-clicking), but worth a conscious note rather than a silent gap.fileis later used to locate a PDF on disk (Phase 3). The issue says the importer "can't deterministically locate the PDF; it must guess fromindex." When Phase 3 consumes this column, thefilevalue becomes a path. Corpus values are Windows paths like..\__scan\W-0001.pdf(test_documents.py:8). That is path-traversal-shaped input. Phase 1 only carries the string, so nothing to fix here — but flag it loudly for Phase 3: the loader must basename-normalize and confine to the scan directory, neveropen(file)directly.Recommendations
fileverbatim in Phase 1 (the issue is right to keep it as-is for the join), but add a one-line note to the issue / Phase 3 ticket thatfileis untrusted-shaped path data requiring basename + directory-confinement when it is eventually resolved to a filesystem read. Codify that as a test when Phase 3 lands (path-traversal..\..\etc\passwd.pdf→ rejected).file(H-0730.pdf), not the full..\__scan\...path. The importer joins on filename anyway, and stripping the directory removes the traversal shape at the source. This also makesindex_file_mismatchand the new column agree on representation.Open Decisions
None — no live attack surface in this phase. The path-handling item is a Phase 3 control, noted above for traceability.
Tobias Wendt — DevOps & Platform Engineer
This is a local developer tool (Python venv, run by hand to regenerate
out/), so there's no Compose/CI infra to size. My angle is reproducibility of the committed artifacts and how this runs in CI.Observations
out/canonical-persons-tree.jsonis the oneout/artifact tracked in git —.gitignorehasout/then!out/canonical-persons-tree.json, and a recent commit (2e59c0ef) un-ignored it intentionally. Regenerating it for Gap 3 will produce a non-reproducible diff becausepersons_tree.py:390writes"generated_at": datetime.datetime.now().isoformat(). Every regen churns that timestamp even when the person data is unchanged. The xlsx writers already solved this — they pin_FIXED_TS = 2020-01-01(writers.py:10) precisely so reruns are content-deterministic (NFR-IDEM-01). The tree JSON skipped that treatment.canonical-documents.xlsx,canonical-persons.xlsx) are gitignored, so addingfile/date_endcolumns won't show up as a tracked diff — only the tree JSON will.out/artifacts" as a step but doesn't say which are committed. Only the tree JSON is.Recommendations
generated_atto a fixed value (or drop it) inpersons_tree.pyas part of this issue, matching the_FIXED_TSpattern inwriters.py. Otherwise the committed tree JSON diff will be noise on every regen and you lose the ability to assert "regenerating produces no change." Cheapest version: setgenerated_atto the same fixed constant the xlsx writers use, or read it from a single shared source.persons_tree.pychange, so the tracked artifact and the code that produces it never drift. Reviewers can thenpython persons_tree.pyand confirm a cleangit diff.Open Decisions
generated_atat all? It's the only thing making the committed artifact non-deterministic. Options: (a) pin it to a fixed constant — preserves the field, gives clean diffs; (b) remove it — simplest, but loses provenance of when the snapshot was made. Both fix the churn; (a) keeps a (now-meaningless) field, (b) is honest that the file is content-addressed, not time-stamped.Elicit — Requirements Engineer
The issue is well-structured (verb-noun title, file-level breakdown, Gherkin AC, explicit out-of-scope, dependency stated) — this is the spec-density bar I expect. My job is to hunt ambiguity, contradiction, and hidden assumptions before code starts. I found three.
Observations
RANGEfor at least three different inputs: intra-month day ranges (10./11.…), year ranges (1881/82), and full hyphen ranges (8.1.1916 - 15.3.1916). The AC only specifies the day-range case. For the others, "the end day" is undefined —1881/82has no end day at all. The requirement is untestable for two of three shapes as written.10./11.I.1917(the literal value in the Given) parses toprecision=UNKNOWNtoday, sodate_iso = 1917-01-10anddate_precision = "RANGE"are both false for that input. The Given and the Then can't both hold. (The ASCII variant10./11.1.1917works.)provisionalcolumn is exactly the kind of scope-trimming that prevents gold-plating. And the out-of-scope section correctly fences off backend work and the dead Briefwechsel feature.Recommendations
dateiAC to Gap 1: emptydatei→fileis an empty string (matches how the corpus's C-0001 row behaves).person_idcomputed by the sameslugify(last, first)rule as the register; ids that have no register counterpart are reported, not assumed away." That is verifiable; "reconciles 1:1" is an unverifiable hope.Open Decisions
date_endacross RANGE shapes. Mustdate_endbe populated for year ranges (1881/82) and hyphen ranges (8.1.1916 - 15.3.1916), or only for intra-month day ranges? This drives both the parser work and what Phase 4 can render. It needs a product answer, not a code answer.10./11.I.1917example (either implement Roman ranges, or switch the AC to the ASCII form) — the two readings imply different amounts of work.Leonie Voss — UX & Accessibility
No concerns from my angle. This is a Python normalizer that produces offline canonical exports (xlsx/JSON) consumed by later import phases — there is no UI, no route, no component, and no user-facing rendering in scope. I checked the issue's out-of-scope section (backend and Briefwechsel explicitly excluded) and the touched files (
documents.py,dates.py,writers.py,persons_tree.py) — all data plumbing, nothing visual.One forward-looking note for when this data does surface to users: the issue mentions Phase 4 RANGE rendering via
meta_date_end. When that lands, the date-range display will need a human-readable format (e.g. "10.–11. Januar 1917") with the en-dash, and must stay legible at 16px+ for the 60+ audience. That's a Phase 4 concern, not this one — flagging only so thedate_endfield's eventual presentation isn't an afterthought.Decision Queue — Action Required
3 decisions need your input before implementation starts. Everything else in the persona comments is a concrete recommendation needing no sign-off.
Requirements / Parser scope
AC example
10./11.I.1917does not parse as RANGE today — it returnsprecision=UNKNOWNbecause the range matcher only retries numeric/monthname forms, not Roman months (dates.py:217). The ASCII form10./11.1.1917works. Option A: extend_RANGE_DAY_RE's retry list to include_match_roman(small fix, more corpus coverage, keeps the AC as written). Option B: change the AC example to the ASCII form and defer Roman ranges (less work now, but the literal corpus value stays UNKNOWN). (Raised by: Felix, Sara, Elicit)What
date_endholds for non-day RANGE shapes.RANGEtoday also covers year ranges (1881/82) and full hyphen ranges (8.1.1916 - 15.3.1916), where "end day" is undefined. Option A: parse and emit a real end for all three shapes (most useful for Phase 4 rendering, more parsing work). Option B: populatedate_endonly for intra-month day ranges and leave year/hyphen ranges empty (matches the AC literally, less work, but Phase 4 can't render those ends). This is a product call about what Phase 4 needs to display. (Raised by: Felix, Markus, Elicit)Architecture / Data contract
person_id"reconciles to register persons 1:1," but the tree (Personendatei 2.xlsx) and register (DOCUMENT_WORKBOOK) are built from different sources with different dedup and different name normalizers — a 1:1 join is not guaranteed. Option A: emitslugify(last, first)as a best-effort key + a reconciliation report counting unmatched ids (cheap, honest; Phase 3 must handle misses as "tree-only persons"). Option B: make the register the single source of person identity and have the tree look up ids from the already-built register instead of re-slugging (more correct, couples the two scripts, more work). (Raised by: Markus, Elicit, Sara)Noted, no decision required
generated_atchurn incanonical-persons-tree.json(the one committedout/artifact) — Tobias recommends pinning it like the xlsx_FIXED_TSso regen diffs are clean. Treat as a recommendation unless you want to keep the timestamp deliberately.filepath-shape — Nora recommends storing the basename only and flagging path-confinement for Phase 3. Recommendation, not a blocker.