17-task TDD plan for tools/import-normalizer/. Incorporates inline 6-persona review: content-deterministic idempotency, duplicate-index fix, provisional-id collision guard, date-parser edge cases, multi-sender split, CSV-injection defang, pinned deps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.5 KiB
Import Migration — Worklog
Running log of each working session. Resume here. Newest entry on top.
2026-05-25 (session 3) — Implementation plan + persona review
Did:
- Wrote
03-normalizer-implementation-plan.md: 17 bite-sized TDD tasks fortools/import-normalizer/(Python, openpyxl), bottom-up — date parser w/ Easter computus first, then persons/alias, ingest, mapping, orchestrator, writers. - Ran a 6-persona inline review (architect, developer, tester, req-engineer, security, devops; ui-expert too) via parallel agents. Acted on all material findings.
Key fixes from review (see plan §"Review feedback incorporated"):
- Idempotency redefined byte-identical → content-deterministic (spec G4/NFR-IDEM-01); pinned workbook timestamps + deterministic alias ordering + a real two-run equality test.
- Real bug: duplicate-index only reported repeats → now flags/reports every occurrence.
- Provisional
person_idcould overwrite a register id → now suffixed. - Date parser gaps: invalid-calendar-date → UNKNOWN, intra-month day-range (
7./8. Sept.1923). - Multi-person sender now split + flagged (REQ-PERS-01); CSV-injection defanged in review files;
pinned deps + hardened root
.gitignore.
Next:
- Marcel reviews the plan. Then execute it (subagent-driven or inline) — the date parser (Task 3/8 + Easter computus) is the meatiest piece.
2026-05-25 (session 2) — Strategy + normalizer spec
Did:
- Decided strategy with Marcel: normalize the raw sheets first, then import (higher leverage than making the Java importer tolerate every mess).
- Locked design decisions (see spec §3): new canonical layout; dates = parsed + raw +
precision; include person register + dedup in this effort; overrides-file + re-run loop;
Python tool at
tools/import-normalizer/. - Century rule fixed by Marcel: archive spans 1873–1957; 2-digit
00–57→19YY,73–99→18YY,58–72→flag; 3-digit→1DDD; never 20xx. - Wrote
02-normalization-spec.mdin the requirements-engineer persona (FR/NFR, Given-When-Then ACs, traceability to IMP-NN, TBD register).
All 6 open questions resolved (spec §9): OQ-01 — movable feasts (Ostern, Pfingsten, …)
computed per year from Easter, never a fixed month; seasons → mid-season month
(Sommer=Jul, Herbst=Oct). OQ-02 ranges → start+RANGE. OQ-03 slug ids. OQ-04 — x-suffix rows
skipped + logged this pass (they're transcriptions of the base letter, not yet mappable).
OQ-05 → .xlsx. OQ-06 → conservative, no silent merge.
Git: moved off the unrelated feat/issue-356-… branch; pulled main; created clean
branch docs/import-migration and committed these docs there. (The dirty .venv
pycache + skills/implement/SKILL.md in the tree are pre-existing/environmental noise — left
uncommitted, not ours.)
Next:
- Marcel reviews the spec.
- Then writing-plans → build the normalizer at
tools/import-normalizer/(backlog B1–B7 are the Musts; B3 date parser incl. Easter computus is the big one).
2026-05-25 (session 1) — Initial analysis
Did:
- Got the real raw archive xlsx (7,943 rows) + person register (163 people). PDFs to follow.
- Compared the new xlsx layout against
MassImportServicedefaults and the old ODS. - Full statistical scan of all rows: dates, indices, senders/receivers, file column.
- Wrote
01-findings-spreadsheet-analysis.mdwith 12 issues (IMP-01..IMP-12) + recommended sequencing. - Installed
openpyxlinto the OCR service venv for inspection.
Key facts established:
- Importer defaults match the ODS, not the new xlsx → wrong column mapping (IMP-01).
- 90% of dated rows (6,571 / 7,319) are free-text dates the ISO-only parser drops (IMP-02).
- Person register is rich but unimported; holds the maiden-name dedup key (IMP-04/05).
Decisions pending from Marcel (blockers for any code work):
- IMP-01: positional re-config of
app.import.col.*vs header-driven mapping rewrite? - IMP-02: how to store imprecise dates — new
dateOriginal+precisioncolumns, or lossy? - IMP-04/05: format for the person/alias mapping; import persons before documents?
- IMP-10: are
x-suffix rows separate documents, attachments, or skipped?
Next:
- Get Marcel's calls on the 4 decisions above.
- Then split the code-change items into Gitea issues (IMP-01b, IMP-02, IMP-04, IMP-06, IMP-12).
- Pure-data tasks (IMP-07 dup list, IMP-09 file reconcile) stay here.