17-task TDD plan for tools/import-normalizer/. Incorporates inline 6-persona review: content-deterministic idempotency, duplicate-index fix, provisional-id collision guard, date-parser edge cases, multi-sender split, CSV-injection defang, pinned deps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
87 lines
4.5 KiB
Markdown
87 lines
4.5 KiB
Markdown
# Import Migration — Worklog
|
||
|
||
Running log of each working session. **Resume here.** Newest entry on top.
|
||
|
||
---
|
||
|
||
## 2026-05-25 (session 3) — Implementation plan + persona review
|
||
|
||
**Did:**
|
||
- Wrote [`03-normalizer-implementation-plan.md`](./03-normalizer-implementation-plan.md): 17
|
||
bite-sized TDD tasks for `tools/import-normalizer/` (Python, openpyxl), bottom-up — date
|
||
parser w/ Easter computus first, then persons/alias, ingest, mapping, orchestrator, writers.
|
||
- Ran a 6-persona inline review (architect, developer, tester, req-engineer, security, devops;
|
||
ui-expert too) via parallel agents. Acted on all material findings.
|
||
|
||
**Key fixes from review (see plan §"Review feedback incorporated"):**
|
||
- Idempotency redefined byte-identical → **content-deterministic** (spec G4/NFR-IDEM-01);
|
||
pinned workbook timestamps + deterministic alias ordering + a real two-run equality test.
|
||
- Real bug: duplicate-index only reported repeats → now flags/reports every occurrence.
|
||
- Provisional `person_id` could overwrite a register id → now suffixed.
|
||
- Date parser gaps: invalid-calendar-date → UNKNOWN, intra-month day-range (`7./8. Sept.1923`).
|
||
- Multi-person sender now split + flagged (REQ-PERS-01); CSV-injection defanged in review files;
|
||
pinned deps + hardened root `.gitignore`.
|
||
|
||
**Next:**
|
||
- Marcel reviews the plan. Then execute it (subagent-driven or inline) — the date parser
|
||
(Task 3/8 + Easter computus) is the meatiest piece.
|
||
|
||
---
|
||
|
||
## 2026-05-25 (session 2) — Strategy + normalizer spec
|
||
|
||
**Did:**
|
||
- Decided strategy with Marcel: **normalize the raw sheets first**, then import (higher
|
||
leverage than making the Java importer tolerate every mess).
|
||
- Locked design decisions (see spec §3): new canonical layout; dates = parsed + raw +
|
||
precision; include person register + dedup in this effort; overrides-file + re-run loop;
|
||
Python tool at `tools/import-normalizer/`.
|
||
- Century rule fixed by Marcel: archive spans **1873–1957**; 2-digit `00–57`→19YY,
|
||
`73–99`→18YY, `58–72`→flag; 3-digit→1DDD; never 20xx.
|
||
- Wrote [`02-normalization-spec.md`](./02-normalization-spec.md) in the requirements-engineer
|
||
persona (FR/NFR, Given-When-Then ACs, traceability to IMP-NN, TBD register).
|
||
|
||
**All 6 open questions resolved (spec §9):** OQ-01 — movable feasts (Ostern, Pfingsten, …)
|
||
**computed per year from Easter**, never a fixed month; seasons → mid-season month
|
||
(Sommer=Jul, Herbst=Oct). OQ-02 ranges → start+RANGE. OQ-03 slug ids. OQ-04 — `x`-suffix rows
|
||
**skipped + logged** this pass (they're transcriptions of the base letter, not yet mappable).
|
||
OQ-05 → `.xlsx`. OQ-06 → conservative, no silent merge.
|
||
|
||
**Git:** moved off the unrelated `feat/issue-356-…` branch; pulled `main`; created clean
|
||
branch **`docs/import-migration`** and committed these docs there. (The dirty `.venv`
|
||
pycache + `skills/implement/SKILL.md` in the tree are pre-existing/environmental noise — left
|
||
uncommitted, not ours.)
|
||
|
||
**Next:**
|
||
- Marcel reviews the spec.
|
||
- Then writing-plans → build the normalizer at `tools/import-normalizer/` (backlog B1–B7 are
|
||
the Musts; B3 date parser incl. Easter computus is the big one).
|
||
|
||
---
|
||
|
||
## 2026-05-25 (session 1) — Initial analysis
|
||
|
||
**Did:**
|
||
- Got the real raw archive xlsx (7,943 rows) + person register (163 people). PDFs to follow.
|
||
- Compared the new xlsx layout against `MassImportService` defaults and the old ODS.
|
||
- Full statistical scan of all rows: dates, indices, senders/receivers, file column.
|
||
- Wrote [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md)
|
||
with 12 issues (IMP-01..IMP-12) + recommended sequencing.
|
||
- Installed `openpyxl` into the OCR service venv for inspection.
|
||
|
||
**Key facts established:**
|
||
- Importer defaults match the **ODS**, not the new xlsx → wrong column mapping (IMP-01).
|
||
- **90%** of dated rows (6,571 / 7,319) are free-text dates the ISO-only parser drops (IMP-02).
|
||
- Person register is rich but **unimported**; holds the maiden-name dedup key (IMP-04/05).
|
||
|
||
**Decisions pending from Marcel (blockers for any code work):**
|
||
1. IMP-01: positional re-config of `app.import.col.*` vs header-driven mapping rewrite?
|
||
2. IMP-02: how to store imprecise dates — new `dateOriginal` + `precision` columns, or lossy?
|
||
3. IMP-04/05: format for the person/alias mapping; import persons before documents?
|
||
4. IMP-10: are `x`-suffix rows separate documents, attachments, or skipped?
|
||
|
||
**Next:**
|
||
- Get Marcel's calls on the 4 decisions above.
|
||
- Then split the code-change items into Gitea issues (IMP-01b, IMP-02, IMP-04, IMP-06, IMP-12).
|
||
- Pure-data tasks (IMP-07 dup list, IMP-09 file reconcile) stay here.
|