Files
familienarchiv/docs/import-migration/WORKLOG.md
Marcel 6f7aa643c9 docs(import): add normalizer implementation plan + apply persona review
17-task TDD plan for tools/import-normalizer/. Incorporates inline
6-persona review: content-deterministic idempotency, duplicate-index
fix, provisional-id collision guard, date-parser edge cases, multi-sender
split, CSV-injection defang, pinned deps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:55:50 +02:00

87 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Import Migration — Worklog
Running log of each working session. **Resume here.** Newest entry on top.
---
## 2026-05-25 (session 3) — Implementation plan + persona review
**Did:**
- Wrote [`03-normalizer-implementation-plan.md`](./03-normalizer-implementation-plan.md): 17
bite-sized TDD tasks for `tools/import-normalizer/` (Python, openpyxl), bottom-up — date
parser w/ Easter computus first, then persons/alias, ingest, mapping, orchestrator, writers.
- Ran a 6-persona inline review (architect, developer, tester, req-engineer, security, devops;
ui-expert too) via parallel agents. Acted on all material findings.
**Key fixes from review (see plan §"Review feedback incorporated"):**
- Idempotency redefined byte-identical → **content-deterministic** (spec G4/NFR-IDEM-01);
pinned workbook timestamps + deterministic alias ordering + a real two-run equality test.
- Real bug: duplicate-index only reported repeats → now flags/reports every occurrence.
- Provisional `person_id` could overwrite a register id → now suffixed.
- Date parser gaps: invalid-calendar-date → UNKNOWN, intra-month day-range (`7./8. Sept.1923`).
- Multi-person sender now split + flagged (REQ-PERS-01); CSV-injection defanged in review files;
pinned deps + hardened root `.gitignore`.
**Next:**
- Marcel reviews the plan. Then execute it (subagent-driven or inline) — the date parser
(Task 3/8 + Easter computus) is the meatiest piece.
---
## 2026-05-25 (session 2) — Strategy + normalizer spec
**Did:**
- Decided strategy with Marcel: **normalize the raw sheets first**, then import (higher
leverage than making the Java importer tolerate every mess).
- Locked design decisions (see spec §3): new canonical layout; dates = parsed + raw +
precision; include person register + dedup in this effort; overrides-file + re-run loop;
Python tool at `tools/import-normalizer/`.
- Century rule fixed by Marcel: archive spans **18731957**; 2-digit `0057`→19YY,
`7399`→18YY, `5872`→flag; 3-digit→1DDD; never 20xx.
- Wrote [`02-normalization-spec.md`](./02-normalization-spec.md) in the requirements-engineer
persona (FR/NFR, Given-When-Then ACs, traceability to IMP-NN, TBD register).
**All 6 open questions resolved (spec §9):** OQ-01 — movable feasts (Ostern, Pfingsten, …)
**computed per year from Easter**, never a fixed month; seasons → mid-season month
(Sommer=Jul, Herbst=Oct). OQ-02 ranges → start+RANGE. OQ-03 slug ids. OQ-04 — `x`-suffix rows
**skipped + logged** this pass (they're transcriptions of the base letter, not yet mappable).
OQ-05 → `.xlsx`. OQ-06 → conservative, no silent merge.
**Git:** moved off the unrelated `feat/issue-356-…` branch; pulled `main`; created clean
branch **`docs/import-migration`** and committed these docs there. (The dirty `.venv`
pycache + `skills/implement/SKILL.md` in the tree are pre-existing/environmental noise — left
uncommitted, not ours.)
**Next:**
- Marcel reviews the spec.
- Then writing-plans → build the normalizer at `tools/import-normalizer/` (backlog B1B7 are
the Musts; B3 date parser incl. Easter computus is the big one).
---
## 2026-05-25 (session 1) — Initial analysis
**Did:**
- Got the real raw archive xlsx (7,943 rows) + person register (163 people). PDFs to follow.
- Compared the new xlsx layout against `MassImportService` defaults and the old ODS.
- Full statistical scan of all rows: dates, indices, senders/receivers, file column.
- Wrote [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md)
with 12 issues (IMP-01..IMP-12) + recommended sequencing.
- Installed `openpyxl` into the OCR service venv for inspection.
**Key facts established:**
- Importer defaults match the **ODS**, not the new xlsx → wrong column mapping (IMP-01).
- **90%** of dated rows (6,571 / 7,319) are free-text dates the ISO-only parser drops (IMP-02).
- Person register is rich but **unimported**; holds the maiden-name dedup key (IMP-04/05).
**Decisions pending from Marcel (blockers for any code work):**
1. IMP-01: positional re-config of `app.import.col.*` vs header-driven mapping rewrite?
2. IMP-02: how to store imprecise dates — new `dateOriginal` + `precision` columns, or lossy?
3. IMP-04/05: format for the person/alias mapping; import persons before documents?
4. IMP-10: are `x`-suffix rows separate documents, attachments, or skipped?
**Next:**
- Get Marcel's calls on the 4 decisions above.
- Then split the code-change items into Gitea issues (IMP-01b, IMP-02, IMP-04, IMP-06, IMP-12).
- Pure-data tasks (IMP-07 dup list, IMP-09 file reconcile) stay here.