docs(import): add import-migration analysis + normalizer spec

Document the raw archive spreadsheet findings (IMP-01..12) and a requirements spec for an offline normalizer that produces a clean canonical dataset before import. Local docs only; no Gitea issue yet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:32:37 +02:00
parent 8e9e3bba06
commit adfff420a5
4 changed files with 821 additions and 0 deletions
--- a/docs/import-migration/WORKLOG.md
+++ b/docs/import-migration/WORKLOG.md
@@ -0,0 +1,62 @@
+# Import Migration — Worklog
+
+Running log of each working session. **Resume here.** Newest entry on top.
+
+---
+
+## 2026-05-25 (session 2) — Strategy + normalizer spec
+
+**Did:**
+- Decided strategy with Marcel: **normalize the raw sheets first**, then import (higher
+  leverage than making the Java importer tolerate every mess).
+- Locked design decisions (see spec §3): new canonical layout; dates = parsed + raw +
+  precision; include person register + dedup in this effort; overrides-file + re-run loop;
+  Python tool at `tools/import-normalizer/`.
+- Century rule fixed by Marcel: archive spans **1873–1957**; 2-digit `00–57`→19YY,
+  `73–99`→18YY, `58–72`→flag; 3-digit→1DDD; never 20xx.
+- Wrote [`02-normalization-spec.md`](./02-normalization-spec.md) in the requirements-engineer
+  persona (FR/NFR, Given-When-Then ACs, traceability to IMP-NN, TBD register).
+
+**All 6 open questions resolved (spec §9):** OQ-01 — movable feasts (Ostern, Pfingsten, …)
+**computed per year from Easter**, never a fixed month; seasons → mid-season month
+(Sommer=Jul, Herbst=Oct). OQ-02 ranges → start+RANGE. OQ-03 slug ids. OQ-04 — `x`-suffix rows
+**skipped + logged** this pass (they're transcriptions of the base letter, not yet mappable).
+OQ-05 → `.xlsx`. OQ-06 → conservative, no silent merge.
+
+**Git:** moved off the unrelated `feat/issue-356-…` branch; pulled `main`; created clean
+branch **`docs/import-migration`** and committed these docs there. (The dirty `.venv`
+pycache + `skills/implement/SKILL.md` in the tree are pre-existing/environmental noise — left
+uncommitted, not ours.)
+
+**Next:**
+- Marcel reviews the spec.
+- Then writing-plans → build the normalizer at `tools/import-normalizer/` (backlog B1–B7 are
+  the Musts; B3 date parser incl. Easter computus is the big one).
+
+---
+
+## 2026-05-25 (session 1) — Initial analysis
+
+**Did:**
+- Got the real raw archive xlsx (7,943 rows) + person register (163 people). PDFs to follow.
+- Compared the new xlsx layout against `MassImportService` defaults and the old ODS.
+- Full statistical scan of all rows: dates, indices, senders/receivers, file column.
+- Wrote [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md)
+  with 12 issues (IMP-01..IMP-12) + recommended sequencing.
+- Installed `openpyxl` into the OCR service venv for inspection.
+
+**Key facts established:**
+- Importer defaults match the **ODS**, not the new xlsx → wrong column mapping (IMP-01).
+- **90%** of dated rows (6,571 / 7,319) are free-text dates the ISO-only parser drops (IMP-02).
+- Person register is rich but **unimported**; holds the maiden-name dedup key (IMP-04/05).
+
+**Decisions pending from Marcel (blockers for any code work):**
+1. IMP-01: positional re-config of `app.import.col.*` vs header-driven mapping rewrite?
+2. IMP-02: how to store imprecise dates — new `dateOriginal` + `precision` columns, or lossy?
+3. IMP-04/05: format for the person/alias mapping; import persons before documents?
+4. IMP-10: are `x`-suffix rows separate documents, attachments, or skipped?
+
+**Next:**
+- Get Marcel's calls on the 4 decisions above.
+- Then split the code-change items into Gitea issues (IMP-01b, IMP-02, IMP-04, IMP-06, IMP-12).
+- Pure-data tasks (IMP-07 dup list, IMP-09 file reconcile) stay here.