familienarchiv/docs/import-migration/WORKLOG.md

# Import Migration — Worklog

Running log of each working session. **Resume here.** Newest entry on top.

---

## 2026-05-25 (session 5) — Unresolved-name classification

**Did:** Implemented [`04-unresolved-names-plan.md`](./04-unresolved-names-plan.md) subagent-driven
(5 tasks, TDD, per-task spec + code-quality review; 67 tests pass). Added `classify_name` +
`NameClass` + `build_given_names` in `persons.py`; `ResolutionContext` now records non-RESOLVABLE
names in `self.unresolved`; orchestrator writes `review/unresolved-names.csv` (replaces the noisy
`ambiguous-receivers.csv`) with per-category stats.

**Why:** `unmatched-names.csv` mixes boring non-family correspondents (expected) with genuinely
unresolvable entries. The new report isolates the latter so review focuses on ~440 real cases.

**Real-run result:** unresolved-names.csv = single_token 191 / prose 103 / unknown 74 /
collective 46 / relational 21 / ambiguous_pair **5** (distinct). The ambiguous over-flagging fix
cut `ambiguous_pair` from 303 → 5 (genuine two-given-name pairs only; `Mieze Schefold` etc. now
correctly RESOLVABLE). given-name set = register first names ∪ `config.EXTRA_GIVEN_NAMES`.

**Next:** populate `overrides/names.csv` from unresolved-names.csv (highest-count first); extend
`EXTRA_GIVEN_NAMES` if a real pair isn't flagged; still-open date work (Spanish months, 58–72 band).

---

## 2026-05-25 (session 4) — Built the normalizer (subagent-driven, all 17 tasks)

**Did:** Executed the plan subagent-driven (implementer + spec review + code-quality review per
task). The tool `tools/import-normalizer/` is **complete and passing (57 tests)**. Final
opus review: **READY** — determinism verified on the real corpus (two runs → identical cell
matrices + byte-identical review files), zero silent drops.

**Per-task code review caught & fixed real issues** (all in the committed code): leading
qualifiers `nach/vor/…` now → APPROX; English month-first matcher hardened to structurally
not shadow `Mai 1895`; person-id collision de-dup suffixes *all* members; `split_receivers`
returns `[]` for a `geb.`-only cell; boolean cells no longer coerced to `1/0`; duplicate-index
flags every occurrence; provisional ids never steal a register id; CSV-injection defanged.

**REAL DRY-RUN** (`python normalize.py` over the actual archive — outputs are gitignored):
- documents_emitted **7,582** (+225 empty +93 blank-index +42 x-suffix = 7,942 rows read, 0 dropped)
- register_persons **163**, provisional_persons **942**
- dates: DAY 6,509 / MONTH 36 / RANGE 36 / APPROX 28 / YEAR 17 / SEASON 1 / UNKNOWN 955
- **unknown_date_rate 9.2%** (of dated rows; target ≤5% pre-override, ≤0.5% after overrides)
- duplicate_index 85, index_file_mismatches 550, ambiguous_receivers 303

**⚠️ Concurrency incident:** a parallel Claude session committed reader-dashboard work to this
branch and hard-reset it mid-execution, deleting the Task 15 files and orphaning a commit.
Recovered via reflog (`reset --hard 366b4848` + `checkout 401160e3 -- <task15 files>`); no code
lost. Casualty: my *during-execution* edits to the plan/spec docs (02/03) for Tasks 5–14 were
discarded — **the committed code + tests are the source of truth**, not the plan doc, which now
reflects the pre-execution + persona-review version.

**Next steps (iterative refinement — the overrides loop, as designed):**
1. Shave the 9.2% UNKNOWN cheaply: add **Spanish month names** (Enero…Diciembre) and the
   `Mon DD-YYYY` dash form to `config.MONTHS`/the parser (Mexican-branch correspondence);
   revisit the 58–72 two-digit-year band (real `…58/59/60` dates = 1958–1960, just past the
   1873–1957 window — decide whether to extend the upper bound in `config`).
2. `?` (99×) is genuinely "date unknown" — leave UNKNOWN or add a convention.
3. Populate `overrides/dates.csv` + `overrides/names.csv` from the review CSVs and re-run.
4. README note: a leading `'`/`!` in a `review/*.csv` `raw` cell may be a CSV-defang artifact —
   match against the true source value when writing overrides.
5. Phase 2 (separate spec): wire the canonical contract into the Java `MassImportService`.

---

## 2026-05-25 (session 3) — Implementation plan + persona review

**Did:**
- Wrote [`03-normalizer-implementation-plan.md`](./03-normalizer-implementation-plan.md): 17
  bite-sized TDD tasks for `tools/import-normalizer/` (Python, openpyxl), bottom-up — date
  parser w/ Easter computus first, then persons/alias, ingest, mapping, orchestrator, writers.
- Ran a 6-persona inline review (architect, developer, tester, req-engineer, security, devops;
  ui-expert too) via parallel agents. Acted on all material findings.

**Key fixes from review (see plan §"Review feedback incorporated"):**
- Idempotency redefined byte-identical → **content-deterministic** (spec G4/NFR-IDEM-01);
  pinned workbook timestamps + deterministic alias ordering + a real two-run equality test.
- Real bug: duplicate-index only reported repeats → now flags/reports every occurrence.
- Provisional `person_id` could overwrite a register id → now suffixed.
- Date parser gaps: invalid-calendar-date → UNKNOWN, intra-month day-range (`7./8. Sept.1923`).
- Multi-person sender now split + flagged (REQ-PERS-01); CSV-injection defanged in review files;
  pinned deps + hardened root `.gitignore`.

**Next:**
- Marcel reviews the plan. Then execute it (subagent-driven or inline) — the date parser
  (Task 3/8 + Easter computus) is the meatiest piece.

---

## 2026-05-25 (session 2) — Strategy + normalizer spec

**Did:**
- Decided strategy with Marcel: **normalize the raw sheets first**, then import (higher
  leverage than making the Java importer tolerate every mess).
- Locked design decisions (see spec §3): new canonical layout; dates = parsed + raw +
  precision; include person register + dedup in this effort; overrides-file + re-run loop;
  Python tool at `tools/import-normalizer/`.
- Century rule fixed by Marcel: archive spans **1873–1957**; 2-digit `00–57`→19YY,
  `73–99`→18YY, `58–72`→flag; 3-digit→1DDD; never 20xx.
- Wrote [`02-normalization-spec.md`](./02-normalization-spec.md) in the requirements-engineer
  persona (FR/NFR, Given-When-Then ACs, traceability to IMP-NN, TBD register).

**All 6 open questions resolved (spec §9):** OQ-01 — movable feasts (Ostern, Pfingsten, …)
**computed per year from Easter**, never a fixed month; seasons → mid-season month
(Sommer=Jul, Herbst=Oct). OQ-02 ranges → start+RANGE. OQ-03 slug ids. OQ-04 — `x`-suffix rows
**skipped + logged** this pass (they're transcriptions of the base letter, not yet mappable).
OQ-05 → `.xlsx`. OQ-06 → conservative, no silent merge.

**Git:** moved off the unrelated `feat/issue-356-…` branch; pulled `main`; created clean
branch **`docs/import-migration`** and committed these docs there. (The dirty `.venv`
pycache + `skills/implement/SKILL.md` in the tree are pre-existing/environmental noise — left
uncommitted, not ours.)

**Next:**
- Marcel reviews the spec.
- Then writing-plans → build the normalizer at `tools/import-normalizer/` (backlog B1–B7 are
  the Musts; B3 date parser incl. Easter computus is the big one).

---

## 2026-05-25 (session 1) — Initial analysis

**Did:**
- Got the real raw archive xlsx (7,943 rows) + person register (163 people). PDFs to follow.
- Compared the new xlsx layout against `MassImportService` defaults and the old ODS.
- Full statistical scan of all rows: dates, indices, senders/receivers, file column.
- Wrote [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md)
  with 12 issues (IMP-01..IMP-12) + recommended sequencing.
- Installed `openpyxl` into the OCR service venv for inspection.

**Key facts established:**
- Importer defaults match the **ODS**, not the new xlsx → wrong column mapping (IMP-01).
- **90%** of dated rows (6,571 / 7,319) are free-text dates the ISO-only parser drops (IMP-02).
- Person register is rich but **unimported**; holds the maiden-name dedup key (IMP-04/05).

**Decisions pending from Marcel (blockers for any code work):**
1. IMP-01: positional re-config of `app.import.col.*` vs header-driven mapping rewrite?
2. IMP-02: how to store imprecise dates — new `dateOriginal` + `precision` columns, or lossy?
3. IMP-04/05: format for the person/alias mapping; import persons before documents?
4. IMP-10: are `x`-suffix rows separate documents, attachments, or skipped?

**Next:**
- Get Marcel's calls on the 4 decisions above.
- Then split the code-change items into Gitea issues (IMP-01b, IMP-02, IMP-04, IMP-06, IMP-12).
- Pure-data tasks (IMP-07 dup list, IMP-09 file reconcile) stay here.