Files
familienarchiv/docs/import-migration/WORKLOG.md
Marcel 97db718f81
All checks were successful
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Backend Unit Tests (pull_request) Successful in 3m52s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Unit & Component Tests (pull_request) Successful in 4m13s
CI / Semgrep Security Scan (pull_request) Successful in 20s
docs(import): add unresolved-names plan + worklog entry
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:01:18 +02:00

148 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Import Migration — Worklog
Running log of each working session. **Resume here.** Newest entry on top.
---
## 2026-05-25 (session 5) — Unresolved-name classification
**Did:** Implemented [`04-unresolved-names-plan.md`](./04-unresolved-names-plan.md) subagent-driven
(5 tasks, TDD, per-task spec + code-quality review; 67 tests pass). Added `classify_name` +
`NameClass` + `build_given_names` in `persons.py`; `ResolutionContext` now records non-RESOLVABLE
names in `self.unresolved`; orchestrator writes `review/unresolved-names.csv` (replaces the noisy
`ambiguous-receivers.csv`) with per-category stats.
**Why:** `unmatched-names.csv` mixes boring non-family correspondents (expected) with genuinely
unresolvable entries. The new report isolates the latter so review focuses on ~440 real cases.
**Real-run result:** unresolved-names.csv = single_token 191 / prose 103 / unknown 74 /
collective 46 / relational 21 / ambiguous_pair **5** (distinct). The ambiguous over-flagging fix
cut `ambiguous_pair` from 303 → 5 (genuine two-given-name pairs only; `Mieze Schefold` etc. now
correctly RESOLVABLE). given-name set = register first names `config.EXTRA_GIVEN_NAMES`.
**Next:** populate `overrides/names.csv` from unresolved-names.csv (highest-count first); extend
`EXTRA_GIVEN_NAMES` if a real pair isn't flagged; still-open date work (Spanish months, 5872 band).
---
## 2026-05-25 (session 4) — Built the normalizer (subagent-driven, all 17 tasks)
**Did:** Executed the plan subagent-driven (implementer + spec review + code-quality review per
task). The tool `tools/import-normalizer/` is **complete and passing (57 tests)**. Final
opus review: **READY** — determinism verified on the real corpus (two runs → identical cell
matrices + byte-identical review files), zero silent drops.
**Per-task code review caught & fixed real issues** (all in the committed code): leading
qualifiers `nach/vor/…` now → APPROX; English month-first matcher hardened to structurally
not shadow `Mai 1895`; person-id collision de-dup suffixes *all* members; `split_receivers`
returns `[]` for a `geb.`-only cell; boolean cells no longer coerced to `1/0`; duplicate-index
flags every occurrence; provisional ids never steal a register id; CSV-injection defanged.
**REAL DRY-RUN** (`python normalize.py` over the actual archive — outputs are gitignored):
- documents_emitted **7,582** (+225 empty +93 blank-index +42 x-suffix = 7,942 rows read, 0 dropped)
- register_persons **163**, provisional_persons **942**
- dates: DAY 6,509 / MONTH 36 / RANGE 36 / APPROX 28 / YEAR 17 / SEASON 1 / UNKNOWN 955
- **unknown_date_rate 9.2%** (of dated rows; target ≤5% pre-override, ≤0.5% after overrides)
- duplicate_index 85, index_file_mismatches 550, ambiguous_receivers 303
**⚠️ Concurrency incident:** a parallel Claude session committed reader-dashboard work to this
branch and hard-reset it mid-execution, deleting the Task 15 files and orphaning a commit.
Recovered via reflog (`reset --hard 366b4848` + `checkout 401160e3 -- <task15 files>`); no code
lost. Casualty: my *during-execution* edits to the plan/spec docs (02/03) for Tasks 514 were
discarded — **the committed code + tests are the source of truth**, not the plan doc, which now
reflects the pre-execution + persona-review version.
**Next steps (iterative refinement — the overrides loop, as designed):**
1. Shave the 9.2% UNKNOWN cheaply: add **Spanish month names** (Enero…Diciembre) and the
`Mon DD-YYYY` dash form to `config.MONTHS`/the parser (Mexican-branch correspondence);
revisit the 5872 two-digit-year band (real `…58/59/60` dates = 19581960, just past the
18731957 window — decide whether to extend the upper bound in `config`).
2. `?` (99×) is genuinely "date unknown" — leave UNKNOWN or add a convention.
3. Populate `overrides/dates.csv` + `overrides/names.csv` from the review CSVs and re-run.
4. README note: a leading `'`/`!` in a `review/*.csv` `raw` cell may be a CSV-defang artifact —
match against the true source value when writing overrides.
5. Phase 2 (separate spec): wire the canonical contract into the Java `MassImportService`.
---
## 2026-05-25 (session 3) — Implementation plan + persona review
**Did:**
- Wrote [`03-normalizer-implementation-plan.md`](./03-normalizer-implementation-plan.md): 17
bite-sized TDD tasks for `tools/import-normalizer/` (Python, openpyxl), bottom-up — date
parser w/ Easter computus first, then persons/alias, ingest, mapping, orchestrator, writers.
- Ran a 6-persona inline review (architect, developer, tester, req-engineer, security, devops;
ui-expert too) via parallel agents. Acted on all material findings.
**Key fixes from review (see plan §"Review feedback incorporated"):**
- Idempotency redefined byte-identical → **content-deterministic** (spec G4/NFR-IDEM-01);
pinned workbook timestamps + deterministic alias ordering + a real two-run equality test.
- Real bug: duplicate-index only reported repeats → now flags/reports every occurrence.
- Provisional `person_id` could overwrite a register id → now suffixed.
- Date parser gaps: invalid-calendar-date → UNKNOWN, intra-month day-range (`7./8. Sept.1923`).
- Multi-person sender now split + flagged (REQ-PERS-01); CSV-injection defanged in review files;
pinned deps + hardened root `.gitignore`.
**Next:**
- Marcel reviews the plan. Then execute it (subagent-driven or inline) — the date parser
(Task 3/8 + Easter computus) is the meatiest piece.
---
## 2026-05-25 (session 2) — Strategy + normalizer spec
**Did:**
- Decided strategy with Marcel: **normalize the raw sheets first**, then import (higher
leverage than making the Java importer tolerate every mess).
- Locked design decisions (see spec §3): new canonical layout; dates = parsed + raw +
precision; include person register + dedup in this effort; overrides-file + re-run loop;
Python tool at `tools/import-normalizer/`.
- Century rule fixed by Marcel: archive spans **18731957**; 2-digit `0057`→19YY,
`7399`→18YY, `5872`→flag; 3-digit→1DDD; never 20xx.
- Wrote [`02-normalization-spec.md`](./02-normalization-spec.md) in the requirements-engineer
persona (FR/NFR, Given-When-Then ACs, traceability to IMP-NN, TBD register).
**All 6 open questions resolved (spec §9):** OQ-01 — movable feasts (Ostern, Pfingsten, …)
**computed per year from Easter**, never a fixed month; seasons → mid-season month
(Sommer=Jul, Herbst=Oct). OQ-02 ranges → start+RANGE. OQ-03 slug ids. OQ-04 — `x`-suffix rows
**skipped + logged** this pass (they're transcriptions of the base letter, not yet mappable).
OQ-05 → `.xlsx`. OQ-06 → conservative, no silent merge.
**Git:** moved off the unrelated `feat/issue-356-…` branch; pulled `main`; created clean
branch **`docs/import-migration`** and committed these docs there. (The dirty `.venv`
pycache + `skills/implement/SKILL.md` in the tree are pre-existing/environmental noise — left
uncommitted, not ours.)
**Next:**
- Marcel reviews the spec.
- Then writing-plans → build the normalizer at `tools/import-normalizer/` (backlog B1B7 are
the Musts; B3 date parser incl. Easter computus is the big one).
---
## 2026-05-25 (session 1) — Initial analysis
**Did:**
- Got the real raw archive xlsx (7,943 rows) + person register (163 people). PDFs to follow.
- Compared the new xlsx layout against `MassImportService` defaults and the old ODS.
- Full statistical scan of all rows: dates, indices, senders/receivers, file column.
- Wrote [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md)
with 12 issues (IMP-01..IMP-12) + recommended sequencing.
- Installed `openpyxl` into the OCR service venv for inspection.
**Key facts established:**
- Importer defaults match the **ODS**, not the new xlsx → wrong column mapping (IMP-01).
- **90%** of dated rows (6,571 / 7,319) are free-text dates the ISO-only parser drops (IMP-02).
- Person register is rich but **unimported**; holds the maiden-name dedup key (IMP-04/05).
**Decisions pending from Marcel (blockers for any code work):**
1. IMP-01: positional re-config of `app.import.col.*` vs header-driven mapping rewrite?
2. IMP-02: how to store imprecise dates — new `dateOriginal` + `precision` columns, or lossy?
3. IMP-04/05: format for the person/alias mapping; import persons before documents?
4. IMP-10: are `x`-suffix rows separate documents, attachments, or skipped?
**Next:**
- Get Marcel's calls on the 4 decisions above.
- Then split the code-change items into Gitea issues (IMP-01b, IMP-02, IMP-04, IMP-06, IMP-12).
- Pure-data tasks (IMP-07 dup list, IMP-09 file reconcile) stay here.