marcel/familienarchiv

Fork 0

Files

Marcel 97db718f81

CI / OCR Service Tests (pull_request) Successful in 22s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s

Details

CI / Backend Unit Tests (pull_request) Successful in 3m52s

Details

CI / fail2ban Regex (pull_request) Successful in 42s

Details

CI / Unit & Component Tests (pull_request) Successful in 4m13s

Details

CI / Semgrep Security Scan (pull_request) Successful in 20s

Details

docs(import): add unresolved-names plan + worklog entry

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 16:01:18 +02:00

8.5 KiB

Raw Blame History

Import Migration — Worklog

Running log of each working session. Resume here. Newest entry on top.

2026-05-25 (session 5) — Unresolved-name classification

Did: Implemented 04-unresolved-names-plan.md subagent-driven (5 tasks, TDD, per-task spec + code-quality review; 67 tests pass). Added classify_name + NameClass + build_given_names in persons.py; ResolutionContext now records non-RESOLVABLE names in self.unresolved; orchestrator writes review/unresolved-names.csv (replaces the noisy ambiguous-receivers.csv) with per-category stats.

Why: unmatched-names.csv mixes boring non-family correspondents (expected) with genuinely unresolvable entries. The new report isolates the latter so review focuses on ~440 real cases.

Real-run result: unresolved-names.csv = single_token 191 / prose 103 / unknown 74 / collective 46 / relational 21 / ambiguous_pair 5 (distinct). The ambiguous over-flagging fix cut ambiguous_pair from 303 → 5 (genuine two-given-name pairs only; Mieze Schefold etc. now correctly RESOLVABLE). given-name set = register first names ∪ config.EXTRA_GIVEN_NAMES.

Next: populate overrides/names.csv from unresolved-names.csv (highest-count first); extend EXTRA_GIVEN_NAMES if a real pair isn't flagged; still-open date work (Spanish months, 58–72 band).

2026-05-25 (session 4) — Built the normalizer (subagent-driven, all 17 tasks)

Did: Executed the plan subagent-driven (implementer + spec review + code-quality review per task). The tool tools/import-normalizer/ is complete and passing (57 tests). Final opus review: READY — determinism verified on the real corpus (two runs → identical cell matrices + byte-identical review files), zero silent drops.

Per-task code review caught & fixed real issues (all in the committed code): leading qualifiers nach/vor/… now → APPROX; English month-first matcher hardened to structurally not shadow Mai 1895; person-id collision de-dup suffixes all members; split_receivers returns [] for a geb.-only cell; boolean cells no longer coerced to 1/0; duplicate-index flags every occurrence; provisional ids never steal a register id; CSV-injection defanged.

REAL DRY-RUN (python normalize.py over the actual archive — outputs are gitignored):

documents_emitted 7,582 (+225 empty +93 blank-index +42 x-suffix = 7,942 rows read, 0 dropped)
register_persons 163, provisional_persons 942
dates: DAY 6,509 / MONTH 36 / RANGE 36 / APPROX 28 / YEAR 17 / SEASON 1 / UNKNOWN 955
unknown_date_rate 9.2% (of dated rows; target ≤5% pre-override, ≤0.5% after overrides)
duplicate_index 85, index_file_mismatches 550, ambiguous_receivers 303

⚠️ Concurrency incident: a parallel Claude session committed reader-dashboard work to this branch and hard-reset it mid-execution, deleting the Task 15 files and orphaning a commit. Recovered via reflog (reset --hard 366b4848 + checkout 401160e3 -- <task15 files>); no code lost. Casualty: my during-execution edits to the plan/spec docs (02/03) for Tasks 5–14 were discarded — the committed code + tests are the source of truth, not the plan doc, which now reflects the pre-execution + persona-review version.

Next steps (iterative refinement — the overrides loop, as designed):

Shave the 9.2% UNKNOWN cheaply: add Spanish month names (Enero…Diciembre) and the Mon DD-YYYY dash form to config.MONTHS/the parser (Mexican-branch correspondence); revisit the 58–72 two-digit-year band (real …58/59/60 dates = 1958–1960, just past the 1873–1957 window — decide whether to extend the upper bound in config).
? (99×) is genuinely "date unknown" — leave UNKNOWN or add a convention.
Populate overrides/dates.csv + overrides/names.csv from the review CSVs and re-run.
README note: a leading '/! in a review/*.csv raw cell may be a CSV-defang artifact — match against the true source value when writing overrides.
Phase 2 (separate spec): wire the canonical contract into the Java MassImportService.

2026-05-25 (session 3) — Implementation plan + persona review

Did:

Wrote 03-normalizer-implementation-plan.md: 17 bite-sized TDD tasks for tools/import-normalizer/ (Python, openpyxl), bottom-up — date parser w/ Easter computus first, then persons/alias, ingest, mapping, orchestrator, writers.
Ran a 6-persona inline review (architect, developer, tester, req-engineer, security, devops; ui-expert too) via parallel agents. Acted on all material findings.

Key fixes from review (see plan §"Review feedback incorporated"):

Idempotency redefined byte-identical → content-deterministic (spec G4/NFR-IDEM-01); pinned workbook timestamps + deterministic alias ordering + a real two-run equality test.
Real bug: duplicate-index only reported repeats → now flags/reports every occurrence.
Provisional person_id could overwrite a register id → now suffixed.
Date parser gaps: invalid-calendar-date → UNKNOWN, intra-month day-range (7./8. Sept.1923).
Multi-person sender now split + flagged (REQ-PERS-01); CSV-injection defanged in review files; pinned deps + hardened root .gitignore.

Next:

Marcel reviews the plan. Then execute it (subagent-driven or inline) — the date parser (Task 3/8 + Easter computus) is the meatiest piece.

2026-05-25 (session 2) — Strategy + normalizer spec

Did:

Decided strategy with Marcel: normalize the raw sheets first, then import (higher leverage than making the Java importer tolerate every mess).
Locked design decisions (see spec §3): new canonical layout; dates = parsed + raw + precision; include person register + dedup in this effort; overrides-file + re-run loop; Python tool at tools/import-normalizer/.
Century rule fixed by Marcel: archive spans 1873–1957; 2-digit 00–57→19YY, 73–99→18YY, 58–72→flag; 3-digit→1DDD; never 20xx.
Wrote 02-normalization-spec.md in the requirements-engineer persona (FR/NFR, Given-When-Then ACs, traceability to IMP-NN, TBD register).

All 6 open questions resolved (spec §9): OQ-01 — movable feasts (Ostern, Pfingsten, …) computed per year from Easter, never a fixed month; seasons → mid-season month (Sommer=Jul, Herbst=Oct). OQ-02 ranges → start+RANGE. OQ-03 slug ids. OQ-04 — x-suffix rows skipped + logged this pass (they're transcriptions of the base letter, not yet mappable). OQ-05 → .xlsx. OQ-06 → conservative, no silent merge.

Git: moved off the unrelated feat/issue-356-… branch; pulled main; created clean branch docs/import-migration and committed these docs there. (The dirty .venv pycache + skills/implement/SKILL.md in the tree are pre-existing/environmental noise — left uncommitted, not ours.)

Next:

Marcel reviews the spec.
Then writing-plans → build the normalizer at tools/import-normalizer/ (backlog B1–B7 are the Musts; B3 date parser incl. Easter computus is the big one).

2026-05-25 (session 1) — Initial analysis

Did:

Got the real raw archive xlsx (7,943 rows) + person register (163 people). PDFs to follow.
Compared the new xlsx layout against MassImportService defaults and the old ODS.
Full statistical scan of all rows: dates, indices, senders/receivers, file column.
Wrote 01-findings-spreadsheet-analysis.md with 12 issues (IMP-01..IMP-12) + recommended sequencing.
Installed openpyxl into the OCR service venv for inspection.

Key facts established:

Importer defaults match the ODS, not the new xlsx → wrong column mapping (IMP-01).
90% of dated rows (6,571 / 7,319) are free-text dates the ISO-only parser drops (IMP-02).
Person register is rich but unimported; holds the maiden-name dedup key (IMP-04/05).

Decisions pending from Marcel (blockers for any code work):

IMP-01: positional re-config of app.import.col.* vs header-driven mapping rewrite?
IMP-02: how to store imprecise dates — new dateOriginal + precision columns, or lossy?
IMP-04/05: format for the person/alias mapping; import persons before documents?
IMP-10: are x-suffix rows separate documents, attachments, or skipped?

Next:

Get Marcel's calls on the 4 decisions above.
Then split the code-change items into Gitea issues (IMP-01b, IMP-02, IMP-04, IMP-06, IMP-12).
Pure-data tasks (IMP-07 dup list, IMP-09 file reconcile) stay here.

8.5 KiB Raw Blame History Unescape Escape

Import Migration — Worklog

2026-05-25 (session 5) — Unresolved-name classification

2026-05-25 (session 4) — Built the normalizer (subagent-driven, all 17 tasks)

2026-05-25 (session 3) — Implementation plan + persona review

2026-05-25 (session 2) — Strategy + normalizer spec

2026-05-25 (session 1) — Initial analysis

8.5 KiB

Raw Blame History