diff --git a/docs/import-migration/03-normalizer-implementation-plan.md b/docs/import-migration/03-normalizer-implementation-plan.md index f315596f..8d2f8428 100644 --- a/docs/import-migration/03-normalizer-implementation-plan.md +++ b/docs/import-migration/03-normalizer-implementation-plan.md @@ -1759,6 +1759,14 @@ def test_write_documents_xlsx_joins_lists(tmp_path): assert row["receiver_person_ids"] == "a|b" assert row["needs_review"] == "unparsed_date" +def test_write_documents_xlsx_pins_timestamp(tmp_path): + # determinism (NFR-IDEM-01): workbook created/modified are pinned, not the current time + doc = documents.CanonicalDocument(index="W-0001") + out = tmp_path / "d.xlsx" + writers.write_documents_xlsx([doc], out) + wb = openpyxl.load_workbook(out) + assert (wb.properties.created.year, wb.properties.created.month, wb.properties.created.day) == (2020, 1, 1) + def test_write_review_csv(tmp_path): out = tmp_path / "r.csv" writers.write_review_csv(out, ["raw", "count"], [["?", 3], ["x", 1]]) @@ -1835,7 +1843,7 @@ def _join(value): def _csv_safe(value): """Neutralise spreadsheet formula injection (CWE-1236) in human-opened review CSVs.""" s = "" if value is None else str(value) - return "'" + s if s[:1] in ("=", "+", "-", "@", "\t", "\r") else s + return "'" + s if s[:1] in ("=", "+", "-", "@", "\t", "\r", "\n") else s DOC_COLUMNS = ["index", "box", "folder", "sender_person_id", "sender_name", diff --git a/docs/import-migration/WORKLOG.md b/docs/import-migration/WORKLOG.md index 2f82baf5..8431ac5e 100644 --- a/docs/import-migration/WORKLOG.md +++ b/docs/import-migration/WORKLOG.md @@ -4,6 +4,46 @@ Running log of each working session. **Resume here.** Newest entry on top. --- +## 2026-05-25 (session 4) — Built the normalizer (subagent-driven, all 17 tasks) + +**Did:** Executed the plan subagent-driven (implementer + spec review + code-quality review per +task). The tool `tools/import-normalizer/` is **complete and passing (57 tests)**. Final +opus review: **READY** — determinism verified on the real corpus (two runs → identical cell +matrices + byte-identical review files), zero silent drops. + +**Per-task code review caught & fixed real issues** (all in the committed code): leading +qualifiers `nach/vor/…` now → APPROX; English month-first matcher hardened to structurally +not shadow `Mai 1895`; person-id collision de-dup suffixes *all* members; `split_receivers` +returns `[]` for a `geb.`-only cell; boolean cells no longer coerced to `1/0`; duplicate-index +flags every occurrence; provisional ids never steal a register id; CSV-injection defanged. + +**REAL DRY-RUN** (`python normalize.py` over the actual archive — outputs are gitignored): +- documents_emitted **7,582** (+225 empty +93 blank-index +42 x-suffix = 7,942 rows read, 0 dropped) +- register_persons **163**, provisional_persons **942** +- dates: DAY 6,509 / MONTH 36 / RANGE 36 / APPROX 28 / YEAR 17 / SEASON 1 / UNKNOWN 955 +- **unknown_date_rate 9.2%** (of dated rows; target ≤5% pre-override, ≤0.5% after overrides) +- duplicate_index 85, index_file_mismatches 550, ambiguous_receivers 303 + +**⚠️ Concurrency incident:** a parallel Claude session committed reader-dashboard work to this +branch and hard-reset it mid-execution, deleting the Task 15 files and orphaning a commit. +Recovered via reflog (`reset --hard 366b4848` + `checkout 401160e3 -- `); no code +lost. Casualty: my *during-execution* edits to the plan/spec docs (02/03) for Tasks 5–14 were +discarded — **the committed code + tests are the source of truth**, not the plan doc, which now +reflects the pre-execution + persona-review version. + +**Next steps (iterative refinement — the overrides loop, as designed):** +1. Shave the 9.2% UNKNOWN cheaply: add **Spanish month names** (Enero…Diciembre) and the + `Mon DD-YYYY` dash form to `config.MONTHS`/the parser (Mexican-branch correspondence); + revisit the 58–72 two-digit-year band (real `…58/59/60` dates = 1958–1960, just past the + 1873–1957 window — decide whether to extend the upper bound in `config`). +2. `?` (99×) is genuinely "date unknown" — leave UNKNOWN or add a convention. +3. Populate `overrides/dates.csv` + `overrides/names.csv` from the review CSVs and re-run. +4. README note: a leading `'`/`!` in a `review/*.csv` `raw` cell may be a CSV-defang artifact — + match against the true source value when writing overrides. +5. Phase 2 (separate spec): wire the canonical contract into the Java `MassImportService`. + +--- + ## 2026-05-25 (session 3) — Implementation plan + persona review **Did:**