docs(import): add normalizer implementation plan + apply persona review

17-task TDD plan for tools/import-normalizer/. Incorporates inline
6-persona review: content-deterministic idempotency, duplicate-index
fix, provisional-id collision guard, date-parser edge cases, multi-sender
split, CSV-injection defang, pinned deps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-25 12:55:50 +02:00
parent adfff420a5
commit 6f7aa643c9
3 changed files with 2301 additions and 2 deletions

View File

@@ -36,7 +36,9 @@ appears under many names. Importing as-is produces garbage (see `IMP-01..12`).
- G3 — **100%** of original values (raw date string, raw name string, source row number) - G3 — **100%** of original values (raw date string, raw name string, source row number)
are preserved. are preserved.
- G4 — A full run over the current inputs completes in **< 60 s** on the dev laptop and is - G4 — A full run over the current inputs completes in **< 60 s** on the dev laptop and is
**byte-identical** when re-run with unchanged inputs+overrides. **content-deterministic** when re-run with unchanged inputs+overrides: identical canonical
cell matrices and identical review-file contents. (Workbook metadata is pinned; literal xlsx
byte-identity is not guaranteed because the zip container stores entry metadata.)
**Primary actor.** Marcel — solo owner & data steward (tech comfort 4/5). Also: a future **Primary actor.** Marcel — solo owner & data steward (tech comfort 4/5). Also: a future
agent re-running the pipeline; and the `MassImportService` as the downstream consumer. agent re-running the pipeline; and the `MassImportService` as the downstream consumer.
@@ -238,7 +240,7 @@ complete.*
| ID | Category | Requirement (measurable) | | ID | Category | Requirement (measurable) |
| --- | --- | --- | | --- | --- | --- |
| NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output **or** a review file; 100% of original date/name strings preserved verbatim. | | NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output **or** a review file; 100% of original date/name strings preserved verbatim. |
| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ byte-identical outputs across runs and machines. | | NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ identical *logical* output across runs/machines: identical canonical cell matrices and review-file contents. Workbook `created`/`modified` metadata is pinned to a constant; ordering of all generated rows/aliases is stable (no set-iteration leakage). xlsx byte-identity is explicitly not required — determinism is asserted on content. |
| NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. | | NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. |
| NFR-ACCUR-01 | Date accuracy | After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. | | NFR-ACCUR-01 | Date accuracy | After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. |
| NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. | | NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. |

File diff suppressed because it is too large Load Diff

View File

@@ -4,6 +4,30 @@ Running log of each working session. **Resume here.** Newest entry on top.
--- ---
## 2026-05-25 (session 3) — Implementation plan + persona review
**Did:**
- Wrote [`03-normalizer-implementation-plan.md`](./03-normalizer-implementation-plan.md): 17
bite-sized TDD tasks for `tools/import-normalizer/` (Python, openpyxl), bottom-up — date
parser w/ Easter computus first, then persons/alias, ingest, mapping, orchestrator, writers.
- Ran a 6-persona inline review (architect, developer, tester, req-engineer, security, devops;
ui-expert too) via parallel agents. Acted on all material findings.
**Key fixes from review (see plan §"Review feedback incorporated"):**
- Idempotency redefined byte-identical → **content-deterministic** (spec G4/NFR-IDEM-01);
pinned workbook timestamps + deterministic alias ordering + a real two-run equality test.
- Real bug: duplicate-index only reported repeats → now flags/reports every occurrence.
- Provisional `person_id` could overwrite a register id → now suffixed.
- Date parser gaps: invalid-calendar-date → UNKNOWN, intra-month day-range (`7./8. Sept.1923`).
- Multi-person sender now split + flagged (REQ-PERS-01); CSV-injection defanged in review files;
pinned deps + hardened root `.gitignore`.
**Next:**
- Marcel reviews the plan. Then execute it (subagent-driven or inline) — the date parser
(Task 3/8 + Easter computus) is the meatiest piece.
---
## 2026-05-25 (session 2) — Strategy + normalizer spec ## 2026-05-25 (session 2) — Strategy + normalizer spec
**Did:** **Did:**