docs(import): add normalizer implementation plan + apply persona review
17-task TDD plan for tools/import-normalizer/. Incorporates inline 6-persona review: content-deterministic idempotency, duplicate-index fix, provisional-id collision guard, date-parser edge cases, multi-sender split, CSV-injection defang, pinned deps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -36,7 +36,9 @@ appears under many names. Importing as-is produces garbage (see `IMP-01..12`).
|
|||||||
- G3 — **100%** of original values (raw date string, raw name string, source row number)
|
- G3 — **100%** of original values (raw date string, raw name string, source row number)
|
||||||
are preserved.
|
are preserved.
|
||||||
- G4 — A full run over the current inputs completes in **< 60 s** on the dev laptop and is
|
- G4 — A full run over the current inputs completes in **< 60 s** on the dev laptop and is
|
||||||
**byte-identical** when re-run with unchanged inputs+overrides.
|
**content-deterministic** when re-run with unchanged inputs+overrides: identical canonical
|
||||||
|
cell matrices and identical review-file contents. (Workbook metadata is pinned; literal xlsx
|
||||||
|
byte-identity is not guaranteed because the zip container stores entry metadata.)
|
||||||
|
|
||||||
**Primary actor.** Marcel — solo owner & data steward (tech comfort 4/5). Also: a future
|
**Primary actor.** Marcel — solo owner & data steward (tech comfort 4/5). Also: a future
|
||||||
agent re-running the pipeline; and the `MassImportService` as the downstream consumer.
|
agent re-running the pipeline; and the `MassImportService` as the downstream consumer.
|
||||||
@@ -238,7 +240,7 @@ complete.*
|
|||||||
| ID | Category | Requirement (measurable) |
|
| ID | Category | Requirement (measurable) |
|
||||||
| --- | --- | --- |
|
| --- | --- | --- |
|
||||||
| NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output **or** a review file; 100% of original date/name strings preserved verbatim. |
|
| NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output **or** a review file; 100% of original date/name strings preserved verbatim. |
|
||||||
| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ byte-identical outputs across runs and machines. |
|
| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ identical *logical* output across runs/machines: identical canonical cell matrices and review-file contents. Workbook `created`/`modified` metadata is pinned to a constant; ordering of all generated rows/aliases is stable (no set-iteration leakage). xlsx byte-identity is explicitly not required — determinism is asserted on content. |
|
||||||
| NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. |
|
| NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. |
|
||||||
| NFR-ACCUR-01 | Date accuracy | After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. |
|
| NFR-ACCUR-01 | Date accuracy | After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. |
|
||||||
| NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. |
|
| NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. |
|
||||||
|
|||||||
2273
docs/import-migration/03-normalizer-implementation-plan.md
Normal file
2273
docs/import-migration/03-normalizer-implementation-plan.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -4,6 +4,30 @@ Running log of each working session. **Resume here.** Newest entry on top.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 2026-05-25 (session 3) — Implementation plan + persona review
|
||||||
|
|
||||||
|
**Did:**
|
||||||
|
- Wrote [`03-normalizer-implementation-plan.md`](./03-normalizer-implementation-plan.md): 17
|
||||||
|
bite-sized TDD tasks for `tools/import-normalizer/` (Python, openpyxl), bottom-up — date
|
||||||
|
parser w/ Easter computus first, then persons/alias, ingest, mapping, orchestrator, writers.
|
||||||
|
- Ran a 6-persona inline review (architect, developer, tester, req-engineer, security, devops;
|
||||||
|
ui-expert too) via parallel agents. Acted on all material findings.
|
||||||
|
|
||||||
|
**Key fixes from review (see plan §"Review feedback incorporated"):**
|
||||||
|
- Idempotency redefined byte-identical → **content-deterministic** (spec G4/NFR-IDEM-01);
|
||||||
|
pinned workbook timestamps + deterministic alias ordering + a real two-run equality test.
|
||||||
|
- Real bug: duplicate-index only reported repeats → now flags/reports every occurrence.
|
||||||
|
- Provisional `person_id` could overwrite a register id → now suffixed.
|
||||||
|
- Date parser gaps: invalid-calendar-date → UNKNOWN, intra-month day-range (`7./8. Sept.1923`).
|
||||||
|
- Multi-person sender now split + flagged (REQ-PERS-01); CSV-injection defanged in review files;
|
||||||
|
pinned deps + hardened root `.gitignore`.
|
||||||
|
|
||||||
|
**Next:**
|
||||||
|
- Marcel reviews the plan. Then execute it (subagent-driven or inline) — the date parser
|
||||||
|
(Task 3/8 + Easter computus) is the meatiest piece.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 2026-05-25 (session 2) — Strategy + normalizer spec
|
## 2026-05-25 (session 2) — Strategy + normalizer spec
|
||||||
|
|
||||||
**Did:**
|
**Did:**
|
||||||
|
|||||||
Reference in New Issue
Block a user