docs(import): add normalizer implementation plan + apply persona review

17-task TDD plan for tools/import-normalizer/. Incorporates inline
6-persona review: content-deterministic idempotency, duplicate-index
fix, provisional-id collision guard, date-parser edge cases, multi-sender
split, CSV-injection defang, pinned deps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-25 12:55:50 +02:00
parent adfff420a5
commit 6f7aa643c9
3 changed files with 2301 additions and 2 deletions

View File

@@ -36,7 +36,9 @@ appears under many names. Importing as-is produces garbage (see `IMP-01..12`).
- G3 — **100%** of original values (raw date string, raw name string, source row number)
are preserved.
- G4 — A full run over the current inputs completes in **< 60 s** on the dev laptop and is
**byte-identical** when re-run with unchanged inputs+overrides.
**content-deterministic** when re-run with unchanged inputs+overrides: identical canonical
cell matrices and identical review-file contents. (Workbook metadata is pinned; literal xlsx
byte-identity is not guaranteed because the zip container stores entry metadata.)
**Primary actor.** Marcel — solo owner & data steward (tech comfort 4/5). Also: a future
agent re-running the pipeline; and the `MassImportService` as the downstream consumer.
@@ -238,7 +240,7 @@ complete.*
| ID | Category | Requirement (measurable) |
| --- | --- | --- |
| NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output **or** a review file; 100% of original date/name strings preserved verbatim. |
| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ byte-identical outputs across runs and machines. |
| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ identical *logical* output across runs/machines: identical canonical cell matrices and review-file contents. Workbook `created`/`modified` metadata is pinned to a constant; ordering of all generated rows/aliases is stable (no set-iteration leakage). xlsx byte-identity is explicitly not required — determinism is asserted on content. |
| NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. |
| NFR-ACCUR-01 | Date accuracy | After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. |
| NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. |