Import normalizer: offline tool to normalize the raw archive spreadsheets #663

Merged
marcel merged 172 commits from docs/import-migration into main 2026-05-28 15:05:51 +02:00
10 changed files with 48 additions and 3183 deletions
Showing only changes of commit 46d1f5c6d8 - Show all commits

3
.gitignore vendored
View File

@@ -30,3 +30,6 @@ frontend/yarn.lock
**/.venv/
**/__pycache__/
*.pyc
# Canonical import artifacts live only on the ops host (PII).
# See tools/import-normalizer/.gitignore — load-bearing for that policy.

View File

@@ -36,7 +36,9 @@
# accidentally share an import source. Must be
# readable by the backend container's UID
# (currently root via the OpenJDK image — any
# world-readable directory works).
# world-readable directory works). Canonical
# artifacts are NOT in git (PII — ADR-025); ops
# syncs them in beside the PDFs out-of-band.
networks:
archiv-net:
@@ -224,6 +226,10 @@ services:
# Read-only; the canonical importer only reads them from /import.
# Required — no default — so staging and prod cannot accidentally share an
# import source. CI workflows pin this per-env (see .gitea/workflows/).
# NOTE: the canonical artifacts are NOT version-controlled (they contain real
# family PII — see ADR-025). Ops must produce them locally from the Python
# normalizer (tools/import-normalizer/) and sync them into this host path
# alongside the <index>.pdf corpus before triggering an import.
volumes:
- ${IMPORT_HOST_DIR:?Set IMPORT_HOST_DIR to a host path holding the import payload (canonical artifacts + <index>.pdf files). See docs/DEPLOYMENT.md.}:/import:ro
environment:

View File

@@ -64,8 +64,10 @@ name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **del
The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in
an explicit dependency DAG — `TagTreeImporter``PersonRegisterImporter`
`PersonTreeImporter``DocumentImporter` — that **consume the committed canonical artifacts**
(`tools/import-normalizer/out/`). A shared `CanonicalSheetReader` maps columns **by header
`PersonTreeImporter``DocumentImporter` — that **consume the canonical artifacts produced
by the offline Python normalizer** (`tools/import-normalizer/out/`, synced onto the ops host
alongside the PDFs — see "Canonical artifacts are produced locally, NOT version-controlled"
below). A shared `CanonicalSheetReader` maps columns **by header
name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each
loader calls the **owning domain's service**, never a repository (layering rule); the tree
loader uses `RelationshipService`, never the relationship repository.
@@ -173,3 +175,27 @@ Settled sub-decisions:
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
`findTopByDocumentCount`) or it would silently return `false`. Guarded by integration tests
against real Postgres.
---
## Canonical artifacts are produced locally, NOT version-controlled
The four files in `tools/import-normalizer/out/` —
`canonical-documents.xlsx`, `canonical-persons.xlsx`, `canonical-tag-tree.xlsx`,
`canonical-persons-tree.json` — contain real family PII (names, addresses, attribution
prose) and are **deliberately excluded from the git index** via
`tools/import-normalizer/.gitignore`. They are regenerated locally from the source
spreadsheet by running the Python normalizer, and synced into the ops host's
`IMPORT_HOST_DIR` out-of-band (alongside the `<index>.pdf` corpus) — the same mechanism
that delivers the PDFs.
The contract between normalizer and importer is the **header schema** (column names,
their types, the `Precision` enum strings, the slug shape) — not the file contents.
`CanonicalSheetReader` maps columns by header name and fails closed
(`IMPORT_ARTIFACT_INVALID`) on a missing header, which is what locks the contract; the
file-level golden fixtures stay outside the repo.
A future maintainer must not "fix" CI by checking these artifacts back in — they are
PII, the regression that prompted this rule. Tests use small synthetic fixtures
constructed in-process (`DocumentImporterTest`, `CanonicalImportIntegrationTest`) rather
than real-corpus snapshots.

View File

@@ -231,7 +231,9 @@ complete.*
### 4.6 Canonical output & provenance (`FR-OUT`, `FR-PROV`) — resolves IMP-01, IMP-09, IMP-12
- **REQ-OUT-01** — The normalizer shall write `out/canonical-documents.xlsx` and
`out/canonical-persons.xlsx` with the headered schemas in §6.
`out/canonical-persons.xlsx` with the headered schemas in §6. The `out/` directory is
**gitignored** (real family PII — see ADR-025); ops syncs the regenerated files onto the
import host alongside the PDFs out-of-band.
- **REQ-PROV-01** — Every canonical document row shall carry `source_row` (1-based row number
in the source sheet) so any value can be traced back to the original.
- **REQ-PROV-02** — Every canonical row shall carry a `needs_review` field listing zero or more

View File

@@ -42,6 +42,12 @@ built) transforms the raw xlsx + person register into a clean canonical dataset
re-run. The Java importer is adjusted to consume the canonical contract in a later **Phase 2**.
See the spec for the full contract.
The canonical artifacts themselves (the `out/` files) are **produced locally and not
version-controlled** — they contain real family PII. They are synced onto the ops host's
`IMPORT_HOST_DIR` alongside the PDFs, out-of-band. The contract is the header schema in
`02-normalization-spec.md` §6, not any particular file in `out/`. See ADR-025 for the full
rationale.
## Status board
| ID | Issue | Severity | Status |

View File

@@ -1,7 +1,5 @@
.venv/
out/*
!out/canonical-persons-tree.json
!out/*.xlsx
out/
review/
__pycache__/
*.pyc

File diff suppressed because it is too large Load Diff