2026-05-28 15:05:51 +02:00
10 changed files with 48 additions and 3183 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -30,3 +30,6 @@ frontend/yarn.lock
 **/.venv/
 **/__pycache__/
 *.pyc
+
+# Canonical import artifacts live only on the ops host (PII).
+# See tools/import-normalizer/.gitignore — load-bearing for that policy.
--- a/docker-compose.prod.yml
+++ b/docker-compose.prod.yml
@@ -36,7 +36,9 @@
 #                               accidentally share an import source. Must be
 #                               readable by the backend container's UID
 #                               (currently root via the OpenJDK image — any
-#                               world-readable directory works).
+#                               world-readable directory works). Canonical
+#                               artifacts are NOT in git (PII — ADR-025); ops
+#                               syncs them in beside the PDFs out-of-band.

 networks:
  archiv-net:
@@ -224,6 +226,10 @@ services:
    # Read-only; the canonical importer only reads them from /import.
    # Required — no default — so staging and prod cannot accidentally share an
    # import source. CI workflows pin this per-env (see .gitea/workflows/).
+    # NOTE: the canonical artifacts are NOT version-controlled (they contain real
+    # family PII — see ADR-025). Ops must produce them locally from the Python
+    # normalizer (tools/import-normalizer/) and sync them into this host path
+    # alongside the <index>.pdf corpus before triggering an import.
    volumes:
      - ${IMPORT_HOST_DIR:?Set IMPORT_HOST_DIR to a host path holding the import payload (canonical artifacts + <index>.pdf files). See docs/DEPLOYMENT.md.}:/import:ro
    environment:
--- a/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md
+++ b/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md
@@ -64,8 +64,10 @@ name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **del

 The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in
 an explicit dependency DAG — `TagTreeImporter` → `PersonRegisterImporter` →
-`PersonTreeImporter` → `DocumentImporter` — that **consume the committed canonical artifacts**
-(`tools/import-normalizer/out/`). A shared `CanonicalSheetReader` maps columns **by header
+`PersonTreeImporter` → `DocumentImporter` — that **consume the canonical artifacts produced
+by the offline Python normalizer** (`tools/import-normalizer/out/`, synced onto the ops host
+alongside the PDFs — see "Canonical artifacts are produced locally, NOT version-controlled"
+below). A shared `CanonicalSheetReader` maps columns **by header
 name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each
 loader calls the **owning domain's service**, never a repository (layering rule); the tree
 loader uses `RelationshipService`, never the relationship repository.
@@ -173,3 +175,27 @@ Settled sub-decisions:
  added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
  `findTopByDocumentCount`) or it would silently return `false`. Guarded by integration tests
  against real Postgres.
+
+---
+
+## Canonical artifacts are produced locally, NOT version-controlled
+
+The four files in `tools/import-normalizer/out/` —
+`canonical-documents.xlsx`, `canonical-persons.xlsx`, `canonical-tag-tree.xlsx`,
+`canonical-persons-tree.json` — contain real family PII (names, addresses, attribution
+prose) and are **deliberately excluded from the git index** via
+`tools/import-normalizer/.gitignore`. They are regenerated locally from the source
+spreadsheet by running the Python normalizer, and synced into the ops host's
+`IMPORT_HOST_DIR` out-of-band (alongside the `<index>.pdf` corpus) — the same mechanism
+that delivers the PDFs.
+
+The contract between normalizer and importer is the **header schema** (column names,
+their types, the `Precision` enum strings, the slug shape) — not the file contents.
+`CanonicalSheetReader` maps columns by header name and fails closed
+(`IMPORT_ARTIFACT_INVALID`) on a missing header, which is what locks the contract; the
+file-level golden fixtures stay outside the repo.
+
+A future maintainer must not "fix" CI by checking these artifacts back in — they are
+PII, the regression that prompted this rule. Tests use small synthetic fixtures
+constructed in-process (`DocumentImporterTest`, `CanonicalImportIntegrationTest`) rather
+than real-corpus snapshots.
--- a/docs/import-migration/02-normalization-spec.md
+++ b/docs/import-migration/02-normalization-spec.md
@@ -231,7 +231,9 @@ complete.*
 ### 4.6 Canonical output & provenance (`FR-OUT`, `FR-PROV`) — resolves IMP-01, IMP-09, IMP-12

 - **REQ-OUT-01** — The normalizer shall write `out/canonical-documents.xlsx` and
-  `out/canonical-persons.xlsx` with the headered schemas in §6.
+  `out/canonical-persons.xlsx` with the headered schemas in §6. The `out/` directory is
+  **gitignored** (real family PII — see ADR-025); ops syncs the regenerated files onto the
+  import host alongside the PDFs out-of-band.
 - **REQ-PROV-01** — Every canonical document row shall carry `source_row` (1-based row number
  in the source sheet) so any value can be traced back to the original.
 - **REQ-PROV-02** — Every canonical row shall carry a `needs_review` field listing zero or more
--- a/docs/import-migration/README.md
+++ b/docs/import-migration/README.md
@@ -42,6 +42,12 @@ built) transforms the raw xlsx + person register into a clean canonical dataset
 re-run. The Java importer is adjusted to consume the canonical contract in a later **Phase 2**.
 See the spec for the full contract.

+The canonical artifacts themselves (the `out/` files) are **produced locally and not
+version-controlled** — they contain real family PII. They are synced onto the ops host's
+`IMPORT_HOST_DIR` alongside the PDFs, out-of-band. The contract is the header schema in
+`02-normalization-spec.md` §6, not any particular file in `out/`. See ADR-025 for the full
+rationale.
+
 ## Status board

 | ID | Issue | Severity | Status |
--- a/tools/import-normalizer/.gitignore
+++ b/tools/import-normalizer/.gitignore
@@ -1,7 +1,5 @@
 .venv/
-out/*
-!out/canonical-persons-tree.json
-!out/*.xlsx
+out/
 review/
 __pycache__/
 *.pyc
--- a/tools/import-normalizer/out/canonical-documents.xlsx
+++ b/tools/import-normalizer/out/canonical-documents.xlsx
--- a/tools/import-normalizer/out/canonical-persons-tree.json
+++ b/tools/import-normalizer/out/canonical-persons-tree.json
--- a/tools/import-normalizer/out/canonical-persons.xlsx
+++ b/tools/import-normalizer/out/canonical-persons.xlsx
--- a/tools/import-normalizer/out/canonical-tag-tree.xlsx
+++ b/tools/import-normalizer/out/canonical-tag-tree.xlsx