chore(import): stop tracking real family PII canonical artifacts

The four files in tools/import-normalizer/out/ contain real names,
addresses, and attribution prose for ~163 living/deceased family members
and were committed by mistake. They are now removed from the index
(kept on disk for local development) and gitignored.

The canonical artifacts are produced locally from the Python normalizer
and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The
contract between normalizer and importer is the header schema, not the
file contents — CanonicalSheetReader fails closed on a missing header,
which is what locks the contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-28 10:20:38 +02:00
parent 07300aeff7
commit 46d1f5c6d8
10 changed files with 48 additions and 3183 deletions

View File

@@ -64,8 +64,10 @@ name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **del
The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in
an explicit dependency DAG — `TagTreeImporter``PersonRegisterImporter`
`PersonTreeImporter``DocumentImporter` — that **consume the committed canonical artifacts**
(`tools/import-normalizer/out/`). A shared `CanonicalSheetReader` maps columns **by header
`PersonTreeImporter``DocumentImporter` — that **consume the canonical artifacts produced
by the offline Python normalizer** (`tools/import-normalizer/out/`, synced onto the ops host
alongside the PDFs — see "Canonical artifacts are produced locally, NOT version-controlled"
below). A shared `CanonicalSheetReader` maps columns **by header
name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each
loader calls the **owning domain's service**, never a repository (layering rule); the tree
loader uses `RelationshipService`, never the relationship repository.
@@ -173,3 +175,27 @@ Settled sub-decisions:
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
`findTopByDocumentCount`) or it would silently return `false`. Guarded by integration tests
against real Postgres.
---
## Canonical artifacts are produced locally, NOT version-controlled
The four files in `tools/import-normalizer/out/` —
`canonical-documents.xlsx`, `canonical-persons.xlsx`, `canonical-tag-tree.xlsx`,
`canonical-persons-tree.json` — contain real family PII (names, addresses, attribution
prose) and are **deliberately excluded from the git index** via
`tools/import-normalizer/.gitignore`. They are regenerated locally from the source
spreadsheet by running the Python normalizer, and synced into the ops host's
`IMPORT_HOST_DIR` out-of-band (alongside the `<index>.pdf` corpus) — the same mechanism
that delivers the PDFs.
The contract between normalizer and importer is the **header schema** (column names,
their types, the `Precision` enum strings, the slug shape) — not the file contents.
`CanonicalSheetReader` maps columns by header name and fails closed
(`IMPORT_ARTIFACT_INVALID`) on a missing header, which is what locks the contract; the
file-level golden fixtures stay outside the repo.
A future maintainer must not "fix" CI by checking these artifacts back in — they are
PII, the regression that prompted this rule. Tests use small synthetic fixtures
constructed in-process (`DocumentImporterTest`, `CanonicalImportIntegrationTest`) rather
than real-corpus snapshots.