chore(import): stop tracking real family PII canonical artifacts
The four files in tools/import-normalizer/out/ contain real names, addresses, and attribution prose for ~163 living/deceased family members and were committed by mistake. They are now removed from the index (kept on disk for local development) and gitignored. The canonical artifacts are produced locally from the Python normalizer and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The contract between normalizer and importer is the header schema, not the file contents — CanonicalSheetReader fails closed on a missing header, which is what locks the contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -64,8 +64,10 @@ name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **del
|
||||
|
||||
The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in
|
||||
an explicit dependency DAG — `TagTreeImporter` → `PersonRegisterImporter` →
|
||||
`PersonTreeImporter` → `DocumentImporter` — that **consume the committed canonical artifacts**
|
||||
(`tools/import-normalizer/out/`). A shared `CanonicalSheetReader` maps columns **by header
|
||||
`PersonTreeImporter` → `DocumentImporter` — that **consume the canonical artifacts produced
|
||||
by the offline Python normalizer** (`tools/import-normalizer/out/`, synced onto the ops host
|
||||
alongside the PDFs — see "Canonical artifacts are produced locally, NOT version-controlled"
|
||||
below). A shared `CanonicalSheetReader` maps columns **by header
|
||||
name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each
|
||||
loader calls the **owning domain's service**, never a repository (layering rule); the tree
|
||||
loader uses `RelationshipService`, never the relationship repository.
|
||||
@@ -173,3 +175,27 @@ Settled sub-decisions:
|
||||
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
|
||||
`findTopByDocumentCount`) or it would silently return `false`. Guarded by integration tests
|
||||
against real Postgres.
|
||||
|
||||
---
|
||||
|
||||
## Canonical artifacts are produced locally, NOT version-controlled
|
||||
|
||||
The four files in `tools/import-normalizer/out/` —
|
||||
`canonical-documents.xlsx`, `canonical-persons.xlsx`, `canonical-tag-tree.xlsx`,
|
||||
`canonical-persons-tree.json` — contain real family PII (names, addresses, attribution
|
||||
prose) and are **deliberately excluded from the git index** via
|
||||
`tools/import-normalizer/.gitignore`. They are regenerated locally from the source
|
||||
spreadsheet by running the Python normalizer, and synced into the ops host's
|
||||
`IMPORT_HOST_DIR` out-of-band (alongside the `<index>.pdf` corpus) — the same mechanism
|
||||
that delivers the PDFs.
|
||||
|
||||
The contract between normalizer and importer is the **header schema** (column names,
|
||||
their types, the `Precision` enum strings, the slug shape) — not the file contents.
|
||||
`CanonicalSheetReader` maps columns by header name and fails closed
|
||||
(`IMPORT_ARTIFACT_INVALID`) on a missing header, which is what locks the contract; the
|
||||
file-level golden fixtures stay outside the repo.
|
||||
|
||||
A future maintainer must not "fix" CI by checking these artifacts back in — they are
|
||||
PII, the regression that prompted this rule. Tests use small synthetic fixtures
|
||||
constructed in-process (`DocumentImporterTest`, `CanonicalImportIntegrationTest`) rather
|
||||
than real-corpus snapshots.
|
||||
|
||||
@@ -231,7 +231,9 @@ complete.*
|
||||
### 4.6 Canonical output & provenance (`FR-OUT`, `FR-PROV`) — resolves IMP-01, IMP-09, IMP-12
|
||||
|
||||
- **REQ-OUT-01** — The normalizer shall write `out/canonical-documents.xlsx` and
|
||||
`out/canonical-persons.xlsx` with the headered schemas in §6.
|
||||
`out/canonical-persons.xlsx` with the headered schemas in §6. The `out/` directory is
|
||||
**gitignored** (real family PII — see ADR-025); ops syncs the regenerated files onto the
|
||||
import host alongside the PDFs out-of-band.
|
||||
- **REQ-PROV-01** — Every canonical document row shall carry `source_row` (1-based row number
|
||||
in the source sheet) so any value can be traced back to the original.
|
||||
- **REQ-PROV-02** — Every canonical row shall carry a `needs_review` field listing zero or more
|
||||
|
||||
@@ -42,6 +42,12 @@ built) transforms the raw xlsx + person register into a clean canonical dataset
|
||||
re-run. The Java importer is adjusted to consume the canonical contract in a later **Phase 2**.
|
||||
See the spec for the full contract.
|
||||
|
||||
The canonical artifacts themselves (the `out/` files) are **produced locally and not
|
||||
version-controlled** — they contain real family PII. They are synced onto the ops host's
|
||||
`IMPORT_HOST_DIR` alongside the PDFs, out-of-band. The contract is the header schema in
|
||||
`02-normalization-spec.md` §6, not any particular file in `out/`. See ADR-025 for the full
|
||||
rationale.
|
||||
|
||||
## Status board
|
||||
|
||||
| ID | Issue | Severity | Status |
|
||||
|
||||
Reference in New Issue
Block a user