Import normalizer: offline tool to normalize the raw archive spreadsheets #663
3
.gitignore
vendored
3
.gitignore
vendored
@@ -30,3 +30,6 @@ frontend/yarn.lock
|
||||
**/.venv/
|
||||
**/__pycache__/
|
||||
*.pyc
|
||||
|
||||
# Canonical import artifacts live only on the ops host (PII).
|
||||
# See tools/import-normalizer/.gitignore — load-bearing for that policy.
|
||||
|
||||
@@ -36,7 +36,9 @@
|
||||
# accidentally share an import source. Must be
|
||||
# readable by the backend container's UID
|
||||
# (currently root via the OpenJDK image — any
|
||||
# world-readable directory works).
|
||||
# world-readable directory works). Canonical
|
||||
# artifacts are NOT in git (PII — ADR-025); ops
|
||||
# syncs them in beside the PDFs out-of-band.
|
||||
|
||||
networks:
|
||||
archiv-net:
|
||||
@@ -224,6 +226,10 @@ services:
|
||||
# Read-only; the canonical importer only reads them from /import.
|
||||
# Required — no default — so staging and prod cannot accidentally share an
|
||||
# import source. CI workflows pin this per-env (see .gitea/workflows/).
|
||||
# NOTE: the canonical artifacts are NOT version-controlled (they contain real
|
||||
# family PII — see ADR-025). Ops must produce them locally from the Python
|
||||
# normalizer (tools/import-normalizer/) and sync them into this host path
|
||||
# alongside the <index>.pdf corpus before triggering an import.
|
||||
volumes:
|
||||
- ${IMPORT_HOST_DIR:?Set IMPORT_HOST_DIR to a host path holding the import payload (canonical artifacts + <index>.pdf files). See docs/DEPLOYMENT.md.}:/import:ro
|
||||
environment:
|
||||
|
||||
@@ -64,8 +64,10 @@ name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **del
|
||||
|
||||
The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in
|
||||
an explicit dependency DAG — `TagTreeImporter` → `PersonRegisterImporter` →
|
||||
`PersonTreeImporter` → `DocumentImporter` — that **consume the committed canonical artifacts**
|
||||
(`tools/import-normalizer/out/`). A shared `CanonicalSheetReader` maps columns **by header
|
||||
`PersonTreeImporter` → `DocumentImporter` — that **consume the canonical artifacts produced
|
||||
by the offline Python normalizer** (`tools/import-normalizer/out/`, synced onto the ops host
|
||||
alongside the PDFs — see "Canonical artifacts are produced locally, NOT version-controlled"
|
||||
below). A shared `CanonicalSheetReader` maps columns **by header
|
||||
name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each
|
||||
loader calls the **owning domain's service**, never a repository (layering rule); the tree
|
||||
loader uses `RelationshipService`, never the relationship repository.
|
||||
@@ -173,3 +175,27 @@ Settled sub-decisions:
|
||||
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
|
||||
`findTopByDocumentCount`) or it would silently return `false`. Guarded by integration tests
|
||||
against real Postgres.
|
||||
|
||||
---
|
||||
|
||||
## Canonical artifacts are produced locally, NOT version-controlled
|
||||
|
||||
The four files in `tools/import-normalizer/out/` —
|
||||
`canonical-documents.xlsx`, `canonical-persons.xlsx`, `canonical-tag-tree.xlsx`,
|
||||
`canonical-persons-tree.json` — contain real family PII (names, addresses, attribution
|
||||
prose) and are **deliberately excluded from the git index** via
|
||||
`tools/import-normalizer/.gitignore`. They are regenerated locally from the source
|
||||
spreadsheet by running the Python normalizer, and synced into the ops host's
|
||||
`IMPORT_HOST_DIR` out-of-band (alongside the `<index>.pdf` corpus) — the same mechanism
|
||||
that delivers the PDFs.
|
||||
|
||||
The contract between normalizer and importer is the **header schema** (column names,
|
||||
their types, the `Precision` enum strings, the slug shape) — not the file contents.
|
||||
`CanonicalSheetReader` maps columns by header name and fails closed
|
||||
(`IMPORT_ARTIFACT_INVALID`) on a missing header, which is what locks the contract; the
|
||||
file-level golden fixtures stay outside the repo.
|
||||
|
||||
A future maintainer must not "fix" CI by checking these artifacts back in — they are
|
||||
PII, the regression that prompted this rule. Tests use small synthetic fixtures
|
||||
constructed in-process (`DocumentImporterTest`, `CanonicalImportIntegrationTest`) rather
|
||||
than real-corpus snapshots.
|
||||
|
||||
@@ -231,7 +231,9 @@ complete.*
|
||||
### 4.6 Canonical output & provenance (`FR-OUT`, `FR-PROV`) — resolves IMP-01, IMP-09, IMP-12
|
||||
|
||||
- **REQ-OUT-01** — The normalizer shall write `out/canonical-documents.xlsx` and
|
||||
`out/canonical-persons.xlsx` with the headered schemas in §6.
|
||||
`out/canonical-persons.xlsx` with the headered schemas in §6. The `out/` directory is
|
||||
**gitignored** (real family PII — see ADR-025); ops syncs the regenerated files onto the
|
||||
import host alongside the PDFs out-of-band.
|
||||
- **REQ-PROV-01** — Every canonical document row shall carry `source_row` (1-based row number
|
||||
in the source sheet) so any value can be traced back to the original.
|
||||
- **REQ-PROV-02** — Every canonical row shall carry a `needs_review` field listing zero or more
|
||||
|
||||
@@ -42,6 +42,12 @@ built) transforms the raw xlsx + person register into a clean canonical dataset
|
||||
re-run. The Java importer is adjusted to consume the canonical contract in a later **Phase 2**.
|
||||
See the spec for the full contract.
|
||||
|
||||
The canonical artifacts themselves (the `out/` files) are **produced locally and not
|
||||
version-controlled** — they contain real family PII. They are synced onto the ops host's
|
||||
`IMPORT_HOST_DIR` alongside the PDFs, out-of-band. The contract is the header schema in
|
||||
`02-normalization-spec.md` §6, not any particular file in `out/`. See ADR-025 for the full
|
||||
rationale.
|
||||
|
||||
## Status board
|
||||
|
||||
| ID | Issue | Severity | Status |
|
||||
|
||||
4
tools/import-normalizer/.gitignore
vendored
4
tools/import-normalizer/.gitignore
vendored
@@ -1,7 +1,5 @@
|
||||
.venv/
|
||||
out/*
|
||||
!out/canonical-persons-tree.json
|
||||
!out/*.xlsx
|
||||
out/
|
||||
review/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
|
||||
Binary file not shown.
File diff suppressed because it is too large
Load Diff
Binary file not shown.
Binary file not shown.
Reference in New Issue
Block a user