The four files in tools/import-normalizer/out/ contain real names, addresses, and attribution prose for ~163 living/deceased family members and were committed by mistake. They are now removed from the index (kept on disk for local development) and gitignored. The canonical artifacts are produced locally from the Python normalizer and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The contract between normalizer and importer is the header schema, not the file contents — CanonicalSheetReader fails closed on a missing header, which is what locks the contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
13 KiB
ADR-025 — Canonical Import Output as Contract & Single-Migration Schema Foundation
Date: 2026-05-27 Status: Accepted Issue: #671 (schema, decisions 1–2); #669 (importer architecture, decision 3) Milestone: Handling the Unknowns — honest uncertainty in dates & people
Context
The "Handling the Unknowns" milestone introduces honest uncertainty into the archive:
documents whose dates are known only approximately or as a range, and people the importer
infers from raw attribution text but cannot confidently identify. Three sibling issues —
date precision (#666), name triage (#665), and the importer (#669) — each independently
planned a Flyway V69 migration that altered persons. Three V69s is a boot failure
(Flyway versions must be unique), and persons.provisional was at risk of being defined
twice.
Two durable decisions had to be made before any application code in Phases 3–6 could compile against the new schema.
Decision
1. All import/precision/attribution/identity schema lives in ONE migration with a single owner
V69__import_precision_attribution_identity_schema.sql adds every new column for this
milestone in a single, atomic, forward-only migration:
documents:meta_date_precision(backfilledDAYwhere dated /UNKNOWNwhere not, thenNOT NULL),meta_date_end,meta_date_raw,sender_text,receiver_text.persons:source_ref(unique index, nullable),provisional(NOT NULL DEFAULT false).tag:source_ref(unique index, nullable).
Integrity is pushed to the database as fail-closed CHECK constraints (the precedent is
V22's person_type allowlist):
meta_date_precisionmust be one of the seven enum values.meta_date_endmay be non-null only when precision =RANGE(one-directional, not biconditional — see Consequences).meta_date_end >= meta_datefor ranges with both endpoints (aCHECK, not a trigger).meta_date_raw,sender_text,receiver_textare length-capped at 10 000 (mirrors thetranscription_blockscap inV18).
No sibling issue adds another migration that alters persons or documents in this
milestone.
2. The backend DatePrecision enum is a verbatim mirror of the normalizer's Precision; the canonical output is the contract
The importer reads the Python normalizer's canonical output
(tools/import-normalizer/). The backend DatePrecision enum
(DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN) is a verbatim copy of the normalizer's
Precision(StrEnum) (dates.py). There is no translation layer: the normalizer's
output strings are persisted as-is. The same applies to source_ref, which carries the
normalizer's person_id / canonical tag_path unchanged as the re-import idempotency key.
3. The importer is four idempotent loaders over the canonical artifacts; Java no longer parses the raw spreadsheet (Phase 3, #669)
The legacy MassImportService read the raw original spreadsheet by positional column
index (@Value app.import.col.*) and re-derived everything in Java (ISO-only date parsing,
name classification via findOrCreateByAlias, an ODS/XXE XML path). It is deleted.
The rebuild is a CanonicalImportOrchestrator driving four single-responsibility loaders in
an explicit dependency DAG — TagTreeImporter → PersonRegisterImporter →
PersonTreeImporter → DocumentImporter — that consume the canonical artifacts produced
by the offline Python normalizer (tools/import-normalizer/out/, synced onto the ops host
alongside the PDFs — see "Canonical artifacts are produced locally, NOT version-controlled"
below). A shared CanonicalSheetReader maps columns by header
name (not by index) and fails closed (IMPORT_ARTIFACT_INVALID) on a missing header. Each
loader calls the owning domain's service, never a repository (layering rule); the tree
loader uses RelationshipService, never the relationship repository.
Settled sub-decisions:
- Idempotency precedence is domain-specific. Persons/tags upsert by
source_ref, documents byindex. Two distinct rules apply:- Person/Tag scalar fields = preserve human edits. On re-import a non-blank field a human
changed in-app is never overwritten (blank fields are filled from canonical via the single
preferHumanidiom), andprovisionalis monotonic-downward — once a human confirms a person (false) it never reverts totrue. Because the orchestrator loads the register and tree before documents, a person alreadyfalsecan never be flipped provisional by a later document row that references the samesource_ref, regardless of document-row order. - Document sender/receivers/tags = canonical-authoritative. A document's sender, receiver
set, and tag set are owned by the canonical row, not the archivist. On re-import of a
PLACEHOLDER document
DocumentImporterclears and re-populatesreceivers/tagsso a row whose set shrinks prunes the removed links rather than accumulating stale ones. The "preserve human edits" rule above does not extend to these collections. The rawsender_text/receiver_textcells are always retained verbatim (a separate invariant). Note non-PLACEHOLDER documents are skipped entirely (ALREADY_EXISTS), so once a document has a file the importer never touches it again — this bounds the authoritative-overwrite blast radius to placeholder rows. Verified against real Postgres inCanonicalImportIntegrationTest(reimport_preservesHumanEditedPersonField,reimport_prunesRemovedReceiverAndTag…,import_neverFlipsRegisterPersonToProvisional…).
- Person/Tag scalar fields = preserve human edits. On re-import a non-blank field a human
changed in-app is never overwritten (blank fields are filled from canonical via the single
- Name policy = Option A. The normalizer resolved attribution upstream: the document sheet
carries the resolved slug in
sender_person_id/receiver_person_idsand the raw cell insender_name/receiver_names. The importer routes register-first bysource_ref(provisionalPersonwhen a slug is unmatched), and always retains the raw cell insender_text/receiver_texteven when a person is linked — the load-bearing invariant behind the merge story. A row with no slug but raw text (prose /?/ object-noise) links no person and keeps only the raw text. provisionalis now populated. Importer-minted persons areprovisional = true; register and tree persons stayfalse. This is the Phase-3 contract the schema (decision 1) left at default-false.- PDFs resolve directly by index (
<index>.pdf), not by afilecolumn. The corpus is uniform — all PDFs are named<index>.pdfflat in the import dir (e.g.W-0124.pdf,Mü-0001.pdf) — soDocumentImporterresolves a document's PDF with an O(1)importDir.resolve(index + ".pdf")lookup. The redundantfilecolumn (carrying the spreadsheet's messydateivalue) and the recursive directory walk that resolved it were removed (#686, which also closed #676 — the O(rows×tree) walk is gone). The normalizer no longer emitsfileor theindex_file_mismatchreview flag. - Security guards are defense-in-depth, not upstream-trust. The
indexis the only thing that drives the on-disk lookup, so it is treated as hostile (CWE-22 does not care it came from our tool):isValidImportIndexrejects slash/backslash, three Unicode slash homoglyphs, any.(so<index>.pdfis the only extension and..can never appear), null byte, and absolute paths, and requires a strict catalog shape (1–4 Latin letters incl. umlauts, one or more hyphens, digits, optional trailingx). A bad index skips the row with a clearSkipReason(INVALID_FILENAME_PATH_TRAVERSAL). The resolved canonical path is still asserted to stay inside the import dir as a second line of defense (a symlinked<index>.pdfcannot escape), and the%PDFmagic-byte check still gates upload. These guards and their tests were ported from the file-column resolution (originally fromMassImportService).
Consequences
- RANGE is one-directional, not biconditional. A
RANGErow may have a nullmeta_date_end(an open-ended range with only a start), because the normalizer can emit start-only ranges. A biconditionalRANGE ⟺ end IS NOT NULLrule would reject valid normalizer output, so it was rejected. Phase 4 rendering must handle aRANGEwith no end gracefully. provisionalstaysfalsethroughout this phase. The column and flag exist, but no code path sets ittrue; the importer (Phase 3) is the only writer. This is intentional, not a half-built feature.- A future dev must not "improve" the enum. Renaming or dropping a
DatePrecisionvalue without changing the normalizer silently breaks import idempotency and date rendering. The enum's Javadoc states this; the DBCHECKenforces validity independent of the Java enum. source_refis unique + nullable. Manually created persons/tags havesource_ref = NULL; Postgres allows multiple NULLs under a plain unique index, so no backfill is needed.- Forward-only. The migration is immutable once shipped (Flyway checksum model); any fix
goes in a later version. There is no down-migration — rollback means restoring from the
nightly
pg_dump, the standard procedure. runImport()is non-transactional — per-loader transactions only. The orchestrator does not wrap the four loaders in a single transaction; each loader (or the per-callupsertBySourceRef/DocumentImporter.load) carries its own@Transactionalboundary. A partial failure mid-run (e.g. the document loader throws after tags + persons committed) leaves the earlier loaders' data committed and theImportStatusset toFAILED. This is acceptable precisely because the import is idempotent: re-running is safe and converges to the same state, so the operational recovery for a partial failure is simply to fix the offending artifact and re-trigger the import — no manual cleanup of half-written data is required. A future maintainer must not assume all-or-nothing semantics.- The index pattern is corpus-specific and must be revisited if the catalog scheme grows.
INDEX_PATTERNaccepts only the current corpus shape — at most four Latin-1 letters (incl. umlauts) followed by one or more hyphens, ASCII digits, and an optional trailingx. This is a conscious constraint, not a general filename validator: a future sub-collection catalogued with a 5-letter prefix, a digit-led id, or a non-Latin-1 letter (e.g.Čor a Cyrillic id) would failisValidImportIndexand its rows would be skipped (INVALID_FILENAME_PATH_TRAVERSAL), not imported. Likewise a real PDF that does not follow<index>.pdfproduces aPLACEHOLDER(the importer logs both cases distinctly — see #686). If the catalog scheme ever changes, the pattern and its tests must be widened deliberately; do not loosen it casually, as it is the allowlist that keeps the on-disk lookup safe. Note\dis intentionally ASCII-only — addingPattern.UNICODE_CHARACTER_CLASSwould silently widen the accepted digit set. - A malicious/garbage index skips its row with a loud
SkipReason, by design. Since #686 the index is the only on-disk lookup key. An index that failsisValidImportIndex(path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog shape) is recorded as aSkippedFilewith reasonINVALID_FILENAME_PATH_TRAVERSALand the import continues with the remaining rows — nothing outside the import dir is ever read. A symlinked<index>.pdfwhose canonical path escapes the import dir is the one case that still aborts the import (aDomainExceptionfrom the containment assertion), because a syntactically valid index resolving outside the dir is an environment-level attack signal, not a row typo. PersonSummaryDTOcoupling.provisionalwas added to thePersonSummaryDTOnative interface projection; because the projection is backed by native SQL, the column had to be added to all three nativeSELECTs (findAllWithDocumentCount,searchWithDocumentCount,findTopByDocumentCount) or it would silently returnfalse. Guarded by integration tests against real Postgres.
Canonical artifacts are produced locally, NOT version-controlled
The four files in tools/import-normalizer/out/ —
canonical-documents.xlsx, canonical-persons.xlsx, canonical-tag-tree.xlsx,
canonical-persons-tree.json — contain real family PII (names, addresses, attribution
prose) and are deliberately excluded from the git index via
tools/import-normalizer/.gitignore. They are regenerated locally from the source
spreadsheet by running the Python normalizer, and synced into the ops host's
IMPORT_HOST_DIR out-of-band (alongside the <index>.pdf corpus) — the same mechanism
that delivers the PDFs.
The contract between normalizer and importer is the header schema (column names,
their types, the Precision enum strings, the slug shape) — not the file contents.
CanonicalSheetReader maps columns by header name and fails closed
(IMPORT_ARTIFACT_INVALID) on a missing header, which is what locks the contract; the
file-level golden fixtures stay outside the repo.
A future maintainer must not "fix" CI by checking these artifacts back in — they are
PII, the regression that prompted this rule. Tests use small synthetic fixtures
constructed in-process (DocumentImporterTest, CanonicalImportIntegrationTest) rather
than real-corpus snapshots.