The four files in tools/import-normalizer/out/ contain real names, addresses, and attribution prose for ~163 living/deceased family members and were committed by mistake. They are now removed from the index (kept on disk for local development) and gitignored. The canonical artifacts are produced locally from the Python normalizer and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The contract between normalizer and importer is the header schema, not the file contents — CanonicalSheetReader fails closed on a missing header, which is what locks the contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
202 lines
13 KiB
Markdown
202 lines
13 KiB
Markdown
# ADR-025 — Canonical Import Output as Contract & Single-Migration Schema Foundation
|
||
|
||
**Date:** 2026-05-27
|
||
**Status:** Accepted
|
||
**Issue:** #671 (schema, decisions 1–2); #669 (importer architecture, decision 3)
|
||
**Milestone:** Handling the Unknowns — honest uncertainty in dates & people
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
The "Handling the Unknowns" milestone introduces honest uncertainty into the archive:
|
||
documents whose dates are known only approximately or as a range, and people the importer
|
||
infers from raw attribution text but cannot confidently identify. Three sibling issues —
|
||
date precision (#666), name triage (#665), and the importer (#669) — each independently
|
||
planned a Flyway `V69` migration that altered `persons`. Three `V69`s is a boot failure
|
||
(Flyway versions must be unique), and `persons.provisional` was at risk of being defined
|
||
twice.
|
||
|
||
Two durable decisions had to be made before any application code in Phases 3–6 could
|
||
compile against the new schema.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
### 1. All import/precision/attribution/identity schema lives in ONE migration with a single owner
|
||
|
||
`V69__import_precision_attribution_identity_schema.sql` adds every new column for this
|
||
milestone in a single, atomic, forward-only migration:
|
||
|
||
- `documents`: `meta_date_precision` (backfilled `DAY` where dated / `UNKNOWN` where not,
|
||
then `NOT NULL`), `meta_date_end`, `meta_date_raw`, `sender_text`, `receiver_text`.
|
||
- `persons`: `source_ref` (unique index, nullable), `provisional` (`NOT NULL DEFAULT false`).
|
||
- `tag`: `source_ref` (unique index, nullable).
|
||
|
||
Integrity is pushed to the database as fail-closed `CHECK` constraints (the precedent is
|
||
`V22`'s `person_type` allowlist):
|
||
|
||
- `meta_date_precision` must be one of the seven enum values.
|
||
- `meta_date_end` may be non-null **only** when precision = `RANGE` (one-directional, not
|
||
biconditional — see Consequences).
|
||
- `meta_date_end >= meta_date` for ranges with both endpoints (a `CHECK`, not a trigger).
|
||
- `meta_date_raw`, `sender_text`, `receiver_text` are length-capped at 10 000 (mirrors the
|
||
`transcription_blocks` cap in `V18`).
|
||
|
||
No sibling issue adds another migration that alters `persons` or `documents` in this
|
||
milestone.
|
||
|
||
### 2. The backend `DatePrecision` enum is a verbatim mirror of the normalizer's `Precision`; the canonical output is the contract
|
||
|
||
The importer reads the Python normalizer's canonical output
|
||
(`tools/import-normalizer/`). The backend `DatePrecision` enum
|
||
(`DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN`) is a verbatim copy of the normalizer's
|
||
`Precision(StrEnum)` (`dates.py`). There is **no translation layer**: the normalizer's
|
||
output strings are persisted as-is. The same applies to `source_ref`, which carries the
|
||
normalizer's `person_id` / canonical `tag_path` unchanged as the re-import idempotency key.
|
||
|
||
### 3. The importer is four idempotent loaders over the canonical artifacts; Java no longer parses the raw spreadsheet (Phase 3, #669)
|
||
|
||
The legacy `MassImportService` read the *raw* original spreadsheet by positional column
|
||
index (`@Value app.import.col.*`) and re-derived everything in Java (ISO-only date parsing,
|
||
name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **deleted**.
|
||
|
||
The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in
|
||
an explicit dependency DAG — `TagTreeImporter` → `PersonRegisterImporter` →
|
||
`PersonTreeImporter` → `DocumentImporter` — that **consume the canonical artifacts produced
|
||
by the offline Python normalizer** (`tools/import-normalizer/out/`, synced onto the ops host
|
||
alongside the PDFs — see "Canonical artifacts are produced locally, NOT version-controlled"
|
||
below). A shared `CanonicalSheetReader` maps columns **by header
|
||
name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each
|
||
loader calls the **owning domain's service**, never a repository (layering rule); the tree
|
||
loader uses `RelationshipService`, never the relationship repository.
|
||
|
||
Settled sub-decisions:
|
||
|
||
- **Idempotency precedence is domain-specific.** Persons/tags upsert by `source_ref`,
|
||
documents by `index`. Two distinct rules apply:
|
||
- **Person/Tag scalar fields = preserve human edits.** On re-import a non-blank field a human
|
||
changed in-app is never overwritten (blank fields are filled from canonical via the single
|
||
`preferHuman` idiom), and `provisional` is monotonic-downward — once a human confirms a
|
||
person (`false`) it never reverts to `true`. Because the orchestrator loads the register and
|
||
tree *before* documents, a person already `false` can never be flipped provisional by a
|
||
later document row that references the same `source_ref`, regardless of document-row order.
|
||
- **Document sender/receivers/tags = canonical-authoritative.** A document's sender, receiver
|
||
set, and tag set are owned by the canonical row, not the archivist. On re-import of a
|
||
PLACEHOLDER document `DocumentImporter` clears and re-populates `receivers`/`tags` so a row
|
||
whose set *shrinks* prunes the removed links rather than accumulating stale ones. The
|
||
"preserve human edits" rule above does **not** extend to these collections. The raw
|
||
`sender_text`/`receiver_text` cells are always retained verbatim (a separate invariant).
|
||
Note non-PLACEHOLDER documents are skipped entirely (`ALREADY_EXISTS`), so once a document
|
||
has a file the importer never touches it again — this bounds the authoritative-overwrite
|
||
blast radius to placeholder rows.
|
||
Verified against real Postgres in `CanonicalImportIntegrationTest`
|
||
(`reimport_preservesHumanEditedPersonField`, `reimport_prunesRemovedReceiverAndTag…`,
|
||
`import_neverFlipsRegisterPersonToProvisional…`).
|
||
- **Name policy = Option A.** The normalizer resolved attribution upstream: the document sheet
|
||
carries the resolved slug in `sender_person_id` / `receiver_person_ids` and the raw cell in
|
||
`sender_name` / `receiver_names`. The importer routes register-first by `source_ref`
|
||
(provisional `Person` when a slug is unmatched), and **always retains the raw cell** in
|
||
`sender_text` / `receiver_text` even when a person is linked — the load-bearing invariant
|
||
behind the merge story. A row with no slug but raw text (prose / `?` / object-noise) links
|
||
no person and keeps only the raw text.
|
||
- **`provisional` is now populated.** Importer-minted persons are `provisional = true`;
|
||
register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1)
|
||
left at default-`false`.
|
||
- **PDFs resolve directly by index (`<index>.pdf`), not by a `file` column.** The corpus is
|
||
uniform — all PDFs are named `<index>.pdf` flat in the import dir (e.g. `W-0124.pdf`,
|
||
`Mü-0001.pdf`) — so `DocumentImporter` resolves a document's PDF with an O(1)
|
||
`importDir.resolve(index + ".pdf")` lookup. The redundant `file` column (carrying the
|
||
spreadsheet's messy `datei` value) and the recursive directory walk that resolved it were
|
||
removed (#686, which also closed #676 — the O(rows×tree) walk is gone). The normalizer no
|
||
longer emits `file` or the `index_file_mismatch` review flag.
|
||
- **Security guards are defense-in-depth, not upstream-trust.** The `index` is the only thing
|
||
that drives the on-disk lookup, so it is treated as hostile (CWE-22 does not care it came from
|
||
our tool): `isValidImportIndex` rejects slash/backslash, three Unicode slash homoglyphs, any
|
||
`.` (so `<index>.pdf` is the only extension and `..` can never appear), null byte, and
|
||
absolute paths, and requires a strict catalog shape (1–4 Latin letters incl. umlauts, one or
|
||
more hyphens, digits, optional trailing `x`). A bad index skips the row with a clear
|
||
`SkipReason` (`INVALID_FILENAME_PATH_TRAVERSAL`). The resolved canonical path is still asserted
|
||
to stay inside the import dir as a second line of defense (a symlinked `<index>.pdf` cannot
|
||
escape), and the `%PDF` magic-byte check still gates upload. These guards and their tests were
|
||
ported from the file-column resolution (originally from `MassImportService`).
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
- **RANGE is one-directional, not biconditional.** A `RANGE` row may have a null
|
||
`meta_date_end` (an open-ended range with only a start), because the normalizer can emit
|
||
start-only ranges. A biconditional `RANGE ⟺ end IS NOT NULL` rule would reject valid
|
||
normalizer output, so it was rejected. Phase 4 rendering must handle a `RANGE` with no end
|
||
gracefully.
|
||
- **`provisional` stays `false` throughout this phase.** The column and flag exist, but no
|
||
code path sets it `true`; the importer (Phase 3) is the only writer. This is intentional,
|
||
not a half-built feature.
|
||
- **A future dev must not "improve" the enum.** Renaming or dropping a `DatePrecision` value
|
||
without changing the normalizer silently breaks import idempotency and date rendering. The
|
||
enum's Javadoc states this; the DB `CHECK` enforces validity independent of the Java enum.
|
||
- **`source_ref` is unique + nullable.** Manually created persons/tags have `source_ref =
|
||
NULL`; Postgres allows multiple NULLs under a plain unique index, so no backfill is needed.
|
||
- **Forward-only.** The migration is immutable once shipped (Flyway checksum model); any fix
|
||
goes in a later version. There is no down-migration — rollback means restoring from the
|
||
nightly `pg_dump`, the standard procedure.
|
||
- **`runImport()` is non-transactional — per-loader transactions only.** The orchestrator
|
||
does not wrap the four loaders in a single transaction; each loader (or the per-call
|
||
`upsertBySourceRef` / `DocumentImporter.load`) carries its own `@Transactional` boundary. A
|
||
partial failure mid-run (e.g. the document loader throws after tags + persons committed)
|
||
leaves the earlier loaders' data committed and the `ImportStatus` set to `FAILED`. This is
|
||
acceptable precisely because the import is idempotent: re-running is safe and converges to
|
||
the same state, so the operational recovery for a partial failure is simply to fix the
|
||
offending artifact and re-trigger the import — no manual cleanup of half-written data is
|
||
required. A future maintainer must not assume all-or-nothing semantics.
|
||
- **The index pattern is corpus-specific and must be revisited if the catalog scheme grows.**
|
||
`INDEX_PATTERN` accepts only the *current* corpus shape — at most four Latin-1 letters (incl.
|
||
umlauts) followed by one or more hyphens, ASCII digits, and an optional trailing `x`. This is a
|
||
conscious constraint, not a general filename validator: a future sub-collection catalogued with
|
||
a 5-letter prefix, a digit-led id, or a non-Latin-1 letter (e.g. `Č` or a Cyrillic id) would
|
||
fail `isValidImportIndex` and its rows would be **skipped** (`INVALID_FILENAME_PATH_TRAVERSAL`),
|
||
not imported. Likewise a real PDF that does not follow `<index>.pdf` produces a `PLACEHOLDER`
|
||
(the importer logs both cases distinctly — see #686). If the catalog scheme ever changes, the
|
||
pattern and its tests must be widened deliberately; do not loosen it casually, as it is the
|
||
allowlist that keeps the on-disk lookup safe. Note `\d` is intentionally ASCII-only — adding
|
||
`Pattern.UNICODE_CHARACTER_CLASS` would silently widen the accepted digit set.
|
||
- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
|
||
the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
|
||
(path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog
|
||
shape) is recorded as a `SkippedFile` with reason `INVALID_FILENAME_PATH_TRAVERSAL` and the
|
||
import continues with the remaining rows — nothing outside the import dir is ever read. A
|
||
symlinked `<index>.pdf` whose canonical path escapes the import dir is the one case that still
|
||
aborts the import (a `DomainException` from the containment assertion), because a syntactically
|
||
valid index resolving outside the dir is an environment-level attack signal, not a row typo.
|
||
- **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native
|
||
interface projection; because the projection is backed by native SQL, the column had to be
|
||
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
|
||
`findTopByDocumentCount`) or it would silently return `false`. Guarded by integration tests
|
||
against real Postgres.
|
||
|
||
---
|
||
|
||
## Canonical artifacts are produced locally, NOT version-controlled
|
||
|
||
The four files in `tools/import-normalizer/out/` —
|
||
`canonical-documents.xlsx`, `canonical-persons.xlsx`, `canonical-tag-tree.xlsx`,
|
||
`canonical-persons-tree.json` — contain real family PII (names, addresses, attribution
|
||
prose) and are **deliberately excluded from the git index** via
|
||
`tools/import-normalizer/.gitignore`. They are regenerated locally from the source
|
||
spreadsheet by running the Python normalizer, and synced into the ops host's
|
||
`IMPORT_HOST_DIR` out-of-band (alongside the `<index>.pdf` corpus) — the same mechanism
|
||
that delivers the PDFs.
|
||
|
||
The contract between normalizer and importer is the **header schema** (column names,
|
||
their types, the `Precision` enum strings, the slug shape) — not the file contents.
|
||
`CanonicalSheetReader` maps columns by header name and fails closed
|
||
(`IMPORT_ARTIFACT_INVALID`) on a missing header, which is what locks the contract; the
|
||
file-level golden fixtures stay outside the repo.
|
||
|
||
A future maintainer must not "fix" CI by checking these artifacts back in — they are
|
||
PII, the regression that prompted this rule. Tests use small synthetic fixtures
|
||
constructed in-process (`DocumentImporterTest`, `CanonicalImportIntegrationTest`) rather
|
||
than real-corpus snapshots.
|