# ADR-025 — Canonical Import Output as Contract & Single-Migration Schema Foundation **Date:** 2026-05-27 **Status:** Accepted **Issue:** #671 (schema, decisions 1–2); #669 (importer architecture, decision 3) **Milestone:** Handling the Unknowns — honest uncertainty in dates & people --- ## Context The "Handling the Unknowns" milestone introduces honest uncertainty into the archive: documents whose dates are known only approximately or as a range, and people the importer infers from raw attribution text but cannot confidently identify. Three sibling issues — date precision (#666), name triage (#665), and the importer (#669) — each independently planned a Flyway `V69` migration that altered `persons`. Three `V69`s is a boot failure (Flyway versions must be unique), and `persons.provisional` was at risk of being defined twice. Two durable decisions had to be made before any application code in Phases 3–6 could compile against the new schema. --- ## Decision ### 1. All import/precision/attribution/identity schema lives in ONE migration with a single owner `V69__import_precision_attribution_identity_schema.sql` adds every new column for this milestone in a single, atomic, forward-only migration: - `documents`: `meta_date_precision` (backfilled `DAY` where dated / `UNKNOWN` where not, then `NOT NULL`), `meta_date_end`, `meta_date_raw`, `sender_text`, `receiver_text`. - `persons`: `source_ref` (unique index, nullable), `provisional` (`NOT NULL DEFAULT false`). - `tag`: `source_ref` (unique index, nullable). Integrity is pushed to the database as fail-closed `CHECK` constraints (the precedent is `V22`'s `person_type` allowlist): - `meta_date_precision` must be one of the seven enum values. - `meta_date_end` may be non-null **only** when precision = `RANGE` (one-directional, not biconditional — see Consequences). - `meta_date_end >= meta_date` for ranges with both endpoints (a `CHECK`, not a trigger). - `meta_date_raw`, `sender_text`, `receiver_text` are length-capped at 10 000 (mirrors the `transcription_blocks` cap in `V18`). No sibling issue adds another migration that alters `persons` or `documents` in this milestone. ### 2. The backend `DatePrecision` enum is a verbatim mirror of the normalizer's `Precision`; the canonical output is the contract The importer reads the Python normalizer's canonical output (`tools/import-normalizer/`). The backend `DatePrecision` enum (`DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN`) is a verbatim copy of the normalizer's `Precision(StrEnum)` (`dates.py`). There is **no translation layer**: the normalizer's output strings are persisted as-is. The same applies to `source_ref`, which carries the normalizer's `person_id` / canonical `tag_path` unchanged as the re-import idempotency key. ### 3. The importer is four idempotent loaders over the canonical artifacts; Java no longer parses the raw spreadsheet (Phase 3, #669) The legacy `MassImportService` read the *raw* original spreadsheet by positional column index (`@Value app.import.col.*`) and re-derived everything in Java (ISO-only date parsing, name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **deleted**. The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in an explicit dependency DAG — `TagTreeImporter` → `PersonRegisterImporter` → `PersonTreeImporter` → `DocumentImporter` — that **consume the committed canonical artifacts** (`tools/import-normalizer/out/`). A shared `CanonicalSheetReader` maps columns **by header name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each loader calls the **owning domain's service**, never a repository (layering rule); the tree loader uses `RelationshipService`, never the relationship repository. Settled sub-decisions: - **Idempotency precedence is domain-specific.** Persons/tags upsert by `source_ref`, documents by `index`. Two distinct rules apply: - **Person/Tag scalar fields = preserve human edits.** On re-import a non-blank field a human changed in-app is never overwritten (blank fields are filled from canonical via the single `preferHuman` idiom), and `provisional` is monotonic-downward — once a human confirms a person (`false`) it never reverts to `true`. Because the orchestrator loads the register and tree *before* documents, a person already `false` can never be flipped provisional by a later document row that references the same `source_ref`, regardless of document-row order. - **Document sender/receivers/tags = canonical-authoritative.** A document's sender, receiver set, and tag set are owned by the canonical row, not the archivist. On re-import of a PLACEHOLDER document `DocumentImporter` clears and re-populates `receivers`/`tags` so a row whose set *shrinks* prunes the removed links rather than accumulating stale ones. The "preserve human edits" rule above does **not** extend to these collections. The raw `sender_text`/`receiver_text` cells are always retained verbatim (a separate invariant). Note non-PLACEHOLDER documents are skipped entirely (`ALREADY_EXISTS`), so once a document has a file the importer never touches it again — this bounds the authoritative-overwrite blast radius to placeholder rows. Verified against real Postgres in `CanonicalImportIntegrationTest` (`reimport_preservesHumanEditedPersonField`, `reimport_prunesRemovedReceiverAndTag…`, `import_neverFlipsRegisterPersonToProvisional…`). - **Name policy = Option A.** The normalizer resolved attribution upstream: the document sheet carries the resolved slug in `sender_person_id` / `receiver_person_ids` and the raw cell in `sender_name` / `receiver_names`. The importer routes register-first by `source_ref` (provisional `Person` when a slug is unmatched), and **always retains the raw cell** in `sender_text` / `receiver_text` even when a person is linked — the load-bearing invariant behind the merge story. A row with no slug but raw text (prose / `?` / object-noise) links no person and keeps only the raw text. - **`provisional` is now populated.** Importer-minted persons are `provisional = true`; register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1) left at default-`false`. - **Security guards are defense-in-depth, not upstream-trust.** The `file` column is treated as hostile (CWE-22 does not care it came from our tool): its basename is validated (`isValidImportFilename` — slash/backslash, three Unicode slash homoglyphs, `..`, null byte, absolute path) and resolved only inside the import dir with canonical-path containment, so a traversal value can never escape. The `%PDF` magic-byte check gates upload. These guards and their tests were ported from `MassImportService` **before** it was deleted. --- ## Consequences - **RANGE is one-directional, not biconditional.** A `RANGE` row may have a null `meta_date_end` (an open-ended range with only a start), because the normalizer can emit start-only ranges. A biconditional `RANGE ⟺ end IS NOT NULL` rule would reject valid normalizer output, so it was rejected. Phase 4 rendering must handle a `RANGE` with no end gracefully. - **`provisional` stays `false` throughout this phase.** The column and flag exist, but no code path sets it `true`; the importer (Phase 3) is the only writer. This is intentional, not a half-built feature. - **A future dev must not "improve" the enum.** Renaming or dropping a `DatePrecision` value without changing the normalizer silently breaks import idempotency and date rendering. The enum's Javadoc states this; the DB `CHECK` enforces validity independent of the Java enum. - **`source_ref` is unique + nullable.** Manually created persons/tags have `source_ref = NULL`; Postgres allows multiple NULLs under a plain unique index, so no backfill is needed. - **Forward-only.** The migration is immutable once shipped (Flyway checksum model); any fix goes in a later version. There is no down-migration — rollback means restoring from the nightly `pg_dump`, the standard procedure. - **`runImport()` is non-transactional — per-loader transactions only.** The orchestrator does not wrap the four loaders in a single transaction; each loader (or the per-call `upsertBySourceRef` / `DocumentImporter.load`) carries its own `@Transactional` boundary. A partial failure mid-run (e.g. the document loader throws after tags + persons committed) leaves the earlier loaders' data committed and the `ImportStatus` set to `FAILED`. This is acceptable precisely because the import is idempotent: re-running is safe and converges to the same state, so the operational recovery for a partial failure is simply to fix the offending artifact and re-trigger the import — no manual cleanup of half-written data is required. A future maintainer must not assume all-or-nothing semantics. - **Path-escape aborts the whole import (fail-closed), by design.** A path-traversal or symlink-escape in a row's file path is treated as an attack signal: the import aborts rather than recording the row as a `SkippedFile` and continuing. This is a deliberate owner decision (2026-05-27) over a per-file skip — a malicious path must surface loudly, not be silently tolerated. - **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native interface projection; because the projection is backed by native SQL, the column had to be added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`, `findTopByDocumentCount`) or it would silently return `false`. Guarded by integration tests against real Postgres.