familienarchiv/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md

# ADR-025 — Canonical Import Output as Contract & Single-Migration Schema Foundation

**Date:** 2026-05-27
**Status:** Accepted
**Issue:** #671 (schema, decisions 1–2); #669 (importer architecture, decision 3)
**Milestone:** Handling the Unknowns — honest uncertainty in dates & people

---

## Context

The "Handling the Unknowns" milestone introduces honest uncertainty into the archive:
documents whose dates are known only approximately or as a range, and people the importer
infers from raw attribution text but cannot confidently identify. Three sibling issues —
date precision (#666), name triage (#665), and the importer (#669) — each independently
planned a Flyway `V69` migration that altered `persons`. Three `V69`s is a boot failure
(Flyway versions must be unique), and `persons.provisional` was at risk of being defined
twice.

Two durable decisions had to be made before any application code in Phases 3–6 could
compile against the new schema.

---

## Decision

### 1. All import/precision/attribution/identity schema lives in ONE migration with a single owner

`V69__import_precision_attribution_identity_schema.sql` adds every new column for this
milestone in a single, atomic, forward-only migration:

- `documents`: `meta_date_precision` (backfilled `DAY` where dated / `UNKNOWN` where not,
  then `NOT NULL`), `meta_date_end`, `meta_date_raw`, `sender_text`, `receiver_text`.
- `persons`: `source_ref` (unique index, nullable), `provisional` (`NOT NULL DEFAULT false`).
- `tag`: `source_ref` (unique index, nullable).

Integrity is pushed to the database as fail-closed `CHECK` constraints (the precedent is
`V22`'s `person_type` allowlist):

- `meta_date_precision` must be one of the seven enum values.
- `meta_date_end` may be non-null **only** when precision = `RANGE` (one-directional, not
  biconditional — see Consequences).
- `meta_date_end >= meta_date` for ranges with both endpoints (a `CHECK`, not a trigger).
- `meta_date_raw`, `sender_text`, `receiver_text` are length-capped at 10 000 (mirrors the
  `transcription_blocks` cap in `V18`).

No sibling issue adds another migration that alters `persons` or `documents` in this
milestone.

### 2. The backend `DatePrecision` enum is a verbatim mirror of the normalizer's `Precision`; the canonical output is the contract

The importer reads the Python normalizer's canonical output
(`tools/import-normalizer/`). The backend `DatePrecision` enum
(`DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN`) is a verbatim copy of the normalizer's
`Precision(StrEnum)` (`dates.py`). There is **no translation layer**: the normalizer's
output strings are persisted as-is. The same applies to `source_ref`, which carries the
normalizer's `person_id` / canonical `tag_path` unchanged as the re-import idempotency key.

### 3. The importer is four idempotent loaders over the canonical artifacts; Java no longer parses the raw spreadsheet (Phase 3, #669)

The legacy `MassImportService` read the *raw* original spreadsheet by positional column
index (`@Value app.import.col.*`) and re-derived everything in Java (ISO-only date parsing,
name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **deleted**.

The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in
an explicit dependency DAG — `TagTreeImporter` → `PersonRegisterImporter` →
`PersonTreeImporter` → `DocumentImporter` — that **consume the canonical artifacts produced
by the offline Python normalizer** (`tools/import-normalizer/out/`, synced onto the ops host
alongside the PDFs — see "Canonical artifacts are produced locally, NOT version-controlled"
below). A shared `CanonicalSheetReader` maps columns **by header
name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each
loader calls the **owning domain's service**, never a repository (layering rule); the tree
loader uses `RelationshipService`, never the relationship repository.

Settled sub-decisions:

- **Idempotency precedence is domain-specific.** Persons/tags upsert by `source_ref`,
  documents by `index`. Two distinct rules apply:
  - **Person/Tag scalar fields = preserve human edits.** On re-import a non-blank field a human
    changed in-app is never overwritten (blank fields are filled from canonical via the single
    `preferHuman` idiom), and `provisional` is monotonic-downward — once a human confirms a
    person (`false`) it never reverts to `true`. Because the orchestrator loads the register and
    tree *before* documents, a person already `false` can never be flipped provisional by a
    later document row that references the same `source_ref`, regardless of document-row order.
  - **Document sender/receivers/tags = canonical-authoritative.** A document's sender, receiver
    set, and tag set are owned by the canonical row, not the archivist. On re-import of a
    PLACEHOLDER document `DocumentImporter` clears and re-populates `receivers`/`tags` so a row
    whose set *shrinks* prunes the removed links rather than accumulating stale ones. The
    "preserve human edits" rule above does **not** extend to these collections. The raw
    `sender_text`/`receiver_text` cells are always retained verbatim (a separate invariant).
    Note non-PLACEHOLDER documents are skipped entirely (`ALREADY_EXISTS`), so once a document
    has a file the importer never touches it again — this bounds the authoritative-overwrite
    blast radius to placeholder rows.
  Verified against real Postgres in `CanonicalImportIntegrationTest`
  (`reimport_preservesHumanEditedPersonField`, `reimport_prunesRemovedReceiverAndTag…`,
  `import_neverFlipsRegisterPersonToProvisional…`).
- **Name policy = Option A.** The normalizer resolved attribution upstream: the document sheet
  carries the resolved slug in `sender_person_id` / `receiver_person_ids` and the raw cell in
  `sender_name` / `receiver_names`. The importer routes register-first by `source_ref`
  (provisional `Person` when a slug is unmatched), and **always retains the raw cell** in
  `sender_text` / `receiver_text` even when a person is linked — the load-bearing invariant
  behind the merge story. A row with no slug but raw text (prose / `?` / object-noise) links
  no person and keeps only the raw text.
- **`provisional` is now populated.** Importer-minted persons are `provisional = true`;
  register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1)
  left at default-`false`.
- **PDFs resolve directly by index (`<index>.pdf`), not by a `file` column.** The corpus is
  uniform — all PDFs are named `<index>.pdf` flat in the import dir (e.g. `W-0124.pdf`,
  `Mü-0001.pdf`) — so `DocumentImporter` resolves a document's PDF with an O(1)
  `importDir.resolve(index + ".pdf")` lookup. The redundant `file` column (carrying the
  spreadsheet's messy `datei` value) and the recursive directory walk that resolved it were
  removed (#686, which also closed #676 — the O(rows×tree) walk is gone). The normalizer no
  longer emits `file` or the `index_file_mismatch` review flag.
- **Security guards are defense-in-depth, not upstream-trust.** The `index` is the only thing
  that drives the on-disk lookup, so it is treated as hostile (CWE-22 does not care it came from
  our tool): `isValidImportIndex` rejects slash/backslash, three Unicode slash homoglyphs, any
  `.` (so `<index>.pdf` is the only extension and `..` can never appear), null byte, and
  absolute paths, and requires a strict catalog shape (1–4 Latin letters incl. umlauts, one or
  more hyphens, digits, optional trailing `x`). A bad index skips the row with a clear
  `SkipReason` (`INVALID_FILENAME_PATH_TRAVERSAL`). The resolved canonical path is still asserted
  to stay inside the import dir as a second line of defense (a symlinked `<index>.pdf` cannot
  escape), and the `%PDF` magic-byte check still gates upload. These guards and their tests were
  ported from the file-column resolution (originally from `MassImportService`).

---

## Consequences

- **RANGE is one-directional, not biconditional.** A `RANGE` row may have a null
  `meta_date_end` (an open-ended range with only a start), because the normalizer can emit
  start-only ranges. A biconditional `RANGE ⟺ end IS NOT NULL` rule would reject valid
  normalizer output, so it was rejected. Phase 4 rendering must handle a `RANGE` with no end
  gracefully.
- **`provisional` stays `false` throughout this phase.** The column and flag exist, but no
  code path sets it `true`; the importer (Phase 3) is the only writer. This is intentional,
  not a half-built feature.
- **A future dev must not "improve" the enum.** Renaming or dropping a `DatePrecision` value
  without changing the normalizer silently breaks import idempotency and date rendering. The
  enum's Javadoc states this; the DB `CHECK` enforces validity independent of the Java enum.
- **`source_ref` is unique + nullable.** Manually created persons/tags have `source_ref =
  NULL`; Postgres allows multiple NULLs under a plain unique index, so no backfill is needed.
- **Forward-only.** The migration is immutable once shipped (Flyway checksum model); any fix
  goes in a later version. There is no down-migration — rollback means restoring from the
  nightly `pg_dump`, the standard procedure.
- **`runImport()` is non-transactional — per-loader transactions only.** The orchestrator
  does not wrap the four loaders in a single transaction; each loader (or the per-call
  `upsertBySourceRef` / `DocumentImporter.load`) carries its own `@Transactional` boundary. A
  partial failure mid-run (e.g. the document loader throws after tags + persons committed)
  leaves the earlier loaders' data committed and the `ImportStatus` set to `FAILED`. This is
  acceptable precisely because the import is idempotent: re-running is safe and converges to
  the same state, so the operational recovery for a partial failure is simply to fix the
  offending artifact and re-trigger the import — no manual cleanup of half-written data is
  required. A future maintainer must not assume all-or-nothing semantics.
- **The index pattern is corpus-specific and must be revisited if the catalog scheme grows.**
  `INDEX_PATTERN` accepts only the *current* corpus shape — at most four Latin-1 letters (incl.
  umlauts) followed by one or more hyphens, ASCII digits, and an optional trailing `x`. This is a
  conscious constraint, not a general filename validator: a future sub-collection catalogued with
  a 5-letter prefix, a digit-led id, or a non-Latin-1 letter (e.g. `Č` or a Cyrillic id) would
  fail `isValidImportIndex` and its rows would be **skipped** (`INVALID_FILENAME_PATH_TRAVERSAL`),
  not imported. Likewise a real PDF that does not follow `<index>.pdf` produces a `PLACEHOLDER`
  (the importer logs both cases distinctly — see #686). If the catalog scheme ever changes, the
  pattern and its tests must be widened deliberately; do not loosen it casually, as it is the
  allowlist that keeps the on-disk lookup safe. Note `\d` is intentionally ASCII-only — adding
  `Pattern.UNICODE_CHARACTER_CLASS` would silently widen the accepted digit set.
- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
  the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
  (path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog
  shape) is recorded as a `SkippedFile` with reason `INVALID_FILENAME_PATH_TRAVERSAL` and the
  import continues with the remaining rows — nothing outside the import dir is ever read. A
  symlinked `<index>.pdf` whose canonical path escapes the import dir is the one case that still
  aborts the import (a `DomainException` from the containment assertion), because a syntactically
  valid index resolving outside the dir is an environment-level attack signal, not a row typo.
- **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native
  interface projection; because the projection is backed by native SQL, the column had to be
  added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
  `findTopByDocumentCount`) or it would silently return `false`. Guarded by integration tests
  against real Postgres.

---

## Canonical artifacts are produced locally, NOT version-controlled

The four files in `tools/import-normalizer/out/` —
`canonical-documents.xlsx`, `canonical-persons.xlsx`, `canonical-tag-tree.xlsx`,
`canonical-persons-tree.json` — contain real family PII (names, addresses, attribution
prose) and are **deliberately excluded from the git index** via
`tools/import-normalizer/.gitignore`. They are regenerated locally from the source
spreadsheet by running the Python normalizer, and synced into the ops host's
`IMPORT_HOST_DIR` out-of-band (alongside the `<index>.pdf` corpus) — the same mechanism
that delivers the PDFs.

The contract between normalizer and importer is the **header schema** (column names,
their types, the `Precision` enum strings, the slug shape) — not the file contents.
`CanonicalSheetReader` maps columns by header name and fails closed
(`IMPORT_ARTIFACT_INVALID`) on a missing header, which is what locks the contract; the
file-level golden fixtures stay outside the repo.

A future maintainer must not "fix" CI by checking these artifacts back in — they are
PII, the regression that prompted this rule. Tests use small synthetic fixtures
constructed in-process (`DocumentImporterTest`, `CanonicalImportIntegrationTest`) rather
than real-corpus snapshots.