Files
familienarchiv/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md
Marcel 74987062f4
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 6m56s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m42s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s
docs(import): document index-based PDF resolution in ADR-025 and DEPLOYMENT
File resolution is now by index (<index>.pdf), not the datei/file
column. Update the ADR-025 security sub-decision and consequence (the
recursive walk and file column are gone; a bad index skips its row with
a loud SkipReason, a symlink-escape still aborts via the containment
assertion) and DEPLOYMENT §6 (PDFs must be named <index>.pdf flat in
the import dir).

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:03:57 +02:00

11 KiB
Raw Blame History

ADR-025 — Canonical Import Output as Contract & Single-Migration Schema Foundation

Date: 2026-05-27 Status: Accepted Issue: #671 (schema, decisions 12); #669 (importer architecture, decision 3) Milestone: Handling the Unknowns — honest uncertainty in dates & people


Context

The "Handling the Unknowns" milestone introduces honest uncertainty into the archive: documents whose dates are known only approximately or as a range, and people the importer infers from raw attribution text but cannot confidently identify. Three sibling issues — date precision (#666), name triage (#665), and the importer (#669) — each independently planned a Flyway V69 migration that altered persons. Three V69s is a boot failure (Flyway versions must be unique), and persons.provisional was at risk of being defined twice.

Two durable decisions had to be made before any application code in Phases 36 could compile against the new schema.


Decision

1. All import/precision/attribution/identity schema lives in ONE migration with a single owner

V69__import_precision_attribution_identity_schema.sql adds every new column for this milestone in a single, atomic, forward-only migration:

  • documents: meta_date_precision (backfilled DAY where dated / UNKNOWN where not, then NOT NULL), meta_date_end, meta_date_raw, sender_text, receiver_text.
  • persons: source_ref (unique index, nullable), provisional (NOT NULL DEFAULT false).
  • tag: source_ref (unique index, nullable).

Integrity is pushed to the database as fail-closed CHECK constraints (the precedent is V22's person_type allowlist):

  • meta_date_precision must be one of the seven enum values.
  • meta_date_end may be non-null only when precision = RANGE (one-directional, not biconditional — see Consequences).
  • meta_date_end >= meta_date for ranges with both endpoints (a CHECK, not a trigger).
  • meta_date_raw, sender_text, receiver_text are length-capped at 10 000 (mirrors the transcription_blocks cap in V18).

No sibling issue adds another migration that alters persons or documents in this milestone.

2. The backend DatePrecision enum is a verbatim mirror of the normalizer's Precision; the canonical output is the contract

The importer reads the Python normalizer's canonical output (tools/import-normalizer/). The backend DatePrecision enum (DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN) is a verbatim copy of the normalizer's Precision(StrEnum) (dates.py). There is no translation layer: the normalizer's output strings are persisted as-is. The same applies to source_ref, which carries the normalizer's person_id / canonical tag_path unchanged as the re-import idempotency key.

3. The importer is four idempotent loaders over the canonical artifacts; Java no longer parses the raw spreadsheet (Phase 3, #669)

The legacy MassImportService read the raw original spreadsheet by positional column index (@Value app.import.col.*) and re-derived everything in Java (ISO-only date parsing, name classification via findOrCreateByAlias, an ODS/XXE XML path). It is deleted.

The rebuild is a CanonicalImportOrchestrator driving four single-responsibility loaders in an explicit dependency DAG — TagTreeImporterPersonRegisterImporterPersonTreeImporterDocumentImporter — that consume the committed canonical artifacts (tools/import-normalizer/out/). A shared CanonicalSheetReader maps columns by header name (not by index) and fails closed (IMPORT_ARTIFACT_INVALID) on a missing header. Each loader calls the owning domain's service, never a repository (layering rule); the tree loader uses RelationshipService, never the relationship repository.

Settled sub-decisions:

  • Idempotency precedence is domain-specific. Persons/tags upsert by source_ref, documents by index. Two distinct rules apply:
    • Person/Tag scalar fields = preserve human edits. On re-import a non-blank field a human changed in-app is never overwritten (blank fields are filled from canonical via the single preferHuman idiom), and provisional is monotonic-downward — once a human confirms a person (false) it never reverts to true. Because the orchestrator loads the register and tree before documents, a person already false can never be flipped provisional by a later document row that references the same source_ref, regardless of document-row order.
    • Document sender/receivers/tags = canonical-authoritative. A document's sender, receiver set, and tag set are owned by the canonical row, not the archivist. On re-import of a PLACEHOLDER document DocumentImporter clears and re-populates receivers/tags so a row whose set shrinks prunes the removed links rather than accumulating stale ones. The "preserve human edits" rule above does not extend to these collections. The raw sender_text/receiver_text cells are always retained verbatim (a separate invariant). Note non-PLACEHOLDER documents are skipped entirely (ALREADY_EXISTS), so once a document has a file the importer never touches it again — this bounds the authoritative-overwrite blast radius to placeholder rows. Verified against real Postgres in CanonicalImportIntegrationTest (reimport_preservesHumanEditedPersonField, reimport_prunesRemovedReceiverAndTag…, import_neverFlipsRegisterPersonToProvisional…).
  • Name policy = Option A. The normalizer resolved attribution upstream: the document sheet carries the resolved slug in sender_person_id / receiver_person_ids and the raw cell in sender_name / receiver_names. The importer routes register-first by source_ref (provisional Person when a slug is unmatched), and always retains the raw cell in sender_text / receiver_text even when a person is linked — the load-bearing invariant behind the merge story. A row with no slug but raw text (prose / ? / object-noise) links no person and keeps only the raw text.
  • provisional is now populated. Importer-minted persons are provisional = true; register and tree persons stay false. This is the Phase-3 contract the schema (decision 1) left at default-false.
  • PDFs resolve directly by index (<index>.pdf), not by a file column. The corpus is uniform — all PDFs are named <index>.pdf flat in the import dir (e.g. W-0124.pdf, Mü-0001.pdf) — so DocumentImporter resolves a document's PDF with an O(1) importDir.resolve(index + ".pdf") lookup. The redundant file column (carrying the spreadsheet's messy datei value) and the recursive directory walk that resolved it were removed (#686, which also closed #676 — the O(rows×tree) walk is gone). The normalizer no longer emits file or the index_file_mismatch review flag.
  • Security guards are defense-in-depth, not upstream-trust. The index is the only thing that drives the on-disk lookup, so it is treated as hostile (CWE-22 does not care it came from our tool): isValidImportIndex rejects slash/backslash, three Unicode slash homoglyphs, any . (so <index>.pdf is the only extension and .. can never appear), null byte, and absolute paths, and requires a strict catalog shape (14 Latin letters incl. umlauts, one or more hyphens, digits, optional trailing x). A bad index skips the row with a clear SkipReason (INVALID_FILENAME_PATH_TRAVERSAL). The resolved canonical path is still asserted to stay inside the import dir as a second line of defense (a symlinked <index>.pdf cannot escape), and the %PDF magic-byte check still gates upload. These guards and their tests were ported from the file-column resolution (originally from MassImportService).

Consequences

  • RANGE is one-directional, not biconditional. A RANGE row may have a null meta_date_end (an open-ended range with only a start), because the normalizer can emit start-only ranges. A biconditional RANGE ⟺ end IS NOT NULL rule would reject valid normalizer output, so it was rejected. Phase 4 rendering must handle a RANGE with no end gracefully.
  • provisional stays false throughout this phase. The column and flag exist, but no code path sets it true; the importer (Phase 3) is the only writer. This is intentional, not a half-built feature.
  • A future dev must not "improve" the enum. Renaming or dropping a DatePrecision value without changing the normalizer silently breaks import idempotency and date rendering. The enum's Javadoc states this; the DB CHECK enforces validity independent of the Java enum.
  • source_ref is unique + nullable. Manually created persons/tags have source_ref = NULL; Postgres allows multiple NULLs under a plain unique index, so no backfill is needed.
  • Forward-only. The migration is immutable once shipped (Flyway checksum model); any fix goes in a later version. There is no down-migration — rollback means restoring from the nightly pg_dump, the standard procedure.
  • runImport() is non-transactional — per-loader transactions only. The orchestrator does not wrap the four loaders in a single transaction; each loader (or the per-call upsertBySourceRef / DocumentImporter.load) carries its own @Transactional boundary. A partial failure mid-run (e.g. the document loader throws after tags + persons committed) leaves the earlier loaders' data committed and the ImportStatus set to FAILED. This is acceptable precisely because the import is idempotent: re-running is safe and converges to the same state, so the operational recovery for a partial failure is simply to fix the offending artifact and re-trigger the import — no manual cleanup of half-written data is required. A future maintainer must not assume all-or-nothing semantics.
  • A malicious/garbage index skips its row with a loud SkipReason, by design. Since #686 the index is the only on-disk lookup key. An index that fails isValidImportIndex (path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog shape) is recorded as a SkippedFile with reason INVALID_FILENAME_PATH_TRAVERSAL and the import continues with the remaining rows — nothing outside the import dir is ever read. A symlinked <index>.pdf whose canonical path escapes the import dir is the one case that still aborts the import (a DomainException from the containment assertion), because a syntactically valid index resolving outside the dir is an environment-level attack signal, not a row typo.
  • PersonSummaryDTO coupling. provisional was added to the PersonSummaryDTO native interface projection; because the projection is backed by native SQL, the column had to be added to all three native SELECTs (findAllWithDocumentCount, searchWithDocumentCount, findTopByDocumentCount) or it would silently return false. Guarded by integration tests against real Postgres.