- ADR-025: add decision 3 (four idempotent loaders over canonical artifacts; raw spreadsheet no longer parsed by Java) with the settled Option-A name policy, human-edit-preserve precedence, provisional contract, and ported security guards. - l3-backend-3b diagram: replace MassImportService/ExcelService with the orchestrator, the four loaders, and CanonicalSheetReader, with the loader dependency edges. - GLOSSARY: Canonical import / canonical artifact / CanonicalSheetReader terms; refresh SkippedFile (new INVALID_FILENAME_PATH_TRAVERSAL reason, index key). - DEPLOYMENT §6: canonical-artifact prerequisite runbook (run normalizer → place four artifacts → trigger import); note idempotent re-run. - CLAUDE.md (root + backend): importing/ package now lists the orchestrator + loaders + CanonicalSheetReader. OpenAPI: no generate:api needed — the ImportStatus/SkippedFile generated schemas already match the new types byte-for-byte (same fields + SkipReason enum), so the API surface is unchanged. Closes #669 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
122 lines
7.1 KiB
Markdown
122 lines
7.1 KiB
Markdown
# ADR-025 — Canonical Import Output as Contract & Single-Migration Schema Foundation
|
||
|
||
**Date:** 2026-05-27
|
||
**Status:** Accepted
|
||
**Issue:** #671 (schema, decisions 1–2); #669 (importer architecture, decision 3)
|
||
**Milestone:** Handling the Unknowns — honest uncertainty in dates & people
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
The "Handling the Unknowns" milestone introduces honest uncertainty into the archive:
|
||
documents whose dates are known only approximately or as a range, and people the importer
|
||
infers from raw attribution text but cannot confidently identify. Three sibling issues —
|
||
date precision (#666), name triage (#665), and the importer (#669) — each independently
|
||
planned a Flyway `V69` migration that altered `persons`. Three `V69`s is a boot failure
|
||
(Flyway versions must be unique), and `persons.provisional` was at risk of being defined
|
||
twice.
|
||
|
||
Two durable decisions had to be made before any application code in Phases 3–6 could
|
||
compile against the new schema.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
### 1. All import/precision/attribution/identity schema lives in ONE migration with a single owner
|
||
|
||
`V69__import_precision_attribution_identity_schema.sql` adds every new column for this
|
||
milestone in a single, atomic, forward-only migration:
|
||
|
||
- `documents`: `meta_date_precision` (backfilled `DAY` where dated / `UNKNOWN` where not,
|
||
then `NOT NULL`), `meta_date_end`, `meta_date_raw`, `sender_text`, `receiver_text`.
|
||
- `persons`: `source_ref` (unique index, nullable), `provisional` (`NOT NULL DEFAULT false`).
|
||
- `tag`: `source_ref` (unique index, nullable).
|
||
|
||
Integrity is pushed to the database as fail-closed `CHECK` constraints (the precedent is
|
||
`V22`'s `person_type` allowlist):
|
||
|
||
- `meta_date_precision` must be one of the seven enum values.
|
||
- `meta_date_end` may be non-null **only** when precision = `RANGE` (one-directional, not
|
||
biconditional — see Consequences).
|
||
- `meta_date_end >= meta_date` for ranges with both endpoints (a `CHECK`, not a trigger).
|
||
- `meta_date_raw`, `sender_text`, `receiver_text` are length-capped at 10 000 (mirrors the
|
||
`transcription_blocks` cap in `V18`).
|
||
|
||
No sibling issue adds another migration that alters `persons` or `documents` in this
|
||
milestone.
|
||
|
||
### 2. The backend `DatePrecision` enum is a verbatim mirror of the normalizer's `Precision`; the canonical output is the contract
|
||
|
||
The importer reads the Python normalizer's canonical output
|
||
(`tools/import-normalizer/`). The backend `DatePrecision` enum
|
||
(`DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN`) is a verbatim copy of the normalizer's
|
||
`Precision(StrEnum)` (`dates.py`). There is **no translation layer**: the normalizer's
|
||
output strings are persisted as-is. The same applies to `source_ref`, which carries the
|
||
normalizer's `person_id` / canonical `tag_path` unchanged as the re-import idempotency key.
|
||
|
||
### 3. The importer is four idempotent loaders over the canonical artifacts; Java no longer parses the raw spreadsheet (Phase 3, #669)
|
||
|
||
The legacy `MassImportService` read the *raw* original spreadsheet by positional column
|
||
index (`@Value app.import.col.*`) and re-derived everything in Java (ISO-only date parsing,
|
||
name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **deleted**.
|
||
|
||
The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in
|
||
an explicit dependency DAG — `TagTreeImporter` → `PersonRegisterImporter` →
|
||
`PersonTreeImporter` → `DocumentImporter` — that **consume the committed canonical artifacts**
|
||
(`tools/import-normalizer/out/`). A shared `CanonicalSheetReader` maps columns **by header
|
||
name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each
|
||
loader calls the **owning domain's service**, never a repository (layering rule); the tree
|
||
loader uses `RelationshipService`, never the relationship repository.
|
||
|
||
Settled sub-decisions:
|
||
|
||
- **Idempotency precedence = preserve human edits.** Persons/tags upsert by `source_ref`,
|
||
documents by `index`. On re-import a non-blank field a human changed in-app is never
|
||
overwritten (blank fields are filled from canonical), and `provisional` is monotonic — once
|
||
a human confirms a person (`false`) it never reverts to `true`. Verified against real
|
||
Postgres in `CanonicalImportIntegrationTest`.
|
||
- **Name policy = Option A.** The normalizer resolved attribution upstream: the document sheet
|
||
carries the resolved slug in `sender_person_id` / `receiver_person_ids` and the raw cell in
|
||
`sender_name` / `receiver_names`. The importer routes register-first by `source_ref`
|
||
(provisional `Person` when a slug is unmatched), and **always retains the raw cell** in
|
||
`sender_text` / `receiver_text` even when a person is linked — the load-bearing invariant
|
||
behind the merge story. A row with no slug but raw text (prose / `?` / object-noise) links
|
||
no person and keeps only the raw text.
|
||
- **`provisional` is now populated.** Importer-minted persons are `provisional = true`;
|
||
register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1)
|
||
left at default-`false`.
|
||
- **Security guards are defense-in-depth, not upstream-trust.** The `file` column is treated as
|
||
hostile (CWE-22 does not care it came from our tool): its basename is validated
|
||
(`isValidImportFilename` — slash/backslash, three Unicode slash homoglyphs, `..`, null byte,
|
||
absolute path) and resolved only inside the import dir with canonical-path containment, so a
|
||
traversal value can never escape. The `%PDF` magic-byte check gates upload. These guards and
|
||
their tests were ported from `MassImportService` **before** it was deleted.
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
- **RANGE is one-directional, not biconditional.** A `RANGE` row may have a null
|
||
`meta_date_end` (an open-ended range with only a start), because the normalizer can emit
|
||
start-only ranges. A biconditional `RANGE ⟺ end IS NOT NULL` rule would reject valid
|
||
normalizer output, so it was rejected. Phase 4 rendering must handle a `RANGE` with no end
|
||
gracefully.
|
||
- **`provisional` stays `false` throughout this phase.** The column and flag exist, but no
|
||
code path sets it `true`; the importer (Phase 3) is the only writer. This is intentional,
|
||
not a half-built feature.
|
||
- **A future dev must not "improve" the enum.** Renaming or dropping a `DatePrecision` value
|
||
without changing the normalizer silently breaks import idempotency and date rendering. The
|
||
enum's Javadoc states this; the DB `CHECK` enforces validity independent of the Java enum.
|
||
- **`source_ref` is unique + nullable.** Manually created persons/tags have `source_ref =
|
||
NULL`; Postgres allows multiple NULLs under a plain unique index, so no backfill is needed.
|
||
- **Forward-only.** The migration is immutable once shipped (Flyway checksum model); any fix
|
||
goes in a later version. There is no down-migration — rollback means restoring from the
|
||
nightly `pg_dump`, the standard procedure.
|
||
- **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native
|
||
interface projection; because the projection is backed by native SQL, the column had to be
|
||
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
|
||
`findTopByDocumentCount`) or it would silently return `false`. Guarded by integration tests
|
||
against real Postgres.
|