diff --git a/docs/GLOSSARY.md b/docs/GLOSSARY.md index 99da1775..1fefb7af 100644 --- a/docs/GLOSSARY.md +++ b/docs/GLOSSARY.md @@ -25,6 +25,11 @@ _Not to be confused with [AppUser](#appuser-appuser)_ — `Person` is a historic **UserGroup** (`UserGroup`) — a named permission bundle assigned to one or more `AppUser`s. A user's effective permissions are the union of all permissions across all groups they belong to. +**source_ref** (`Person.sourceRef`, `Tag.sourceRef`) — the import normalizer's stable identity for a `Person` (its `person_id`) or `Tag` (its canonical `tag_path`). It is the join key linking normalized records to documents and the idempotency key for re-import; null for manually created records and unique among non-null values. + +**provisional person** (`Person.provisional`) — a `Person` the importer inferred from raw attribution text but could not confidently match to a known individual. The flag lets the persons directory surface uncertainty honestly rather than fabricate a confident identity; it defaults to `false` and is set `true` only by the importer. +_Not to be confused with `family_member`_ — `provisional` expresses import confidence, while `family_member` is a genealogical fact about whether the person belongs to the family tree. + --- ## Document-Related Terms @@ -36,6 +41,10 @@ _See also [TranscriptionBlock](#transcriptionblock-transcriptionblock)._ **Document** (`Document`) — a single archival item (letter, postcard, photograph) with a file stored in MinIO/S3 and associated metadata (sender, receivers, date, tags, transcription blocks). +**date precision** (`Document.metaDatePrecision`, enum `DatePrecision`) — how exactly a document's date is known, one of `DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN`. A verbatim mirror of the import normalizer's `Precision` enum so honest dates can be rendered (`APPROX` → "ca.", `RANGE` uses `meta_date_end`) instead of fabricating a false `DAY`-level date. `UNKNOWN` is the explicit value for undated documents. + +**raw attribution** (`Document.senderText`, `Document.receiverText`, `Document.metaDateRaw`) — the original spreadsheet cell text for a document's sender, receiver, and date, preserved verbatim even after a `Person` or normalized date is linked. It keeps provenance intact and enables an "as written in the original" view. + **DocumentVersion** (`DocumentVersion`) — an append-only snapshot of a `Document`'s metadata at a point in time. Append-only by convention; no consumer-facing create or update endpoint exists. The entity uses Lombok `@Data` (which generates setters), so immutability is enforced by application convention, not at the Java level. **Tag** (`Tag`) — a hierarchical category that can be applied to `Document`s. Tags are self-referencing via a `parent_id` foreign key, forming a tree structure. diff --git a/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md b/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md new file mode 100644 index 00000000..0feb670b --- /dev/null +++ b/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md @@ -0,0 +1,83 @@ +# ADR-025 — Canonical Import Output as Contract & Single-Migration Schema Foundation + +**Date:** 2026-05-27 +**Status:** Accepted +**Issue:** #671 +**Milestone:** Handling the Unknowns — honest uncertainty in dates & people + +--- + +## Context + +The "Handling the Unknowns" milestone introduces honest uncertainty into the archive: +documents whose dates are known only approximately or as a range, and people the importer +infers from raw attribution text but cannot confidently identify. Three sibling issues — +date precision (#666), name triage (#665), and the importer (#669) — each independently +planned a Flyway `V69` migration that altered `persons`. Three `V69`s is a boot failure +(Flyway versions must be unique), and `persons.provisional` was at risk of being defined +twice. + +Two durable decisions had to be made before any application code in Phases 3–6 could +compile against the new schema. + +--- + +## Decision + +### 1. All import/precision/attribution/identity schema lives in ONE migration with a single owner + +`V69__import_precision_attribution_identity_schema.sql` adds every new column for this +milestone in a single, atomic, forward-only migration: + +- `documents`: `meta_date_precision` (backfilled `DAY` where dated / `UNKNOWN` where not, + then `NOT NULL`), `meta_date_end`, `meta_date_raw`, `sender_text`, `receiver_text`. +- `persons`: `source_ref` (unique index, nullable), `provisional` (`NOT NULL DEFAULT false`). +- `tag`: `source_ref` (unique index, nullable). + +Integrity is pushed to the database as fail-closed `CHECK` constraints (the precedent is +`V22`'s `person_type` allowlist): + +- `meta_date_precision` must be one of the seven enum values. +- `meta_date_end` may be non-null **only** when precision = `RANGE` (one-directional, not + biconditional — see Consequences). +- `meta_date_end >= meta_date` for ranges with both endpoints (a `CHECK`, not a trigger). +- `meta_date_raw`, `sender_text`, `receiver_text` are length-capped at 10 000 (mirrors the + `transcription_blocks` cap in `V18`). + +No sibling issue adds another migration that alters `persons` or `documents` in this +milestone. + +### 2. The backend `DatePrecision` enum is a verbatim mirror of the normalizer's `Precision`; the canonical output is the contract + +The importer reads the Python normalizer's canonical output +(`tools/import-normalizer/`). The backend `DatePrecision` enum +(`DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN`) is a verbatim copy of the normalizer's +`Precision(StrEnum)` (`dates.py`). There is **no translation layer**: the normalizer's +output strings are persisted as-is. The same applies to `source_ref`, which carries the +normalizer's `person_id` / canonical `tag_path` unchanged as the re-import idempotency key. + +--- + +## Consequences + +- **RANGE is one-directional, not biconditional.** A `RANGE` row may have a null + `meta_date_end` (an open-ended range with only a start), because the normalizer can emit + start-only ranges. A biconditional `RANGE ⟺ end IS NOT NULL` rule would reject valid + normalizer output, so it was rejected. Phase 4 rendering must handle a `RANGE` with no end + gracefully. +- **`provisional` stays `false` throughout this phase.** The column and flag exist, but no + code path sets it `true`; the importer (Phase 3) is the only writer. This is intentional, + not a half-built feature. +- **A future dev must not "improve" the enum.** Renaming or dropping a `DatePrecision` value + without changing the normalizer silently breaks import idempotency and date rendering. The + enum's Javadoc states this; the DB `CHECK` enforces validity independent of the Java enum. +- **`source_ref` is unique + nullable.** Manually created persons/tags have `source_ref = + NULL`; Postgres allows multiple NULLs under a plain unique index, so no backfill is needed. +- **Forward-only.** The migration is immutable once shipped (Flyway checksum model); any fix + goes in a later version. There is no down-migration — rollback means restoring from the + nightly `pg_dump`, the standard procedure. +- **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native + interface projection; because the projection is backed by native SQL, the column had to be + added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`, + `findTopByDocumentCount`) or it would silently return `false`. Guarded by integration tests + against real Postgres. diff --git a/docs/architecture/db/db-orm.puml b/docs/architecture/db/db-orm.puml index a6e64aa3..7b03c156 100644 --- a/docs/architecture/db/db-orm.puml +++ b/docs/architecture/db/db-orm.puml @@ -1,6 +1,6 @@ @startuml db-orm -' Schema source: Flyway V1–V60 (excl. V37, V43 — intentionally removed) -' Schema as of: V60 (2026-05-06) +' Schema source: Flyway V1–V69 (excl. V37, V43 — intentionally removed) +' Schema as of: V69 (2026-05-27) ' ⚠ This is a versioned snapshot. Update when the schema changes significantly. hide circle @@ -88,6 +88,11 @@ package "Documents" { summary : TEXT transcription : TEXT meta_date : DATE + meta_date_precision : VARCHAR(16) NOT NULL + meta_date_end : DATE + meta_date_raw : TEXT + sender_text : TEXT + receiver_text : TEXT meta_location : VARCHAR(255) meta_document_location : VARCHAR(255) archive_box : VARCHAR(255) @@ -182,6 +187,8 @@ package "Persons" { birth_year : INTEGER death_year : INTEGER family_member : BOOLEAN NOT NULL + source_ref : VARCHAR(255) UNIQUE + provisional : BOOLEAN NOT NULL } entity person_name_aliases { @@ -217,6 +224,7 @@ package "Tags" { name : VARCHAR(255) NOT NULL UNIQUE parent_id : UUID <> color : VARCHAR(20) + source_ref : VARCHAR(255) UNIQUE } } diff --git a/docs/architecture/db/db-relationships.puml b/docs/architecture/db/db-relationships.puml index c3100cfa..d6f4b542 100644 --- a/docs/architecture/db/db-relationships.puml +++ b/docs/architecture/db/db-relationships.puml @@ -1,7 +1,9 @@ @startuml db-relationships -' Schema source: Flyway V1–V60 (excl. V37, V43 — intentionally removed) -' Schema as of: V60 (2026-05-06) +' Schema source: Flyway V1–V69 (excl. V37, V43 — intentionally removed) +' Schema as of: V69 (2026-05-27) ' ⚠ This is a versioned snapshot. Update when the schema changes significantly. +' Note: V69 adds columns only (persons.source_ref, tag.source_ref, document +' precision/attribution fields); no new FK relationships, so this diagram is unchanged. hide circle skinparam linetype ortho