All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m41s
CI / OCR Service Tests (pull_request) Successful in 23s
CI / Backend Unit Tests (pull_request) Successful in 3m51s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 23s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m4s
@Elicit on PR #693: two doc gaps that block traceability on this PR. 1. docs/GLOSSARY.md: add a Stammbaum section with the layout vocabulary introduced by #689 and #361 — Stammbaum, seeded rank, sibling block, loose spouse, parented, anchor index, intra-family marriage, marriage dot, canonical fixture. Removes the Pending placeholder. 2. docs/adr/026: commit the AC3 reachability probe (the SQL that returned "0 of 942 unseeded persons match the predicate" in May 2026) directly into the ADR. A future architect re-evaluating the deferral can rerun it verbatim — reproducibility of the decision is itself a requirement. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
172 lines
18 KiB
Markdown
172 lines
18 KiB
Markdown
# Familienarchiv — Glossary
|
||
|
||
Domain-specific and overloaded terms used in this codebase.
|
||
Each entry: **Term** — definition (≤ 2 sentences). Where two terms are easily confused, a _Not to be confused with_ note follows.
|
||
|
||
For architecture context see [`docs/architecture/c4-diagrams.md`](architecture/c4-diagrams.md).
|
||
For domain package structure see [`docs/ARCHITECTURE.md`](ARCHITECTURE.md) _(coming: DOC-2)_.
|
||
|
||
---
|
||
|
||
## Identity Terms
|
||
|
||
**AppUser** (`AppUser`) — a real person who can log into the system (a family member or administrator). `AppUser` records carry login credentials, group memberships, and notification history.
|
||
_Not to be confused with [Person](#person-person)_ — an AppUser is never recorded as a document sender, receiver, or historical individual.
|
||
|
||
**Reader** — an `AppUser` whose effective permissions include `READ_ALL` but neither `WRITE_ALL` nor `ANNOTATE_ALL`. Readers see a dedicated dashboard (`isReader = !canWrite && !canAnnotate`) focused on browsing documents, persons, and stories rather than contribution tasks. A user who also holds `BLOG_WRITE` is still classified as a Reader and additionally sees a drafts module.
|
||
_Not to be confused with [AppUser](#appuser-appuser)_ — Reader is a permission-derived role, not an entity.
|
||
|
||
**Permission** — a discrete capability string assigned to a `UserGroup` (e.g. `READ_ALL`, `WRITE_ALL`, `ADMIN`, `ADMIN_USER`, `ADMIN_TAG`, `ADMIN_PERMISSION`). Enforced via the `@RequirePermission` AOP annotation on controller methods, checked at runtime by `PermissionAspect`; not via Spring Security's `@PreAuthorize`.
|
||
|
||
**Person** (`Person`) — a historical individual in the family archive (sender, receiver of letters, person mentioned in transcriptions). NEVER has a login account and NEVER appears as an `AppUser`.
|
||
_Not to be confused with [AppUser](#appuser-appuser)_ — `Person` is a historical record; `AppUser` is someone who can log in today.
|
||
|
||
**PersonNameAlias** (`PersonNameAlias`) — an alternate or historical name form associated with a `Person` (e.g. maiden name, nickname, abbreviated form). Used to locate `Person` records during mass import via `PersonNameAliasType`.
|
||
|
||
**UserGroup** (`UserGroup`) — a named permission bundle assigned to one or more `AppUser`s. A user's effective permissions are the union of all permissions across all groups they belong to.
|
||
|
||
**source_ref** (`Person.sourceRef`, `Tag.sourceRef`) — the import normalizer's stable identity for a `Person` (its `person_id`) or `Tag` (its canonical `tag_path`). It is the join key linking normalized records to documents and the idempotency key for re-import; null for manually created records and unique among non-null values.
|
||
|
||
**provisional person** (`Person.provisional`) — a `Person` the importer inferred from raw attribution text but could not confidently match to a known individual. The flag lets the persons directory surface uncertainty honestly rather than fabricate a confident identity; it defaults to `false` and is set `true` only by the importer.
|
||
_Not to be confused with `family_member`_ — `provisional` expresses import confidence, while `family_member` is a genealogical fact about whether the person belongs to the family tree.
|
||
|
||
---
|
||
|
||
## Document-Related Terms
|
||
|
||
**Annotation** (`DocumentAnnotation`) — a free-form polygon or shape drawn over a document page image to highlight a region of interest. Always scoped to a specific page of a `Document`; stored as a polygon (JSONB).
|
||
_See also [TranscriptionBlock](#transcriptionblock-transcriptionblock)._
|
||
|
||
**Comment** (`DocumentComment`, table `document_comments`) — a threaded discussion message attached to a `Document`. Always scoped to a `Document`; optionally further contextualized by a specific `DocumentAnnotation` or `TranscriptionBlock`.
|
||
|
||
**Document** (`Document`) — a single archival item (letter, postcard, photograph) with a file stored in MinIO/S3 and associated metadata (sender, receivers, date, tags, transcription blocks).
|
||
|
||
**date precision** (`Document.metaDatePrecision`, enum `DatePrecision`) — how exactly a document's date is known, one of `DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN`. A verbatim mirror of the import normalizer's `Precision` enum so honest dates can be rendered (`APPROX` → "ca.", `RANGE` uses `meta_date_end`) instead of fabricating a false `DAY`-level date. `UNKNOWN` is the explicit value for undated documents.
|
||
|
||
**raw attribution** (`Document.senderText`, `Document.receiverText`, `Document.metaDateRaw`) — the original spreadsheet cell text for a document's sender, receiver, and date, preserved verbatim even after a `Person` or normalized date is linked. It keeps provenance intact and enables an "as written in the original" view.
|
||
|
||
**DocumentVersion** (`DocumentVersion`) — an append-only snapshot of a `Document`'s metadata at a point in time. Append-only by convention; no consumer-facing create or update endpoint exists. The entity uses Lombok `@Data` (which generates setters), so immutability is enforced by application convention, not at the Java level.
|
||
|
||
**Tag** (`Tag`) — a hierarchical category that can be applied to `Document`s. Tags are self-referencing via a `parent_id` foreign key, forming a tree structure.
|
||
|
||
**TranscriptionBlock** (`TranscriptionBlock`) — a paragraph-level segment of a `Document`'s transcribed text, with a polygon region (stored as JSONB) identifying its position on the page. One document can have many blocks across multiple pages.
|
||
_See also [Annotation](#annotation-documentannotation)._
|
||
|
||
---
|
||
|
||
## Workflow Terms
|
||
|
||
**DocumentStatus lifecycle** — the ordered states a `Document` moves through:
|
||
`PLACEHOLDER → UPLOADED → TRANSCRIBED → REVIEWED → ARCHIVED`
|
||
- `PLACEHOLDER`: created during mass import; no file attached yet.
|
||
- `UPLOADED`: a file has been stored in MinIO/S3.
|
||
- `TRANSCRIBED`: all transcription blocks have been marked done.
|
||
- `REVIEWED`: a reviewer has approved the transcription.
|
||
- `ARCHIVED`: the document is finalized and read-only.
|
||
|
||
**Canonical import** — an asynchronous batch process (`CanonicalImportOrchestrator`) that consumes the normalizer's committed canonical artifacts and creates `Tag`s, `Person`s (register + tree), family relationships, and `Document`s. Four idempotent loaders run in a fixed dependency order — `TagTreeImporter` → `PersonRegisterImporter` → `PersonTreeImporter` → `DocumentImporter` — each calling the owning domain's service. Re-running it never duplicates rows (upsert by `source_ref` / document `index`) and never overwrites a human-edited field. Only one import can run at a time (`IMPORT_ALREADY_RUNNING` error if attempted concurrently); a missing or malformed artifact fails closed (`IMPORT_ARTIFACT_INVALID`). Replaced the legacy raw-spreadsheet `MassImportService` (see ADR-025).
|
||
|
||
**canonical artifact** — one of the four files the normalizer (`tools/import-normalizer/`) emits and commits to `tools/import-normalizer/out/`: `canonical-tag-tree.xlsx`, `canonical-persons.xlsx`, `canonical-persons-tree.json`, `canonical-documents.xlsx`. They are the contract the backend importer reads (mapped by header name); the semantic transformation (German-date parsing, name classification) lives only in the normalizer, never in Java.
|
||
|
||
**CanonicalSheetReader** — the value-level POI helper that opens a canonical `.xlsx`, maps the header row to column indices by name (replacing the brittle positional column config), splits pipe-delimited list columns, and throws `IMPORT_ARTIFACT_INVALID` on a missing required header rather than NPE-ing on a null index.
|
||
|
||
**SkippedFile** (`ImportStatus.SkippedFile`) — a file that was presented for import but not processed, recorded with a `filename` and a `reason` code. Possible reasons: `INVALID_FILENAME_PATH_TRAVERSAL` (the file-column basename failed the path-traversal guard), `INVALID_PDF_SIGNATURE` (magic-byte validation failed), `S3_UPLOAD_FAILED` (file upload to MinIO/S3 threw an exception), `FILE_READ_ERROR` (the file could not be opened for reading), or `ALREADY_EXISTS` (a document with the same `index` already exists in the archive with a status other than `PLACEHOLDER`).
|
||
|
||
**skipped count** — the total number of `SkippedFile` entries accumulated during a single import run (`ImportStatus.skipped()`). Shown in the amber warning section of the Import Status Card in the admin UI; a value of zero suppresses the section entirely.
|
||
|
||
**Transcription queue** — the set of `Document`s and `TranscriptionBlock`s awaiting work, computed on-the-fly from `Document`/`Block` status. Three views: segmentation queue, transcription queue, ready-to-read queue. NOT a persistent entity — no `transcription_queues` table exists.
|
||
_See also [DocumentStatus lifecycle](#documentstatus-lifecycle)._
|
||
|
||
---
|
||
|
||
## OCR-Specific Terms
|
||
|
||
**HTR** — Handwritten Text Recognition. Recognizes cursive and historical handwriting (contrasted with OCR for printed/typewritten text). The primary mode used for letters in this archive.
|
||
|
||
**Kurrent** — Old German cursive handwriting style, the primary historical script appearing in letters from the 1899–1950 period covered by this archive.
|
||
|
||
**OCR** — Optical Character Recognition. Recognizes printed or typewritten text. Used for typed documents; HTR is used for handwritten ones.
|
||
|
||
**OcrJob** (`OcrJob`, table `ocr_jobs`) — a first-class persistent entity tracking a batch OCR run across one or more documents (`OcrJobDocument`, table `ocr_job_documents`). Distinct from the concept of "running OCR on a single document." Lifecycle: `PENDING → RUNNING → DONE / FAILED` (see `OcrJobStatus`).
|
||
|
||
**SenderModel** (`SenderModel`, table `sender_models`) — a fine-tuned Kraken HTR model trained on a specific historical correspondent's handwriting. Both an OCR-service concept (the model weights) and a persistent entity linking a `Person` to the path of their trained model file.
|
||
|
||
**Sütterlin** — A specific standardized style of Kurrent taught in German schools from 1915 to 1941.
|
||
|
||
**Illegible word** — a word whose recognition confidence falls below the configured threshold; replaced with the literal token `[unleserlich]` in the rendered block text and counted in the `ocr_illegible_words_total` Prometheus counter.
|
||
|
||
**Models-ready gauge** — the `ocr_models_ready` Prometheus gauge, flipped from `0` to `1` once the FastAPI lifespan startup has finished loading the Kraken model and the spell-checker. Used both for the `/health` endpoint and as the supervised signal for the `ocr_models_ready < 1 for 2m` alert.
|
||
|
||
**Recognition model accuracy** — the accuracy reported by `ketos train` for the recognition (text-line) model, exposed as `ocr_model_accuracy{kind="recognition"}`. Sourced from `_parse_best_checkpoint` on the highest-scoring checkpoint after training.
|
||
|
||
**Segmentation model accuracy** — the accuracy reported by `ketos segtrain` for the baseline layout analysis (`blla`) model, exposed as `ocr_model_accuracy{kind="segmentation"}`. Distinct from recognition accuracy because the two models are trained and improved independently.
|
||
|
||
---
|
||
|
||
## Stammbaum (Family-Tree Layout) Terms
|
||
|
||
**Stammbaum** `[user-facing]` — the genealogy / family-tree view of the archive, accessible at `/stammbaum`. Renders every `Person` as a node positioned by `PersonRelationship` edges (`PARENT_OF`, `SPOUSE_OF`) into rows that correspond to generations. The browser-side layout pipeline lives at `frontend/src/lib/person/genealogy/`.
|
||
_See also [PersonRelationship](#person-person)._
|
||
|
||
**seeded rank** (`Person.generation`) — the imported generation index on a `Person` (G 0 = founders, increasing downward), used as a strict row anchor in `buildLayout.ts`. The iterative fallback heuristic never overrides a seeded rank, and spouse-pulldown never pulls a seeded rank — only unseeded nodes (no `generation`) flow through the heuristic.
|
||
|
||
**sibling block** — a layout unit holding the children of a single parent-set at one generation, used inside `buildLayout.ts`. Each block has a center computed from the parents' midpoint; blocks are then packed left-to-right within a generation row. Two adjacent sibling blocks at the same rank can be merged if a `SPOUSE_OF` edge crosses them (intra-family marriage, AC2).
|
||
|
||
**loose spouse** — a person at a given generation who is a spouse of someone in a sibling block but is not themselves a parented child of anyone in the graph. Loose spouses are attached adjacent to their parented partner (right side per Leonie's UX rule) so the spouse line stays short.
|
||
_Not to be confused with [parented](#parented-layout)_ — loose is the absence of parent edges into the graph.
|
||
|
||
**parented** `[layout]` — a layout flag on a sibling-block member indicating that the person has at least one `PARENT_OF` edge incoming from a node already in the graph at the prior generation. Parented members are the layout anchors of their block (the block is centred so the average index of parented members sits under the parents' midpoint); non-parented members (loose spouses) ride along on the side.
|
||
|
||
**anchor index** — within a sibling block, the average position of `parented` member indices. The block is shifted horizontally so this index, multiplied by `NODE_W + COL_GAP`, lines up under the midpoint of the block's parents — keeping every parent-child connector orthogonal (90°).
|
||
|
||
**intra-family marriage** — a `SPOUSE_OF` edge where both endpoints are parented members of *different* sibling blocks at the same rank (i.e. both have parents in the graph, but the parent sets differ). Layout merges the two blocks so the spouses sit adjacent at the join boundary; latent in current data (0 cases in the May-2026 canonical snapshot) but covered by a synthetic regression test in `buildLayout.test.ts`.
|
||
|
||
**marriage dot** — the SVG circle drawn at the midpoint of a `SPOUSE_OF` connector in the Stammbaum tree (`StammbaumTree.svelte`). Radius is `r=6` (12 px diameter) so the marker meets WCAG 1.4.11 (3:1 non-text contrast) when it stacks to disambiguate multiple marriages on the same focal person.
|
||
|
||
**canonical fixture** (Stammbaum) — `frontend/src/lib/person/genealogy/__fixtures__/stammbaum.json`, a pinned `/api/network` snapshot used by `buildLayout.test.ts` for structural-property assertions against real data. Captured locally via `frontend/scripts/capture-network-fixture.mjs` with explicit credentials and a localhost backend; never invoked from CI. Sanity-gated by `validateFixture.ts` (≥ 50 nodes / ≥ 5 generations / ≥ 1 SPOUSE_OF edge / ≥ 1 multi-spouse person).
|
||
|
||
---
|
||
|
||
## Other Domain Terms
|
||
|
||
**Aktivität / Aktivitäten** `[user-facing]` — the family activity feed accessible at `/aktivitaeten`. Shows recent documents, transcriptions, comments, and Geschichten as a chronological timeline.
|
||
_See also [Chronik](#chronik-internal)._
|
||
|
||
**Briefwechsel** `[user-facing]` — the bilateral conversation timeline between two `Person`s, derived from `Document` sender/receiver relationships. Accessible at `/briefwechsel`. Not a persistent entity — data is computed from existing `Document` records.
|
||
_See also [Derived domain](#derived-domain)._
|
||
|
||
**Chronik** `[internal]` — the conceptual and code-level name for the unified activity feed (per ADR-003 `003-chronik-unified-activity-feed.md`). Used in code, architecture documents, and ADRs. The user-facing label for the same concept is [Aktivität](#aktivitat--aktivitaten-user-facing).
|
||
|
||
**Geschichte** (`Geschichte`) `[user-facing]` — a narrative story or article published in the archive, linking `Person`s and `Document`s. Lifecycle: `DRAFT → PUBLISHED` (see `GeschichteStatus`). DRAFT stories are hidden from users without the `BLOG_WRITE` permission.
|
||
|
||
**Notification** (`Notification`) — an in-app message delivered to an `AppUser`. No email or SMS delivery exists today. Delivered via Server-Sent Events (`SseEmitterRegistry`) and persisted in the `notifications` table.
|
||
|
||
**Audit log** (`AuditLog`, table `audit_log`) — an append-only event store recording domain-level activity (document edits, user actions, etc.). Append-only by application convention; a `REVOKE UPDATE, DELETE` is attempted at the DB layer (see migrations V46, V47) but is a no-op if the application role is the table owner in PostgreSQL. Do not rely on DB-enforced immutability — the constraint is application-layer only.
|
||
|
||
---
|
||
|
||
## Architectural Terms
|
||
|
||
**Cross-cutting** — code that lives in `lib/shared/` (frontend) or cross-domain packages (backend) because it has no entity of its own, no user-facing CRUD, AND is used by two or more domains OR is framework infrastructure (error handling, API client, i18n utilities).
|
||
|
||
**Derived domain** — a Tier-2 frontend domain that has its own UI but no backend entities of its own. Data is computed from Tier-1 domain records. Current derived domains: `conversation` (from `Document` sender/receivers) and `activity` (from audit, notifications, document events).
|
||
_See also [Briefwechsel](#briefwechsel-user-facing)._
|
||
|
||
**Domain** — a Tier-1 bounded context with its own entities, controller, service, repository, and DTOs. Backend domains: `document`, `person`, `tag`, `user`, `geschichte`, `notification`, `ocr`, `audit`, `dashboard`. Frontend domains mirror this structure under `src/lib/`.
|
||
|
||
---
|
||
|
||
## Infrastructure Terms
|
||
|
||
**archiv-app** — the bucket-scoped MinIO service account the backend uses to read and write the `familienarchiv` bucket. Distinct from the MinIO root account (`archiv`, used only by the bootstrap container for admin operations). Defined and provisioned in [`infra/minio/bootstrap.sh`](../infra/minio/bootstrap.sh) and consumed by the backend as `S3_ACCESS_KEY` in [`docker-compose.prod.yml`](../docker-compose.prod.yml). The attached `archiv-app-policy` grants `s3:GetObject/PutObject/DeleteObject` on `familienarchiv/*` and `s3:ListBucket/GetBucketLocation` on the bucket only — not the built-in `readwrite` policy which would grant `s3:*` on all buckets.
|
||
_See also [ADR-010 — MinIO stays self-hosted, not Hetzner OBS](./adr/010-minio-self-hosted-not-hetzner-obs.md)._
|
||
|
||
---
|
||
|
||
## Pending Terms
|
||
|
||
_Terms flagged as potentially ambiguous that have not yet been formally defined here. Add an entry above and remove it from this list when resolved._
|
||
|
||
- Terms surfaced by Epic 1 audit findings (#388–#392) — review audit reports under `docs/audits/` when available and add any term flagged as ambiguous.
|
||
- `OcrBatchService` vs `OcrAsyncRunner` — both handle async OCR orchestration; their division of responsibility should be clarified here.
|