All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m21s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m52s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
193 lines
21 KiB
Markdown
193 lines
21 KiB
Markdown
# Familienarchiv — Glossary
|
||
|
||
Domain-specific and overloaded terms used in this codebase.
|
||
Each entry: **Term** — definition (≤ 2 sentences). Where two terms are easily confused, a _Not to be confused with_ note follows.
|
||
|
||
For architecture context see [`docs/architecture/c4-diagrams.md`](architecture/c4-diagrams.md).
|
||
For domain package structure see [`docs/ARCHITECTURE.md`](ARCHITECTURE.md) _(coming: DOC-2)_.
|
||
|
||
---
|
||
|
||
## Identity Terms
|
||
|
||
**AppUser** (`AppUser`) — a real person who can log into the system (a family member or administrator). `AppUser` records carry login credentials, group memberships, and notification history.
|
||
_Not to be confused with [Person](#person-person)_ — an AppUser is never recorded as a document sender, receiver, or historical individual.
|
||
|
||
**Reader** — an `AppUser` whose effective permissions include `READ_ALL` but neither `WRITE_ALL` nor `ANNOTATE_ALL`. Readers see a dedicated dashboard (`isReader = !canWrite && !canAnnotate`) focused on browsing documents, persons, and stories rather than contribution tasks. A user who also holds `BLOG_WRITE` is still classified as a Reader and additionally sees a drafts module.
|
||
_Not to be confused with [AppUser](#appuser-appuser)_ — Reader is a permission-derived role, not an entity.
|
||
|
||
**Permission** — a discrete capability string assigned to a `UserGroup` (e.g. `READ_ALL`, `WRITE_ALL`, `ADMIN`, `ADMIN_USER`, `ADMIN_TAG`, `ADMIN_PERMISSION`). Enforced via the `@RequirePermission` AOP annotation on controller methods, checked at runtime by `PermissionAspect`; not via Spring Security's `@PreAuthorize`.
|
||
|
||
**Person** (`Person`) — a historical individual in the family archive (sender, receiver of letters, person mentioned in transcriptions). NEVER has a login account and NEVER appears as an `AppUser`.
|
||
_Not to be confused with [AppUser](#appuser-appuser)_ — `Person` is a historical record; `AppUser` is someone who can log in today.
|
||
|
||
**PersonNameAlias** (`PersonNameAlias`) — an alternate or historical name form associated with a `Person` (e.g. maiden name, nickname, abbreviated form). Used to locate `Person` records during mass import via `PersonNameAliasType`.
|
||
|
||
**UserGroup** (`UserGroup`) — a named permission bundle assigned to one or more `AppUser`s. A user's effective permissions are the union of all permissions across all groups they belong to.
|
||
|
||
**source_ref** (`Person.sourceRef`, `Tag.sourceRef`) — the import normalizer's stable identity for a `Person` (its `person_id`) or `Tag` (its canonical `tag_path`). It is the join key linking normalized records to documents and the idempotency key for re-import; null for manually created records and unique among non-null values.
|
||
|
||
**provisional person** (`Person.provisional`) — a `Person` the importer inferred from raw attribution text but could not confidently match to a known individual. The flag lets the persons directory surface uncertainty honestly rather than fabricate a confident identity; it defaults to `false` and is set `true` only by the importer.
|
||
_Not to be confused with `family_member`_ — `provisional` expresses import confidence, while `family_member` is a genealogical fact about whether the person belongs to the family tree.
|
||
|
||
---
|
||
|
||
## Document-Related Terms
|
||
|
||
**Annotation** (`DocumentAnnotation`) — a free-form polygon or shape drawn over a document page image to highlight a region of interest. Always scoped to a specific page of a `Document`; stored as a polygon (JSONB).
|
||
_See also [TranscriptionBlock](#transcriptionblock-transcriptionblock)._
|
||
|
||
**Comment** (`DocumentComment`, table `document_comments`) — a threaded discussion message attached to a `Document`. Always scoped to a `Document`; optionally further contextualized by a specific `DocumentAnnotation` or `TranscriptionBlock`.
|
||
|
||
**Document** (`Document`) — a single archival item (letter, postcard, photograph) with a file stored in MinIO/S3 and associated metadata (sender, receivers, date, tags, transcription blocks).
|
||
|
||
**date precision** (`Document.metaDatePrecision`, enum `DatePrecision`) — how exactly a document's date is known, one of `DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN`. A verbatim mirror of the import normalizer's `Precision` enum so honest dates can be rendered (`APPROX` → "ca.", `RANGE` uses `meta_date_end`) instead of fabricating a false `DAY`-level date. `UNKNOWN` is the explicit value for undated documents.
|
||
|
||
**raw attribution** (`Document.senderText`, `Document.receiverText`, `Document.metaDateRaw`) — the original spreadsheet cell text for a document's sender, receiver, and date, preserved verbatim even after a `Person` or normalized date is linked. It keeps provenance intact and enables an "as written in the original" view.
|
||
|
||
**auto-generated title** (`DocumentTitleFactory`) — a `Document` title composed by the formula `{index} – {dateLabel} – {location}` (index = `originalFilename`; date label honest at the row's precision; location omitted when blank). On edit, an unchanged auto-title follows a corrected date/location forward (exact old-vs-new match in `DocumentService.updateDocument`); a hand-written title is kept verbatim. `POST /api/admin/backfill-titles` rewrites already-stale ones in one sweep using a grammar heuristic (`DocumentTitleBackfillMatcher`).
|
||
_Not to be confused with a hand-written title_ — only a title that still equals what the factory builds is treated as machine-generated and rewritten; prose is left untouched.
|
||
|
||
**DocumentVersion** (`DocumentVersion`) — an append-only snapshot of a `Document`'s metadata at a point in time. Append-only by convention; no consumer-facing create or update endpoint exists. The entity uses Lombok `@Data` (which generates setters), so immutability is enforced by application convention, not at the Java level.
|
||
|
||
**Tag** (`Tag`) — a hierarchical category that can be applied to `Document`s. Tags are self-referencing via a `parent_id` foreign key, forming a tree structure.
|
||
|
||
**TranscriptionBlock** (`TranscriptionBlock`) — a paragraph-level segment of a `Document`'s transcribed text, with a polygon region (stored as JSONB) identifying its position on the page. One document can have many blocks across multiple pages.
|
||
_See also [Annotation](#annotation-documentannotation)._
|
||
|
||
---
|
||
|
||
## Workflow Terms
|
||
|
||
**DocumentStatus lifecycle** — the ordered states a `Document` moves through:
|
||
`PLACEHOLDER → UPLOADED → TRANSCRIBED → REVIEWED → ARCHIVED`
|
||
|
||
- `PLACEHOLDER`: created during mass import; no file attached yet.
|
||
- `UPLOADED`: a file has been stored in MinIO/S3.
|
||
- `TRANSCRIBED`: all transcription blocks have been marked done.
|
||
- `REVIEWED`: a reviewer has approved the transcription.
|
||
- `ARCHIVED`: the document is finalized and read-only.
|
||
|
||
**Canonical import** — an asynchronous batch process (`CanonicalImportOrchestrator`) that consumes the normalizer's committed canonical artifacts and creates `Tag`s, `Person`s (register + tree), family relationships, and `Document`s. Four idempotent loaders run in a fixed dependency order — `TagTreeImporter` → `PersonRegisterImporter` → `PersonTreeImporter` → `DocumentImporter` — each calling the owning domain's service. Re-running it never duplicates rows (upsert by `source_ref` / document `index`) and never overwrites a human-edited field. Only one import can run at a time (`IMPORT_ALREADY_RUNNING` error if attempted concurrently); a missing or malformed artifact fails closed (`IMPORT_ARTIFACT_INVALID`). Replaced the legacy raw-spreadsheet `MassImportService` (see ADR-025).
|
||
|
||
**canonical artifact** — one of the four files the normalizer (`tools/import-normalizer/`) emits and commits to `tools/import-normalizer/out/`: `canonical-tag-tree.xlsx`, `canonical-persons.xlsx`, `canonical-persons-tree.json`, `canonical-documents.xlsx`. They are the contract the backend importer reads (mapped by header name); the semantic transformation (German-date parsing, name classification) lives only in the normalizer, never in Java.
|
||
|
||
**CanonicalSheetReader** — the value-level POI helper that opens a canonical `.xlsx`, maps the header row to column indices by name (replacing the brittle positional column config), splits pipe-delimited list columns, and throws `IMPORT_ARTIFACT_INVALID` on a missing required header rather than NPE-ing on a null index.
|
||
|
||
**SkippedFile** (`ImportStatus.SkippedFile`) — a file that was presented for import but not processed, recorded with a `filename` and a `reason` code. Possible reasons: `INVALID_FILENAME_PATH_TRAVERSAL` (the file-column basename failed the path-traversal guard), `INVALID_PDF_SIGNATURE` (magic-byte validation failed), `S3_UPLOAD_FAILED` (file upload to MinIO/S3 threw an exception), `FILE_READ_ERROR` (the file could not be opened for reading), or `ALREADY_EXISTS` (a document with the same `index` already exists in the archive with a status other than `PLACEHOLDER`).
|
||
|
||
**skipped count** — the total number of `SkippedFile` entries accumulated during a single import run (`ImportStatus.skipped()`). Shown in the amber warning section of the Import Status Card in the admin UI; a value of zero suppresses the section entirely.
|
||
|
||
**Transcription queue** — the set of `Document`s and `TranscriptionBlock`s awaiting work, computed on-the-fly from `Document`/`Block` status. Three views: segmentation queue, transcription queue, ready-to-read queue. NOT a persistent entity — no `transcription_queues` table exists.
|
||
_See also [DocumentStatus lifecycle](#documentstatus-lifecycle)._
|
||
|
||
---
|
||
|
||
## OCR-Specific Terms
|
||
|
||
**HTR** — Handwritten Text Recognition. Recognizes cursive and historical handwriting (contrasted with OCR for printed/typewritten text). The primary mode used for letters in this archive.
|
||
|
||
**Kurrent** — Old German cursive handwriting style, the primary historical script appearing in letters from the 1899–1950 period covered by this archive.
|
||
|
||
**OCR** — Optical Character Recognition. Recognizes printed or typewritten text. Used for typed documents; HTR is used for handwritten ones.
|
||
|
||
**OcrJob** (`OcrJob`, table `ocr_jobs`) — a first-class persistent entity tracking a batch OCR run across one or more documents (`OcrJobDocument`, table `ocr_job_documents`). Distinct from the concept of "running OCR on a single document." Lifecycle: `PENDING → RUNNING → DONE / FAILED` (see `OcrJobStatus`).
|
||
|
||
**SenderModel** (`SenderModel`, table `sender_models`) — a fine-tuned Kraken HTR model trained on a specific historical correspondent's handwriting. Both an OCR-service concept (the model weights) and a persistent entity linking a `Person` to the path of their trained model file.
|
||
|
||
**Sütterlin** — A specific standardized style of Kurrent taught in German schools from 1915 to 1941.
|
||
|
||
**Illegible word** — a word whose recognition confidence falls below the configured threshold; replaced with the literal token `[unleserlich]` in the rendered block text and counted in the `ocr_illegible_words_total` Prometheus counter.
|
||
|
||
**Models-ready gauge** — the `ocr_models_ready` Prometheus gauge, flipped from `0` to `1` once the FastAPI lifespan startup has finished loading the Kraken model and the spell-checker. Used both for the `/health` endpoint and as the supervised signal for the `ocr_models_ready < 1 for 2m` alert.
|
||
|
||
**Recognition model accuracy** — the accuracy reported by `ketos train` for the recognition (text-line) model, exposed as `ocr_model_accuracy{kind="recognition"}`. Sourced from `_parse_best_checkpoint` on the highest-scoring checkpoint after training.
|
||
|
||
**Segmentation model accuracy** — the accuracy reported by `ketos segtrain` for the baseline layout analysis (`blla`) model, exposed as `ocr_model_accuracy{kind="segmentation"}`. Distinct from recognition accuracy because the two models are trained and improved independently.
|
||
|
||
---
|
||
|
||
## Stammbaum (Family-Tree Layout) Terms
|
||
|
||
**Stammbaum** `[user-facing]` — the genealogy / family-tree view of the archive, accessible at `/stammbaum`. Renders every `Person` as a node positioned by `PersonRelationship` edges (`PARENT_OF`, `SPOUSE_OF`) into rows that correspond to generations. The browser-side layout pipeline lives at `frontend/src/lib/person/genealogy/`.
|
||
_See also [PersonRelationship](#person-person)._
|
||
|
||
**seeded rank** (`Person.generation`) — the imported generation index on a `Person` (G 0 = founders, increasing downward), used as a strict row anchor in `buildLayout.ts`. The iterative fallback heuristic never overrides a seeded rank, and spouse-pulldown never pulls a seeded rank — only unseeded nodes (no `generation`) flow through the heuristic.
|
||
|
||
**family forest** — the model the Stammbaum horizontal layout reasons over (ADR-030, `familyForest.ts`): a forest of **units** rather than per-generation rows. Replaces the old per-generation "sibling block" packer. The canonical fixture is ~24 root units over 62 nodes.
|
||
|
||
**unit** `[layout]` — one bloodline carrier (the **primary**) plus the spouse(s) absorbed into its run, rendered as one adjacent row of cards. `members[0]` is the primary; the rest are spouses in marriage-year order (#361). A lone person is a unit of one. A unit's children are the units anchored by the couple's offspring. The unit — not the individual — is the node the tidy-tree packs.
|
||
|
||
**tidy tree** — the bottom-up Reingold–Tilford contour packer (`tidyTree.ts`) that assigns each unit's horizontal `x`: lay out child subtrees first, pack them so their contours clear by `COL_GAP` at every level, then centre the unit over the span of its children. Contours are indexed by absolute generation level, so unrelated roots at different generations share x-columns. `x` comes from structure; `y` still comes from rank (`assignRanks`, #689).
|
||
|
||
**structural owner** — for a couple, the spouse that keeps the bloodline (hierarchy) position: lower `birthYear`, then stable `id` (`pickStructuralOwner` in `familyForest.ts`). The other spouse is absorbed into the owner's run. Reused by the cross-link, cycle, and intra-family paths so the rule is defined once.
|
||
|
||
**loose spouse** — a person who marries into the graph with no `PARENT_OF` edges of their own. They are absorbed into their partner's unit run (no ancestor subtree), but any children of theirs still anchor through the couple unit.
|
||
|
||
**bloodline** — the set of people reachable from a root unit via structural-owner `PARENT_OF` edges; renders as one contiguous horizontal band with no foreign node interleaved (the contiguity invariant that fixed the smeared-bloodline bug, #724).
|
||
|
||
**cross-link** `[layout]` — a `PARENT_OF` edge whose child is positioned in a spouse's run elsewhere (a cross-level intra-family marriage). The connector draws it with a distinct `2 6` dash at reduced opacity — never the `4 4` ended-marriage cadence — with geometry still landing on the child (WCAG 1.4.1).
|
||
|
||
**intra-family marriage** — a `SPOUSE_OF` edge where both endpoints have parents in the graph. The couple is always exactly adjacent in the owner's run; when the two spouses' parents sit at the same structural level the displaced parent edge stays solid (the adjacency case), otherwise it renders as a cross-link. The canonical fixture has two such marriages (Walter⚭Eugenie, Clara⚭Herbert), covered in `buildLayout.test.ts`.
|
||
|
||
**marriage dot** — the SVG circle drawn at the midpoint of a `SPOUSE_OF` connector in the Stammbaum tree (`StammbaumTree.svelte`). Radius is `r=6` (12 px diameter) so the marker meets WCAG 1.4.11 (3:1 non-text contrast) when it stacks to disambiguate multiple marriages on the same focal person.
|
||
|
||
**canonical fixture** (Stammbaum) — `frontend/src/lib/person/genealogy/__fixtures__/stammbaum.json`, a pinned `/api/network` snapshot used by `buildLayout.test.ts` for structural-property assertions against real data. Captured locally via `frontend/scripts/capture-network-fixture.mjs` with explicit credentials and a localhost backend; never invoked from CI. Sanity-gated by `validateFixture.ts` (≥ 50 nodes / ≥ 5 generations / ≥ 1 SPOUSE_OF edge / ≥ 1 multi-spouse person).
|
||
|
||
**pan/zoom view state** `[#692]` — the `{ x, y, z }` triple (`PanZoomState` in `frontend/src/lib/person/genealogy/panZoom.ts`) describing the Stammbaum canvas position: `x`/`y` are pan offsets applied to the SVG viewBox centre, `z` is the zoom factor (clamped 0.25–10). Mirrored into the shareable URL as `?cx&cy&z` and seeded server-side from those params. See [ADR-027](adr/027-stammbaum-custom-viewbox-pan-zoom.md).
|
||
|
||
**fit-to-screen** `[user-facing, #692]` — the Stammbaum control (`⤢`) and initial state that frames the whole tree in the viewport. Because the base viewBox already encloses the layout at `z=1`, fit-to-screen is simply the default view `{x:0, y:0, z:1}`.
|
||
|
||
**lineage highlight** `[user-facing, #703]` — the focus+dim layer bound to the Stammbaum side panel: while a person is selected, that person, their full pedigree upward, their full descendant tree downward, and the spouses of all those blood people render at full strength while everyone else is dimmed (opacity, not a hue swap). Connectors dim unless both joined people are active. Computed by the pure traversal in `frontend/src/lib/person/genealogy/layout/highlightLineage.ts`.
|
||
|
||
---
|
||
|
||
## Other Domain Terms
|
||
|
||
**Aktivität / Aktivitäten** `[user-facing]` — the family activity feed accessible at `/aktivitaeten`. Shows recent documents, transcriptions, comments, and Geschichten as a chronological timeline.
|
||
_See also [Chronik](#chronik-internal)._
|
||
|
||
**Chronik** `[internal]` — the conceptual and code-level name for the unified activity feed (per ADR-003 `003-chronik-unified-activity-feed.md`). Used in code, architecture documents, and ADRs. The user-facing label for the same concept is [Aktivität](#aktivitat--aktivitaten-user-facing).
|
||
|
||
**Geschichte** (`Geschichte`) `[user-facing]` — a narrative story or article published in the archive, linking `Person`s and `Document`s. Lifecycle: `DRAFT → PUBLISHED` (see `GeschichteStatus`). DRAFT stories are hidden from users without the `BLOG_WRITE` permission.
|
||
|
||
**Notification** (`Notification`) — an in-app message delivered to an `AppUser`. No email or SMS delivery exists today. Delivered via Server-Sent Events (`SseEmitterRegistry`) and persisted in the `notifications` table.
|
||
|
||
**Audit log** (`AuditLog`, table `audit_log`) — an append-only event store recording domain-level activity (document edits, user actions, etc.). Append-only by application convention; a `REVOKE UPDATE, DELETE` is attempted at the DB layer (see migrations V46, V47) but is a no-op if the application role is the table owner in PostgreSQL. Do not rely on DB-enforced immutability — the constraint is application-layer only.
|
||
|
||
---
|
||
|
||
## Architectural Terms
|
||
|
||
**Cross-cutting** — code that lives in `lib/shared/` (frontend) or cross-domain packages (backend) because it has no entity of its own, no user-facing CRUD, AND is used by two or more domains OR is framework infrastructure (error handling, API client, i18n utilities).
|
||
|
||
**Derived domain** — a Tier-2 frontend domain that has its own UI but no backend entities of its own. Data is computed from Tier-1 domain records. The current derived domain is `activity` (from audit, notifications, document events).
|
||
|
||
**Domain** — a Tier-1 bounded context with its own entities, controller, service, repository, and DTOs. Backend domains: `document`, `person`, `tag`, `user`, `geschichte`, `notification`, `ocr`, `audit`, `dashboard`. Frontend domains mirror this structure under `src/lib/`.
|
||
|
||
---
|
||
|
||
## NL Search Terms
|
||
|
||
**NlSearch** — the natural-language document search feature. Users type a plain-German query (e.g. "Was hat Walter im Krieg an Emma geschrieben?"); the backend parses it via Ollama, resolves person names to database UUIDs, and delegates to the standard `DocumentService.searchDocuments()` path. Endpoint: `POST /api/search/nl`.
|
||
|
||
**NlQueryInterpretation** — the structured result of parsing a natural-language query. Contains: `resolvedPersons` (persons whose names unambiguously matched one DB record), `ambiguousPersons` (all candidates when a name matched more than one person), `keywords` (LLM-extracted search terms), `dateFrom`/`dateTo` (extracted date range), `rawQuery` (the original user input), and `keywordsApplied` (whether keyword FTS was used in the search).
|
||
|
||
**PersonHint** — a lightweight `{id, displayName}` pair used in `NlQueryInterpretation` to describe a resolved or ambiguous person without exposing the full `Person` entity to the frontend.
|
||
|
||
---
|
||
|
||
## Infrastructure Terms
|
||
|
||
**archiv-app** — the bucket-scoped MinIO service account the backend uses to read and write the `familienarchiv` bucket. Distinct from the MinIO root account (`archiv`, used only by the bootstrap container for admin operations). Defined and provisioned in [`infra/minio/bootstrap.sh`](../infra/minio/bootstrap.sh) and consumed by the backend as `S3_ACCESS_KEY` in [`docker-compose.prod.yml`](../docker-compose.prod.yml). The attached `archiv-app-policy` grants `s3:GetObject/PutObject/DeleteObject` on `familienarchiv/*` and `s3:ListBucket/GetBucketLocation` on the bucket only — not the built-in `readwrite` policy which would grant `s3:*` on all buckets.
|
||
_See also [ADR-010 — MinIO stays self-hosted, not Hetzner OBS](./adr/010-minio-self-hosted-not-hetzner-obs.md)._
|
||
|
||
---
|
||
|
||
## Pending Terms
|
||
|
||
_Terms flagged as potentially ambiguous that have not yet been formally defined here. Add an entry above and remove it from this list when resolved._
|
||
|
||
- Terms surfaced by Epic 1 audit findings (#388–#392) — review audit reports under `docs/audits/` when available and add any term flagged as ambiguous.
|
||
- `OcrBatchService` vs `OcrAsyncRunner` — both handle async OCR orchestration; their division of responsibility should be clarified here.
|