diff --git a/backend/api_tests/Admin-Auth.http b/backend/api_tests/Admin-Auth.http index 7bf2f8de..1beb6aeb 100644 --- a/backend/api_tests/Admin-Auth.http +++ b/backend/api_tests/Admin-Auth.http @@ -28,4 +28,18 @@ Authorization: Basic Gast_User gast ###Groups #GET GET http://localhost:8080/api/admin/tags -Authorization: Basic admin admin123 \ No newline at end of file +Authorization: Basic admin admin123 + +### One-time backfill: re-sync already-stale auto-titles (#726) +# RUNBOOK: a one-shot ADMIN maintenance call, NOT part of normal operation. Run it ONCE +# after deploying #726 to clean the existing backlog of stale titles (e.g. a title still +# showing "2028" after the date was corrected to "1928"). It is synchronous and idempotent +# — a second run returns {"count": 0} and writes nothing. Hit the backend DIRECTLY on +# port 8080 (NOT through the SvelteKit proxy) so the sweep can't trip the proxy timeout. +# Returns {"count": }. +POST http://localhost:8080/api/admin/backfill-titles +Authorization: Basic admin admin123 + +### NEGATIV-TEST: ein Nicht-Admin darf den Backfill NICHT auslösen -> 403 Forbidden +POST http://localhost:8080/api/admin/backfill-titles +Authorization: Basic Gast_User gast \ No newline at end of file diff --git a/docs/GLOSSARY.md b/docs/GLOSSARY.md index 200ecabb..8bb508ab 100644 --- a/docs/GLOSSARY.md +++ b/docs/GLOSSARY.md @@ -45,6 +45,9 @@ _See also [TranscriptionBlock](#transcriptionblock-transcriptionblock)._ **raw attribution** (`Document.senderText`, `Document.receiverText`, `Document.metaDateRaw`) — the original spreadsheet cell text for a document's sender, receiver, and date, preserved verbatim even after a `Person` or normalized date is linked. It keeps provenance intact and enables an "as written in the original" view. +**auto-generated title** (`DocumentTitleFactory`) — a `Document` title composed by the formula `{index} – {dateLabel} – {location}` (index = `originalFilename`; date label honest at the row's precision; location omitted when blank). On edit, an unchanged auto-title follows a corrected date/location forward (exact old-vs-new match in `DocumentService.updateDocument`); a hand-written title is kept verbatim. `POST /api/admin/backfill-titles` rewrites already-stale ones in one sweep using a grammar heuristic (`DocumentTitleBackfillMatcher`). +_Not to be confused with a hand-written title_ — only a title that still equals what the factory builds is treated as machine-generated and rewritten; prose is left untouched. + **DocumentVersion** (`DocumentVersion`) — an append-only snapshot of a `Document`'s metadata at a point in time. Append-only by convention; no consumer-facing create or update endpoint exists. The entity uses Lombok `@Data` (which generates setters), so immutability is enforced by application convention, not at the Java level. **Tag** (`Tag`) — a hierarchical category that can be applied to `Document`s. Tags are self-referencing via a `parent_id` foreign key, forming a tree structure. diff --git a/docs/adr/031-document-title-shared-factory-and-save-time-regeneration.md b/docs/adr/031-document-title-shared-factory-and-save-time-regeneration.md new file mode 100644 index 00000000..e3e97392 --- /dev/null +++ b/docs/adr/031-document-title-shared-factory-and-save-time-regeneration.md @@ -0,0 +1,112 @@ +# ADR-031 — The document title is a shared `document`-package factory, re-synced by an exact match on save and a grammar heuristic on a one-time backfill + +**Date:** 2026-06-04 +**Status:** Accepted +**Issue:** #726 (auto-sync document titles with date/location: save-time + one-time backfill) +**Milestone:** — + +--- + +## Context + +A document title was a string built **once**, at import time, by a private +`DocumentImporter.buildTitle()` composing `{index} – {dateLabel} – {location}` (index = +`originalFilename`, date label honest at the row's precision via `DocumentTitleFormatter`, +location verbatim). Nothing rebuilt it afterwards. When an archivist later corrected a date +or location in the edit form, the title kept its stale value (e.g. it still read `2028` +after the date was fixed to `1928`), because the edit form round-trips the stored title +verbatim and `updateDocument` simply re-persisted it. + +Two distinct problems live here: + +1. **Going forward**, an edit to date/location must flow into a title that was machine-built + — but must never overwrite a title a human wrote. +2. **The existing backlog** of already-stale titles must be cleaned once. For these rows the + pre-edit state is gone, so there is no exact value to compare against. + +The composition formula also existed only inside `importing`, which is the wrong owner: a +title is a `document` concern, and three call sites (import, save-time, backfill) must share +one rule or they will drift. + +## Decision + +### 1. One formula, owned by the `document` package (`DocumentTitleFactory`) + +Extract the composition into `DocumentTitleFactory` (a `@Component` in the `document` +package) with `build(Document)`. `DocumentImporter` (package `importing`) now consumes it. +`DocumentTitleFormatter` moves into `document` alongside the factory (it stays +package-private; `importing` reaches the formula only through the factory). The direction is +deliberate: `document` owns the rule, `importing` depends on it — not the reverse. The +German date *label* remains the deliberate Java/TS dual implementation pinned by +`docs/date-label-fixtures.json` (#666); this ADR touches the **composition** only and does +not collapse the frontend `formatDocumentDate`. + +### 2. Save-time regeneration is an EXACT match, not a heuristic + +In `DocumentService.updateDocument` only (bulk edit is out of scope), capture +`autoTitleBefore = titleFactory.build(doc)` from the **currently-persisted** state *before* +any setter runs. Then: + +- if the **submitted** title equals `autoTitleBefore`, it was the machine value → rebuild + from the new state; +- otherwise keep the submitted title verbatim (hand-written or freshly typed). + +This is an exact old-vs-new comparison — no false positives, no false negatives — relying on +the edit form round-tripping an untouched title verbatim. `projectedState` mirrors the +existing setter asymmetry exactly: `documentDate`/`location` overwrite unconditionally (a +null clears them), while precision/end/raw are taken from the DTO only when non-null and +otherwise kept from the entity. A blank submission is never persisted (the title is always +present) — it falls back to the rebuilt auto-title, which always carries at least the index. + +### 3. The one-time backlog cleanup is a grammar heuristic, behind an ADMIN endpoint + +`POST /api/admin/backfill-titles` (synchronous, under `AdminController`'s class-level +`@RequirePermission(Permission.ADMIN)`) sweeps every document and, for each whose stored +title passes the overwrite test, rebuilds it via the factory. Because the pre-edit state is +gone, the test (`DocumentTitleBackfillMatcher`, used **only** here) is a grammar heuristic: +after stripping the **literal** index prefix, the remainder must be exactly the index, a +known date-label form (+ an optional trailing location), or a lone segment equal to the +document's current location. Prose is left untouched; anything malformed fails closed. + +The backfill saves via `documentRepository.save` directly and **never** routes through +`updateDocument` — following the `backfillFileHashes` precedent — so a mechanical rename does +not snapshot the whole corpus into `document_versions`. It is idempotent (a second run +rewrites nothing) and logs one SLF4J-parameterized `scanned/updated/skipped` line; the +response is `BackfillResult(count)`. + +### 4. Edit-form feedback (FR-005) + +A localized helper line (de/en/es) under the title input explains that the title is built +from date/place and that a hand-edit is preserved, wired via `aria-describedby` and shown +only on the single-document edit form. A live preview was considered and declined. + +## Consequences + +- The three call sites can never diverge — there is exactly one formula + (`NFR-MAINT-001`). Save-time cost is a string build + compare; the backfill is one + synchronous transactional sweep over a low-thousands corpus. +- Security: the index is compared **literally** (`String.startsWith` / `Pattern.quote`) + because `originalFilename` is user-controlled and may carry regex metacharacters — an + unquoted pattern would be a ReDoS / regex-injection vector (CWE-1333 / CWE-625). The + date-label sub-patterns use only bounded, non-nested quantifiers. +- **File-replaced documents are treated as manual, by design.** The index is + `originalFilename`, which `updateDocument` reassigns to the uploaded file's name on a + file-replace. After a replace the stored title no longer matches `build(currentState)`, so + neither save-time nor backfill rewrites it. This is the accepted fail-safe of overloading + `originalFilename` rather than adding a dedicated `catalogIndex` column. +- The save-time heuristic risk is zero (exact match); the backfill heuristic can, by its + documented FR-004 rule, treat `{index} – {valid date label} – {anything}` as machine-built + and rewrite the trailing segment. This is the accepted trade for cleaning the backlog + without the lost pre-edit state. + +## Alternatives considered + +- **A dedicated `catalogIndex` column** instead of overloading `originalFilename` — rejected; + it adds a migration and a second source of truth for the index for no current benefit, and + the file-replace fail-safe is acceptable. +- **A heuristic at save-time too** (instead of the exact match) — rejected; the stored title + is available pre-edit, so an exact comparison is strictly better (no false positives). +- **A live title preview in the edit form** — rejected (FR-005); a static helper line is + calmer for the 60+ audience and avoids a second client-side mirror of the formula. +- **Collapsing the frontend `formatDocumentDate` into the backend** — out of scope; the + Java/TS date-label split is the deliberate #666 design, pinned by a shared fixture. diff --git a/docs/architecture/c4/l3-backend-3b-document-management.puml b/docs/architecture/c4/l3-backend-3b-document-management.puml index 65049e7e..4db3c41f 100644 --- a/docs/architecture/c4/l3-backend-3b-document-management.puml +++ b/docs/architecture/c4/l3-backend-3b-document-management.puml @@ -9,15 +9,17 @@ ContainerDb(minio, "Object Storage", "MinIO (S3-compatible)") System_Boundary(backend, "API Backend (Spring Boot)") { Component(docCtrl, "DocumentController", "Spring MVC — /api/documents", "CRUD for documents: search, get by ID, update metadata, upload/download file, batch metadata updates, and per-month density aggregation for the timeline filter widget.") - Component(adminCtrl, "AdminController", "Spring MVC — /api/admin", "Triggers the asynchronous canonical import (requires ADMIN permission). Reports import state (IDLE/RUNNING/DONE/FAILED).") - Component(docSvc, "DocumentService", "Spring Service", "Core document business logic: store, update, search. Resolves persons and tags, delegates file I/O to FileService, builds dynamic JPA Specifications, and integrates with audit logging.") + Component(adminCtrl, "AdminController", "Spring MVC — /api/admin", "Triggers the asynchronous canonical import (requires ADMIN permission). Reports import state (IDLE/RUNNING/DONE/FAILED). Hosts the one-shot maintenance backfills (versions, file-hashes, titles) — synchronous, ADMIN-only.") + Component(docSvc, "DocumentService", "Spring Service", "Core document business logic: store, update, search. On update, regenerates an unchanged auto-title from the new date/location (exact old-vs-new match, #726); exposes backfillTitles() to clean already-stale titles in one sweep. Resolves persons and tags, delegates file I/O to FileService, builds dynamic JPA Specifications, and integrates with audit logging.") Component(fileSvc, "FileService", "Spring Service", "Wraps AWS SDK v2 S3Client. Uploads files with UUID-keyed paths, computes SHA-256 hash, downloads with content-type detection, and generates presigned URLs for OCR access.") Component(importOrch, "CanonicalImportOrchestrator", "Spring Service — @Async", "Runs the four canonical loaders in an explicit dependency DAG (TagTree → PersonRegister → PersonTree → Document). Smoke-checks all four artifacts before starting, owns the IDLE/RUNNING/DONE/FAILED state machine, fails closed on a malformed artifact.") Component(tagTreeLoader, "TagTreeImporter", "Spring Component", "Upserts the tag hierarchy from canonical-tag-tree.xlsx via TagService (by canonical tag_path).") Component(personRegLoader, "PersonRegisterImporter", "Spring Component", "Upserts register persons from canonical-persons.xlsx via PersonService (by normalizer person_id).") Component(personTreeLoader, "PersonTreeImporter", "Spring Component", "Upserts tree persons + relationships from canonical-persons-tree.json via PersonService and RelationshipService.") - Component(docLoader, "DocumentImporter", "Spring Component", "Loads canonical-documents.xlsx: routes attribution register-first (raw cell always retained in sender_text/receiver_text), parses clean dates, builds an honest precision-aware title via DocumentTitleFormatter, keeps the S3 upload + thumbnail plumbing, and resolves each PDF by index (importDir/.pdf) guarded by strict index validation + canonical-path containment + %PDF magic-byte check (no recursive walk).") - Component(titleFmt, "DocumentTitleFormatter", "Pure helper", "Formats the date label baked into an import title at exactly the data's precision (MONTH -> 'Juni 1916', never a fabricated day). Mirrors the frontend formatDocumentDate; both are pinned to docs/date-label-fixtures.json (#666).") + Component(docLoader, "DocumentImporter", "Spring Component", "Loads canonical-documents.xlsx: routes attribution register-first (raw cell always retained in sender_text/receiver_text), parses clean dates, builds the title via DocumentTitleFactory, keeps the S3 upload + thumbnail plumbing, and resolves each PDF by index (importDir/.pdf) guarded by strict index validation + canonical-path containment + %PDF magic-byte check (no recursive walk).") + Component(titleFactory, "DocumentTitleFactory", "Spring Component", "Single source of truth for the auto-title {index} – {dateLabel} – {location} (#726). The document package owns this formula; importer, save-time regeneration, and the backfill all build through it so they never diverge.") + Component(titleFmt, "DocumentTitleFormatter", "Pure helper (document pkg)", "Formats the date label at exactly the data's precision (MONTH -> 'Juni 1916', never a fabricated day). Mirrors the frontend formatDocumentDate; both are pinned to docs/date-label-fixtures.json (#666).") + Component(titleMatcher, "DocumentTitleBackfillMatcher", "Pure helper", "Backfill-only heuristic deciding whether a STORED title is machine-generated (overwritable) vs hand-written prose. Index matched literally (no regex injection / ReDoS); fail-closed.") Component(sheetReader, "CanonicalSheetReader", "POI helper", "Maps a canonical .xlsx by header name (no positional indices), splits pipe-delimited list columns, fails closed (IMPORT_ARTIFACT_INVALID) on a missing required header.") Component(minioConf, "MinioConfig", "Spring @Configuration", "Creates the S3Client and S3Presigner beans with path-style access for MinIO. Validates MinIO connectivity on startup.") Component(docRepo, "DocumentRepository", "Spring Data JPA", "Queries documents with Specification-based dynamic search, full-text search with ranking and match highlighting, and transcription pipeline queue projections.") @@ -44,7 +46,11 @@ Rel(importOrch, docLoader, "4. Loads documents") Rel(tagTreeLoader, sheetReader, "Reads canonical .xlsx") Rel(personRegLoader, sheetReader, "Reads canonical .xlsx") Rel(docLoader, sheetReader, "Reads canonical .xlsx") -Rel(docLoader, titleFmt, "Builds honest title date") +Rel(docLoader, titleFactory, "Builds the auto-title") +Rel(docSvc, titleFactory, "Regenerates auto-title (save-time + backfill)") +Rel(docSvc, titleMatcher, "Backfill overwrite test") +Rel(titleFactory, titleFmt, "Formats the honest date label") +Rel(adminCtrl, docSvc, "backfillTitles() / backfillFileHashes()") Rel(tagTreeLoader, tagSvc, "Upserts tags by source_ref") Rel(personRegLoader, personSvc, "Upserts persons by source_ref") Rel(personTreeLoader, personSvc, "Upserts persons by source_ref")