Files
familienarchiv/docs/adr/031-document-title-shared-factory-and-save-time-regeneration.md
Marcel cf457cb96f
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 2m32s
CI / OCR Service Tests (pull_request) Successful in 26s
CI / Backend Unit Tests (pull_request) Successful in 3m35s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m6s
docs(document): ADR-031 + glossary/c4/api_tests for auto-title sync (#726)
ADR-031 records the shared document-package title factory, the exact-match save-time
regeneration, and the grammar-heuristic one-time backfill (with the ReDoS / no-version-spam
/ file-replace-is-manual decisions). Adds an "auto-generated title" glossary entry, extends
the document-management c4 diagram with DocumentTitleFactory / DocumentTitleBackfillMatcher
and the backfill flows, and documents POST /api/admin/backfill-titles in Admin-Auth.http as
a one-shot ADMIN call hitting port 8080 directly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:44:56 +02:00

6.6 KiB
Raw Blame History

ADR-031 — The document title is a shared document-package factory, re-synced by an exact match on save and a grammar heuristic on a one-time backfill

Date: 2026-06-04 Status: Accepted Issue: #726 (auto-sync document titles with date/location: save-time + one-time backfill) Milestone:


Context

A document title was a string built once, at import time, by a private DocumentImporter.buildTitle() composing {index} {dateLabel} {location} (index = originalFilename, date label honest at the row's precision via DocumentTitleFormatter, location verbatim). Nothing rebuilt it afterwards. When an archivist later corrected a date or location in the edit form, the title kept its stale value (e.g. it still read 2028 after the date was fixed to 1928), because the edit form round-trips the stored title verbatim and updateDocument simply re-persisted it.

Two distinct problems live here:

  1. Going forward, an edit to date/location must flow into a title that was machine-built — but must never overwrite a title a human wrote.
  2. The existing backlog of already-stale titles must be cleaned once. For these rows the pre-edit state is gone, so there is no exact value to compare against.

The composition formula also existed only inside importing, which is the wrong owner: a title is a document concern, and three call sites (import, save-time, backfill) must share one rule or they will drift.

Decision

1. One formula, owned by the document package (DocumentTitleFactory)

Extract the composition into DocumentTitleFactory (a @Component in the document package) with build(Document). DocumentImporter (package importing) now consumes it. DocumentTitleFormatter moves into document alongside the factory (it stays package-private; importing reaches the formula only through the factory). The direction is deliberate: document owns the rule, importing depends on it — not the reverse. The German date label remains the deliberate Java/TS dual implementation pinned by docs/date-label-fixtures.json (#666); this ADR touches the composition only and does not collapse the frontend formatDocumentDate.

2. Save-time regeneration is an EXACT match, not a heuristic

In DocumentService.updateDocument only (bulk edit is out of scope), capture autoTitleBefore = titleFactory.build(doc) from the currently-persisted state before any setter runs. Then:

  • if the submitted title equals autoTitleBefore, it was the machine value → rebuild from the new state;
  • otherwise keep the submitted title verbatim (hand-written or freshly typed).

This is an exact old-vs-new comparison — no false positives, no false negatives — relying on the edit form round-tripping an untouched title verbatim. projectedState mirrors the existing setter asymmetry exactly: documentDate/location overwrite unconditionally (a null clears them), while precision/end/raw are taken from the DTO only when non-null and otherwise kept from the entity. A blank submission is never persisted (the title is always present) — it falls back to the rebuilt auto-title, which always carries at least the index.

3. The one-time backlog cleanup is a grammar heuristic, behind an ADMIN endpoint

POST /api/admin/backfill-titles (synchronous, under AdminController's class-level @RequirePermission(Permission.ADMIN)) sweeps every document and, for each whose stored title passes the overwrite test, rebuilds it via the factory. Because the pre-edit state is gone, the test (DocumentTitleBackfillMatcher, used only here) is a grammar heuristic: after stripping the literal index prefix, the remainder must be exactly the index, a known date-label form (+ an optional trailing location), or a lone segment equal to the document's current location. Prose is left untouched; anything malformed fails closed.

The backfill saves via documentRepository.save directly and never routes through updateDocument — following the backfillFileHashes precedent — so a mechanical rename does not snapshot the whole corpus into document_versions. It is idempotent (a second run rewrites nothing) and logs one SLF4J-parameterized scanned/updated/skipped line; the response is BackfillResult(count).

4. Edit-form feedback (FR-005)

A localized helper line (de/en/es) under the title input explains that the title is built from date/place and that a hand-edit is preserved, wired via aria-describedby and shown only on the single-document edit form. A live preview was considered and declined.

Consequences

  • The three call sites can never diverge — there is exactly one formula (NFR-MAINT-001). Save-time cost is a string build + compare; the backfill is one synchronous transactional sweep over a low-thousands corpus.
  • Security: the index is compared literally (String.startsWith / Pattern.quote) because originalFilename is user-controlled and may carry regex metacharacters — an unquoted pattern would be a ReDoS / regex-injection vector (CWE-1333 / CWE-625). The date-label sub-patterns use only bounded, non-nested quantifiers.
  • File-replaced documents are treated as manual, by design. The index is originalFilename, which updateDocument reassigns to the uploaded file's name on a file-replace. After a replace the stored title no longer matches build(currentState), so neither save-time nor backfill rewrites it. This is the accepted fail-safe of overloading originalFilename rather than adding a dedicated catalogIndex column.
  • The save-time heuristic risk is zero (exact match); the backfill heuristic can, by its documented FR-004 rule, treat {index} {valid date label} {anything} as machine-built and rewrite the trailing segment. This is the accepted trade for cleaning the backlog without the lost pre-edit state.

Alternatives considered

  • A dedicated catalogIndex column instead of overloading originalFilename — rejected; it adds a migration and a second source of truth for the index for no current benefit, and the file-replace fail-safe is acceptable.
  • A heuristic at save-time too (instead of the exact match) — rejected; the stored title is available pre-edit, so an exact comparison is strictly better (no false positives).
  • A live title preview in the edit form — rejected (FR-005); a static helper line is calmer for the 60+ audience and avoids a second client-side mirror of the formula.
  • Collapsing the frontend formatDocumentDate into the backend — out of scope; the Java/TS date-label split is the deliberate #666 design, pinned by a shared fixture.