Files
familienarchiv/docs/adr/031-document-title-shared-factory-and-save-time-regeneration.md
Marcel cf457cb96f
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 2m32s
CI / OCR Service Tests (pull_request) Successful in 26s
CI / Backend Unit Tests (pull_request) Successful in 3m35s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m6s
docs(document): ADR-031 + glossary/c4/api_tests for auto-title sync (#726)
ADR-031 records the shared document-package title factory, the exact-match save-time
regeneration, and the grammar-heuristic one-time backfill (with the ReDoS / no-version-spam
/ file-replace-is-manual decisions). Adds an "auto-generated title" glossary entry, extends
the document-management c4 diagram with DocumentTitleFactory / DocumentTitleBackfillMatcher
and the backfill flows, and documents POST /api/admin/backfill-titles in Admin-Auth.http as
a one-shot ADMIN call hitting port 8080 directly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:44:56 +02:00

113 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-031 — The document title is a shared `document`-package factory, re-synced by an exact match on save and a grammar heuristic on a one-time backfill
**Date:** 2026-06-04
**Status:** Accepted
**Issue:** #726 (auto-sync document titles with date/location: save-time + one-time backfill)
**Milestone:**
---
## Context
A document title was a string built **once**, at import time, by a private
`DocumentImporter.buildTitle()` composing `{index} {dateLabel} {location}` (index =
`originalFilename`, date label honest at the row's precision via `DocumentTitleFormatter`,
location verbatim). Nothing rebuilt it afterwards. When an archivist later corrected a date
or location in the edit form, the title kept its stale value (e.g. it still read `2028`
after the date was fixed to `1928`), because the edit form round-trips the stored title
verbatim and `updateDocument` simply re-persisted it.
Two distinct problems live here:
1. **Going forward**, an edit to date/location must flow into a title that was machine-built
— but must never overwrite a title a human wrote.
2. **The existing backlog** of already-stale titles must be cleaned once. For these rows the
pre-edit state is gone, so there is no exact value to compare against.
The composition formula also existed only inside `importing`, which is the wrong owner: a
title is a `document` concern, and three call sites (import, save-time, backfill) must share
one rule or they will drift.
## Decision
### 1. One formula, owned by the `document` package (`DocumentTitleFactory`)
Extract the composition into `DocumentTitleFactory` (a `@Component` in the `document`
package) with `build(Document)`. `DocumentImporter` (package `importing`) now consumes it.
`DocumentTitleFormatter` moves into `document` alongside the factory (it stays
package-private; `importing` reaches the formula only through the factory). The direction is
deliberate: `document` owns the rule, `importing` depends on it — not the reverse. The
German date *label* remains the deliberate Java/TS dual implementation pinned by
`docs/date-label-fixtures.json` (#666); this ADR touches the **composition** only and does
not collapse the frontend `formatDocumentDate`.
### 2. Save-time regeneration is an EXACT match, not a heuristic
In `DocumentService.updateDocument` only (bulk edit is out of scope), capture
`autoTitleBefore = titleFactory.build(doc)` from the **currently-persisted** state *before*
any setter runs. Then:
- if the **submitted** title equals `autoTitleBefore`, it was the machine value → rebuild
from the new state;
- otherwise keep the submitted title verbatim (hand-written or freshly typed).
This is an exact old-vs-new comparison — no false positives, no false negatives — relying on
the edit form round-tripping an untouched title verbatim. `projectedState` mirrors the
existing setter asymmetry exactly: `documentDate`/`location` overwrite unconditionally (a
null clears them), while precision/end/raw are taken from the DTO only when non-null and
otherwise kept from the entity. A blank submission is never persisted (the title is always
present) — it falls back to the rebuilt auto-title, which always carries at least the index.
### 3. The one-time backlog cleanup is a grammar heuristic, behind an ADMIN endpoint
`POST /api/admin/backfill-titles` (synchronous, under `AdminController`'s class-level
`@RequirePermission(Permission.ADMIN)`) sweeps every document and, for each whose stored
title passes the overwrite test, rebuilds it via the factory. Because the pre-edit state is
gone, the test (`DocumentTitleBackfillMatcher`, used **only** here) is a grammar heuristic:
after stripping the **literal** index prefix, the remainder must be exactly the index, a
known date-label form (+ an optional trailing location), or a lone segment equal to the
document's current location. Prose is left untouched; anything malformed fails closed.
The backfill saves via `documentRepository.save` directly and **never** routes through
`updateDocument` — following the `backfillFileHashes` precedent — so a mechanical rename does
not snapshot the whole corpus into `document_versions`. It is idempotent (a second run
rewrites nothing) and logs one SLF4J-parameterized `scanned/updated/skipped` line; the
response is `BackfillResult(count)`.
### 4. Edit-form feedback (FR-005)
A localized helper line (de/en/es) under the title input explains that the title is built
from date/place and that a hand-edit is preserved, wired via `aria-describedby` and shown
only on the single-document edit form. A live preview was considered and declined.
## Consequences
- The three call sites can never diverge — there is exactly one formula
(`NFR-MAINT-001`). Save-time cost is a string build + compare; the backfill is one
synchronous transactional sweep over a low-thousands corpus.
- Security: the index is compared **literally** (`String.startsWith` / `Pattern.quote`)
because `originalFilename` is user-controlled and may carry regex metacharacters — an
unquoted pattern would be a ReDoS / regex-injection vector (CWE-1333 / CWE-625). The
date-label sub-patterns use only bounded, non-nested quantifiers.
- **File-replaced documents are treated as manual, by design.** The index is
`originalFilename`, which `updateDocument` reassigns to the uploaded file's name on a
file-replace. After a replace the stored title no longer matches `build(currentState)`, so
neither save-time nor backfill rewrites it. This is the accepted fail-safe of overloading
`originalFilename` rather than adding a dedicated `catalogIndex` column.
- The save-time heuristic risk is zero (exact match); the backfill heuristic can, by its
documented FR-004 rule, treat `{index} {valid date label} {anything}` as machine-built
and rewrite the trailing segment. This is the accepted trade for cleaning the backlog
without the lost pre-edit state.
## Alternatives considered
- **A dedicated `catalogIndex` column** instead of overloading `originalFilename` — rejected;
it adds a migration and a second source of truth for the index for no current benefit, and
the file-replace fail-safe is acceptable.
- **A heuristic at save-time too** (instead of the exact match) — rejected; the stored title
is available pre-edit, so an exact comparison is strictly better (no false positives).
- **A live title preview in the edit form** — rejected (FR-005); a static helper line is
calmer for the 60+ audience and avoids a second client-side mirror of the formula.
- **Collapsing the frontend `formatDocumentDate` into the backend** — out of scope; the
Java/TS date-label split is the deliberate #666 design, pinned by a shared fixture.