ADR-031 records the shared document-package title factory, the exact-match save-time regeneration, and the grammar-heuristic one-time backfill (with the ReDoS / no-version-spam / file-replace-is-manual decisions). Adds an "auto-generated title" glossary entry, extends the document-management c4 diagram with DocumentTitleFactory / DocumentTitleBackfillMatcher and the backfill flows, and documents POST /api/admin/backfill-titles in Admin-Auth.http as a one-shot ADMIN call hitting port 8080 directly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.6 KiB
ADR-031 — The document title is a shared document-package factory, re-synced by an exact match on save and a grammar heuristic on a one-time backfill
Date: 2026-06-04 Status: Accepted Issue: #726 (auto-sync document titles with date/location: save-time + one-time backfill) Milestone: —
Context
A document title was a string built once, at import time, by a private
DocumentImporter.buildTitle() composing {index} – {dateLabel} – {location} (index =
originalFilename, date label honest at the row's precision via DocumentTitleFormatter,
location verbatim). Nothing rebuilt it afterwards. When an archivist later corrected a date
or location in the edit form, the title kept its stale value (e.g. it still read 2028
after the date was fixed to 1928), because the edit form round-trips the stored title
verbatim and updateDocument simply re-persisted it.
Two distinct problems live here:
- Going forward, an edit to date/location must flow into a title that was machine-built — but must never overwrite a title a human wrote.
- The existing backlog of already-stale titles must be cleaned once. For these rows the pre-edit state is gone, so there is no exact value to compare against.
The composition formula also existed only inside importing, which is the wrong owner: a
title is a document concern, and three call sites (import, save-time, backfill) must share
one rule or they will drift.
Decision
1. One formula, owned by the document package (DocumentTitleFactory)
Extract the composition into DocumentTitleFactory (a @Component in the document
package) with build(Document). DocumentImporter (package importing) now consumes it.
DocumentTitleFormatter moves into document alongside the factory (it stays
package-private; importing reaches the formula only through the factory). The direction is
deliberate: document owns the rule, importing depends on it — not the reverse. The
German date label remains the deliberate Java/TS dual implementation pinned by
docs/date-label-fixtures.json (#666); this ADR touches the composition only and does
not collapse the frontend formatDocumentDate.
2. Save-time regeneration is an EXACT match, not a heuristic
In DocumentService.updateDocument only (bulk edit is out of scope), capture
autoTitleBefore = titleFactory.build(doc) from the currently-persisted state before
any setter runs. Then:
- if the submitted title equals
autoTitleBefore, it was the machine value → rebuild from the new state; - otherwise keep the submitted title verbatim (hand-written or freshly typed).
This is an exact old-vs-new comparison — no false positives, no false negatives — relying on
the edit form round-tripping an untouched title verbatim. projectedState mirrors the
existing setter asymmetry exactly: documentDate/location overwrite unconditionally (a
null clears them), while precision/end/raw are taken from the DTO only when non-null and
otherwise kept from the entity. A blank submission is never persisted (the title is always
present) — it falls back to the rebuilt auto-title, which always carries at least the index.
3. The one-time backlog cleanup is a grammar heuristic, behind an ADMIN endpoint
POST /api/admin/backfill-titles (synchronous, under AdminController's class-level
@RequirePermission(Permission.ADMIN)) sweeps every document and, for each whose stored
title passes the overwrite test, rebuilds it via the factory. Because the pre-edit state is
gone, the test (DocumentTitleBackfillMatcher, used only here) is a grammar heuristic:
after stripping the literal index prefix, the remainder must be exactly the index, a
known date-label form (+ an optional trailing location), or a lone segment equal to the
document's current location. Prose is left untouched; anything malformed fails closed.
The backfill saves via documentRepository.save directly and never routes through
updateDocument — following the backfillFileHashes precedent — so a mechanical rename does
not snapshot the whole corpus into document_versions. It is idempotent (a second run
rewrites nothing) and logs one SLF4J-parameterized scanned/updated/skipped line; the
response is BackfillResult(count).
4. Edit-form feedback (FR-005)
A localized helper line (de/en/es) under the title input explains that the title is built
from date/place and that a hand-edit is preserved, wired via aria-describedby and shown
only on the single-document edit form. A live preview was considered and declined.
Consequences
- The three call sites can never diverge — there is exactly one formula
(
NFR-MAINT-001). Save-time cost is a string build + compare; the backfill is one synchronous transactional sweep over a low-thousands corpus. - Security: the index is compared literally (
String.startsWith/Pattern.quote) becauseoriginalFilenameis user-controlled and may carry regex metacharacters — an unquoted pattern would be a ReDoS / regex-injection vector (CWE-1333 / CWE-625). The date-label sub-patterns use only bounded, non-nested quantifiers. - File-replaced documents are treated as manual, by design. The index is
originalFilename, whichupdateDocumentreassigns to the uploaded file's name on a file-replace. After a replace the stored title no longer matchesbuild(currentState), so neither save-time nor backfill rewrites it. This is the accepted fail-safe of overloadingoriginalFilenamerather than adding a dedicatedcatalogIndexcolumn. - The save-time heuristic risk is zero (exact match); the backfill heuristic can, by its
documented FR-004 rule, treat
{index} – {valid date label} – {anything}as machine-built and rewrite the trailing segment. This is the accepted trade for cleaning the backlog without the lost pre-edit state.
Alternatives considered
- A dedicated
catalogIndexcolumn instead of overloadingoriginalFilename— rejected; it adds a migration and a second source of truth for the index for no current benefit, and the file-replace fail-safe is acceptable. - A heuristic at save-time too (instead of the exact match) — rejected; the stored title is available pre-edit, so an exact comparison is strictly better (no false positives).
- A live title preview in the edit form — rejected (FR-005); a static helper line is calmer for the 60+ audience and avoids a second client-side mirror of the formula.
- Collapsing the frontend
formatDocumentDateinto the backend — out of scope; the Java/TS date-label split is the deliberate #666 design, pinned by a shared fixture.