Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 2m32s
CI / OCR Service Tests (pull_request) Successful in 26s
CI / Backend Unit Tests (pull_request) Successful in 3m35s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m6s
ADR-031 records the shared document-package title factory, the exact-match save-time regeneration, and the grammar-heuristic one-time backfill (with the ReDoS / no-version-spam / file-replace-is-manual decisions). Adds an "auto-generated title" glossary entry, extends the document-management c4 diagram with DocumentTitleFactory / DocumentTitleBackfillMatcher and the backfill flows, and documents POST /api/admin/backfill-titles in Admin-Auth.http as a one-shot ADMIN call hitting port 8080 directly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
113 lines
6.6 KiB
Markdown
113 lines
6.6 KiB
Markdown
# ADR-031 — The document title is a shared `document`-package factory, re-synced by an exact match on save and a grammar heuristic on a one-time backfill
|
||
|
||
**Date:** 2026-06-04
|
||
**Status:** Accepted
|
||
**Issue:** #726 (auto-sync document titles with date/location: save-time + one-time backfill)
|
||
**Milestone:** —
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
A document title was a string built **once**, at import time, by a private
|
||
`DocumentImporter.buildTitle()` composing `{index} – {dateLabel} – {location}` (index =
|
||
`originalFilename`, date label honest at the row's precision via `DocumentTitleFormatter`,
|
||
location verbatim). Nothing rebuilt it afterwards. When an archivist later corrected a date
|
||
or location in the edit form, the title kept its stale value (e.g. it still read `2028`
|
||
after the date was fixed to `1928`), because the edit form round-trips the stored title
|
||
verbatim and `updateDocument` simply re-persisted it.
|
||
|
||
Two distinct problems live here:
|
||
|
||
1. **Going forward**, an edit to date/location must flow into a title that was machine-built
|
||
— but must never overwrite a title a human wrote.
|
||
2. **The existing backlog** of already-stale titles must be cleaned once. For these rows the
|
||
pre-edit state is gone, so there is no exact value to compare against.
|
||
|
||
The composition formula also existed only inside `importing`, which is the wrong owner: a
|
||
title is a `document` concern, and three call sites (import, save-time, backfill) must share
|
||
one rule or they will drift.
|
||
|
||
## Decision
|
||
|
||
### 1. One formula, owned by the `document` package (`DocumentTitleFactory`)
|
||
|
||
Extract the composition into `DocumentTitleFactory` (a `@Component` in the `document`
|
||
package) with `build(Document)`. `DocumentImporter` (package `importing`) now consumes it.
|
||
`DocumentTitleFormatter` moves into `document` alongside the factory (it stays
|
||
package-private; `importing` reaches the formula only through the factory). The direction is
|
||
deliberate: `document` owns the rule, `importing` depends on it — not the reverse. The
|
||
German date *label* remains the deliberate Java/TS dual implementation pinned by
|
||
`docs/date-label-fixtures.json` (#666); this ADR touches the **composition** only and does
|
||
not collapse the frontend `formatDocumentDate`.
|
||
|
||
### 2. Save-time regeneration is an EXACT match, not a heuristic
|
||
|
||
In `DocumentService.updateDocument` only (bulk edit is out of scope), capture
|
||
`autoTitleBefore = titleFactory.build(doc)` from the **currently-persisted** state *before*
|
||
any setter runs. Then:
|
||
|
||
- if the **submitted** title equals `autoTitleBefore`, it was the machine value → rebuild
|
||
from the new state;
|
||
- otherwise keep the submitted title verbatim (hand-written or freshly typed).
|
||
|
||
This is an exact old-vs-new comparison — no false positives, no false negatives — relying on
|
||
the edit form round-tripping an untouched title verbatim. `projectedState` mirrors the
|
||
existing setter asymmetry exactly: `documentDate`/`location` overwrite unconditionally (a
|
||
null clears them), while precision/end/raw are taken from the DTO only when non-null and
|
||
otherwise kept from the entity. A blank submission is never persisted (the title is always
|
||
present) — it falls back to the rebuilt auto-title, which always carries at least the index.
|
||
|
||
### 3. The one-time backlog cleanup is a grammar heuristic, behind an ADMIN endpoint
|
||
|
||
`POST /api/admin/backfill-titles` (synchronous, under `AdminController`'s class-level
|
||
`@RequirePermission(Permission.ADMIN)`) sweeps every document and, for each whose stored
|
||
title passes the overwrite test, rebuilds it via the factory. Because the pre-edit state is
|
||
gone, the test (`DocumentTitleBackfillMatcher`, used **only** here) is a grammar heuristic:
|
||
after stripping the **literal** index prefix, the remainder must be exactly the index, a
|
||
known date-label form (+ an optional trailing location), or a lone segment equal to the
|
||
document's current location. Prose is left untouched; anything malformed fails closed.
|
||
|
||
The backfill saves via `documentRepository.save` directly and **never** routes through
|
||
`updateDocument` — following the `backfillFileHashes` precedent — so a mechanical rename does
|
||
not snapshot the whole corpus into `document_versions`. It is idempotent (a second run
|
||
rewrites nothing) and logs one SLF4J-parameterized `scanned/updated/skipped` line; the
|
||
response is `BackfillResult(count)`.
|
||
|
||
### 4. Edit-form feedback (FR-005)
|
||
|
||
A localized helper line (de/en/es) under the title input explains that the title is built
|
||
from date/place and that a hand-edit is preserved, wired via `aria-describedby` and shown
|
||
only on the single-document edit form. A live preview was considered and declined.
|
||
|
||
## Consequences
|
||
|
||
- The three call sites can never diverge — there is exactly one formula
|
||
(`NFR-MAINT-001`). Save-time cost is a string build + compare; the backfill is one
|
||
synchronous transactional sweep over a low-thousands corpus.
|
||
- Security: the index is compared **literally** (`String.startsWith` / `Pattern.quote`)
|
||
because `originalFilename` is user-controlled and may carry regex metacharacters — an
|
||
unquoted pattern would be a ReDoS / regex-injection vector (CWE-1333 / CWE-625). The
|
||
date-label sub-patterns use only bounded, non-nested quantifiers.
|
||
- **File-replaced documents are treated as manual, by design.** The index is
|
||
`originalFilename`, which `updateDocument` reassigns to the uploaded file's name on a
|
||
file-replace. After a replace the stored title no longer matches `build(currentState)`, so
|
||
neither save-time nor backfill rewrites it. This is the accepted fail-safe of overloading
|
||
`originalFilename` rather than adding a dedicated `catalogIndex` column.
|
||
- The save-time heuristic risk is zero (exact match); the backfill heuristic can, by its
|
||
documented FR-004 rule, treat `{index} – {valid date label} – {anything}` as machine-built
|
||
and rewrite the trailing segment. This is the accepted trade for cleaning the backlog
|
||
without the lost pre-edit state.
|
||
|
||
## Alternatives considered
|
||
|
||
- **A dedicated `catalogIndex` column** instead of overloading `originalFilename` — rejected;
|
||
it adds a migration and a second source of truth for the index for no current benefit, and
|
||
the file-replace fail-safe is acceptable.
|
||
- **A heuristic at save-time too** (instead of the exact match) — rejected; the stored title
|
||
is available pre-edit, so an exact comparison is strictly better (no false positives).
|
||
- **A live title preview in the edit form** — rejected (FR-005); a static helper line is
|
||
calmer for the 60+ audience and avoids a second client-side mirror of the formula.
|
||
- **Collapsing the frontend `formatDocumentDate` into the backend** — out of scope; the
|
||
Java/TS date-label split is the deliberate #666 design, pinned by a shared fixture.
|