docs(document): ADR-031 + glossary/c4/api_tests for auto-title sync (#726)
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 2m32s
CI / OCR Service Tests (pull_request) Successful in 26s
CI / Backend Unit Tests (pull_request) Successful in 3m35s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m6s
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 2m32s
CI / OCR Service Tests (pull_request) Successful in 26s
CI / Backend Unit Tests (pull_request) Successful in 3m35s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m6s
ADR-031 records the shared document-package title factory, the exact-match save-time regeneration, and the grammar-heuristic one-time backfill (with the ReDoS / no-version-spam / file-replace-is-manual decisions). Adds an "auto-generated title" glossary entry, extends the document-management c4 diagram with DocumentTitleFactory / DocumentTitleBackfillMatcher and the backfill flows, and documents POST /api/admin/backfill-titles in Admin-Auth.http as a one-shot ADMIN call hitting port 8080 directly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,112 @@
|
||||
# ADR-031 — The document title is a shared `document`-package factory, re-synced by an exact match on save and a grammar heuristic on a one-time backfill
|
||||
|
||||
**Date:** 2026-06-04
|
||||
**Status:** Accepted
|
||||
**Issue:** #726 (auto-sync document titles with date/location: save-time + one-time backfill)
|
||||
**Milestone:** —
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
A document title was a string built **once**, at import time, by a private
|
||||
`DocumentImporter.buildTitle()` composing `{index} – {dateLabel} – {location}` (index =
|
||||
`originalFilename`, date label honest at the row's precision via `DocumentTitleFormatter`,
|
||||
location verbatim). Nothing rebuilt it afterwards. When an archivist later corrected a date
|
||||
or location in the edit form, the title kept its stale value (e.g. it still read `2028`
|
||||
after the date was fixed to `1928`), because the edit form round-trips the stored title
|
||||
verbatim and `updateDocument` simply re-persisted it.
|
||||
|
||||
Two distinct problems live here:
|
||||
|
||||
1. **Going forward**, an edit to date/location must flow into a title that was machine-built
|
||||
— but must never overwrite a title a human wrote.
|
||||
2. **The existing backlog** of already-stale titles must be cleaned once. For these rows the
|
||||
pre-edit state is gone, so there is no exact value to compare against.
|
||||
|
||||
The composition formula also existed only inside `importing`, which is the wrong owner: a
|
||||
title is a `document` concern, and three call sites (import, save-time, backfill) must share
|
||||
one rule or they will drift.
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. One formula, owned by the `document` package (`DocumentTitleFactory`)
|
||||
|
||||
Extract the composition into `DocumentTitleFactory` (a `@Component` in the `document`
|
||||
package) with `build(Document)`. `DocumentImporter` (package `importing`) now consumes it.
|
||||
`DocumentTitleFormatter` moves into `document` alongside the factory (it stays
|
||||
package-private; `importing` reaches the formula only through the factory). The direction is
|
||||
deliberate: `document` owns the rule, `importing` depends on it — not the reverse. The
|
||||
German date *label* remains the deliberate Java/TS dual implementation pinned by
|
||||
`docs/date-label-fixtures.json` (#666); this ADR touches the **composition** only and does
|
||||
not collapse the frontend `formatDocumentDate`.
|
||||
|
||||
### 2. Save-time regeneration is an EXACT match, not a heuristic
|
||||
|
||||
In `DocumentService.updateDocument` only (bulk edit is out of scope), capture
|
||||
`autoTitleBefore = titleFactory.build(doc)` from the **currently-persisted** state *before*
|
||||
any setter runs. Then:
|
||||
|
||||
- if the **submitted** title equals `autoTitleBefore`, it was the machine value → rebuild
|
||||
from the new state;
|
||||
- otherwise keep the submitted title verbatim (hand-written or freshly typed).
|
||||
|
||||
This is an exact old-vs-new comparison — no false positives, no false negatives — relying on
|
||||
the edit form round-tripping an untouched title verbatim. `projectedState` mirrors the
|
||||
existing setter asymmetry exactly: `documentDate`/`location` overwrite unconditionally (a
|
||||
null clears them), while precision/end/raw are taken from the DTO only when non-null and
|
||||
otherwise kept from the entity. A blank submission is never persisted (the title is always
|
||||
present) — it falls back to the rebuilt auto-title, which always carries at least the index.
|
||||
|
||||
### 3. The one-time backlog cleanup is a grammar heuristic, behind an ADMIN endpoint
|
||||
|
||||
`POST /api/admin/backfill-titles` (synchronous, under `AdminController`'s class-level
|
||||
`@RequirePermission(Permission.ADMIN)`) sweeps every document and, for each whose stored
|
||||
title passes the overwrite test, rebuilds it via the factory. Because the pre-edit state is
|
||||
gone, the test (`DocumentTitleBackfillMatcher`, used **only** here) is a grammar heuristic:
|
||||
after stripping the **literal** index prefix, the remainder must be exactly the index, a
|
||||
known date-label form (+ an optional trailing location), or a lone segment equal to the
|
||||
document's current location. Prose is left untouched; anything malformed fails closed.
|
||||
|
||||
The backfill saves via `documentRepository.save` directly and **never** routes through
|
||||
`updateDocument` — following the `backfillFileHashes` precedent — so a mechanical rename does
|
||||
not snapshot the whole corpus into `document_versions`. It is idempotent (a second run
|
||||
rewrites nothing) and logs one SLF4J-parameterized `scanned/updated/skipped` line; the
|
||||
response is `BackfillResult(count)`.
|
||||
|
||||
### 4. Edit-form feedback (FR-005)
|
||||
|
||||
A localized helper line (de/en/es) under the title input explains that the title is built
|
||||
from date/place and that a hand-edit is preserved, wired via `aria-describedby` and shown
|
||||
only on the single-document edit form. A live preview was considered and declined.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The three call sites can never diverge — there is exactly one formula
|
||||
(`NFR-MAINT-001`). Save-time cost is a string build + compare; the backfill is one
|
||||
synchronous transactional sweep over a low-thousands corpus.
|
||||
- Security: the index is compared **literally** (`String.startsWith` / `Pattern.quote`)
|
||||
because `originalFilename` is user-controlled and may carry regex metacharacters — an
|
||||
unquoted pattern would be a ReDoS / regex-injection vector (CWE-1333 / CWE-625). The
|
||||
date-label sub-patterns use only bounded, non-nested quantifiers.
|
||||
- **File-replaced documents are treated as manual, by design.** The index is
|
||||
`originalFilename`, which `updateDocument` reassigns to the uploaded file's name on a
|
||||
file-replace. After a replace the stored title no longer matches `build(currentState)`, so
|
||||
neither save-time nor backfill rewrites it. This is the accepted fail-safe of overloading
|
||||
`originalFilename` rather than adding a dedicated `catalogIndex` column.
|
||||
- The save-time heuristic risk is zero (exact match); the backfill heuristic can, by its
|
||||
documented FR-004 rule, treat `{index} – {valid date label} – {anything}` as machine-built
|
||||
and rewrite the trailing segment. This is the accepted trade for cleaning the backlog
|
||||
without the lost pre-edit state.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **A dedicated `catalogIndex` column** instead of overloading `originalFilename` — rejected;
|
||||
it adds a migration and a second source of truth for the index for no current benefit, and
|
||||
the file-replace fail-safe is acceptable.
|
||||
- **A heuristic at save-time too** (instead of the exact match) — rejected; the stored title
|
||||
is available pre-edit, so an exact comparison is strictly better (no false positives).
|
||||
- **A live title preview in the edit form** — rejected (FR-005); a static helper line is
|
||||
calmer for the 60+ audience and avoids a second client-side mirror of the formula.
|
||||
- **Collapsing the frontend `formatDocumentDate` into the backend** — out of scope; the
|
||||
Java/TS date-label split is the deliberate #666 design, pinned by a shared fixture.
|
||||
Reference in New Issue
Block a user