docs(adr): record the index pattern as a corpus-specific constraint
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m35s
CI / OCR Service Tests (pull_request) Successful in 21s
CI / Backend Unit Tests (pull_request) Successful in 3m58s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m5s

Address PR #687 review concern (Elicit): add an ADR-025 Consequences
entry noting INDEX_PATTERN accepts only the current corpus shape (<=4
Latin-1 letters, hyphens, ASCII digits, optional x) and must be revisited
deliberately if the catalog scheme grows (5-letter prefix, digit-led id,
non-Latin letter), since such rows would otherwise be skipped, not
imported. Also records the ASCII-only \d intent.

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-27 21:20:21 +02:00
parent b3f4fb2d8d
commit 7354c3d332

View File

@@ -149,6 +149,17 @@ Settled sub-decisions:
the same state, so the operational recovery for a partial failure is simply to fix the
offending artifact and re-trigger the import — no manual cleanup of half-written data is
required. A future maintainer must not assume all-or-nothing semantics.
- **The index pattern is corpus-specific and must be revisited if the catalog scheme grows.**
`INDEX_PATTERN` accepts only the *current* corpus shape — at most four Latin-1 letters (incl.
umlauts) followed by one or more hyphens, ASCII digits, and an optional trailing `x`. This is a
conscious constraint, not a general filename validator: a future sub-collection catalogued with
a 5-letter prefix, a digit-led id, or a non-Latin-1 letter (e.g. `Č` or a Cyrillic id) would
fail `isValidImportIndex` and its rows would be **skipped** (`INVALID_FILENAME_PATH_TRAVERSAL`),
not imported. Likewise a real PDF that does not follow `<index>.pdf` produces a `PLACEHOLDER`
(the importer logs both cases distinctly — see #686). If the catalog scheme ever changes, the
pattern and its tests must be widened deliberately; do not loosen it casually, as it is the
allowlist that keeps the on-disk lookup safe. Note `\d` is intentionally ASCII-only — adding
`Pattern.UNICODE_CHARACTER_CLASS` would silently widen the accepted digit set.
- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
(path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog