docs(adr): record the index pattern as a corpus-specific constraint

Address PR #687 review concern (Elicit): add an ADR-025 Consequences
entry noting INDEX_PATTERN accepts only the current corpus shape (<=4
Latin-1 letters, hyphens, ASCII digits, optional x) and must be revisited
deliberately if the catalog scheme grows (5-letter prefix, digit-led id,
non-Latin letter), since such rows would otherwise be skipped, not
imported. Also records the ASCII-only \d intent.

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-27 21:20:21 +02:00
committed by marcel
parent f5e2241fe0
commit 34e0eec1ba

View File

@@ -149,6 +149,17 @@ Settled sub-decisions:
the same state, so the operational recovery for a partial failure is simply to fix the
offending artifact and re-trigger the import — no manual cleanup of half-written data is
required. A future maintainer must not assume all-or-nothing semantics.
- **The index pattern is corpus-specific and must be revisited if the catalog scheme grows.**
`INDEX_PATTERN` accepts only the *current* corpus shape — at most four Latin-1 letters (incl.
umlauts) followed by one or more hyphens, ASCII digits, and an optional trailing `x`. This is a
conscious constraint, not a general filename validator: a future sub-collection catalogued with
a 5-letter prefix, a digit-led id, or a non-Latin-1 letter (e.g. `Č` or a Cyrillic id) would
fail `isValidImportIndex` and its rows would be **skipped** (`INVALID_FILENAME_PATH_TRAVERSAL`),
not imported. Likewise a real PDF that does not follow `<index>.pdf` produces a `PLACEHOLDER`
(the importer logs both cases distinctly — see #686). If the catalog scheme ever changes, the
pattern and its tests must be widened deliberately; do not loosen it casually, as it is the
allowlist that keeps the on-disk lookup safe. Note `\d` is intentionally ASCII-only — adding
`Pattern.UNICODE_CHARACTER_CLASS` would silently widen the accepted digit set.
- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
(path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog