docs(adr): record the index pattern as a corpus-specific constraint
Address PR #687 review concern (Elicit): add an ADR-025 Consequences entry noting INDEX_PATTERN accepts only the current corpus shape (<=4 Latin-1 letters, hyphens, ASCII digits, optional x) and must be revisited deliberately if the catalog scheme grows (5-letter prefix, digit-led id, non-Latin letter), since such rows would otherwise be skipped, not imported. Also records the ASCII-only \d intent. Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -149,6 +149,17 @@ Settled sub-decisions:
|
||||
the same state, so the operational recovery for a partial failure is simply to fix the
|
||||
offending artifact and re-trigger the import — no manual cleanup of half-written data is
|
||||
required. A future maintainer must not assume all-or-nothing semantics.
|
||||
- **The index pattern is corpus-specific and must be revisited if the catalog scheme grows.**
|
||||
`INDEX_PATTERN` accepts only the *current* corpus shape — at most four Latin-1 letters (incl.
|
||||
umlauts) followed by one or more hyphens, ASCII digits, and an optional trailing `x`. This is a
|
||||
conscious constraint, not a general filename validator: a future sub-collection catalogued with
|
||||
a 5-letter prefix, a digit-led id, or a non-Latin-1 letter (e.g. `Č` or a Cyrillic id) would
|
||||
fail `isValidImportIndex` and its rows would be **skipped** (`INVALID_FILENAME_PATH_TRAVERSAL`),
|
||||
not imported. Likewise a real PDF that does not follow `<index>.pdf` produces a `PLACEHOLDER`
|
||||
(the importer logs both cases distinctly — see #686). If the catalog scheme ever changes, the
|
||||
pattern and its tests must be widened deliberately; do not loosen it casually, as it is the
|
||||
allowlist that keeps the on-disk lookup safe. Note `\d` is intentionally ASCII-only — adding
|
||||
`Pattern.UNICODE_CHARACTER_CLASS` would silently widen the accepted digit set.
|
||||
- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
|
||||
the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
|
||||
(path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog
|
||||
|
||||
Reference in New Issue
Block a user