From 7354c3d33241a5192d4e41bb1b32041c27ed7ae2 Mon Sep 17 00:00:00 2001 From: Marcel Date: Wed, 27 May 2026 21:20:21 +0200 Subject: [PATCH] docs(adr): record the index pattern as a corpus-specific constraint Address PR #687 review concern (Elicit): add an ADR-025 Consequences entry noting INDEX_PATTERN accepts only the current corpus shape (<=4 Latin-1 letters, hyphens, ASCII digits, optional x) and must be revisited deliberately if the catalog scheme grows (5-letter prefix, digit-led id, non-Latin letter), since such rows would otherwise be skipped, not imported. Also records the ASCII-only \d intent. Refs #686 Co-Authored-By: Claude Opus 4.7 --- ...l-import-and-single-migration-schema-foundation.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md b/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md index 05bf8b0f..27c34092 100644 --- a/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md +++ b/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md @@ -149,6 +149,17 @@ Settled sub-decisions: the same state, so the operational recovery for a partial failure is simply to fix the offending artifact and re-trigger the import — no manual cleanup of half-written data is required. A future maintainer must not assume all-or-nothing semantics. +- **The index pattern is corpus-specific and must be revisited if the catalog scheme grows.** + `INDEX_PATTERN` accepts only the *current* corpus shape — at most four Latin-1 letters (incl. + umlauts) followed by one or more hyphens, ASCII digits, and an optional trailing `x`. This is a + conscious constraint, not a general filename validator: a future sub-collection catalogued with + a 5-letter prefix, a digit-led id, or a non-Latin-1 letter (e.g. `Č` or a Cyrillic id) would + fail `isValidImportIndex` and its rows would be **skipped** (`INVALID_FILENAME_PATH_TRAVERSAL`), + not imported. Likewise a real PDF that does not follow `.pdf` produces a `PLACEHOLDER` + (the importer logs both cases distinctly — see #686). If the catalog scheme ever changes, the + pattern and its tests must be widened deliberately; do not loosen it casually, as it is the + allowlist that keeps the on-disk lookup safe. Note `\d` is intentionally ASCII-only — adding + `Pattern.UNICODE_CHARACTER_CLASS` would silently widen the accepted digit set. - **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686 the index is the only on-disk lookup key. An index that fails `isValidImportIndex` (path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog