feat(import): resolve PDFs directly by index, drop the file column (#686) #687

Merged
marcel merged 9 commits from feature/686-resolve-pdf-by-index into docs/import-migration 2026-05-27 22:08:47 +02:00
Showing only changes of commit 7354c3d332 - Show all commits

View File

@@ -149,6 +149,17 @@ Settled sub-decisions:
the same state, so the operational recovery for a partial failure is simply to fix the
offending artifact and re-trigger the import — no manual cleanup of half-written data is
required. A future maintainer must not assume all-or-nothing semantics.
- **The index pattern is corpus-specific and must be revisited if the catalog scheme grows.**
`INDEX_PATTERN` accepts only the *current* corpus shape — at most four Latin-1 letters (incl.
umlauts) followed by one or more hyphens, ASCII digits, and an optional trailing `x`. This is a
conscious constraint, not a general filename validator: a future sub-collection catalogued with
a 5-letter prefix, a digit-led id, or a non-Latin-1 letter (e.g. `Č` or a Cyrillic id) would
fail `isValidImportIndex` and its rows would be **skipped** (`INVALID_FILENAME_PATH_TRAVERSAL`),
not imported. Likewise a real PDF that does not follow `<index>.pdf` produces a `PLACEHOLDER`
(the importer logs both cases distinctly — see #686). If the catalog scheme ever changes, the
pattern and its tests must be widened deliberately; do not loosen it casually, as it is the
allowlist that keeps the on-disk lookup safe. Note `\d` is intentionally ASCII-only — adding
`Pattern.UNICODE_CHARACTER_CLASS` would silently widen the accepted digit set.
- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
(path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog