docs(import): document index-based PDF resolution in ADR-025 and DEPLOYMENT

File resolution is now by index (<index>.pdf), not the datei/file
column. Update the ADR-025 security sub-decision and consequence (the
recursive walk and file column are gone; a bad index skips its row with
a loud SkipReason, a symlink-escape still aborts via the containment
assertion) and DEPLOYMENT §6 (PDFs must be named <index>.pdf flat in
the import dir).

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-27 21:03:57 +02:00
committed by marcel
parent 32d9a33550
commit 658277e97c
2 changed files with 33 additions and 14 deletions

View File

@@ -102,12 +102,23 @@ Settled sub-decisions:
- **`provisional` is now populated.** Importer-minted persons are `provisional = true`;
register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1)
left at default-`false`.
- **Security guards are defense-in-depth, not upstream-trust.** The `file` column is treated as
hostile (CWE-22 does not care it came from our tool): its basename is validated
(`isValidImportFilename` — slash/backslash, three Unicode slash homoglyphs, `..`, null byte,
absolute path) and resolved only inside the import dir with canonical-path containment, so a
traversal value can never escape. The `%PDF` magic-byte check gates upload. These guards and
their tests were ported from `MassImportService` **before** it was deleted.
- **PDFs resolve directly by index (`<index>.pdf`), not by a `file` column.** The corpus is
uniform — all PDFs are named `<index>.pdf` flat in the import dir (e.g. `W-0124.pdf`,
`Mü-0001.pdf`) — so `DocumentImporter` resolves a document's PDF with an O(1)
`importDir.resolve(index + ".pdf")` lookup. The redundant `file` column (carrying the
spreadsheet's messy `datei` value) and the recursive directory walk that resolved it were
removed (#686, which also closed #676 — the O(rows×tree) walk is gone). The normalizer no
longer emits `file` or the `index_file_mismatch` review flag.
- **Security guards are defense-in-depth, not upstream-trust.** The `index` is the only thing
that drives the on-disk lookup, so it is treated as hostile (CWE-22 does not care it came from
our tool): `isValidImportIndex` rejects slash/backslash, three Unicode slash homoglyphs, any
`.` (so `<index>.pdf` is the only extension and `..` can never appear), null byte, and
absolute paths, and requires a strict catalog shape (14 Latin letters incl. umlauts, one or
more hyphens, digits, optional trailing `x`). A bad index skips the row with a clear
`SkipReason` (`INVALID_FILENAME_PATH_TRAVERSAL`). The resolved canonical path is still asserted
to stay inside the import dir as a second line of defense (a symlinked `<index>.pdf` cannot
escape), and the `%PDF` magic-byte check still gates upload. These guards and their tests were
ported from the file-column resolution (originally from `MassImportService`).
---
@@ -138,11 +149,14 @@ Settled sub-decisions:
the same state, so the operational recovery for a partial failure is simply to fix the
offending artifact and re-trigger the import — no manual cleanup of half-written data is
required. A future maintainer must not assume all-or-nothing semantics.
- **Path-escape aborts the whole import (fail-closed), by design.** A path-traversal or
symlink-escape in a row's file path is treated as an attack signal: the import aborts rather
than recording the row as a `SkippedFile` and continuing. This is a deliberate owner decision
(2026-05-27) over a per-file skip — a malicious path must surface loudly, not be silently
tolerated.
- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
(path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog
shape) is recorded as a `SkippedFile` with reason `INVALID_FILENAME_PATH_TRAVERSAL` and the
import continues with the remaining rows — nothing outside the import dir is ever read. A
symlinked `<index>.pdf` whose canonical path escapes the import dir is the one case that still
aborts the import (a `DomainException` from the containment assertion), because a syntactically
valid index resolving outside the dir is an environment-level attack signal, not a row typo.
- **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native
interface projection; because the projection is backed by native SQL, the column had to be
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,