docs(import): document index-based PDF resolution in ADR-025 and DEPLOYMENT
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 6m56s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m42s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s

File resolution is now by index (<index>.pdf), not the datei/file
column. Update the ADR-025 security sub-decision and consequence (the
recursive walk and file column are gone; a bad index skips its row with
a loud SkipReason, a symlink-escape still aborts via the containment
assertion) and DEPLOYMENT §6 (PDFs must be named <index>.pdf flat in
the import dir).

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-27 21:03:57 +02:00
parent 508a5e555e
commit 74987062f4
2 changed files with 33 additions and 14 deletions

View File

@@ -577,10 +577,15 @@ python3 -m venv .venv && .venv/bin/pip install -r requirements.txt # once, on
# writes the four canonical artifacts into ./out/ # writes the four canonical artifacts into ./out/
``` ```
**Dev:** place all four canonical artifacts **plus** the referenced PDFs into `./import/` **Dev:** place all four canonical artifacts **plus** the PDFs into `./import/`
at the repo root (the dev compose bind-mounts it to `/import`, which is `app.import.dir`). at the repo root (the dev compose bind-mounts it to `/import`, which is `app.import.dir`).
The orchestrator smoke-checks that all four artifacts are present before starting and fails Each PDF must be named `<index>.pdf` (e.g. `W-0124.pdf`, `Mü-0001.pdf`) and live flat in the
closed (`IMPORT_ARTIFACT_INVALID`) if any is missing. import dir: since #686 the importer resolves a document's PDF directly by its index
(`importDir/<index>.pdf`), not via a `datei`/`file` column — the recursive directory walk and
its basename/homoglyph guards are gone, replaced by strict index validation plus a
canonical-path containment assertion (a document whose `<index>.pdf` is absent simply becomes a
`PLACEHOLDER`). The orchestrator smoke-checks that all four artifacts are present before
starting and fails closed (`IMPORT_ARTIFACT_INVALID`) if any is missing.
**Staging/production:** **Staging/production:**

View File

@@ -102,12 +102,23 @@ Settled sub-decisions:
- **`provisional` is now populated.** Importer-minted persons are `provisional = true`; - **`provisional` is now populated.** Importer-minted persons are `provisional = true`;
register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1) register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1)
left at default-`false`. left at default-`false`.
- **Security guards are defense-in-depth, not upstream-trust.** The `file` column is treated as - **PDFs resolve directly by index (`<index>.pdf`), not by a `file` column.** The corpus is
hostile (CWE-22 does not care it came from our tool): its basename is validated uniform — all PDFs are named `<index>.pdf` flat in the import dir (e.g. `W-0124.pdf`,
(`isValidImportFilename` — slash/backslash, three Unicode slash homoglyphs, `..`, null byte, `Mü-0001.pdf`) — so `DocumentImporter` resolves a document's PDF with an O(1)
absolute path) and resolved only inside the import dir with canonical-path containment, so a `importDir.resolve(index + ".pdf")` lookup. The redundant `file` column (carrying the
traversal value can never escape. The `%PDF` magic-byte check gates upload. These guards and spreadsheet's messy `datei` value) and the recursive directory walk that resolved it were
their tests were ported from `MassImportService` **before** it was deleted. removed (#686, which also closed #676 — the O(rows×tree) walk is gone). The normalizer no
longer emits `file` or the `index_file_mismatch` review flag.
- **Security guards are defense-in-depth, not upstream-trust.** The `index` is the only thing
that drives the on-disk lookup, so it is treated as hostile (CWE-22 does not care it came from
our tool): `isValidImportIndex` rejects slash/backslash, three Unicode slash homoglyphs, any
`.` (so `<index>.pdf` is the only extension and `..` can never appear), null byte, and
absolute paths, and requires a strict catalog shape (14 Latin letters incl. umlauts, one or
more hyphens, digits, optional trailing `x`). A bad index skips the row with a clear
`SkipReason` (`INVALID_FILENAME_PATH_TRAVERSAL`). The resolved canonical path is still asserted
to stay inside the import dir as a second line of defense (a symlinked `<index>.pdf` cannot
escape), and the `%PDF` magic-byte check still gates upload. These guards and their tests were
ported from the file-column resolution (originally from `MassImportService`).
--- ---
@@ -138,11 +149,14 @@ Settled sub-decisions:
the same state, so the operational recovery for a partial failure is simply to fix the the same state, so the operational recovery for a partial failure is simply to fix the
offending artifact and re-trigger the import — no manual cleanup of half-written data is offending artifact and re-trigger the import — no manual cleanup of half-written data is
required. A future maintainer must not assume all-or-nothing semantics. required. A future maintainer must not assume all-or-nothing semantics.
- **Path-escape aborts the whole import (fail-closed), by design.** A path-traversal or - **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
symlink-escape in a row's file path is treated as an attack signal: the import aborts rather the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
than recording the row as a `SkippedFile` and continuing. This is a deliberate owner decision (path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog
(2026-05-27) over a per-file skip — a malicious path must surface loudly, not be silently shape) is recorded as a `SkippedFile` with reason `INVALID_FILENAME_PATH_TRAVERSAL` and the
tolerated. import continues with the remaining rows — nothing outside the import dir is ever read. A
symlinked `<index>.pdf` whose canonical path escapes the import dir is the one case that still
aborts the import (a `DomainException` from the containment assertion), because a syntactically
valid index resolving outside the dir is an environment-level attack signal, not a row typo.
- **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native - **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native
interface projection; because the projection is backed by native SQL, the column had to be interface projection; because the projection is backed by native SQL, the column had to be
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`, added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,