docs(import): document index-based PDF resolution in ADR-025 and DEPLOYMENT
File resolution is now by index (<index>.pdf), not the datei/file column. Update the ADR-025 security sub-decision and consequence (the recursive walk and file column are gone; a bad index skips its row with a loud SkipReason, a symlink-escape still aborts via the containment assertion) and DEPLOYMENT §6 (PDFs must be named <index>.pdf flat in the import dir). Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -102,12 +102,23 @@ Settled sub-decisions:
|
||||
- **`provisional` is now populated.** Importer-minted persons are `provisional = true`;
|
||||
register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1)
|
||||
left at default-`false`.
|
||||
- **Security guards are defense-in-depth, not upstream-trust.** The `file` column is treated as
|
||||
hostile (CWE-22 does not care it came from our tool): its basename is validated
|
||||
(`isValidImportFilename` — slash/backslash, three Unicode slash homoglyphs, `..`, null byte,
|
||||
absolute path) and resolved only inside the import dir with canonical-path containment, so a
|
||||
traversal value can never escape. The `%PDF` magic-byte check gates upload. These guards and
|
||||
their tests were ported from `MassImportService` **before** it was deleted.
|
||||
- **PDFs resolve directly by index (`<index>.pdf`), not by a `file` column.** The corpus is
|
||||
uniform — all PDFs are named `<index>.pdf` flat in the import dir (e.g. `W-0124.pdf`,
|
||||
`Mü-0001.pdf`) — so `DocumentImporter` resolves a document's PDF with an O(1)
|
||||
`importDir.resolve(index + ".pdf")` lookup. The redundant `file` column (carrying the
|
||||
spreadsheet's messy `datei` value) and the recursive directory walk that resolved it were
|
||||
removed (#686, which also closed #676 — the O(rows×tree) walk is gone). The normalizer no
|
||||
longer emits `file` or the `index_file_mismatch` review flag.
|
||||
- **Security guards are defense-in-depth, not upstream-trust.** The `index` is the only thing
|
||||
that drives the on-disk lookup, so it is treated as hostile (CWE-22 does not care it came from
|
||||
our tool): `isValidImportIndex` rejects slash/backslash, three Unicode slash homoglyphs, any
|
||||
`.` (so `<index>.pdf` is the only extension and `..` can never appear), null byte, and
|
||||
absolute paths, and requires a strict catalog shape (1–4 Latin letters incl. umlauts, one or
|
||||
more hyphens, digits, optional trailing `x`). A bad index skips the row with a clear
|
||||
`SkipReason` (`INVALID_FILENAME_PATH_TRAVERSAL`). The resolved canonical path is still asserted
|
||||
to stay inside the import dir as a second line of defense (a symlinked `<index>.pdf` cannot
|
||||
escape), and the `%PDF` magic-byte check still gates upload. These guards and their tests were
|
||||
ported from the file-column resolution (originally from `MassImportService`).
|
||||
|
||||
---
|
||||
|
||||
@@ -138,11 +149,14 @@ Settled sub-decisions:
|
||||
the same state, so the operational recovery for a partial failure is simply to fix the
|
||||
offending artifact and re-trigger the import — no manual cleanup of half-written data is
|
||||
required. A future maintainer must not assume all-or-nothing semantics.
|
||||
- **Path-escape aborts the whole import (fail-closed), by design.** A path-traversal or
|
||||
symlink-escape in a row's file path is treated as an attack signal: the import aborts rather
|
||||
than recording the row as a `SkippedFile` and continuing. This is a deliberate owner decision
|
||||
(2026-05-27) over a per-file skip — a malicious path must surface loudly, not be silently
|
||||
tolerated.
|
||||
- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
|
||||
the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
|
||||
(path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog
|
||||
shape) is recorded as a `SkippedFile` with reason `INVALID_FILENAME_PATH_TRAVERSAL` and the
|
||||
import continues with the remaining rows — nothing outside the import dir is ever read. A
|
||||
symlinked `<index>.pdf` whose canonical path escapes the import dir is the one case that still
|
||||
aborts the import (a `DomainException` from the containment assertion), because a syntactically
|
||||
valid index resolving outside the dir is an environment-level attack signal, not a row typo.
|
||||
- **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native
|
||||
interface projection; because the projection is backed by native SQL, the column had to be
|
||||
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
|
||||
|
||||
Reference in New Issue
Block a user