diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md index 0e1fe07e..79fe5b16 100644 --- a/docs/DEPLOYMENT.md +++ b/docs/DEPLOYMENT.md @@ -577,10 +577,15 @@ python3 -m venv .venv && .venv/bin/pip install -r requirements.txt # once, on # writes the four canonical artifacts into ./out/ ``` -**Dev:** place all four canonical artifacts **plus** the referenced PDFs into `./import/` +**Dev:** place all four canonical artifacts **plus** the PDFs into `./import/` at the repo root (the dev compose bind-mounts it to `/import`, which is `app.import.dir`). -The orchestrator smoke-checks that all four artifacts are present before starting and fails -closed (`IMPORT_ARTIFACT_INVALID`) if any is missing. +Each PDF must be named `.pdf` (e.g. `W-0124.pdf`, `Mü-0001.pdf`) and live flat in the +import dir: since #686 the importer resolves a document's PDF directly by its index +(`importDir/.pdf`), not via a `datei`/`file` column — the recursive directory walk and +its basename/homoglyph guards are gone, replaced by strict index validation plus a +canonical-path containment assertion (a document whose `.pdf` is absent simply becomes a +`PLACEHOLDER`). The orchestrator smoke-checks that all four artifacts are present before +starting and fails closed (`IMPORT_ARTIFACT_INVALID`) if any is missing. **Staging/production:** diff --git a/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md b/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md index 3515413b..05bf8b0f 100644 --- a/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md +++ b/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md @@ -102,12 +102,23 @@ Settled sub-decisions: - **`provisional` is now populated.** Importer-minted persons are `provisional = true`; register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1) left at default-`false`. -- **Security guards are defense-in-depth, not upstream-trust.** The `file` column is treated as - hostile (CWE-22 does not care it came from our tool): its basename is validated - (`isValidImportFilename` — slash/backslash, three Unicode slash homoglyphs, `..`, null byte, - absolute path) and resolved only inside the import dir with canonical-path containment, so a - traversal value can never escape. The `%PDF` magic-byte check gates upload. These guards and - their tests were ported from `MassImportService` **before** it was deleted. +- **PDFs resolve directly by index (`.pdf`), not by a `file` column.** The corpus is + uniform — all PDFs are named `.pdf` flat in the import dir (e.g. `W-0124.pdf`, + `Mü-0001.pdf`) — so `DocumentImporter` resolves a document's PDF with an O(1) + `importDir.resolve(index + ".pdf")` lookup. The redundant `file` column (carrying the + spreadsheet's messy `datei` value) and the recursive directory walk that resolved it were + removed (#686, which also closed #676 — the O(rows×tree) walk is gone). The normalizer no + longer emits `file` or the `index_file_mismatch` review flag. +- **Security guards are defense-in-depth, not upstream-trust.** The `index` is the only thing + that drives the on-disk lookup, so it is treated as hostile (CWE-22 does not care it came from + our tool): `isValidImportIndex` rejects slash/backslash, three Unicode slash homoglyphs, any + `.` (so `.pdf` is the only extension and `..` can never appear), null byte, and + absolute paths, and requires a strict catalog shape (1–4 Latin letters incl. umlauts, one or + more hyphens, digits, optional trailing `x`). A bad index skips the row with a clear + `SkipReason` (`INVALID_FILENAME_PATH_TRAVERSAL`). The resolved canonical path is still asserted + to stay inside the import dir as a second line of defense (a symlinked `.pdf` cannot + escape), and the `%PDF` magic-byte check still gates upload. These guards and their tests were + ported from the file-column resolution (originally from `MassImportService`). --- @@ -138,11 +149,14 @@ Settled sub-decisions: the same state, so the operational recovery for a partial failure is simply to fix the offending artifact and re-trigger the import — no manual cleanup of half-written data is required. A future maintainer must not assume all-or-nothing semantics. -- **Path-escape aborts the whole import (fail-closed), by design.** A path-traversal or - symlink-escape in a row's file path is treated as an attack signal: the import aborts rather - than recording the row as a `SkippedFile` and continuing. This is a deliberate owner decision - (2026-05-27) over a per-file skip — a malicious path must surface loudly, not be silently - tolerated. +- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686 + the index is the only on-disk lookup key. An index that fails `isValidImportIndex` + (path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog + shape) is recorded as a `SkippedFile` with reason `INVALID_FILENAME_PATH_TRAVERSAL` and the + import continues with the remaining rows — nothing outside the import dir is ever read. A + symlinked `.pdf` whose canonical path escapes the import dir is the one case that still + aborts the import (a `DomainException` from the containment assertion), because a syntactically + valid index resolving outside the dir is an environment-level attack signal, not a row typo. - **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native interface projection; because the projection is backed by native SQL, the column had to be added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,