docs(import): document index-based PDF resolution in ADR-025 and DEPLOYMENT
File resolution is now by index (<index>.pdf), not the datei/file column. Update the ADR-025 security sub-decision and consequence (the recursive walk and file column are gone; a bad index skips its row with a loud SkipReason, a symlink-escape still aborts via the containment assertion) and DEPLOYMENT §6 (PDFs must be named <index>.pdf flat in the import dir). Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -577,10 +577,15 @@ python3 -m venv .venv && .venv/bin/pip install -r requirements.txt # once, on
|
|||||||
# writes the four canonical artifacts into ./out/
|
# writes the four canonical artifacts into ./out/
|
||||||
```
|
```
|
||||||
|
|
||||||
**Dev:** place all four canonical artifacts **plus** the referenced PDFs into `./import/`
|
**Dev:** place all four canonical artifacts **plus** the PDFs into `./import/`
|
||||||
at the repo root (the dev compose bind-mounts it to `/import`, which is `app.import.dir`).
|
at the repo root (the dev compose bind-mounts it to `/import`, which is `app.import.dir`).
|
||||||
The orchestrator smoke-checks that all four artifacts are present before starting and fails
|
Each PDF must be named `<index>.pdf` (e.g. `W-0124.pdf`, `Mü-0001.pdf`) and live flat in the
|
||||||
closed (`IMPORT_ARTIFACT_INVALID`) if any is missing.
|
import dir: since #686 the importer resolves a document's PDF directly by its index
|
||||||
|
(`importDir/<index>.pdf`), not via a `datei`/`file` column — the recursive directory walk and
|
||||||
|
its basename/homoglyph guards are gone, replaced by strict index validation plus a
|
||||||
|
canonical-path containment assertion (a document whose `<index>.pdf` is absent simply becomes a
|
||||||
|
`PLACEHOLDER`). The orchestrator smoke-checks that all four artifacts are present before
|
||||||
|
starting and fails closed (`IMPORT_ARTIFACT_INVALID`) if any is missing.
|
||||||
|
|
||||||
**Staging/production:**
|
**Staging/production:**
|
||||||
|
|
||||||
|
|||||||
@@ -102,12 +102,23 @@ Settled sub-decisions:
|
|||||||
- **`provisional` is now populated.** Importer-minted persons are `provisional = true`;
|
- **`provisional` is now populated.** Importer-minted persons are `provisional = true`;
|
||||||
register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1)
|
register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1)
|
||||||
left at default-`false`.
|
left at default-`false`.
|
||||||
- **Security guards are defense-in-depth, not upstream-trust.** The `file` column is treated as
|
- **PDFs resolve directly by index (`<index>.pdf`), not by a `file` column.** The corpus is
|
||||||
hostile (CWE-22 does not care it came from our tool): its basename is validated
|
uniform — all PDFs are named `<index>.pdf` flat in the import dir (e.g. `W-0124.pdf`,
|
||||||
(`isValidImportFilename` — slash/backslash, three Unicode slash homoglyphs, `..`, null byte,
|
`Mü-0001.pdf`) — so `DocumentImporter` resolves a document's PDF with an O(1)
|
||||||
absolute path) and resolved only inside the import dir with canonical-path containment, so a
|
`importDir.resolve(index + ".pdf")` lookup. The redundant `file` column (carrying the
|
||||||
traversal value can never escape. The `%PDF` magic-byte check gates upload. These guards and
|
spreadsheet's messy `datei` value) and the recursive directory walk that resolved it were
|
||||||
their tests were ported from `MassImportService` **before** it was deleted.
|
removed (#686, which also closed #676 — the O(rows×tree) walk is gone). The normalizer no
|
||||||
|
longer emits `file` or the `index_file_mismatch` review flag.
|
||||||
|
- **Security guards are defense-in-depth, not upstream-trust.** The `index` is the only thing
|
||||||
|
that drives the on-disk lookup, so it is treated as hostile (CWE-22 does not care it came from
|
||||||
|
our tool): `isValidImportIndex` rejects slash/backslash, three Unicode slash homoglyphs, any
|
||||||
|
`.` (so `<index>.pdf` is the only extension and `..` can never appear), null byte, and
|
||||||
|
absolute paths, and requires a strict catalog shape (1–4 Latin letters incl. umlauts, one or
|
||||||
|
more hyphens, digits, optional trailing `x`). A bad index skips the row with a clear
|
||||||
|
`SkipReason` (`INVALID_FILENAME_PATH_TRAVERSAL`). The resolved canonical path is still asserted
|
||||||
|
to stay inside the import dir as a second line of defense (a symlinked `<index>.pdf` cannot
|
||||||
|
escape), and the `%PDF` magic-byte check still gates upload. These guards and their tests were
|
||||||
|
ported from the file-column resolution (originally from `MassImportService`).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -138,11 +149,14 @@ Settled sub-decisions:
|
|||||||
the same state, so the operational recovery for a partial failure is simply to fix the
|
the same state, so the operational recovery for a partial failure is simply to fix the
|
||||||
offending artifact and re-trigger the import — no manual cleanup of half-written data is
|
offending artifact and re-trigger the import — no manual cleanup of half-written data is
|
||||||
required. A future maintainer must not assume all-or-nothing semantics.
|
required. A future maintainer must not assume all-or-nothing semantics.
|
||||||
- **Path-escape aborts the whole import (fail-closed), by design.** A path-traversal or
|
- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
|
||||||
symlink-escape in a row's file path is treated as an attack signal: the import aborts rather
|
the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
|
||||||
than recording the row as a `SkippedFile` and continuing. This is a deliberate owner decision
|
(path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog
|
||||||
(2026-05-27) over a per-file skip — a malicious path must surface loudly, not be silently
|
shape) is recorded as a `SkippedFile` with reason `INVALID_FILENAME_PATH_TRAVERSAL` and the
|
||||||
tolerated.
|
import continues with the remaining rows — nothing outside the import dir is ever read. A
|
||||||
|
symlinked `<index>.pdf` whose canonical path escapes the import dir is the one case that still
|
||||||
|
aborts the import (a `DomainException` from the containment assertion), because a syntactically
|
||||||
|
valid index resolving outside the dir is an environment-level attack signal, not a row typo.
|
||||||
- **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native
|
- **`PersonSummaryDTO` coupling.** `provisional` was added to the `PersonSummaryDTO` native
|
||||||
interface projection; because the projection is backed by native SQL, the column had to be
|
interface projection; because the projection is backed by native SQL, the column had to be
|
||||||
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
|
added to all three native `SELECT`s (`findAllWithDocumentCount`, `searchWithDocumentCount`,
|
||||||
|
|||||||
Reference in New Issue
Block a user