docs(import): document index-based PDF resolution in ADR-025 and DEPLOYMENT
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 6m56s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m42s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 6m56s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m42s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s
File resolution is now by index (<index>.pdf), not the datei/file column. Update the ADR-025 security sub-decision and consequence (the recursive walk and file column are gone; a bad index skips its row with a loud SkipReason, a symlink-escape still aborts via the containment assertion) and DEPLOYMENT §6 (PDFs must be named <index>.pdf flat in the import dir). Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -577,10 +577,15 @@ python3 -m venv .venv && .venv/bin/pip install -r requirements.txt # once, on
|
||||
# writes the four canonical artifacts into ./out/
|
||||
```
|
||||
|
||||
**Dev:** place all four canonical artifacts **plus** the referenced PDFs into `./import/`
|
||||
**Dev:** place all four canonical artifacts **plus** the PDFs into `./import/`
|
||||
at the repo root (the dev compose bind-mounts it to `/import`, which is `app.import.dir`).
|
||||
The orchestrator smoke-checks that all four artifacts are present before starting and fails
|
||||
closed (`IMPORT_ARTIFACT_INVALID`) if any is missing.
|
||||
Each PDF must be named `<index>.pdf` (e.g. `W-0124.pdf`, `Mü-0001.pdf`) and live flat in the
|
||||
import dir: since #686 the importer resolves a document's PDF directly by its index
|
||||
(`importDir/<index>.pdf`), not via a `datei`/`file` column — the recursive directory walk and
|
||||
its basename/homoglyph guards are gone, replaced by strict index validation plus a
|
||||
canonical-path containment assertion (a document whose `<index>.pdf` is absent simply becomes a
|
||||
`PLACEHOLDER`). The orchestrator smoke-checks that all four artifacts are present before
|
||||
starting and fails closed (`IMPORT_ARTIFACT_INVALID`) if any is missing.
|
||||
|
||||
**Staging/production:**
|
||||
|
||||
|
||||
Reference in New Issue
Block a user