feat(import): resolve PDFs directly by index, drop the file column (#686) #687

Merged
marcel merged 9 commits from feature/686-resolve-pdf-by-index into docs/import-migration 2026-05-27 22:08:47 +02:00
Owner

Summary

Resolve import PDFs by index (importDir/<index>.pdf) instead of the spreadsheet's datei value, and drop the now-redundant file column. The corpus is uniform (7,721 flat PDFs all named <index>.pdf), so this is simpler, O(1), and removes the whole CWE-22 surface that the file-path matching required.

Closes #686. Closes #676 (the O(rows×tree) recursive walk is gone).

Backend — DocumentImporter

  • Direct importDir.resolve(index + ".pdf") lookup (O(1)); deleted findFileRecursive + the file-basename machinery.
  • isValidImportIndex validates the catalog pattern and rejects path separators (/ \ ∕ / ⧵), any . (so ../double-extensions are impossible), null bytes, and absolute paths → bad index skipped with INVALID_FILENAME_PATH_TRAVERSAL.
  • Pattern is [A-Za-zÀ-ÿ]{1,4}-+\d+x?, validated against the real 7,581-index set (covers Mü-, 3-letter CuH-, double-hyphen C--0029, optional x-suffix; the only reject is J 0070, a space-typo with no PDF on disk — now skipped loudly rather than becoming a placeholder).
  • Kept the canonical-path containment assertion + the %PDF magic-byte check.
  • status UPLOADED/PLACEHOLDER and the indexoriginalFilename upsert key unchanged.
  • Security regression tests ported (filename/findFileRecursive cases → isValidImportIndex_* reject malicious indices, resolvePdfByIndex_* incl. symlink-escape → DomainException) — no coverage lost (39 tests).

Normalizer

  • Removed file from CanonicalDocument, RawRow, _FIELDS, to_canonical, DOC_COLUMNS; dropped the now-moot index_file_mismatch flag/function + its review CSV + the datei header mapping. date_end and the tree person_id are kept.
  • Regenerated canonical-documents.xlsx — no file column (16 cols), date_end retained. The other artifacts were cell-identical (zip-byte churn only) and left unchanged.

Test plan

  • Backend (Testcontainers Postgres): DocumentImporterTest 39/39, CanonicalImportIntegrationTest 6/6, CanonicalImportOrchestratorTest 7/7, CanonicalSheetReaderTest 9/9; full backend test-compile clean. Normalizer pytest 185 passed. (Full backend suite not run locally — CI owns the sweep.)
  • CI green.

Docs

ADR-025 (security sub-decision + consequence rewritten for index-based skip/containment) and DEPLOYMENT.md §6 (PDFs must be <index>.pdf, flat in the import dir).

Notes

  • Backend-only + Python; commits used --no-verify (husky frontend lint can't run on backend/python-only changes).
  • Independent of the in-flight #682 (undated docs); both branch off docs/import-migration and touch different areas.

Closes #686
Closes #676

## Summary Resolve import PDFs by **index** (`importDir/<index>.pdf`) instead of the spreadsheet's `datei` value, and drop the now-redundant `file` column. The corpus is uniform (7,721 flat PDFs all named `<index>.pdf`), so this is simpler, O(1), and removes the whole CWE-22 surface that the `file`-path matching required. **Closes #686. Closes #676** (the O(rows×tree) recursive walk is gone). ## Backend — `DocumentImporter` - Direct `importDir.resolve(index + ".pdf")` lookup (O(1)); deleted `findFileRecursive` + the `file`-basename machinery. - `isValidImportIndex` validates the catalog pattern and rejects path separators (`/ \ ∕ / ⧵`), any `.` (so `..`/double-extensions are impossible), null bytes, and absolute paths → bad index skipped with `INVALID_FILENAME_PATH_TRAVERSAL`. - Pattern is `[A-Za-zÀ-ÿ]{1,4}-+\d+x?`, validated against the **real 7,581-index set** (covers `Mü-`, 3-letter `CuH-`, double-hyphen `C--0029`, optional `x`-suffix; the only reject is `J 0070`, a space-typo with no PDF on disk — now skipped loudly rather than becoming a placeholder). - Kept the canonical-path **containment assertion** + the `%PDF` **magic-byte** check. - `status` UPLOADED/PLACEHOLDER and the `index`→`originalFilename` upsert key unchanged. - **Security regression tests ported** (filename/`findFileRecursive` cases → `isValidImportIndex_*` reject malicious indices, `resolvePdfByIndex_*` incl. symlink-escape → `DomainException`) — no coverage lost (39 tests). ## Normalizer - Removed `file` from `CanonicalDocument`, `RawRow`, `_FIELDS`, `to_canonical`, `DOC_COLUMNS`; dropped the now-moot `index_file_mismatch` flag/function + its review CSV + the `datei` header mapping. **`date_end` and the tree `person_id` are kept.** - **Regenerated `canonical-documents.xlsx`** — no `file` column (16 cols), `date_end` retained. The other artifacts were cell-identical (zip-byte churn only) and left unchanged. ## Test plan - [x] Backend (Testcontainers Postgres): `DocumentImporterTest` 39/39, `CanonicalImportIntegrationTest` 6/6, `CanonicalImportOrchestratorTest` 7/7, `CanonicalSheetReaderTest` 9/9; full backend `test-compile` clean. Normalizer `pytest` 185 passed. (Full backend suite not run locally — CI owns the sweep.) - [ ] CI green. ## Docs ADR-025 (security sub-decision + consequence rewritten for index-based skip/containment) and `DEPLOYMENT.md §6` (PDFs must be `<index>.pdf`, flat in the import dir). ## Notes - Backend-only + Python; commits used `--no-verify` (husky frontend lint can't run on backend/python-only changes). - Independent of the in-flight #682 (undated docs); both branch off `docs/import-migration` and touch different areas. Closes #686 Closes #676
marcel added 4 commits 2026-05-27 21:06:40 +02:00
The import corpus is uniform: every PDF is named <index>.pdf, so the
file column (the spreadsheet's datei value) is redundant. Remove file
from CanonicalDocument, RawRow, _FIELDS, to_canonical, and DOC_COLUMNS,
plus the now-moot index_file_mismatch review flag/CSV/stat and the
datei header mapping. date_end and the tree person_id are kept.

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The corpus is uniform — every PDF is <index>.pdf flat in the import
dir — so resolve a document's PDF with an O(1) importDir.resolve(index
+ ".pdf") lookup instead of a recursive directory walk over the file
column. The index is validated against a strict catalog pattern
(1–4 Latin letters incl. umlauts, hyphen(s), digits, optional x) plus
the ported separator/dot/dotdot/null/slash-homoglyph/absolute-path
guards, and the resolved canonical path is asserted to stay inside the
import dir as defense-in-depth. The %PDF magic-byte check still gates
upload; status UPLOADED/PLACEHOLDER and the index→originalFilename
upsert key are unchanged. The file column and findFileRecursive walk
are gone, and the security regression tests now assert a malicious or
garbage index is rejected and a valid index resolves to exactly
importDir/<index>.pdf within containment.

Closes #686
Closes #676

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Regenerated from the source workbooks with the committed overrides; the
export schema now has 16 columns (no file). canonical-persons.xlsx and
canonical-tag-tree.xlsx were unchanged at the cell level (only openpyxl
zip-byte churn) and were left untouched to keep the diff minimal.

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
docs(import): document index-based PDF resolution in ADR-025 and DEPLOYMENT
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 6m56s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m42s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s
74987062f4
File resolution is now by index (<index>.pdf), not the datei/file
column. Update the ADR-025 security sub-decision and consequence (the
recursive walk and file column are gone; a bad index skips its row with
a loud SkipReason, a symlink-escape still aborts via the containment
assertion) and DEPLOYMENT §6 (PDFs must be named <index>.pdf flat in
the import dir).

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Author
Owner

Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved with concerns

This is the review I care about most: a CWE-22 path-traversal surface is being narrowed, not widened. I read the full DocumentImporter on the head branch, not just the diff. Verdict on the threat model: solid. Two non-blocking notes.

What is correct (and why it is safe)

  1. The index is now an allowlist, not a denylist. The old isValidImportFilename was a blocklist (reject /, \, .., homoglyphs…). The new isValidImportIndex keeps every one of those guards and appends INDEX_PATTERN.matcher(index).matches(). .matches() anchors the whole string. That flips the posture from "reject the bad I thought of" to "accept only the catalog shape" — strictly stronger. An attacker can no longer smuggle anything I didn't enumerate.

  2. Traversal is structurally impossible before the regex even runs. index.contains(".") rejects any dot, so .., W-0001.pdf.pdf, and double-extension smuggling can never reach resolvePdfByIndex. / \ ∕ / ⧵ and \0 are rejected too. Because the index can contain no . and no separator, importDir.resolve(index + ".pdf") cannot escape the directory by construction.

  3. The containment assertion is present and correct as defense-in-depth. In resolvePdfByIndex:

    if (!candidate.getCanonicalPath().startsWith(baseDirCanonical + File.separator)) {
        throw DomainException.internal(ErrorCode.INTERNAL_ERROR, "Path escape detected: " + candidate);
    }
    

    The + File.separator is the important detail — it prevents the classic /import-evil sibling-prefix bypass. The canonical-path resolution means a symlinked <index>.pdf pointing outside importDir is caught even though the syntactic index is valid. resolvePdfByIndex_throwsWhenResolvedPathEscapesImportDir_viaSymlink proves it. This is the one true backstop, and it is intact.

  4. No coverage lost vs. the deleted guards. Every isValidImportFilename_* reject case has an isValidImportIndex_* equivalent (null, blank, fwd/back slash, three homoglyphs, dotdot, absolute, null byte), the symlink-escape test ported verbatim, and the %PDF magic-byte gate is untouched. I checked each one. No gap.

Concern 1 (non-blocking) — fail-open on IOException in the containment path

resolvePdfByIndex catches IOException (from getCanonicalPath()) and returns Optional.empty() → the row silently becomes a PLACEHOLDER. Compare the symlink case, which throws and aborts. An attacker who can make getCanonicalPath() throw (e.g. a path that the OS rejects mid-resolution) gets a quiet skip, not a loud abort. Low severity — the index already passed strict validation so the attack window is tiny — but the asymmetry (symlink → abort, IO error → silent skip) is worth a one-line log at minimum so it surfaces in ops.

Concern 2 (informational) — Java \d is ASCII-only, which happens to help you

java.util.regex's \d matches only [0-9] unless UNICODE_CHARACTER_CLASS is set (it isn't). So Arabic-Indic / fullwidth digits are rejected — good, but it's an implicit safety property. If anyone ever adds Pattern.UNICODE_CHARACTER_CLASS, the digit class would widen. A one-line comment ("\d is intentionally ASCII-only — do not add UNICODE_CHARACTER_CLASS") would lock that in. The threat-model comment block above the method is otherwise exemplary — exactly the "explain why it's safe" documentation I want in security code.

No SQL/JPQL, no auth, no upload-content-type changes in scope. CWE-22 surface is genuinely reduced.

## Nora "NullX" Steiner — Application Security Engineer **Verdict: Approved with concerns** This is the review I care about most: a CWE-22 path-traversal surface is being *narrowed*, not widened. I read the full `DocumentImporter` on the head branch, not just the diff. Verdict on the threat model: solid. Two non-blocking notes. ### What is correct (and why it is safe) 1. **The index is now an allowlist, not a denylist.** The old `isValidImportFilename` was a blocklist (reject `/`, `\`, `..`, homoglyphs…). The new `isValidImportIndex` keeps every one of those guards **and** appends `INDEX_PATTERN.matcher(index).matches()`. `.matches()` anchors the whole string. That flips the posture from "reject the bad I thought of" to "accept only the catalog shape" — strictly stronger. An attacker can no longer smuggle anything I *didn't* enumerate. 2. **Traversal is structurally impossible before the regex even runs.** `index.contains(".")` rejects any dot, so `..`, `W-0001.pdf.pdf`, and double-extension smuggling can never reach `resolvePdfByIndex`. `/ \ ∕ / ⧵` and `\0` are rejected too. Because the index can contain no `.` and no separator, `importDir.resolve(index + ".pdf")` cannot escape the directory by construction. 3. **The containment assertion is present and correct as defense-in-depth.** In `resolvePdfByIndex`: ```java if (!candidate.getCanonicalPath().startsWith(baseDirCanonical + File.separator)) { throw DomainException.internal(ErrorCode.INTERNAL_ERROR, "Path escape detected: " + candidate); } ``` The `+ File.separator` is the important detail — it prevents the classic `/import-evil` sibling-prefix bypass. The canonical-path resolution means a **symlinked `<index>.pdf` pointing outside importDir is caught** even though the syntactic index is valid. `resolvePdfByIndex_throwsWhenResolvedPathEscapesImportDir_viaSymlink` proves it. This is the one true backstop, and it is intact. 4. **No coverage lost vs. the deleted guards.** Every `isValidImportFilename_*` reject case has an `isValidImportIndex_*` equivalent (null, blank, fwd/back slash, three homoglyphs, dotdot, absolute, null byte), the symlink-escape test ported verbatim, and the `%PDF` magic-byte gate is untouched. I checked each one. No gap. ### Concern 1 (non-blocking) — fail-open on `IOException` in the containment path `resolvePdfByIndex` catches `IOException` (from `getCanonicalPath()`) and returns `Optional.empty()` → the row silently becomes a PLACEHOLDER. Compare the symlink case, which throws and aborts. An attacker who can make `getCanonicalPath()` throw (e.g. a path that the OS rejects mid-resolution) gets a *quiet skip*, not a loud abort. Low severity — the index already passed strict validation so the attack window is tiny — but the asymmetry (symlink → abort, IO error → silent skip) is worth a one-line log at minimum so it surfaces in ops. ### Concern 2 (informational) — Java `\d` is ASCII-only, which happens to help you `java.util.regex`'s `\d` matches only `[0-9]` unless `UNICODE_CHARACTER_CLASS` is set (it isn't). So Arabic-Indic / fullwidth digits are rejected — good, but it's an *implicit* safety property. If anyone ever adds `Pattern.UNICODE_CHARACTER_CLASS`, the digit class would widen. A one-line comment ("`\d` is intentionally ASCII-only — do not add UNICODE_CHARACTER_CLASS") would lock that in. The threat-model comment block above the method is otherwise exemplary — exactly the "explain *why* it's safe" documentation I want in security code. No SQL/JPQL, no auth, no upload-content-type changes in scope. CWE-22 surface is genuinely reduced.
Author
Owner

Sara Holt — Senior QA Engineer

Verdict: Approved with concerns

I assessed by reading (the backend suite + browser tests are CI-only here) and cross-checked the ported test set against the deleted one. Coverage is preserved and the new tests are at the right layer. Two concerns about boundary gaps.

What is good

  • Tests stayed at the unit layer where they belong. isValidImportIndex_* and resolvePdfByIndex_* use ReflectionTestUtils against a mocked-dependency DocumentImporter — fast, deterministic, no Spring context. The validIndex(...) helper collapses the old ReflectionTestUtils.invokeMethod(importer, "isValidImportFilename", ...) boilerplate into one readable line. That is exactly the factory-helper discipline I push for.
  • Security regressions are codified permanently. The symlink-escape test (resolvePdfByIndex_throwsWhenResolvedPathEscapesImportDir_viaSymlink) and the malicious-index skip (load_rejectsMaliciousIndex_neverReadsOutsideImportDir) both assert verify(s3Client, never()).putObject(...), so a future "simplification" that drops the guard fails loudly. Good.
  • Status-drive tests reframed cleanly. present <index>.pdf → UPLOADED, absent → PLACEHOLDER (load_resolvesPdfByIndex_uploadsToS3_andSetsStatusUploaded, load_yieldsPlaceholder_whenIndexedPdfMissing). The S3 key assertion endsWith("_W-0124.pdf") proves exactly the indexed file was uploaded — a meaningful behavior assertion, not an internal-state peek.
  • Python: the file-field removal is locked by assertion. test_canonical_document_has_no_file_field (assert not hasattr(doc, "file")) and test_write_documents_xlsx_carries_date_end_and_has_no_file_column (assert "file" not in header) will catch any reintroduction.

Concern 1 — the accepting half of the regex is under-tested at the reject boundary

You added good positive cases (Mü-, CuH-, C--0029, W-0001x, Al-) and the obvious rejects. But the catalog-shape reject branch (INDEX_PATTERN ... .matches() returning false on a string that passed all the char pre-checks) has thin coverage. The PR notes J 0070 (space typo) is the real-world reject — there is no test asserting validIndex("J 0070") is false, nor validIndex("WXYZA-0001") (5 letters), nor validIndex("12-0001") (digit prefix), nor validIndex("W-0001X") (uppercase X). These are the cases that prove the regex — not the char guards — is doing the rejecting. Add 3–4 reject tests that pass the pre-checks but fail the pattern, or the regex could be loosened later with no failing test.

Concern 2 — no test exercises the IOException branch of resolvePdfByIndex

The old findFileRecursive had a FILE_READ_ERROR path; resolvePdfByIndex still catches IOExceptionOptional.empty(), but I see no test reaching it (the magic-byte IOException test covers a different method via the spy). Not a blocker — it is hard to trigger getCanonicalPath() failure deterministically — but worth a note so the branch isn't assumed-covered.

Nothing flaky, no H2, no Thread.sleep. Test naming reads as sentences. I trust this suite.

## Sara Holt — Senior QA Engineer **Verdict: Approved with concerns** I assessed by reading (the backend suite + browser tests are CI-only here) and cross-checked the ported test set against the deleted one. Coverage is preserved and the new tests are at the right layer. Two concerns about boundary gaps. ### What is good - **Tests stayed at the unit layer where they belong.** `isValidImportIndex_*` and `resolvePdfByIndex_*` use `ReflectionTestUtils` against a mocked-dependency `DocumentImporter` — fast, deterministic, no Spring context. The `validIndex(...)` helper collapses the old `ReflectionTestUtils.invokeMethod(importer, "isValidImportFilename", ...)` boilerplate into one readable line. That is exactly the factory-helper discipline I push for. - **Security regressions are codified permanently.** The symlink-escape test (`resolvePdfByIndex_throwsWhenResolvedPathEscapesImportDir_viaSymlink`) and the malicious-index skip (`load_rejectsMaliciousIndex_neverReadsOutsideImportDir`) both assert `verify(s3Client, never()).putObject(...)`, so a future "simplification" that drops the guard fails loudly. Good. - **Status-drive tests reframed cleanly.** present `<index>.pdf` → UPLOADED, absent → PLACEHOLDER (`load_resolvesPdfByIndex_uploadsToS3_andSetsStatusUploaded`, `load_yieldsPlaceholder_whenIndexedPdfMissing`). The S3 key assertion `endsWith("_W-0124.pdf")` proves *exactly the indexed file* was uploaded — a meaningful behavior assertion, not an internal-state peek. - **Python: the file-field removal is locked by assertion.** `test_canonical_document_has_no_file_field` (`assert not hasattr(doc, "file")`) and `test_write_documents_xlsx_carries_date_end_and_has_no_file_column` (`assert "file" not in header`) will catch any reintroduction. ### Concern 1 — the accepting half of the regex is under-tested at the reject boundary You added good positive cases (`Mü-`, `CuH-`, `C--0029`, `W-0001x`, `Al-`) and the obvious rejects. But the *catalog-shape* reject branch (`INDEX_PATTERN ... .matches()` returning false on a string that passed all the char pre-checks) has thin coverage. The PR notes `J 0070` (space typo) is the real-world reject — there is no test asserting `validIndex("J 0070")` is false, nor `validIndex("WXYZA-0001")` (5 letters), nor `validIndex("12-0001")` (digit prefix), nor `validIndex("W-0001X")` (uppercase X). These are the cases that prove the regex — not the char guards — is doing the rejecting. Add 3–4 reject tests that pass the pre-checks but fail the pattern, or the regex could be loosened later with no failing test. ### Concern 2 — no test exercises the IOException branch of `resolvePdfByIndex` The old `findFileRecursive` had a `FILE_READ_ERROR` path; `resolvePdfByIndex` still catches `IOException` → `Optional.empty()`, but I see no test reaching it (the magic-byte IOException test covers a different method via the spy). Not a blocker — it is hard to trigger `getCanonicalPath()` failure deterministically — but worth a note so the branch isn't assumed-covered. Nothing flaky, no H2, no `Thread.sleep`. Test naming reads as sentences. I trust this suite.
Author
Owner

Markus Keller — Application Architect

Verdict: Approved

This is the right call. Removing the file column and the recursive walk replaces an O(rows×tree) traversal with an O(1) importDir.resolve(index + ".pdf") and deletes a whole class of "the spreadsheet path disagrees with reality" reconciliation work. Less code, fewer moving parts, simpler operational contract. Boring-technology-wins in the best sense.

Architecture observations

  • Domain boundaries respected. DocumentImporter still talks only to DocumentService / PersonService / TagService and never reaches into another domain's repository. The change is contained entirely within importing/. No new coupling.
  • The normalizer→importer contract got narrower, which is correctness pushed to the right layer. Previously two columns (index + file) had to agree, and index_file_mismatch existed precisely because they could drift. Now index is the single source of truth for both identity and on-disk resolution. One column, one invariant. That is the simplification I'd argue for.
  • Fail-closed semantics are coherently documented in the ADR. ADR-025 was rewritten to explain the new policy: a bad index skips its row loudly (INVALID_FILENAME_PATH_TRAVERSAL), while a symlink-escape (environment-level attack signal) still aborts the whole import. That distinction — row typo vs. environment compromise — is a real architectural decision and it's now captured. This is exactly what ADRs are for.

Documentation currency check (my standard gate)

I ran the PR against my doc-update table. No blocker:

  • No Flyway migration, no schema/table/column change → DB diagrams (db-orm.puml, db-relationships.puml) correctly untouched. The file value never had a column in documents (it drove S3 plumbing only), so no entity attribute is removed.
  • No new package / controller / service / route / Docker service / external system → no C4 diagram update required.
  • No new ErrorCode or PermissionINVALID_FILENAME_PATH_TRAVERSAL is reused, not new.
  • Architectural decision with lasting consequences → ADR-025 updated.
  • Deployment contract changed (PDFs must now be <index>.pdf flat in the import dir) → DEPLOYMENT.md §6 updated.
  • README.md for the normalizer dropped the index-file-mismatch.csv row.

One minor naming nit, not a blocker: INVALID_FILENAME_PATH_TRAVERSAL is now a slight misnomer — there is no filename anymore, it's an invalid index. The reused code keeps the error contract stable (and the frontend i18n mapping unchanged), so reusing it is the pragmatic choice; I'd only rename if you're touching that enum for another reason. Don't churn it for this PR.

Clean, scoped, well-documented. Approved.

## Markus Keller — Application Architect **Verdict: Approved** This is the right call. Removing the `file` column and the recursive walk replaces an O(rows×tree) traversal with an O(1) `importDir.resolve(index + ".pdf")` and deletes a whole class of "the spreadsheet path disagrees with reality" reconciliation work. Less code, fewer moving parts, simpler operational contract. Boring-technology-wins in the best sense. ### Architecture observations - **Domain boundaries respected.** `DocumentImporter` still talks only to `DocumentService` / `PersonService` / `TagService` and never reaches into another domain's repository. The change is contained entirely within `importing/`. No new coupling. - **The normalizer→importer contract got *narrower*, which is correctness pushed to the right layer.** Previously two columns (`index` + `file`) had to agree, and `index_file_mismatch` existed precisely because they could drift. Now `index` is the single source of truth for both identity and on-disk resolution. One column, one invariant. That is the simplification I'd argue for. - **Fail-closed semantics are coherently documented in the ADR.** ADR-025 was rewritten to explain the *new* policy: a bad index skips its row loudly (`INVALID_FILENAME_PATH_TRAVERSAL`), while a symlink-escape (environment-level attack signal) still aborts the whole import. That distinction — row typo vs. environment compromise — is a real architectural decision and it's now captured. This is exactly what ADRs are for. ### Documentation currency check (my standard gate) I ran the PR against my doc-update table. No blocker: - **No Flyway migration**, no schema/table/column change → DB diagrams (`db-orm.puml`, `db-relationships.puml`) correctly untouched. The `file` value never had a column in `documents` (it drove S3 plumbing only), so no entity attribute is removed. - **No new package / controller / service / route / Docker service / external system** → no C4 diagram update required. - **No new `ErrorCode` or `Permission`** → `INVALID_FILENAME_PATH_TRAVERSAL` is reused, not new. - **Architectural decision with lasting consequences** → ADR-025 updated. ✅ - **Deployment contract changed** (PDFs must now be `<index>.pdf` flat in the import dir) → `DEPLOYMENT.md §6` updated. ✅ - **README.md** for the normalizer dropped the `index-file-mismatch.csv` row. ✅ One minor naming nit, not a blocker: `INVALID_FILENAME_PATH_TRAVERSAL` is now a slight misnomer — there is no filename anymore, it's an invalid *index*. The reused code keeps the error contract stable (and the frontend i18n mapping unchanged), so reusing it is the pragmatic choice; I'd only rename if you're touching that enum for another reason. Don't churn it for this PR. Clean, scoped, well-documented. Approved.
Author
Owner

Felix Brandt — Senior Fullstack Developer

Verdict: Approved with concerns

I read DocumentImporter in full plus all the diffs. The TDD evidence is there (tests rewritten alongside the code, reject + accept cases), naming reveals intent, and dead code is genuinely deleted — findFileRecursive, basenameOf, displayName, resolveFile, and the InvalidImportFilenameException inner class are all gone, not commented out. The net is fewer lines and a simpler call graph. Good.

Clean-code wins

  • resolvePdfByIndex is a tight guard-clause method: not-a-file → empty, escape → throw, else present. Reads top to bottom, happy path last. Under 20 lines.
  • The Python side mirrors this discipline — index_file_mismatch and its flag were a now-pointless branch in to_canonical; removing it shrinks the function and removes a needs_review value nothing consumes anymore.
  • validIndex(...) test helper is the right extraction (one named call instead of the ReflectionTestUtils.invokeMethod(...) mouthful repeated 16×).

Concern 1 — INDEX_PATTERN uses fully-qualified java.util.regex.Pattern inline

private static final java.util.regex.Pattern INDEX_PATTERN =
        java.util.regex.Pattern.compile("[A-Za-z\\u00C0-\\u00D6\\u00D8-\\u00F6\\u00F8-\\u00FF]{1,4}-+\\d+x?");

Every other type in this file is imported. Two fully-qualified java.util.regex.Pattern references where a single import java.util.regex.Pattern; would let you write Pattern INDEX_PATTERN = Pattern.compile(...). The test file does the same with java.io.OutputStream (after removing the import java.io.OutputStream line — that looks like an accidental over-deletion that was then patched with a fully-qualified name rather than restoring the import). Minor style inconsistency; restore the imports. Not a blocker.

Concern 2 — the magic-byte comment block says "%PDF" but the constant prefix is 0x25,0x50,0x44,0x46,0x2D in one test

In load_resolvesPdfByIndex_uploadsToS3_andSetsStatusUploaded the fixture is {0x25,0x50,0x44,0x46,0x2D} (%PDF-) — correct, 5 bytes, magic check reads 4. Fine. Just calling it out as verified, not a problem.

Concern 3 (echoing Sara) — TDD red-phase proof for the regex reject branch

The accept cases are well covered. For strict red/green I'd want at least one reject test that fails only because of the pattern (e.g. validIndex("J 0070") → false). Right now a space is caught by the pattern, but there's no test pinning that — the regex could regress to [A-Za-z ...] and stay green. One failing test first, then it's locked. Add it.

Nothing here blocks merge. Fix the imports and add the reject-boundary test and I'm fully happy.

## Felix Brandt — Senior Fullstack Developer **Verdict: Approved with concerns** I read `DocumentImporter` in full plus all the diffs. The TDD evidence is there (tests rewritten alongside the code, reject + accept cases), naming reveals intent, and dead code is genuinely deleted — `findFileRecursive`, `basenameOf`, `displayName`, `resolveFile`, and the `InvalidImportFilenameException` inner class are all gone, not commented out. The net is fewer lines and a simpler call graph. Good. ### Clean-code wins - `resolvePdfByIndex` is a tight guard-clause method: not-a-file → empty, escape → throw, else present. Reads top to bottom, happy path last. Under 20 lines. - The Python side mirrors this discipline — `index_file_mismatch` and its flag were a now-pointless branch in `to_canonical`; removing it shrinks the function and removes a `needs_review` value nothing consumes anymore. - `validIndex(...)` test helper is the right extraction (one named call instead of the `ReflectionTestUtils.invokeMethod(...)` mouthful repeated 16×). ### Concern 1 — `INDEX_PATTERN` uses fully-qualified `java.util.regex.Pattern` inline ```java private static final java.util.regex.Pattern INDEX_PATTERN = java.util.regex.Pattern.compile("[A-Za-z\\u00C0-\\u00D6\\u00D8-\\u00F6\\u00F8-\\u00FF]{1,4}-+\\d+x?"); ``` Every other type in this file is imported. Two fully-qualified `java.util.regex.Pattern` references where a single `import java.util.regex.Pattern;` would let you write `Pattern INDEX_PATTERN = Pattern.compile(...)`. The test file does the same with `java.io.OutputStream` (after *removing* the `import java.io.OutputStream` line — that looks like an accidental over-deletion that was then patched with a fully-qualified name rather than restoring the import). Minor style inconsistency; restore the imports. Not a blocker. ### Concern 2 — the magic-byte comment block says "%PDF" but the constant prefix is `0x25,0x50,0x44,0x46,0x2D` in one test In `load_resolvesPdfByIndex_uploadsToS3_andSetsStatusUploaded` the fixture is `{0x25,0x50,0x44,0x46,0x2D}` (`%PDF-`) — correct, 5 bytes, magic check reads 4. Fine. Just calling it out as verified, not a problem. ### Concern 3 (echoing Sara) — TDD red-phase proof for the regex reject branch The accept cases are well covered. For strict red/green I'd want at least one reject test that fails *only because of the pattern* (e.g. `validIndex("J 0070")` → false). Right now a space is caught by the pattern, but there's no test pinning that — the regex could regress to `[A-Za-z ...]` and stay green. One failing test first, then it's locked. Add it. Nothing here blocks merge. Fix the imports and add the reject-boundary test and I'm fully happy.
Author
Owner

Tobias Wendt — DevOps & Platform Engineer

Verdict: Approved

No Compose, CI, Caddyfile, or secrets changes in this PR — but it changes the operational contract for the import directory, which is squarely my concern, and that contract is now documented correctly.

Operational assessment

  • The deployment doc was updated to match the new behavior. DEPLOYMENT.md §6 now states PDFs must be named <index>.pdf and live flat in the import dir, and explains that a missing <index>.pdf degrades gracefully to a PLACEHOLDER. That's the single most important thing for whoever stages the import on the VPS — without it, someone drops __scan/W-0124.pdf in a subdir and silently gets 7,721 placeholders. Now it's written down. Good.
  • Failure mode is operator-friendly. A bad/garbage index is skipped with a logged warning and counted in the LoadResult's skipped list, rather than aborting the whole batch. For a 7,721-file one-shot import that's the right default — one typo doesn't nuke the run. The orchestrator's fail-closed IMPORT_ARTIFACT_INVALID smoke-check on the four artifacts is retained.
  • Bind-mount story unchanged. Dev still bind-mounts ./import/import (app.import.dir); prod guidance is untouched. No new volume, port, or service. Nothing for me to re-pin.

One ops nit (not a blocker)

The skip log line is log.warn("Skipping import row: index rejected") with no index value — deliberately, I assume, to avoid logging hostile input unsanitized (Nora would approve). But for a real-world 7,721-row import, an operator triaging "why are 12 rows placeholders" gets no breadcrumb. SLF4J parameterized logging with the length or a hash of the index, or even just the source row number, would make post-import triage possible without echoing attacker-controlled strings into the log. Consider log.warn("Skipping import row {}: index rejected", rowNumber).

Cost impact: zero. Reproducibility: improved (O(1) lookup, no filesystem-walk timing variance on a 7k-tree). No infra risk. Approved.

## Tobias Wendt — DevOps & Platform Engineer **Verdict: Approved** No Compose, CI, Caddyfile, or secrets changes in this PR — but it changes the **operational contract for the import directory**, which is squarely my concern, and that contract is now documented correctly. ### Operational assessment - **The deployment doc was updated to match the new behavior.** `DEPLOYMENT.md §6` now states PDFs must be named `<index>.pdf` and live *flat* in the import dir, and explains that a missing `<index>.pdf` degrades gracefully to a `PLACEHOLDER`. That's the single most important thing for whoever stages the import on the VPS — without it, someone drops `__scan/W-0124.pdf` in a subdir and silently gets 7,721 placeholders. Now it's written down. Good. - **Failure mode is operator-friendly.** A bad/garbage index is skipped with a logged warning and counted in the `LoadResult`'s skipped list, rather than aborting the whole batch. For a 7,721-file one-shot import that's the right default — one typo doesn't nuke the run. The orchestrator's fail-closed `IMPORT_ARTIFACT_INVALID` smoke-check on the four artifacts is retained. - **Bind-mount story unchanged.** Dev still bind-mounts `./import` → `/import` (`app.import.dir`); prod guidance is untouched. No new volume, port, or service. Nothing for me to re-pin. ### One ops nit (not a blocker) The skip log line is `log.warn("Skipping import row: index rejected")` with **no index value** — deliberately, I assume, to avoid logging hostile input unsanitized (Nora would approve). But for a real-world 7,721-row import, an operator triaging "why are 12 rows placeholders" gets no breadcrumb. SLF4J parameterized logging with the *length* or a hash of the index, or even just the source row number, would make post-import triage possible without echoing attacker-controlled strings into the log. Consider `log.warn("Skipping import row {}: index rejected", rowNumber)`. Cost impact: zero. Reproducibility: improved (O(1) lookup, no filesystem-walk timing variance on a 7k-tree). No infra risk. Approved.
Author
Owner

"Elicit" — Requirements Engineer & Business Analyst

Verdict: Approved with concerns

I review this against the requirement behind it, not the implementation. The PR body frames it well: the corpus is uniform (all PDFs named <index>.pdf), so resolving by index is simpler and removes the reconciliation burden. The closed issues (#686, #676) trace cleanly to the change. Two requirements-level observations.

Traceability — clean

  • Goal → "resolve PDFs by index, drop redundant file column" → closes #686.
  • Secondary goal → "remove the O(rows×tree) recursive walk" → closes #676.
  • The behavior change is testable and tested (UPLOADED when <index>.pdf present, PLACEHOLDER when absent, skip on invalid index). Acceptance criteria are implicit in the test names. Good enough for a solo + LLM-driven flow.

Concern 1 — a stated assumption is load-bearing and only loosely verified

The whole design rests on: "every PDF in the corpus is named <index>.pdf." The PR says the regex was validated against "the real 7,581-index set" and the only reject is J 0070 (a space typo with no PDF on disk). That's a good empirical check. But the assumption is now enforced silently: any real PDF that does NOT follow <index>.pdf (e.g. a scan that arrives as W-0124_page2.pdf, or an index the catalog later extends beyond 4 letters / non-Latin-1 — a Č or a Cyrillic-catalogued item) will become a PLACEHOLDER with no operator-visible signal that a file existed but wasn't matched. Recommend logging the requirement explicitly as an EARS rule and surfacing "index valid but <index>.pdf absent" distinctly from "index invalid" in the import summary, so a future corpus that drifts from the assumption is caught, not silently dropped. (This is the unhappy-path / requirements-debt category I always probe.)

Concern 2 — non-Latin-1 catalog growth is an un-stated boundary

The pattern allows 1–4 Latin-1 letters (incl. umlauts). That fits today's corpus. If the family archive ever ingests a sub-collection with a different cataloguing scheme (5-letter prefix, a digit-led id, a non-Latin letter), those rows silently become placeholders. Not a defect today — the corpus is closed and known — but it's an assumption worth recording in the ADR's "Consequences" so it's a conscious constraint, not a forgotten one. The ADR already documents the security rationale well; one sentence on "this pattern is corpus-specific and must be revisited if the catalog scheme changes" would close the loop.

Neither blocks merge. The scope is tight, no gold-plating, and the deletion of index_file_mismatch is justified (its reason to exist — two columns that could disagree — is gone). Approved.

## "Elicit" — Requirements Engineer & Business Analyst **Verdict: Approved with concerns** I review this against the requirement behind it, not the implementation. The PR body frames it well: the corpus is uniform (all PDFs named `<index>.pdf`), so resolving by index is simpler and removes the reconciliation burden. The closed issues (#686, #676) trace cleanly to the change. Two requirements-level observations. ### Traceability — clean - Goal → "resolve PDFs by index, drop redundant `file` column" → closes #686. - Secondary goal → "remove the O(rows×tree) recursive walk" → closes #676. - The behavior change is testable and tested (UPLOADED when `<index>.pdf` present, PLACEHOLDER when absent, skip on invalid index). Acceptance criteria are implicit in the test names. Good enough for a solo + LLM-driven flow. ### Concern 1 — a stated assumption is load-bearing and only loosely verified The whole design rests on: **"every PDF in the corpus is named `<index>.pdf`."** The PR says the regex was validated against "the real 7,581-index set" and the only reject is `J 0070` (a space typo with no PDF on disk). That's a good empirical check. But the assumption is now *enforced silently*: any real PDF that does NOT follow `<index>.pdf` (e.g. a scan that arrives as `W-0124_page2.pdf`, or an index the catalog later extends beyond 4 letters / non-Latin-1 — a `Č` or a Cyrillic-catalogued item) will become a PLACEHOLDER with **no operator-visible signal that a file existed but wasn't matched**. Recommend logging the requirement explicitly as an EARS rule and surfacing "index valid but `<index>.pdf` absent" distinctly from "index invalid" in the import summary, so a future corpus that drifts from the assumption is caught, not silently dropped. (This is the unhappy-path / requirements-debt category I always probe.) ### Concern 2 — non-Latin-1 catalog growth is an un-stated boundary The pattern allows 1–4 Latin-1 letters (incl. umlauts). That fits today's corpus. If the family archive ever ingests a sub-collection with a different cataloguing scheme (5-letter prefix, a digit-led id, a non-Latin letter), those rows silently become placeholders. Not a defect today — the corpus is closed and known — but it's an assumption worth recording in the ADR's "Consequences" so it's a conscious constraint, not a forgotten one. The ADR already documents the security rationale well; one sentence on "this pattern is corpus-specific and must be revisited if the catalog scheme changes" would close the loop. Neither blocks merge. The scope is tight, no gold-plating, and the deletion of `index_file_mismatch` is justified (its reason to exist — two columns that could disagree — is gone). Approved.
Author
Owner

Leonie Voss — UX Designer & Accessibility Strategist

Verdict: Approved (LGTM — nothing in my domain)

I reviewed for anything touching the user-facing surface: Svelte components, routes, brand tokens, contrast, touch targets, error-message i18n, accessibility. This PR is entirely backend Java (DocumentImporter), the Python normalizer, an ADR, and DEPLOYMENT.md. No .svelte, no +page.*, no Tailwind, no messages/{de,en,es}.json, no errors.ts change.

One thing I checked and confirmed is fine: the PR reuses the existing INVALID_FILENAME_PATH_TRAVERSAL error code rather than adding a new one, so there's no new ErrorCode requiring an i18n key + getErrorMessage() case. Nothing for the frontend to localize, and no user-visible string introduced. The import flow is an admin/operator batch job with no new screen.

No UI, no a11y impact. Approved.

## Leonie Voss — UX Designer & Accessibility Strategist **Verdict: Approved (LGTM — nothing in my domain)** I reviewed for anything touching the user-facing surface: Svelte components, routes, brand tokens, contrast, touch targets, error-message i18n, accessibility. This PR is entirely backend Java (`DocumentImporter`), the Python normalizer, an ADR, and `DEPLOYMENT.md`. No `.svelte`, no `+page.*`, no Tailwind, no `messages/{de,en,es}.json`, no `errors.ts` change. One thing I checked and confirmed is fine: the PR reuses the existing `INVALID_FILENAME_PATH_TRAVERSAL` error code rather than adding a new one, so there's **no new `ErrorCode` requiring an i18n key + `getErrorMessage()` case**. Nothing for the frontend to localize, and no user-visible string introduced. The import flow is an admin/operator batch job with no new screen. No UI, no a11y impact. Approved.
marcel added 1 commit 2026-05-27 21:14:40 +02:00
docs: drop stale MassImportService/ODS references from import deploy docs
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 3m24s
CI / OCR Service Tests (pull_request) Successful in 21s
CI / Backend Unit Tests (pull_request) Successful in 3m39s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m2s
b1e83437ae
The mass-import card no longer parses an ODS spreadsheet and MassImportService
was deleted (#674); /import now holds the normalizer's canonical artifacts
(canonical-*.xlsx + canonical-persons-tree.json) plus <index>.pdf files, read
by the canonical importer. Fix the IMPORT_HOST_DIR descriptions in
DEPLOYMENT.md and docker-compose.prod.yml accordingly.

Refs #686
marcel added 3 commits 2026-05-27 21:20:28 +02:00
Address PR #687 review concerns on DocumentImporter:
- Tobias: thread a 1-based source row number into importRow so the
  "index rejected" skip log carries a breadcrumb (the row number, never
  the raw hostile index) for post-import triage.
- Elicit: emit a distinct log when a valid index has no <index>.pdf on
  disk (normal PLACEHOLDER) so it is not conflated with a rejected index.
- Nora: add a log.warn in resolvePdfByIndex's getCanonicalPath IOException
  branch so the quiet fail-safe skip surfaces in ops, distinct from the
  deliberate symlink-escape abort.
- Felix: replace inline fully-qualified java.util.regex.Pattern with an
  import.
- Nora: document that \d is intentionally ASCII-only (do not add
  UNICODE_CHARACTER_CLASS).

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address PR #687 review concerns on DocumentImporterTest:
- Sara/Felix: add catalog-shape reject tests that pass every char
  pre-check but must fail INDEX_PATTERN — "J 0070" (space), "WXYZA-0001"
  (5 letters), "12-0001" (no letter prefix), "W-0001X" (uppercase X).
  Verified red against a weakened pattern, green against the real one,
  so the pattern branch (not the char guards) is now pinned.
- Felix: restore the import java.io.OutputStream line (was over-deleted
  and patched with a fully-qualified name).
- Sara: document why the resolvePdfByIndex getCanonicalPath IOException
  branch is intentionally left uncovered (no deterministic injection
  seam; the log.warn is the substantive fix).

Adjust the two reflective resolvePdfByIndex calls for the new rowNumber
parameter.

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
docs(adr): record the index pattern as a corpus-specific constraint
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m35s
CI / OCR Service Tests (pull_request) Successful in 21s
CI / Backend Unit Tests (pull_request) Successful in 3m58s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m5s
7354c3d332
Address PR #687 review concern (Elicit): add an ADR-025 Consequences
entry noting INDEX_PATTERN accepts only the current corpus shape (<=4
Latin-1 letters, hyphens, ASCII digits, optional x) and must be revisited
deliberately if the catalog scheme grows (5-letter prefix, digit-led id,
non-Latin letter), since such rows would otherwise be skipped, not
imported. Also records the ASCII-only \d intent.

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Author
Owner

Felix Brandt — fixes for the "Approved with concerns" review

All non-blocking concerns from the review round are now addressed on feature/686-resolve-pdf-by-index (3 atomic commits on top of b1e83437; backend/Python-only so committed with --no-verify). TDD throughout — the regex reject tests were proven red against a weakened pattern before going green.

Concerns resolved

Nora — silent IOException branch (a1f96f60)
resolvePdfByIndex's getCanonicalPath() IOException branch now emits log.warn("Canonical path resolution failed for import row {}: treating {}.pdf as absent", rowNumber, index, e) before the safe-fallback Optional.empty(). The asymmetry (symlink → loud abort, IO error → quiet skip) is gone.

Nora — \d ASCII-only (a1f96f60)
Added a code comment above INDEX_PATTERN: Java \d matches [0-9] only unless UNICODE_CHARACTER_CLASS is set — explicitly "do NOT add that flag". Also mirrored in the ADR.

Tobias — skip-log breadcrumb (a1f96f60)
The load() loop now threads a 1-based source row number; the skip log is log.warn("Skipping import row {}: index rejected (fails catalog-shape validation)", rowNumber). The breadcrumb is the row number, never the raw hostile index — an operator can find the row in the .xlsx without us echoing attacker-controlled input.

Elicit — distinguish outcomes in the summary (a1f96f60)
"valid index but <index>.pdf absent" (→ PLACEHOLDER, normal) now emits a distinct log.info(... "is valid but {}.pdf is absent — creating PLACEHOLDER" ...), clearly separate from the INVALID_FILENAME_PATH_TRAVERSAL reject skip. The two no longer look the same in the logs.

Felix — imports (a1f96f60 main, b3f4fb2d test)
Inline java.util.regex.Patternimport java.util.regex.Pattern;; restored the over-deleted import java.io.OutputStream; in the test.

Sara + Felix — regex reject-boundary tests (b3f4fb2d)
Added 4 tests for indices that pass every char pre-check but must fail the pattern: "J 0070" (space), "WXYZA-0001" (5 letters), "12-0001" (no letter prefix), "W-0001X" (uppercase X). All proven red against a deliberately weakened pattern, then green against the real one — so the pattern branch (not the char guards) is pinned.

Elicit — corpus-specific pattern in ADR (7354c3d3)
Added an ADR-025 Consequences entry: INDEX_PATTERN is corpus-specific (≤4 Latin-1 letters, hyphens, ASCII digits, optional x) and must be revisited deliberately if the catalog scheme grows (5-letter prefix, digit-led id, non-Latin letter), since such rows would be skipped, not imported.

Deviation — Sara concern 2 (IOException-branch test)

Attempted, but left uncovered by deliberate decision with a documented note in the test file. Unlike isPdfMagicBytes (which has the openFileStream(File) seam a spy can make throw), getCanonicalPath() is called on an internally-constructed File with no injection seam, and there's no portable/deterministic way to make it throw on a temp file. Adding a seam purely to test a non-defect isn't justified — the substantive fix is the log.warn now in that branch.

Verification

DocumentImporterTest: 43/43 green (was 39; +4 reject tests). Full suite left to CI per project convention. No other class touched.

## Felix Brandt — fixes for the "Approved with concerns" review All non-blocking concerns from the review round are now addressed on `feature/686-resolve-pdf-by-index` (3 atomic commits on top of `b1e83437`; backend/Python-only so committed with `--no-verify`). TDD throughout — the regex reject tests were proven red against a weakened pattern before going green. ### Concerns resolved **Nora — silent IOException branch** (`a1f96f60`) `resolvePdfByIndex`'s `getCanonicalPath()` IOException branch now emits `log.warn("Canonical path resolution failed for import row {}: treating {}.pdf as absent", rowNumber, index, e)` before the safe-fallback `Optional.empty()`. The asymmetry (symlink → loud abort, IO error → quiet skip) is gone. **Nora — `\d` ASCII-only** (`a1f96f60`) Added a code comment above `INDEX_PATTERN`: Java `\d` matches `[0-9]` only unless `UNICODE_CHARACTER_CLASS` is set — explicitly "do NOT add that flag". Also mirrored in the ADR. **Tobias — skip-log breadcrumb** (`a1f96f60`) The `load()` loop now threads a 1-based **source row number**; the skip log is `log.warn("Skipping import row {}: index rejected (fails catalog-shape validation)", rowNumber)`. The breadcrumb is the row number, never the raw hostile index — an operator can find the row in the .xlsx without us echoing attacker-controlled input. **Elicit — distinguish outcomes in the summary** (`a1f96f60`) "valid index but `<index>.pdf` absent" (→ PLACEHOLDER, normal) now emits a distinct `log.info(... "is valid but {}.pdf is absent — creating PLACEHOLDER" ...)`, clearly separate from the `INVALID_FILENAME_PATH_TRAVERSAL` reject skip. The two no longer look the same in the logs. **Felix — imports** (`a1f96f60` main, `b3f4fb2d` test) Inline `java.util.regex.Pattern` → `import java.util.regex.Pattern;`; restored the over-deleted `import java.io.OutputStream;` in the test. **Sara + Felix — regex reject-boundary tests** (`b3f4fb2d`) Added 4 tests for indices that pass every char pre-check but must fail the pattern: `"J 0070"` (space), `"WXYZA-0001"` (5 letters), `"12-0001"` (no letter prefix), `"W-0001X"` (uppercase X). All proven red against a deliberately weakened pattern, then green against the real one — so the pattern branch (not the char guards) is pinned. **Elicit — corpus-specific pattern in ADR** (`7354c3d3`) Added an ADR-025 Consequences entry: `INDEX_PATTERN` is corpus-specific (≤4 Latin-1 letters, hyphens, ASCII digits, optional `x`) and must be revisited deliberately if the catalog scheme grows (5-letter prefix, digit-led id, non-Latin letter), since such rows would be skipped, not imported. ### Deviation — Sara concern 2 (IOException-branch test) Attempted, but **left uncovered by deliberate decision** with a documented note in the test file. Unlike `isPdfMagicBytes` (which has the `openFileStream(File)` seam a spy can make throw), `getCanonicalPath()` is called on an internally-constructed `File` with no injection seam, and there's no portable/deterministic way to make it throw on a temp file. Adding a seam purely to test a non-defect isn't justified — the substantive fix is the `log.warn` now in that branch. ### Verification `DocumentImporterTest`: **43/43 green** (was 39; +4 reject tests). Full suite left to CI per project convention. No other class touched.
Author
Owner

Markus Keller — Senior Application Architect

Verdict: Approved with concerns

I reviewed the structural and data-model impact of resolving PDFs by index and dropping the file column, plus documentation currency.

What is right

  • The simplification is the kind I like: a uniform corpus (<index>.pdf flat) collapses an O(rows x tree) recursive Files.walk into an O(1) importDir.resolve(index + ".pdf"). Less code, less surface, fewer failure modes. Closing #676 by deletion is the best kind of fix.
  • No Flyway migration is touched. The dropped file column lives only in the normalizer's canonical-documents.xlsx artifact (CanonicalDocument, RawRow, _FIELDS, DOC_COLUMNS) — not in the documents table. So the DB-diagram doc-check rule does not fire here. Correctly scoped.
  • ADR-025 is updated thoroughly: the security sub-decision is rewritten, and a new Consequence records that INDEX_PATTERN is corpus-specific and must be widened deliberately if the catalog scheme grows (5-letter prefix, digit-led id, non-Latin-1 letter). That is exactly the institutional memory an ADR exists to hold.
  • DEPLOYMENT.md §6 and docker-compose.prod.yml now describe <index>.pdf flat in the import dir and the canonical artifacts, replacing stale ODS/datei wording. Good doc hygiene.

Concerns (not blockers for this PR)

  1. docs/architecture/c4/l3-backend-3b-document-management.puml still describes the deleted MassImportService ("Reads Excel/ODS files from /import mount ... delegates to ExcelService") and AdminController "Triggers asynchronous Excel/ODS mass import". This drift was introduced when MassImportService was removed on the base branch (docs/import-migration), not by #687 — so it is out of scope for this PR's doc-check. But the canonical-import reality (DocumentImporter + orchestrator + canonical artifacts) is now far from this diagram. Track a follow-up to redraw 3b before docs/import-migration lands on main; do not let it slip in silently.
  2. INDEX_PATTERN is an application-layer allowlist doing the job a constraint can't (filesystem lookup safety), which is the correct layer here — there is no DB-layer equivalent for "valid on-disk filename". The ADR consequence documents the lock-in. No action; just confirming the layer choice is right.

Boundaries

DocumentImporter stays within the importing domain, calls DocumentService/PersonService/TagService (never their repositories), and S3/thumbnail plumbing is unchanged. No new coupling. Clean.

## Markus Keller — Senior Application Architect **Verdict: Approved with concerns** I reviewed the structural and data-model impact of resolving PDFs by index and dropping the `file` column, plus documentation currency. ### What is right - The simplification is the kind I like: a uniform corpus (`<index>.pdf` flat) collapses an `O(rows x tree)` recursive `Files.walk` into an `O(1)` `importDir.resolve(index + ".pdf")`. Less code, less surface, fewer failure modes. Closing #676 by deletion is the best kind of fix. - No Flyway migration is touched. The dropped `file` column lives only in the normalizer's `canonical-documents.xlsx` artifact (`CanonicalDocument`, `RawRow`, `_FIELDS`, `DOC_COLUMNS`) — not in the `documents` table. So the DB-diagram doc-check rule does **not** fire here. Correctly scoped. - ADR-025 is updated thoroughly: the security sub-decision is rewritten, and a new Consequence records that `INDEX_PATTERN` is corpus-specific and must be widened deliberately if the catalog scheme grows (5-letter prefix, digit-led id, non-Latin-1 letter). That is exactly the institutional memory an ADR exists to hold. - `DEPLOYMENT.md §6` and `docker-compose.prod.yml` now describe `<index>.pdf` flat in the import dir and the canonical artifacts, replacing stale ODS/`datei` wording. Good doc hygiene. ### Concerns (not blockers for this PR) 1. `docs/architecture/c4/l3-backend-3b-document-management.puml` still describes the **deleted** `MassImportService` ("Reads Excel/ODS files from /import mount ... delegates to ExcelService") and `AdminController` "Triggers asynchronous Excel/ODS mass import". This drift was introduced when `MassImportService` was removed on the base branch (`docs/import-migration`), **not** by #687 — so it is out of scope for this PR's doc-check. But the canonical-import reality (DocumentImporter + orchestrator + canonical artifacts) is now far from this diagram. Track a follow-up to redraw 3b before `docs/import-migration` lands on `main`; do not let it slip in silently. 2. `INDEX_PATTERN` is an application-layer allowlist doing the job a constraint can't (filesystem lookup safety), which is the correct layer here — there is no DB-layer equivalent for "valid on-disk filename". The ADR consequence documents the lock-in. No action; just confirming the layer choice is right. ### Boundaries DocumentImporter stays within the `importing` domain, calls `DocumentService`/`PersonService`/`TagService` (never their repositories), and S3/thumbnail plumbing is unchanged. No new coupling. Clean.
Author
Owner

Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved

This is a CWE-22 (path traversal) review of the new index-driven resolution. I read the full DocumentImporter on the head branch, not just the diff.

The control is sound

The index is the only value that reaches disk I/O, and it is treated as hostile regardless of upstream trust. isValidImportIndex runs char-level guards before the pattern:

  • /, \, three slash homoglyphs (U+2215, U+FF0F, U+29F5), null byte, absolute-path marker — all rejected.
  • any . is rejected, so .. traversal and <index>.pdf.pdf double-extension smuggling are both structurally impossible; <index>.pdf is the only extension the importer appends.
  • Then INDEX_PATTERN.matcher(index).matches(). Critically, matches() anchors to the entire input (the pattern itself has no ^/$, but matches() makes that irrelevant). So "J 0070" with an interior space cannot partial-match — verified against the new reject test.

Defense-in-depth is real, not theatre: even a syntactically valid index whose <index>.pdf is a symlink escaping importDir is caught by the canonical-path containment assertion (startsWith(baseDirCanonical + File.separator)) and throws DomainException. The resolvePdfByIndex_throwsWhenResolvedPathEscapesImportDir_viaSymlink test pins it.

Log-injection / info-leak check — correctly handled

The skip path now logs the 1-based source row number, not the raw hostile index:
log.warn("Skipping import row {}: index rejected (fails catalog-shape validation)", rowNumber).
This is the right call — an attacker-controlled string (CRLF, ANSI, ${jndi:...}) never reaches the log sink via the reject branch. SLF4J {} parameterization is used throughout, so even the PLACEHOLDER/absent-PDF branch (which does log the index) is JNDI-safe; and that index is already proven to match the strict catalog shape, so there is no hostile content there anyway. Both branches are distinct and both safe. Good.

Regression coverage — the part I care most about

The MassImportService-era traversal tests were not lost, they were ported and strengthened: null/blank/slash/backslash/.././absolute/null-byte/homoglyph all have isValidImportIndex_* equivalents, plus four new catalog-shape reject boundaries (J 0070, WXYZA-0001, 12-0001, W-0001X) that specifically pin the INDEX_PATTERN branch. The load_rejectsMaliciousIndex_neverReadsOutsideImportDir test asserts s3Client is never called and nothing is saved. This is exactly how a security fix should be regression-locked.

The intentionally-untested IOException branch — reasonable

resolvePdfByIndex's getCanonicalPath() IOException branch has no test, documented inline. I agree this is not a finding: there is no injection seam to force getCanonicalPath() to throw on a temp file deterministically, and the substantive security behaviour (fail-safe to PLACEHOLDER + non-silent log.warn) is correct. Adding production code purely to make it throw would be worse than the gap. The \d ASCII-only note (do not add UNICODE_CHARACTER_CLASS) is a genuinely valuable threat-model comment — it prevents a future "improvement" from silently widening the accepted digit set to Arabic-Indic/fullwidth. Keep it.

No injection, no IDOR, no traversal escape, no info leak. Clean.

## Nora "NullX" Steiner — Application Security Engineer **Verdict: Approved** This is a CWE-22 (path traversal) review of the new index-driven resolution. I read the full `DocumentImporter` on the head branch, not just the diff. ### The control is sound The `index` is the only value that reaches disk I/O, and it is treated as hostile regardless of upstream trust. `isValidImportIndex` runs char-level guards **before** the pattern: - `/`, `\`, three slash homoglyphs (U+2215, U+FF0F, U+29F5), null byte, absolute-path marker — all rejected. - **any `.`** is rejected, so `..` traversal and `<index>.pdf.pdf` double-extension smuggling are both structurally impossible; `<index>.pdf` is the only extension the importer appends. - Then `INDEX_PATTERN.matcher(index).matches()`. Critically, `matches()` anchors to the *entire* input (the pattern itself has no `^`/`$`, but `matches()` makes that irrelevant). So `"J 0070"` with an interior space cannot partial-match — verified against the new reject test. Defense-in-depth is real, not theatre: even a syntactically valid index whose `<index>.pdf` is a **symlink** escaping `importDir` is caught by the canonical-path containment assertion (`startsWith(baseDirCanonical + File.separator)`) and throws `DomainException`. The `resolvePdfByIndex_throwsWhenResolvedPathEscapesImportDir_viaSymlink` test pins it. ### Log-injection / info-leak check — correctly handled The skip path now logs the **1-based source row number**, not the raw hostile index: `log.warn("Skipping import row {}: index rejected (fails catalog-shape validation)", rowNumber)`. This is the right call — an attacker-controlled string (CRLF, ANSI, `${jndi:...}`) never reaches the log sink via the reject branch. SLF4J `{}` parameterization is used throughout, so even the PLACEHOLDER/absent-PDF branch (which *does* log the index) is JNDI-safe; and that index is already proven to match the strict catalog shape, so there is no hostile content there anyway. Both branches are distinct and both safe. Good. ### Regression coverage — the part I care most about The MassImportService-era traversal tests were not lost, they were **ported and strengthened**: null/blank/slash/backslash/`..`/`.`/absolute/null-byte/homoglyph all have `isValidImportIndex_*` equivalents, plus four new catalog-shape reject boundaries (`J 0070`, `WXYZA-0001`, `12-0001`, `W-0001X`) that specifically pin the `INDEX_PATTERN` branch. The `load_rejectsMaliciousIndex_neverReadsOutsideImportDir` test asserts `s3Client` is never called and nothing is saved. This is exactly how a security fix should be regression-locked. ### The intentionally-untested IOException branch — reasonable `resolvePdfByIndex`'s `getCanonicalPath()` IOException branch has no test, documented inline. I agree this is not a finding: there is no injection seam to force `getCanonicalPath()` to throw on a temp file deterministically, and the substantive security behaviour (fail-safe to PLACEHOLDER + non-silent `log.warn`) is correct. Adding production code purely to make it throw would be worse than the gap. The `\d` ASCII-only note (do not add `UNICODE_CHARACTER_CLASS`) is a genuinely valuable threat-model comment — it prevents a future "improvement" from silently widening the accepted digit set to Arabic-Indic/fullwidth. Keep it. No injection, no IDOR, no traversal escape, no info leak. Clean.
Author
Owner

Felix Brandt — Senior Fullstack Developer

Verdict: Approved

Clean-code and TDD review of the Java importer and the Python normalizer changes.

Naming and structure

  • resolveFile/basenameOf/findFileRecursive/InvalidImportFilenameException are gone, not commented out. Dead code deleted. resolvePdfByIndex and isValidImportIndex name exactly what they do. The displayName helper that existed only to pick file-vs-index for the skip log is removed too — the skip now carries the bare index, which is simpler.
  • importRow reads top-down as guard clauses: reject invalid index (early return) -> resolve -> absent (log + fall through to PLACEHOLDER) -> present (magic-byte gate) -> persist. The happy path is the last call (persist(...)). This is the guard-clause shape I push for.
  • The inline java.util.regex.Pattern FQNs were promoted to a proper import java.util.regex.Pattern; and the unused Path/Stream imports removed. Imports are tidy.
  • The int rowNumber threading is minimal and well-commented (header is row 1, first data row is row 2 — matches the .xlsx an operator sees). The breadcrumb genuinely helps triage.

TDD evidence

The test file is the strongest part. Every removed behaviour has a replacement test, and the new behaviours are pinned:

  • load_resolvesPdfByIndex_uploadsToS3_andSetsStatusUploaded even captures the RequestBody and asserts the S3 key ends with _W-0124.pdf — proving exactly importDir/<index>.pdf was uploaded.
  • load_yieldsPlaceholder_whenIndexedPdfMissing and load_setsStatusPlaceholder_whenNoIndexedPdf lock the absent->PLACEHOLDER path.
  • The four regex reject-boundary tests have a comment block explaining they pass the char pre-checks and fail only on INDEX_PATTERN — so a weaker pattern goes red. That is a test that documents intent.
    The author reports 43/43; the structure is consistent with that.

Python normalizer

index_file_mismatch() + its flag, the datei header map, file from RawRow/CanonicalDocument/_FIELDS/DOC_COLUMNS, and the review-CSV writer are all removed coherently. The old test_to_canonical_carries_file_name / test_index_file_mismatch tests are replaced by test_canonical_document_has_no_file_field (assert not hasattr(doc, "file")) and test_write_documents_xlsx_carries_date_end_and_has_no_file_column. date_end and the tree person_id are correctly retained. Type hints intact. No half-removed state.

Minor (not blocking)

  • The inline comments in importRow/resolvePdfByIndex are on the longer side, but they explain why (the reject-vs-absent asymmetry, the log-safety reasoning) rather than what, so they earn their place. I would not cut them.

Nice work. The deletion-heavy diff with full test parity is exactly how a simplification PR should look.

## Felix Brandt — Senior Fullstack Developer **Verdict: Approved** Clean-code and TDD review of the Java importer and the Python normalizer changes. ### Naming and structure - `resolveFile`/`basenameOf`/`findFileRecursive`/`InvalidImportFilenameException` are **gone**, not commented out. Dead code deleted. `resolvePdfByIndex` and `isValidImportIndex` name exactly what they do. The `displayName` helper that existed only to pick file-vs-index for the skip log is removed too — the skip now carries the bare `index`, which is simpler. - `importRow` reads top-down as guard clauses: reject invalid index (early return) -> resolve -> absent (log + fall through to PLACEHOLDER) -> present (magic-byte gate) -> persist. The happy path is the last call (`persist(...)`). This is the guard-clause shape I push for. - The inline `java.util.regex.Pattern` FQNs were promoted to a proper `import java.util.regex.Pattern;` and the unused `Path`/`Stream` imports removed. Imports are tidy. - The `int rowNumber` threading is minimal and well-commented (header is row 1, first data row is row 2 — matches the .xlsx an operator sees). The breadcrumb genuinely helps triage. ### TDD evidence The test file is the strongest part. Every removed behaviour has a replacement test, and the new behaviours are pinned: - `load_resolvesPdfByIndex_uploadsToS3_andSetsStatusUploaded` even captures the `RequestBody` and asserts the S3 key ends with `_W-0124.pdf` — proving exactly `importDir/<index>.pdf` was uploaded. - `load_yieldsPlaceholder_whenIndexedPdfMissing` and `load_setsStatusPlaceholder_whenNoIndexedPdf` lock the absent->PLACEHOLDER path. - The four regex reject-boundary tests have a comment block explaining they pass the char pre-checks and fail *only* on `INDEX_PATTERN` — so a weaker pattern goes red. That is a test that documents intent. The author reports 43/43; the structure is consistent with that. ### Python normalizer `index_file_mismatch()` + its flag, the `datei` header map, `file` from `RawRow`/`CanonicalDocument`/`_FIELDS`/`DOC_COLUMNS`, and the review-CSV writer are all removed coherently. The old `test_to_canonical_carries_file_name` / `test_index_file_mismatch` tests are replaced by `test_canonical_document_has_no_file_field` (`assert not hasattr(doc, "file")`) and `test_write_documents_xlsx_carries_date_end_and_has_no_file_column`. `date_end` and the tree `person_id` are correctly retained. Type hints intact. No half-removed state. ### Minor (not blocking) - The inline comments in `importRow`/`resolvePdfByIndex` are on the longer side, but they explain *why* (the reject-vs-absent asymmetry, the log-safety reasoning) rather than *what*, so they earn their place. I would not cut them. Nice work. The deletion-heavy diff with full test parity is exactly how a simplification PR should look.
Author
Owner

Sara Holt — Senior QA Engineer

Verdict: Approved with concerns

Test-strategy and coverage review. I assessed by reading the suite (full backend sweep is CI-only on this project); the author reports DocumentImporterTest 43/43 and the integration/reader tests green.

Coverage is strong

  • Boundary discipline on the validator: the four catalog-shape rejects (J 0070, WXYZA-0001, 12-0001, W-0001X) plus the accepts (W-0124, Al-0001, CuH-0010, Mü-0001, C--0029, W-0001x) pin both edges of INDEX_PATTERN. The umlaut and double-hyphen accepts are grounded in real corpus indices — that is the right reason to have a test, not coverage padding.
  • State-driving behaviour: present->UPLOADED (with S3 key assertion) and absent->PLACEHOLDER are both covered, and the malicious-index test asserts the negative (s3Client never, documentService never save). One logical assertion focus per test, descriptive should/when-style names.
  • Security regression parity: nothing from the MassImportService-era traversal suite was dropped; each maps to an isValidImportIndex_* or resolvePdfByIndex_* equivalent. The symlink-escape test survives and still asserts DomainException.
  • Test data still goes through the docRow(...)/writeDocs(...) factory helpers; the file arg was removed from docRow cleanly, so the helpers stayed DRY.

The one gap — agreed, but with a note

The resolvePdfByIndex getCanonicalPath() IOException branch is uncovered, documented by an inline NOTE (Sara, PR #687). I accept the reasoning: unlike isPdfMagicBytes (which has the openFileStream(File) package-private seam a spy can force to throw), getCanonicalPath() is called on an internally-built File with no seam, and you cannot deterministically make it throw on a temp file. Manufacturing a seam purely to cover a non-defect would be production code in service of a test — net negative. The substantive fix (the log.warn so the quiet skip surfaces in ops) is the right outcome, and it is documented so the branch is not assumed tested. Good.

My concern is narrow: this leaves one branch genuinely uncovered, which will register against branch coverage (issue #496 territory). That is fine for this PR — just make sure the JaCoCo gate is not configured to hard-fail on this single uncoverable branch; if it does, prefer a documented exclude over a contrived seam. Verify the CI coverage gate stays green before merge.

Reliability

Tests use Testcontainers Postgres for the integration layer (per the existing CanonicalImportIntegrationTest), @TempDir for filesystem isolation, no Thread.sleep, no @Disabled. Deterministic.

Solid test work overall.

## Sara Holt — Senior QA Engineer **Verdict: Approved with concerns** Test-strategy and coverage review. I assessed by reading the suite (full backend sweep is CI-only on this project); the author reports `DocumentImporterTest` 43/43 and the integration/reader tests green. ### Coverage is strong - **Boundary discipline on the validator**: the four catalog-shape rejects (`J 0070`, `WXYZA-0001`, `12-0001`, `W-0001X`) plus the accepts (`W-0124`, `Al-0001`, `CuH-0010`, `Mü-0001`, `C--0029`, `W-0001x`) pin both edges of `INDEX_PATTERN`. The umlaut and double-hyphen accepts are grounded in real corpus indices — that is the right reason to have a test, not coverage padding. - **State-driving behaviour**: present->UPLOADED (with S3 key assertion) and absent->PLACEHOLDER are both covered, and the malicious-index test asserts the negative (`s3Client never`, `documentService never save`). One logical assertion focus per test, descriptive `should/when`-style names. - **Security regression parity**: nothing from the MassImportService-era traversal suite was dropped; each maps to an `isValidImportIndex_*` or `resolvePdfByIndex_*` equivalent. The symlink-escape test survives and still asserts `DomainException`. - Test data still goes through the `docRow(...)`/`writeDocs(...)` factory helpers; the `file` arg was removed from `docRow` cleanly, so the helpers stayed DRY. ### The one gap — agreed, but with a note The `resolvePdfByIndex` `getCanonicalPath()` IOException branch is uncovered, documented by an inline `NOTE (Sara, PR #687)`. I accept the reasoning: unlike `isPdfMagicBytes` (which has the `openFileStream(File)` package-private seam a spy can force to throw), `getCanonicalPath()` is called on an internally-built `File` with no seam, and you cannot deterministically make it throw on a temp file. Manufacturing a seam purely to cover a non-defect would be production code in service of a test — net negative. The substantive fix (the `log.warn` so the quiet skip surfaces in ops) is the right outcome, and it is documented so the branch is not *assumed* tested. Good. My concern is narrow: this leaves one branch genuinely uncovered, which will register against branch coverage (issue #496 territory). That is fine for this PR — just make sure the JaCoCo gate is not configured to hard-fail on this single uncoverable branch; if it does, prefer a documented exclude over a contrived seam. Verify the CI coverage gate stays green before merge. ### Reliability Tests use Testcontainers Postgres for the integration layer (per the existing `CanonicalImportIntegrationTest`), `@TempDir` for filesystem isolation, no `Thread.sleep`, no `@Disabled`. Deterministic. Solid test work overall.
Author
Owner

Tobias Wendt — DevOps & Platform Engineer

Verdict: Approved

I reviewed the infrastructure-facing surface: the IMPORT_HOST_DIR mount contract, the compose/docs wording, and operability of the new skip/placeholder logging.

Compose + env documentation

  • docker-compose.prod.yml: the IMPORT_HOST_DIR comment and the :? error message are updated from "ODS spreadsheet + PDFs / MassImportService Files.list/Files.walk" to "canonical artifacts (canonical-*.xlsx + canonical-persons-tree.json) + <index>.pdf files / the canonical importer only reads them". The mount is still :ro, still required-with-no-default (compose refuses to start when unset, so staging and prod cannot share a source), and still localhost-bound on the backend port. No security regression in the topology.
  • DEPLOYMENT.md: the IMPORT_HOST_DIR table row and §6 now state PDFs must be <index>.pdf, flat in the dir, and explain the index-resolution + absent->PLACEHOLDER behaviour. The operator-facing contract is accurate. The README normalizer table dropped the now-moot index-file-mismatch.csv row. Consistent.

Operability — this is the part I value

The logging asymmetry is now correct for on-call triage:

  • invalid index -> log.warn("Skipping import row {}: index rejected ...", rowNumber) (row number, not hostile input)
  • valid index, no PDF -> log.info("Import row {}: index {} is valid but {}.pdf is absent — creating PLACEHOLDER", ...)
  • canonical-path resolution IOException -> log.warn("Canonical path resolution failed for import row {}: treating {}.pdf as absent", ...) — previously a silent return Optional.empty(), now surfaced.

That last one is the meaningful operability fix: a corpus that drifts from the <index>.pdf assumption no longer disappears quietly into placeholders without a trace. An operator can grep the row number against the source .xlsx. The summary line Imported {} documents from {} ({} skipped) is retained.

Notes (not blocking)

  • This is backend/Python-only; commits used --no-verify to bypass the husky frontend lint, which can't run on these paths. Acceptable, but make sure CI runs the backend + normalizer test stages so the bypass doesn't hide a failure — the gate, not the local hook, is what matters here.
  • No image tags, volumes, ports, healthchecks, or CI workflow files changed. Nothing to flag on pinning/exposure/secrets.

Infra contract is clean and the change is more operable than what it replaces.

## Tobias Wendt — DevOps & Platform Engineer **Verdict: Approved** I reviewed the infrastructure-facing surface: the `IMPORT_HOST_DIR` mount contract, the compose/docs wording, and operability of the new skip/placeholder logging. ### Compose + env documentation - `docker-compose.prod.yml`: the `IMPORT_HOST_DIR` comment and the `:?` error message are updated from "ODS spreadsheet + PDFs / MassImportService Files.list/Files.walk" to "canonical artifacts (`canonical-*.xlsx` + `canonical-persons-tree.json`) + `<index>.pdf` files / the canonical importer only reads them". The mount is still `:ro`, still required-with-no-default (compose refuses to start when unset, so staging and prod cannot share a source), and still localhost-bound on the backend port. No security regression in the topology. - `DEPLOYMENT.md`: the `IMPORT_HOST_DIR` table row and §6 now state PDFs must be `<index>.pdf`, flat in the dir, and explain the index-resolution + absent->PLACEHOLDER behaviour. The operator-facing contract is accurate. The README normalizer table dropped the now-moot `index-file-mismatch.csv` row. Consistent. ### Operability — this is the part I value The logging asymmetry is now correct for on-call triage: - invalid index -> `log.warn("Skipping import row {}: index rejected ...", rowNumber)` (row number, not hostile input) - valid index, no PDF -> `log.info("Import row {}: index {} is valid but {}.pdf is absent — creating PLACEHOLDER", ...)` - canonical-path resolution IOException -> `log.warn("Canonical path resolution failed for import row {}: treating {}.pdf as absent", ...)` — previously a silent `return Optional.empty()`, now surfaced. That last one is the meaningful operability fix: a corpus that drifts from the `<index>.pdf` assumption no longer disappears quietly into placeholders without a trace. An operator can grep the row number against the source `.xlsx`. The summary line `Imported {} documents from {} ({} skipped)` is retained. ### Notes (not blocking) - This is backend/Python-only; commits used `--no-verify` to bypass the husky frontend lint, which can't run on these paths. Acceptable, but make sure CI runs the backend + normalizer test stages so the bypass doesn't hide a failure — the gate, not the local hook, is what matters here. - No image tags, volumes, ports, healthchecks, or CI workflow files changed. Nothing to flag on pinning/exposure/secrets. Infra contract is clean and the change is more operable than what it replaces.
Author
Owner

Elicit — Requirements Engineer & Business Analyst

Verdict: Approved

Brownfield requirements lens: does the change satisfy the intent of #686/#676, are the unhappy paths specified, and is the behavioural contract documented and testable?

Traceability — goal to behaviour

  • #686 (resolve PDFs by index, drop file column) and #676 (kill the O(rows x tree) recursive walk) are both satisfied by construction: the walk and the file column are deleted, replaced by importDir/<index>.pdf. The ADR + DEPLOYMENT.md state the contract in operator terms. Goal -> decision -> code -> test chain is intact.

Unhappy paths are now explicit (this was my main interest)

The earlier ambiguity I would have flagged — "what happens when the index is bad vs when the PDF is merely missing?" — is now resolved into two clearly-distinct, documented behaviours:

  1. Invalid/hostile index -> row skipped with SkipReason.INVALID_FILENAME_PATH_TRAVERSAL, import continues. Testable, tested.
  2. Valid index, <index>.pdf absent -> row becomes a PLACEHOLDER (metadata-only), not skipped. Testable, tested.
  3. Symlink escaping importDir -> import aborts (DomainException) — treated as an environment-level attack signal, not a row typo.

These are spelled out as EARS-style rules in ADR-025's Consequences, including the corpus-specificity caveat. That is exactly the "what does done look like, including the error flows" contract I want before a change like this merges. No silent-drop ambiguity remains — both the skip and the absent cases log distinctly.

Scope discipline

No gold-plating. The PR does one thing (index resolution + column drop) and resists widening INDEX_PATTERN "just in case" — the ADR instead records that widening must be a deliberate future decision when the catalog scheme grows. That is the correct way to park a non-requirement.

Open item (track, do not block)

The corpus assumption "every PDF is named <index>.pdf, flat" is now a hard contract. If a future sub-collection arrives with a different naming or nesting scheme, rows silently become PLACEHOLDERs (logged) or are skipped (logged). The ADR captures this. Suggest a one-line note in the operator runbook / import SOP so whoever stages IMPORT_HOST_DIR knows the naming requirement up front — a requirement is only met if the human feeding the system knows it. Minor, documentation-only.

Requirements are clear, testable, and the error space is fully specified. No unresolved ambiguity in the must-have behaviour.

## Elicit — Requirements Engineer & Business Analyst **Verdict: Approved** Brownfield requirements lens: does the change satisfy the intent of #686/#676, are the unhappy paths specified, and is the behavioural contract documented and testable? ### Traceability — goal to behaviour - **#686** (resolve PDFs by index, drop `file` column) and **#676** (kill the `O(rows x tree)` recursive walk) are both satisfied by construction: the walk and the `file` column are deleted, replaced by `importDir/<index>.pdf`. The ADR + DEPLOYMENT.md state the contract in operator terms. Goal -> decision -> code -> test chain is intact. ### Unhappy paths are now explicit (this was my main interest) The earlier ambiguity I would have flagged — "what happens when the index is bad vs when the PDF is merely missing?" — is now resolved into two clearly-distinct, documented behaviours: 1. **Invalid/hostile index** -> row **skipped** with `SkipReason.INVALID_FILENAME_PATH_TRAVERSAL`, import continues. Testable, tested. 2. **Valid index, `<index>.pdf` absent** -> row becomes a **PLACEHOLDER** (metadata-only), not skipped. Testable, tested. 3. **Symlink escaping importDir** -> import **aborts** (DomainException) — treated as an environment-level attack signal, not a row typo. These are spelled out as EARS-style rules in ADR-025's Consequences, including the corpus-specificity caveat. That is exactly the "what does done look like, including the error flows" contract I want before a change like this merges. No silent-drop ambiguity remains — both the skip and the absent cases log distinctly. ### Scope discipline No gold-plating. The PR does one thing (index resolution + column drop) and resists widening `INDEX_PATTERN` "just in case" — the ADR instead records that widening must be a deliberate future decision when the catalog scheme grows. That is the correct way to park a non-requirement. ### Open item (track, do not block) The corpus assumption "every PDF is named `<index>.pdf`, flat" is now a hard contract. If a future sub-collection arrives with a different naming or nesting scheme, rows silently become PLACEHOLDERs (logged) or are skipped (logged). The ADR captures this. Suggest a one-line note in the operator runbook / import SOP so whoever stages `IMPORT_HOST_DIR` knows the naming requirement up front — a requirement is only met if the human feeding the system knows it. Minor, documentation-only. Requirements are clear, testable, and the error space is fully specified. No unresolved ambiguity in the must-have behaviour.
Author
Owner

Leonie Voss — UX Designer & Accessibility Strategist

Verdict: Approved (no UI surface)

I checked for any user-facing change: Svelte components/routes, brand tokens, accessibility, responsive layout, error-message presentation.

This PR is backend Java (DocumentImporter, tests), the Python import-normalizer, ADR-025, DEPLOYMENT.md, docker-compose.prod.yml, and the normalizer README. No frontend files, no .svelte, no messages/{de,en,es}.json, no error-code mapping are touched. There is nothing within my remit to evaluate — no rendered output changes for either audience (the 60+ readers or the younger phone users).

One adjacent note for whoever owns the /admin/system import card: the import now distinguishes "row skipped (bad index)" from "PLACEHOLDER (PDF absent)" in the backend logs. If/when that status is ever surfaced in the admin UI's import report, those two outcomes should be shown with redundant cues (icon + text, not colour alone) and the skip reason should map through getErrorMessage() to a localized string rather than echoing the raw SkipReason enum. Not actionable in this PR — just flagging it for the future UI that will read this data.

LGTM from a UX/accessibility standpoint.

## Leonie Voss — UX Designer & Accessibility Strategist **Verdict: Approved (no UI surface)** I checked for any user-facing change: Svelte components/routes, brand tokens, accessibility, responsive layout, error-message presentation. This PR is backend Java (`DocumentImporter`, tests), the Python import-normalizer, ADR-025, `DEPLOYMENT.md`, `docker-compose.prod.yml`, and the normalizer README. **No frontend files, no `.svelte`, no `messages/{de,en,es}.json`, no error-code mapping** are touched. There is nothing within my remit to evaluate — no rendered output changes for either audience (the 60+ readers or the younger phone users). One adjacent note for whoever owns the `/admin/system` import card: the import now distinguishes "row skipped (bad index)" from "PLACEHOLDER (PDF absent)" in the backend logs. If/when that status is ever surfaced in the admin UI's import report, those two outcomes should be shown with **redundant cues** (icon + text, not colour alone) and the skip reason should map through `getErrorMessage()` to a localized string rather than echoing the raw `SkipReason` enum. Not actionable in this PR — just flagging it for the future UI that will read this data. LGTM from a UX/accessibility standpoint.
marcel added 1 commit 2026-05-27 21:30:46 +02:00
docs: drop remaining stale MassImportService/ExcelService references
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m38s
CI / OCR Service Tests (pull_request) Successful in 21s
CI / Backend Unit Tests (pull_request) Successful in 4m9s
CI / fail2ban Regex (pull_request) Successful in 48s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m5s
ea38efc734
Replace the legacy raw-spreadsheet importer references left behind after
#674 with the canonical import architecture (CanonicalImportOrchestrator +
four loaders) and document #686 index-based PDF resolution.

- l3-backend-3b: DocumentImporter now resolves PDF by index (importDir/
  <index>.pdf) with index validation + canonical-path containment + %PDF
  magic-byte check (no recursive walk / homoglyph file-path guards)
- c4-diagrams.md: replace massImport/excelSvc components + their rels with
  an importOrch (CanonicalImportOrchestrator) component wired to doc/person/
  tag services; refresh adminCtrl and adminSystem descriptions
- ARCHITECTURE.md: importing package row now describes the orchestrator +
  four loaders consuming canonical artifacts
- TODO-backend.md: remove obsolete "MassImportService provides no status"
  item (service deleted; orchestrator already exposes import-status); update
  stale ExcelService test-coverage suggestion

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
marcel merged commit 0a3d12b9af into docs/import-migration 2026-05-27 22:08:47 +02:00
marcel deleted branch feature/686-resolve-pdf-by-index 2026-05-27 22:08:47 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#687