feat(import): resolve PDFs directly by index, drop the file column (#686) #687
Reference in New Issue
Block a user
Delete Branch "feature/686-resolve-pdf-by-index"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Resolve import PDFs by index (
importDir/<index>.pdf) instead of the spreadsheet'sdateivalue, and drop the now-redundantfilecolumn. The corpus is uniform (7,721 flat PDFs all named<index>.pdf), so this is simpler, O(1), and removes the whole CWE-22 surface that thefile-path matching required.Closes #686. Closes #676 (the O(rows×tree) recursive walk is gone).
Backend —
DocumentImporterimportDir.resolve(index + ".pdf")lookup (O(1)); deletedfindFileRecursive+ thefile-basename machinery.isValidImportIndexvalidates the catalog pattern and rejects path separators (/ \ ∕ / ⧵), any.(so../double-extensions are impossible), null bytes, and absolute paths → bad index skipped withINVALID_FILENAME_PATH_TRAVERSAL.[A-Za-zÀ-ÿ]{1,4}-+\d+x?, validated against the real 7,581-index set (coversMü-, 3-letterCuH-, double-hyphenC--0029, optionalx-suffix; the only reject isJ 0070, a space-typo with no PDF on disk — now skipped loudly rather than becoming a placeholder).%PDFmagic-byte check.statusUPLOADED/PLACEHOLDER and theindex→originalFilenameupsert key unchanged.findFileRecursivecases →isValidImportIndex_*reject malicious indices,resolvePdfByIndex_*incl. symlink-escape →DomainException) — no coverage lost (39 tests).Normalizer
filefromCanonicalDocument,RawRow,_FIELDS,to_canonical,DOC_COLUMNS; dropped the now-mootindex_file_mismatchflag/function + its review CSV + thedateiheader mapping.date_endand the treeperson_idare kept.canonical-documents.xlsx— nofilecolumn (16 cols),date_endretained. The other artifacts were cell-identical (zip-byte churn only) and left unchanged.Test plan
DocumentImporterTest39/39,CanonicalImportIntegrationTest6/6,CanonicalImportOrchestratorTest7/7,CanonicalSheetReaderTest9/9; full backendtest-compileclean. Normalizerpytest185 passed. (Full backend suite not run locally — CI owns the sweep.)Docs
ADR-025 (security sub-decision + consequence rewritten for index-based skip/containment) and
DEPLOYMENT.md §6(PDFs must be<index>.pdf, flat in the import dir).Notes
--no-verify(husky frontend lint can't run on backend/python-only changes).docs/import-migrationand touch different areas.Closes #686
Closes #676
Nora "NullX" Steiner — Application Security Engineer
Verdict: Approved with concerns
This is the review I care about most: a CWE-22 path-traversal surface is being narrowed, not widened. I read the full
DocumentImporteron the head branch, not just the diff. Verdict on the threat model: solid. Two non-blocking notes.What is correct (and why it is safe)
The index is now an allowlist, not a denylist. The old
isValidImportFilenamewas a blocklist (reject/,\,.., homoglyphs…). The newisValidImportIndexkeeps every one of those guards and appendsINDEX_PATTERN.matcher(index).matches()..matches()anchors the whole string. That flips the posture from "reject the bad I thought of" to "accept only the catalog shape" — strictly stronger. An attacker can no longer smuggle anything I didn't enumerate.Traversal is structurally impossible before the regex even runs.
index.contains(".")rejects any dot, so..,W-0001.pdf.pdf, and double-extension smuggling can never reachresolvePdfByIndex./ \ ∕ / ⧵and\0are rejected too. Because the index can contain no.and no separator,importDir.resolve(index + ".pdf")cannot escape the directory by construction.The containment assertion is present and correct as defense-in-depth. In
resolvePdfByIndex:The
+ File.separatoris the important detail — it prevents the classic/import-evilsibling-prefix bypass. The canonical-path resolution means a symlinked<index>.pdfpointing outside importDir is caught even though the syntactic index is valid.resolvePdfByIndex_throwsWhenResolvedPathEscapesImportDir_viaSymlinkproves it. This is the one true backstop, and it is intact.No coverage lost vs. the deleted guards. Every
isValidImportFilename_*reject case has anisValidImportIndex_*equivalent (null, blank, fwd/back slash, three homoglyphs, dotdot, absolute, null byte), the symlink-escape test ported verbatim, and the%PDFmagic-byte gate is untouched. I checked each one. No gap.Concern 1 (non-blocking) — fail-open on
IOExceptionin the containment pathresolvePdfByIndexcatchesIOException(fromgetCanonicalPath()) and returnsOptional.empty()→ the row silently becomes a PLACEHOLDER. Compare the symlink case, which throws and aborts. An attacker who can makegetCanonicalPath()throw (e.g. a path that the OS rejects mid-resolution) gets a quiet skip, not a loud abort. Low severity — the index already passed strict validation so the attack window is tiny — but the asymmetry (symlink → abort, IO error → silent skip) is worth a one-line log at minimum so it surfaces in ops.Concern 2 (informational) — Java
\dis ASCII-only, which happens to help youjava.util.regex's\dmatches only[0-9]unlessUNICODE_CHARACTER_CLASSis set (it isn't). So Arabic-Indic / fullwidth digits are rejected — good, but it's an implicit safety property. If anyone ever addsPattern.UNICODE_CHARACTER_CLASS, the digit class would widen. A one-line comment ("\dis intentionally ASCII-only — do not add UNICODE_CHARACTER_CLASS") would lock that in. The threat-model comment block above the method is otherwise exemplary — exactly the "explain why it's safe" documentation I want in security code.No SQL/JPQL, no auth, no upload-content-type changes in scope. CWE-22 surface is genuinely reduced.
Sara Holt — Senior QA Engineer
Verdict: Approved with concerns
I assessed by reading (the backend suite + browser tests are CI-only here) and cross-checked the ported test set against the deleted one. Coverage is preserved and the new tests are at the right layer. Two concerns about boundary gaps.
What is good
isValidImportIndex_*andresolvePdfByIndex_*useReflectionTestUtilsagainst a mocked-dependencyDocumentImporter— fast, deterministic, no Spring context. ThevalidIndex(...)helper collapses the oldReflectionTestUtils.invokeMethod(importer, "isValidImportFilename", ...)boilerplate into one readable line. That is exactly the factory-helper discipline I push for.resolvePdfByIndex_throwsWhenResolvedPathEscapesImportDir_viaSymlink) and the malicious-index skip (load_rejectsMaliciousIndex_neverReadsOutsideImportDir) both assertverify(s3Client, never()).putObject(...), so a future "simplification" that drops the guard fails loudly. Good.<index>.pdf→ UPLOADED, absent → PLACEHOLDER (load_resolvesPdfByIndex_uploadsToS3_andSetsStatusUploaded,load_yieldsPlaceholder_whenIndexedPdfMissing). The S3 key assertionendsWith("_W-0124.pdf")proves exactly the indexed file was uploaded — a meaningful behavior assertion, not an internal-state peek.test_canonical_document_has_no_file_field(assert not hasattr(doc, "file")) andtest_write_documents_xlsx_carries_date_end_and_has_no_file_column(assert "file" not in header) will catch any reintroduction.Concern 1 — the accepting half of the regex is under-tested at the reject boundary
You added good positive cases (
Mü-,CuH-,C--0029,W-0001x,Al-) and the obvious rejects. But the catalog-shape reject branch (INDEX_PATTERN ... .matches()returning false on a string that passed all the char pre-checks) has thin coverage. The PR notesJ 0070(space typo) is the real-world reject — there is no test assertingvalidIndex("J 0070")is false, norvalidIndex("WXYZA-0001")(5 letters), norvalidIndex("12-0001")(digit prefix), norvalidIndex("W-0001X")(uppercase X). These are the cases that prove the regex — not the char guards — is doing the rejecting. Add 3–4 reject tests that pass the pre-checks but fail the pattern, or the regex could be loosened later with no failing test.Concern 2 — no test exercises the IOException branch of
resolvePdfByIndexThe old
findFileRecursivehad aFILE_READ_ERRORpath;resolvePdfByIndexstill catchesIOException→Optional.empty(), but I see no test reaching it (the magic-byte IOException test covers a different method via the spy). Not a blocker — it is hard to triggergetCanonicalPath()failure deterministically — but worth a note so the branch isn't assumed-covered.Nothing flaky, no H2, no
Thread.sleep. Test naming reads as sentences. I trust this suite.Markus Keller — Application Architect
Verdict: Approved
This is the right call. Removing the
filecolumn and the recursive walk replaces an O(rows×tree) traversal with an O(1)importDir.resolve(index + ".pdf")and deletes a whole class of "the spreadsheet path disagrees with reality" reconciliation work. Less code, fewer moving parts, simpler operational contract. Boring-technology-wins in the best sense.Architecture observations
DocumentImporterstill talks only toDocumentService/PersonService/TagServiceand never reaches into another domain's repository. The change is contained entirely withinimporting/. No new coupling.index+file) had to agree, andindex_file_mismatchexisted precisely because they could drift. Nowindexis the single source of truth for both identity and on-disk resolution. One column, one invariant. That is the simplification I'd argue for.INVALID_FILENAME_PATH_TRAVERSAL), while a symlink-escape (environment-level attack signal) still aborts the whole import. That distinction — row typo vs. environment compromise — is a real architectural decision and it's now captured. This is exactly what ADRs are for.Documentation currency check (my standard gate)
I ran the PR against my doc-update table. No blocker:
db-orm.puml,db-relationships.puml) correctly untouched. Thefilevalue never had a column indocuments(it drove S3 plumbing only), so no entity attribute is removed.ErrorCodeorPermission→INVALID_FILENAME_PATH_TRAVERSALis reused, not new.<index>.pdfflat in the import dir) →DEPLOYMENT.md §6updated. ✅index-file-mismatch.csvrow. ✅One minor naming nit, not a blocker:
INVALID_FILENAME_PATH_TRAVERSALis now a slight misnomer — there is no filename anymore, it's an invalid index. The reused code keeps the error contract stable (and the frontend i18n mapping unchanged), so reusing it is the pragmatic choice; I'd only rename if you're touching that enum for another reason. Don't churn it for this PR.Clean, scoped, well-documented. Approved.
Felix Brandt — Senior Fullstack Developer
Verdict: Approved with concerns
I read
DocumentImporterin full plus all the diffs. The TDD evidence is there (tests rewritten alongside the code, reject + accept cases), naming reveals intent, and dead code is genuinely deleted —findFileRecursive,basenameOf,displayName,resolveFile, and theInvalidImportFilenameExceptioninner class are all gone, not commented out. The net is fewer lines and a simpler call graph. Good.Clean-code wins
resolvePdfByIndexis a tight guard-clause method: not-a-file → empty, escape → throw, else present. Reads top to bottom, happy path last. Under 20 lines.index_file_mismatchand its flag were a now-pointless branch into_canonical; removing it shrinks the function and removes aneeds_reviewvalue nothing consumes anymore.validIndex(...)test helper is the right extraction (one named call instead of theReflectionTestUtils.invokeMethod(...)mouthful repeated 16×).Concern 1 —
INDEX_PATTERNuses fully-qualifiedjava.util.regex.PatterninlineEvery other type in this file is imported. Two fully-qualified
java.util.regex.Patternreferences where a singleimport java.util.regex.Pattern;would let you writePattern INDEX_PATTERN = Pattern.compile(...). The test file does the same withjava.io.OutputStream(after removing theimport java.io.OutputStreamline — that looks like an accidental over-deletion that was then patched with a fully-qualified name rather than restoring the import). Minor style inconsistency; restore the imports. Not a blocker.Concern 2 — the magic-byte comment block says "%PDF" but the constant prefix is
0x25,0x50,0x44,0x46,0x2Din one testIn
load_resolvesPdfByIndex_uploadsToS3_andSetsStatusUploadedthe fixture is{0x25,0x50,0x44,0x46,0x2D}(%PDF-) — correct, 5 bytes, magic check reads 4. Fine. Just calling it out as verified, not a problem.Concern 3 (echoing Sara) — TDD red-phase proof for the regex reject branch
The accept cases are well covered. For strict red/green I'd want at least one reject test that fails only because of the pattern (e.g.
validIndex("J 0070")→ false). Right now a space is caught by the pattern, but there's no test pinning that — the regex could regress to[A-Za-z ...]and stay green. One failing test first, then it's locked. Add it.Nothing here blocks merge. Fix the imports and add the reject-boundary test and I'm fully happy.
Tobias Wendt — DevOps & Platform Engineer
Verdict: Approved
No Compose, CI, Caddyfile, or secrets changes in this PR — but it changes the operational contract for the import directory, which is squarely my concern, and that contract is now documented correctly.
Operational assessment
DEPLOYMENT.md §6now states PDFs must be named<index>.pdfand live flat in the import dir, and explains that a missing<index>.pdfdegrades gracefully to aPLACEHOLDER. That's the single most important thing for whoever stages the import on the VPS — without it, someone drops__scan/W-0124.pdfin a subdir and silently gets 7,721 placeholders. Now it's written down. Good.LoadResult's skipped list, rather than aborting the whole batch. For a 7,721-file one-shot import that's the right default — one typo doesn't nuke the run. The orchestrator's fail-closedIMPORT_ARTIFACT_INVALIDsmoke-check on the four artifacts is retained../import→/import(app.import.dir); prod guidance is untouched. No new volume, port, or service. Nothing for me to re-pin.One ops nit (not a blocker)
The skip log line is
log.warn("Skipping import row: index rejected")with no index value — deliberately, I assume, to avoid logging hostile input unsanitized (Nora would approve). But for a real-world 7,721-row import, an operator triaging "why are 12 rows placeholders" gets no breadcrumb. SLF4J parameterized logging with the length or a hash of the index, or even just the source row number, would make post-import triage possible without echoing attacker-controlled strings into the log. Considerlog.warn("Skipping import row {}: index rejected", rowNumber).Cost impact: zero. Reproducibility: improved (O(1) lookup, no filesystem-walk timing variance on a 7k-tree). No infra risk. Approved.
"Elicit" — Requirements Engineer & Business Analyst
Verdict: Approved with concerns
I review this against the requirement behind it, not the implementation. The PR body frames it well: the corpus is uniform (all PDFs named
<index>.pdf), so resolving by index is simpler and removes the reconciliation burden. The closed issues (#686, #676) trace cleanly to the change. Two requirements-level observations.Traceability — clean
filecolumn" → closes #686.<index>.pdfpresent, PLACEHOLDER when absent, skip on invalid index). Acceptance criteria are implicit in the test names. Good enough for a solo + LLM-driven flow.Concern 1 — a stated assumption is load-bearing and only loosely verified
The whole design rests on: "every PDF in the corpus is named
<index>.pdf." The PR says the regex was validated against "the real 7,581-index set" and the only reject isJ 0070(a space typo with no PDF on disk). That's a good empirical check. But the assumption is now enforced silently: any real PDF that does NOT follow<index>.pdf(e.g. a scan that arrives asW-0124_page2.pdf, or an index the catalog later extends beyond 4 letters / non-Latin-1 — aČor a Cyrillic-catalogued item) will become a PLACEHOLDER with no operator-visible signal that a file existed but wasn't matched. Recommend logging the requirement explicitly as an EARS rule and surfacing "index valid but<index>.pdfabsent" distinctly from "index invalid" in the import summary, so a future corpus that drifts from the assumption is caught, not silently dropped. (This is the unhappy-path / requirements-debt category I always probe.)Concern 2 — non-Latin-1 catalog growth is an un-stated boundary
The pattern allows 1–4 Latin-1 letters (incl. umlauts). That fits today's corpus. If the family archive ever ingests a sub-collection with a different cataloguing scheme (5-letter prefix, a digit-led id, a non-Latin letter), those rows silently become placeholders. Not a defect today — the corpus is closed and known — but it's an assumption worth recording in the ADR's "Consequences" so it's a conscious constraint, not a forgotten one. The ADR already documents the security rationale well; one sentence on "this pattern is corpus-specific and must be revisited if the catalog scheme changes" would close the loop.
Neither blocks merge. The scope is tight, no gold-plating, and the deletion of
index_file_mismatchis justified (its reason to exist — two columns that could disagree — is gone). Approved.Leonie Voss — UX Designer & Accessibility Strategist
Verdict: Approved (LGTM — nothing in my domain)
I reviewed for anything touching the user-facing surface: Svelte components, routes, brand tokens, contrast, touch targets, error-message i18n, accessibility. This PR is entirely backend Java (
DocumentImporter), the Python normalizer, an ADR, andDEPLOYMENT.md. No.svelte, no+page.*, no Tailwind, nomessages/{de,en,es}.json, noerrors.tschange.One thing I checked and confirmed is fine: the PR reuses the existing
INVALID_FILENAME_PATH_TRAVERSALerror code rather than adding a new one, so there's no newErrorCoderequiring an i18n key +getErrorMessage()case. Nothing for the frontend to localize, and no user-visible string introduced. The import flow is an admin/operator batch job with no new screen.No UI, no a11y impact. Approved.
Felix Brandt — fixes for the "Approved with concerns" review
All non-blocking concerns from the review round are now addressed on
feature/686-resolve-pdf-by-index(3 atomic commits on top ofb1e83437; backend/Python-only so committed with--no-verify). TDD throughout — the regex reject tests were proven red against a weakened pattern before going green.Concerns resolved
Nora — silent IOException branch (
a1f96f60)resolvePdfByIndex'sgetCanonicalPath()IOException branch now emitslog.warn("Canonical path resolution failed for import row {}: treating {}.pdf as absent", rowNumber, index, e)before the safe-fallbackOptional.empty(). The asymmetry (symlink → loud abort, IO error → quiet skip) is gone.Nora —
\dASCII-only (a1f96f60)Added a code comment above
INDEX_PATTERN: Java\dmatches[0-9]only unlessUNICODE_CHARACTER_CLASSis set — explicitly "do NOT add that flag". Also mirrored in the ADR.Tobias — skip-log breadcrumb (
a1f96f60)The
load()loop now threads a 1-based source row number; the skip log islog.warn("Skipping import row {}: index rejected (fails catalog-shape validation)", rowNumber). The breadcrumb is the row number, never the raw hostile index — an operator can find the row in the .xlsx without us echoing attacker-controlled input.Elicit — distinguish outcomes in the summary (
a1f96f60)"valid index but
<index>.pdfabsent" (→ PLACEHOLDER, normal) now emits a distinctlog.info(... "is valid but {}.pdf is absent — creating PLACEHOLDER" ...), clearly separate from theINVALID_FILENAME_PATH_TRAVERSALreject skip. The two no longer look the same in the logs.Felix — imports (
a1f96f60main,b3f4fb2dtest)Inline
java.util.regex.Pattern→import java.util.regex.Pattern;; restored the over-deletedimport java.io.OutputStream;in the test.Sara + Felix — regex reject-boundary tests (
b3f4fb2d)Added 4 tests for indices that pass every char pre-check but must fail the pattern:
"J 0070"(space),"WXYZA-0001"(5 letters),"12-0001"(no letter prefix),"W-0001X"(uppercase X). All proven red against a deliberately weakened pattern, then green against the real one — so the pattern branch (not the char guards) is pinned.Elicit — corpus-specific pattern in ADR (
7354c3d3)Added an ADR-025 Consequences entry:
INDEX_PATTERNis corpus-specific (≤4 Latin-1 letters, hyphens, ASCII digits, optionalx) and must be revisited deliberately if the catalog scheme grows (5-letter prefix, digit-led id, non-Latin letter), since such rows would be skipped, not imported.Deviation — Sara concern 2 (IOException-branch test)
Attempted, but left uncovered by deliberate decision with a documented note in the test file. Unlike
isPdfMagicBytes(which has theopenFileStream(File)seam a spy can make throw),getCanonicalPath()is called on an internally-constructedFilewith no injection seam, and there's no portable/deterministic way to make it throw on a temp file. Adding a seam purely to test a non-defect isn't justified — the substantive fix is thelog.warnnow in that branch.Verification
DocumentImporterTest: 43/43 green (was 39; +4 reject tests). Full suite left to CI per project convention. No other class touched.Markus Keller — Senior Application Architect
Verdict: Approved with concerns
I reviewed the structural and data-model impact of resolving PDFs by index and dropping the
filecolumn, plus documentation currency.What is right
<index>.pdfflat) collapses anO(rows x tree)recursiveFiles.walkinto anO(1)importDir.resolve(index + ".pdf"). Less code, less surface, fewer failure modes. Closing #676 by deletion is the best kind of fix.filecolumn lives only in the normalizer'scanonical-documents.xlsxartifact (CanonicalDocument,RawRow,_FIELDS,DOC_COLUMNS) — not in thedocumentstable. So the DB-diagram doc-check rule does not fire here. Correctly scoped.INDEX_PATTERNis corpus-specific and must be widened deliberately if the catalog scheme grows (5-letter prefix, digit-led id, non-Latin-1 letter). That is exactly the institutional memory an ADR exists to hold.DEPLOYMENT.md §6anddocker-compose.prod.ymlnow describe<index>.pdfflat in the import dir and the canonical artifacts, replacing stale ODS/dateiwording. Good doc hygiene.Concerns (not blockers for this PR)
docs/architecture/c4/l3-backend-3b-document-management.pumlstill describes the deletedMassImportService("Reads Excel/ODS files from /import mount ... delegates to ExcelService") andAdminController"Triggers asynchronous Excel/ODS mass import". This drift was introduced whenMassImportServicewas removed on the base branch (docs/import-migration), not by #687 — so it is out of scope for this PR's doc-check. But the canonical-import reality (DocumentImporter + orchestrator + canonical artifacts) is now far from this diagram. Track a follow-up to redraw 3b beforedocs/import-migrationlands onmain; do not let it slip in silently.INDEX_PATTERNis an application-layer allowlist doing the job a constraint can't (filesystem lookup safety), which is the correct layer here — there is no DB-layer equivalent for "valid on-disk filename". The ADR consequence documents the lock-in. No action; just confirming the layer choice is right.Boundaries
DocumentImporter stays within the
importingdomain, callsDocumentService/PersonService/TagService(never their repositories), and S3/thumbnail plumbing is unchanged. No new coupling. Clean.Nora "NullX" Steiner — Application Security Engineer
Verdict: Approved
This is a CWE-22 (path traversal) review of the new index-driven resolution. I read the full
DocumentImporteron the head branch, not just the diff.The control is sound
The
indexis the only value that reaches disk I/O, and it is treated as hostile regardless of upstream trust.isValidImportIndexruns char-level guards before the pattern:/,\, three slash homoglyphs (U+2215, U+FF0F, U+29F5), null byte, absolute-path marker — all rejected..is rejected, so..traversal and<index>.pdf.pdfdouble-extension smuggling are both structurally impossible;<index>.pdfis the only extension the importer appends.INDEX_PATTERN.matcher(index).matches(). Critically,matches()anchors to the entire input (the pattern itself has no^/$, butmatches()makes that irrelevant). So"J 0070"with an interior space cannot partial-match — verified against the new reject test.Defense-in-depth is real, not theatre: even a syntactically valid index whose
<index>.pdfis a symlink escapingimportDiris caught by the canonical-path containment assertion (startsWith(baseDirCanonical + File.separator)) and throwsDomainException. TheresolvePdfByIndex_throwsWhenResolvedPathEscapesImportDir_viaSymlinktest pins it.Log-injection / info-leak check — correctly handled
The skip path now logs the 1-based source row number, not the raw hostile index:
log.warn("Skipping import row {}: index rejected (fails catalog-shape validation)", rowNumber).This is the right call — an attacker-controlled string (CRLF, ANSI,
${jndi:...}) never reaches the log sink via the reject branch. SLF4J{}parameterization is used throughout, so even the PLACEHOLDER/absent-PDF branch (which does log the index) is JNDI-safe; and that index is already proven to match the strict catalog shape, so there is no hostile content there anyway. Both branches are distinct and both safe. Good.Regression coverage — the part I care most about
The MassImportService-era traversal tests were not lost, they were ported and strengthened: null/blank/slash/backslash/
.././absolute/null-byte/homoglyph all haveisValidImportIndex_*equivalents, plus four new catalog-shape reject boundaries (J 0070,WXYZA-0001,12-0001,W-0001X) that specifically pin theINDEX_PATTERNbranch. Theload_rejectsMaliciousIndex_neverReadsOutsideImportDirtest assertss3Clientis never called and nothing is saved. This is exactly how a security fix should be regression-locked.The intentionally-untested IOException branch — reasonable
resolvePdfByIndex'sgetCanonicalPath()IOException branch has no test, documented inline. I agree this is not a finding: there is no injection seam to forcegetCanonicalPath()to throw on a temp file deterministically, and the substantive security behaviour (fail-safe to PLACEHOLDER + non-silentlog.warn) is correct. Adding production code purely to make it throw would be worse than the gap. The\dASCII-only note (do not addUNICODE_CHARACTER_CLASS) is a genuinely valuable threat-model comment — it prevents a future "improvement" from silently widening the accepted digit set to Arabic-Indic/fullwidth. Keep it.No injection, no IDOR, no traversal escape, no info leak. Clean.
Felix Brandt — Senior Fullstack Developer
Verdict: Approved
Clean-code and TDD review of the Java importer and the Python normalizer changes.
Naming and structure
resolveFile/basenameOf/findFileRecursive/InvalidImportFilenameExceptionare gone, not commented out. Dead code deleted.resolvePdfByIndexandisValidImportIndexname exactly what they do. ThedisplayNamehelper that existed only to pick file-vs-index for the skip log is removed too — the skip now carries the bareindex, which is simpler.importRowreads top-down as guard clauses: reject invalid index (early return) -> resolve -> absent (log + fall through to PLACEHOLDER) -> present (magic-byte gate) -> persist. The happy path is the last call (persist(...)). This is the guard-clause shape I push for.java.util.regex.PatternFQNs were promoted to a properimport java.util.regex.Pattern;and the unusedPath/Streamimports removed. Imports are tidy.int rowNumberthreading is minimal and well-commented (header is row 1, first data row is row 2 — matches the .xlsx an operator sees). The breadcrumb genuinely helps triage.TDD evidence
The test file is the strongest part. Every removed behaviour has a replacement test, and the new behaviours are pinned:
load_resolvesPdfByIndex_uploadsToS3_andSetsStatusUploadedeven captures theRequestBodyand asserts the S3 key ends with_W-0124.pdf— proving exactlyimportDir/<index>.pdfwas uploaded.load_yieldsPlaceholder_whenIndexedPdfMissingandload_setsStatusPlaceholder_whenNoIndexedPdflock the absent->PLACEHOLDER path.INDEX_PATTERN— so a weaker pattern goes red. That is a test that documents intent.The author reports 43/43; the structure is consistent with that.
Python normalizer
index_file_mismatch()+ its flag, thedateiheader map,filefromRawRow/CanonicalDocument/_FIELDS/DOC_COLUMNS, and the review-CSV writer are all removed coherently. The oldtest_to_canonical_carries_file_name/test_index_file_mismatchtests are replaced bytest_canonical_document_has_no_file_field(assert not hasattr(doc, "file")) andtest_write_documents_xlsx_carries_date_end_and_has_no_file_column.date_endand the treeperson_idare correctly retained. Type hints intact. No half-removed state.Minor (not blocking)
importRow/resolvePdfByIndexare on the longer side, but they explain why (the reject-vs-absent asymmetry, the log-safety reasoning) rather than what, so they earn their place. I would not cut them.Nice work. The deletion-heavy diff with full test parity is exactly how a simplification PR should look.
Sara Holt — Senior QA Engineer
Verdict: Approved with concerns
Test-strategy and coverage review. I assessed by reading the suite (full backend sweep is CI-only on this project); the author reports
DocumentImporterTest43/43 and the integration/reader tests green.Coverage is strong
J 0070,WXYZA-0001,12-0001,W-0001X) plus the accepts (W-0124,Al-0001,CuH-0010,Mü-0001,C--0029,W-0001x) pin both edges ofINDEX_PATTERN. The umlaut and double-hyphen accepts are grounded in real corpus indices — that is the right reason to have a test, not coverage padding.s3Client never,documentService never save). One logical assertion focus per test, descriptiveshould/when-style names.isValidImportIndex_*orresolvePdfByIndex_*equivalent. The symlink-escape test survives and still assertsDomainException.docRow(...)/writeDocs(...)factory helpers; thefilearg was removed fromdocRowcleanly, so the helpers stayed DRY.The one gap — agreed, but with a note
The
resolvePdfByIndexgetCanonicalPath()IOException branch is uncovered, documented by an inlineNOTE (Sara, PR #687). I accept the reasoning: unlikeisPdfMagicBytes(which has theopenFileStream(File)package-private seam a spy can force to throw),getCanonicalPath()is called on an internally-builtFilewith no seam, and you cannot deterministically make it throw on a temp file. Manufacturing a seam purely to cover a non-defect would be production code in service of a test — net negative. The substantive fix (thelog.warnso the quiet skip surfaces in ops) is the right outcome, and it is documented so the branch is not assumed tested. Good.My concern is narrow: this leaves one branch genuinely uncovered, which will register against branch coverage (issue #496 territory). That is fine for this PR — just make sure the JaCoCo gate is not configured to hard-fail on this single uncoverable branch; if it does, prefer a documented exclude over a contrived seam. Verify the CI coverage gate stays green before merge.
Reliability
Tests use Testcontainers Postgres for the integration layer (per the existing
CanonicalImportIntegrationTest),@TempDirfor filesystem isolation, noThread.sleep, no@Disabled. Deterministic.Solid test work overall.
Tobias Wendt — DevOps & Platform Engineer
Verdict: Approved
I reviewed the infrastructure-facing surface: the
IMPORT_HOST_DIRmount contract, the compose/docs wording, and operability of the new skip/placeholder logging.Compose + env documentation
docker-compose.prod.yml: theIMPORT_HOST_DIRcomment and the:?error message are updated from "ODS spreadsheet + PDFs / MassImportService Files.list/Files.walk" to "canonical artifacts (canonical-*.xlsx+canonical-persons-tree.json) +<index>.pdffiles / the canonical importer only reads them". The mount is still:ro, still required-with-no-default (compose refuses to start when unset, so staging and prod cannot share a source), and still localhost-bound on the backend port. No security regression in the topology.DEPLOYMENT.md: theIMPORT_HOST_DIRtable row and §6 now state PDFs must be<index>.pdf, flat in the dir, and explain the index-resolution + absent->PLACEHOLDER behaviour. The operator-facing contract is accurate. The README normalizer table dropped the now-mootindex-file-mismatch.csvrow. Consistent.Operability — this is the part I value
The logging asymmetry is now correct for on-call triage:
log.warn("Skipping import row {}: index rejected ...", rowNumber)(row number, not hostile input)log.info("Import row {}: index {} is valid but {}.pdf is absent — creating PLACEHOLDER", ...)log.warn("Canonical path resolution failed for import row {}: treating {}.pdf as absent", ...)— previously a silentreturn Optional.empty(), now surfaced.That last one is the meaningful operability fix: a corpus that drifts from the
<index>.pdfassumption no longer disappears quietly into placeholders without a trace. An operator can grep the row number against the source.xlsx. The summary lineImported {} documents from {} ({} skipped)is retained.Notes (not blocking)
--no-verifyto bypass the husky frontend lint, which can't run on these paths. Acceptable, but make sure CI runs the backend + normalizer test stages so the bypass doesn't hide a failure — the gate, not the local hook, is what matters here.Infra contract is clean and the change is more operable than what it replaces.
Elicit — Requirements Engineer & Business Analyst
Verdict: Approved
Brownfield requirements lens: does the change satisfy the intent of #686/#676, are the unhappy paths specified, and is the behavioural contract documented and testable?
Traceability — goal to behaviour
filecolumn) and #676 (kill theO(rows x tree)recursive walk) are both satisfied by construction: the walk and thefilecolumn are deleted, replaced byimportDir/<index>.pdf. The ADR + DEPLOYMENT.md state the contract in operator terms. Goal -> decision -> code -> test chain is intact.Unhappy paths are now explicit (this was my main interest)
The earlier ambiguity I would have flagged — "what happens when the index is bad vs when the PDF is merely missing?" — is now resolved into two clearly-distinct, documented behaviours:
SkipReason.INVALID_FILENAME_PATH_TRAVERSAL, import continues. Testable, tested.<index>.pdfabsent -> row becomes a PLACEHOLDER (metadata-only), not skipped. Testable, tested.These are spelled out as EARS-style rules in ADR-025's Consequences, including the corpus-specificity caveat. That is exactly the "what does done look like, including the error flows" contract I want before a change like this merges. No silent-drop ambiguity remains — both the skip and the absent cases log distinctly.
Scope discipline
No gold-plating. The PR does one thing (index resolution + column drop) and resists widening
INDEX_PATTERN"just in case" — the ADR instead records that widening must be a deliberate future decision when the catalog scheme grows. That is the correct way to park a non-requirement.Open item (track, do not block)
The corpus assumption "every PDF is named
<index>.pdf, flat" is now a hard contract. If a future sub-collection arrives with a different naming or nesting scheme, rows silently become PLACEHOLDERs (logged) or are skipped (logged). The ADR captures this. Suggest a one-line note in the operator runbook / import SOP so whoever stagesIMPORT_HOST_DIRknows the naming requirement up front — a requirement is only met if the human feeding the system knows it. Minor, documentation-only.Requirements are clear, testable, and the error space is fully specified. No unresolved ambiguity in the must-have behaviour.
Leonie Voss — UX Designer & Accessibility Strategist
Verdict: Approved (no UI surface)
I checked for any user-facing change: Svelte components/routes, brand tokens, accessibility, responsive layout, error-message presentation.
This PR is backend Java (
DocumentImporter, tests), the Python import-normalizer, ADR-025,DEPLOYMENT.md,docker-compose.prod.yml, and the normalizer README. No frontend files, no.svelte, nomessages/{de,en,es}.json, no error-code mapping are touched. There is nothing within my remit to evaluate — no rendered output changes for either audience (the 60+ readers or the younger phone users).One adjacent note for whoever owns the
/admin/systemimport card: the import now distinguishes "row skipped (bad index)" from "PLACEHOLDER (PDF absent)" in the backend logs. If/when that status is ever surfaced in the admin UI's import report, those two outcomes should be shown with redundant cues (icon + text, not colour alone) and the skip reason should map throughgetErrorMessage()to a localized string rather than echoing the rawSkipReasonenum. Not actionable in this PR — just flagging it for the future UI that will read this data.LGTM from a UX/accessibility standpoint.