feat(importing): rebuild importer as modular canonical loaders (Phase 3, #669) #674
Reference in New Issue
Block a user
Delete Branch "feature/669-modular-importer"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Phase 3 of the "Handling the Unknowns" milestone — replaces the monolithic raw-spreadsheet
MassImportServicewith four idempotent loaders that consume the normalizer's committed canonical exports. Builds on the merged #670 (canonical exports) + #671 (V69 schema).CanonicalSheetReader— POI reader that maps columns by header name, splits|-delimited lists, fails closed with a newIMPORT_ARTIFACT_INVALIDerror code (mirrored toerrors.ts+ de/en/es).TagTreeImporter→PersonRegisterImporter→PersonTreeImporter→DocumentImporter.PersonService/TagService.upsertBySourceRef(keyed on the normalizerperson_id/tag_path); documents upsert byindex. Re-import preserves human edits: never overwrites a non-blank human-edited field, fills only blanks;provisionalis monotonic (never reverts once confirmed).provisional = true; register persons stay false (the Phase-3 contract #671 left at default-false).sender_text/receiver_text, even when a person is linked.filecolumn.MassImportService+XxeSafeXmlParserdeleted, but only after porting ~19 security regression tests (path-traversal, 3 Unicode homoglyphs,../null-byte/absolute-path basenames, symlink canonical-path containment,%PDFmagic-byte) intoDocumentImporterTest.Decision applied / deviation
The committed
canonical-documents.xlsxcarries no category columns — the normalizer already resolved Option-A routing into person slugs (sender_person_id/receiver_person_ids) + raw names. So the loader routes by slug presence (slug → register-first match, provisional if unmatched; empty slug + raw text → no Person, raw retained), rather than a category enum. Object-noise/stopword curation lives upstream in the normalizer overrides, as #669 specifies. Documented in ADR-025 + the loader commit.Test plan
CanonicalImportIntegrationTest4, AdminController 15, plus regressions (PersonService 49, TagService 40). 117 pass, 0 fail../mvnw clean package -DskipTestsbuilds. (Full suite intentionally not run locally — CI owns the sweep.)Notes
generate:apineeded — the generatedImportStatus/SkippedFile/SkipReasonschemas already match the new types;statusCodeis a free string.l3-backend-3bdiagram, GLOSSARY (4 terms), DEPLOYMENT §6 runbook, root + backendCLAUDE.mdpackage tables.IMPORT_FAILED_ARTIFACTstatus label is included (browser test is CI-only).--no-verify(husky frontend lint can't run in a worktree).Closes #669
CanonicalImportOrchestrator runs the four loaders in an explicit dependency DAG (TagTree -> PersonRegister -> PersonTree -> Document), owns the async runner + ImportStatus state machine the admin UI consumes, smoke-checks all four artifacts are present before starting (fail-fast IMPORT_FAILED_ARTIFACT rather than a half-run), and fails closed on a malformed artifact. AdminController now depends on the orchestrator; the {state, statusCode, processed, skippedFiles, skipped} response shape is unchanged so ImportStatusCard.svelte keeps working. Deletes the legacy MassImportService (positional @Value app.import.col.*, ISO-only parseDate, Java name classification) and the ODS/XXE XxeSafeXmlParser path now that the loaders cover them — the security guards were ported to DocumentImporter first (previous commit). Replaces the positional column config in application.yaml with the canonical artifact directory. Refs #669 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>Markus Keller — Senior Application Architect
Verdict: ⚠️ Approved with concerns
The decomposition is genuinely good architecture. A monolithic
MassImportService(509 lines, positional@Value app.import.col.*, ODS/XXE XML, name classification, all in one bean) is replaced by aCanonicalImportOrchestratordriving four single-responsibility loaders in an explicit dependency DAG. The layering rule is respected everywhere I checked: every loader calls the owning domain's service (TagService,PersonService,RelationshipService,DocumentService) and never a repository.CanonicalSheetReaderis a clean no-Spring value-level seam. ADR-025 decision 3 captures the why. This is exactly the modular-monolith refactor I'd want.Concerns (not blockers)
Idempotency guarantee is asymmetric between domains — and the asymmetry is undocumented.
PersonService.upsertBySourceRef/TagService.upsertBySourceRefcarefully preserve human edits (preferHuman, monotonicprovisional). ButDocumentImporter.buildDocument(lines 158–185) just overwrites every field via setters (setTitle,setSenderText,setLocation, …) and accumulates into collections:doc.getReceivers().addAll(receivers)anddoc.getTags().add(tag)with no priorclear(). For a PLACEHOLDER document re-imported after the canonical receiver set shrinks, the removed receiver/tag is never pruned — stale links persist. The ADR's "never overwrites a human-edited field" sentence reads as if it covers documents; it only covers person/tag. Either document the document-domain exception explicitly in ADR-025, or make the receiver/tag sets authoritative (clear-then-add) on re-import. Note non-PLACEHOLDER docs are skipped (ALREADY_EXISTS), which bounds the blast radius — that's why I rate this a concern, not a blocker.CanonicalImportOrchestrator.runImport()is not transactional, only each loader call is.DocumentImporter.loadis@Transactional(good — keeps the session open for lazyreceivers);TagTreeImporter.load/PersonRegisterImporter.load/PersonTreeImporter.loadare not, so they rely on the per-call@Transactionalupserts. A loader failing mid-way (e.g. documents) leaves tags+persons committed and documents absent — recoverable only by re-running. Given the import is idempotent, partial-failure-then-retry is acceptable, but it's worth one sentence in the ADR consequences so a future maintainer doesn't assume all-or-nothing.Doc currency — checks pass
ADR-025 updated (decision 3 + issue ref),
docs/architecture/c4/l3-backend-3bdiagram updated, GLOSSARY terms added, DEPLOYMENT §6 runbook rewritten for the canonical flow, root + backendCLAUDE.mdpackage tables updated, newIMPORT_ARTIFACT_INVALIDErrorCode reflected in errors.ts. No Flyway migration in this PR (schema is #671), so the DB diagrams correctly stay untouched. Doc-to-code match is clean — no doc blockers.Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
I focused on the highest-risk claim in this PR: that the CWE-22 / file-handling guards genuinely survived the deletion of
MassImportService+XxeSafeXmlParser. I diffed the old guards against the newDocumentImporterline by line. They survived intact, and in one respect the new design is stronger.What was ported verbatim (verified)
isValidImportFilename(DocumentImporter:274–286) is byte-identical to the old guard: rejects/,\, the three Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5),.., lone., null byte, and absolute paths.isPdfMagicBytes(293–302) and theFiles.walk+ canonical-path containment check infindFileRecursive(304–320) are unchanged. All ~19 regressions are ported intoDocumentImporterTest(the symlink-escape test at 140–150 and the two traversal-in-file-column tests at 154–183 are the load-bearing ones, and they assert the rightSkipReasons). The threat-model comment block at line 272 ("ported verbatim — do not weaken") is exactly the kind of self-documenting security marker I want to see.XXE: correctly eliminated, not regressed
XxeSafeXmlParserexisted solely to harden the ODScontent.xmlDOM parse. The ODS path is gone entirely —.xlsxgoes through POI and the tree through JacksonObjectMapper.readTree(JSON, no external entities). So deleting the hardened factory removes dead code, not a control. No XXE surface remains. Good.Defense-in-depth observation (positive)
The new flow strips to basename before validating (
basenameOfat 248–252, thenisValidImportFilename), and then confines disk lookup toimportDirwith canonical containment. So even a../../etc/cron.d/xvalue reduces to basenamex, finds nothing, and yields a PLACEHOLDER — the file outsideimportDiris never read (proven byload_traversalFileColumn_cannotEscapeImportDir_yieldsPlaceholder). Treating our own Python tool's output as hostile is the correct posture; the ADR decision-3 bullet states this explicitly.Minor notes (no action required)
IMPORT_ARTIFACT_INVALIDmessages includefile.getName()— fine, it's an operator-facing artifact name, not reflected to an attacker-controlled context, and SLF4J parameterized logging is used throughout.AdminController.triggerMassImport(POST /trigger-import) — the diff only swaps the bean type; the existingADMINauthorization is unchanged andAdminControllerTestexercises it under@WithMockUser(authorities="ADMIN"). I confirmed the diff does not strip any@RequirePermission. No new write endpoint was introduced.No injection, no broken auth, no traversal regression. Clean.
Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
This is clean code. Methods are short and single-purpose (
resolveSender,resolveReceivers,resolvePerson,attachTag,parseIsoDate— all well under 20 lines), guard clauses replace nesting (if (index.isBlank()) continue;,if (slug.isBlank()) return null;), names reveal intent, and the section-comment banners makeDocumentImporternavigable. Builders used throughout,@RequiredArgsConstructorwithfinalfields,DomainExceptionfactories everywhere,@Transactionalonly on writes. The TDD evidence is strong — every loader, the upserts, and the orchestrator have focused unit tests, and the test names read as behaviors.Concerns
fromCanonicalcallspersonRepository.savetwice. PersonService:36–46 saves thePerson, then if a maiden name is present saves aPersonNameAlias. That's fine, but the personsavehappens unconditionally even when the caller will immediately need the id only for the alias — minor. The real nit:fromCanonicalis reached viaorElseGetinsideupsertBySourceRefwhich is@Transactional, butfromCanonicalitself isn't annotated. It's invoked in-proxy so it inherits the transaction — correct — just confirm no one calls it directly later.mergeCanonicalmutatesexistingthenpersonRepository.save(...)— butbirthYear/deathYearfill-blank logic differs from name fields. Names usepreferHuman(blank-or-overwrite). Years useif (existing.getXxx() == null). Same intent, two idioms. ApreferHuman-styleIntegerhelper would make the precedence rule uniform and self-documenting. Pure readability; behavior is correct and tested.DocumentImporter.buildDocumentaccumulates receivers/tags without clearing on re-import (doc.getReceivers().addAll(...)). Markus flagged the data-correctness angle; from my side it's also a clean-code smell — the method reads like a fresh build but mutates an existing aggregate. If you keep accumulate-semantics, a comment stating "re-import only adds, never prunes — see ADR-025" would stop a future reader assuming it's authoritative.Suggestions (nice to have)
PersonTreeImporterrelationship nodes use JSON fields literally namedpersonId/relatedPersonIdwhose values arerowIds (row_a). Thetext(node, "personId")lookup intoidByRowIdis correct (verified against the tests), but the field naming is a trap. A one-line comment atcreateRelationships("the relationship node's personId field carries a rowId, not a person slug") would save the next reader the double-take I did.OBJECT_MAPPERas astatic finalper-class mapper is justified in the comment — good call not depending on the web bean.Nothing here blocks. Fix the doc-comment nits and I'm a full approve.
Sara Holt — Senior QA Engineer
Verdict: ⚠️ Approved with concerns
The test strategy is well-pyramided: fast Mockito unit tests for each loader and the two upserts, plus one
CanonicalImportIntegrationTeston real Postgres viaPostgresContainerConfig(Testcontainers, not H2 — correct, since theUNIQUE(source_ref)constraint and conflict behaviour only exist in real Postgres). Test names read as sentences, factory helpers (docRow,writeDocs,writeSheet) keep bodies focused, and the security regressions are codified as permanent tests. The integration test assertingprovisional true for importer-created / false for registerandpreservesHumanEditedPersonFieldagainst real Postgres is exactly the right layer for those behaviours.Concerns — coverage gaps
No test covers the receiver/tag accumulation-on-reimport behaviour.
reimport_isIdempotent_noDuplicatePersonsTagsOrDocumentsre-imports the same artifact set, so receivers are aSetof identical UUIDs and the count stays equal — it proves nothing about what happens when the canonical receiver/tag set changes between imports. The stale-link risk Markus and Felix flagged is exactly the kind of bug that hides behind an idempotency test that only re-runs identical input. Add an integration test: import, then re-import a document row with one receiver removed, and assert the link is gone (or, if accumulate-semantics are intentional, assert it persists and document that as the contract).PersonTreeImporter.addRelationshipIdempotentlyswallowsDUPLICATE_RELATIONSHIPandCIRCULAR_RELATIONSHIPbut rethrows everything else (lines 116–121). There's a unit test for the duplicate case, but no test asserting that an unexpectedDomainException(some other ErrorCode) does propagate. Without it, a future "let's swallow all DomainExceptions to be safe" refactor would go uncaught. One negative test closes that gap permanently.CanonicalSheetReaderTest(8 tests) — I'd want one asserting behaviour on a row shorter than the header width (theindex >= cells.size()branch inRow.get). ThereadCellswidth-padding handles it, but a row with trailing empty cells omitted by POI is a real-world artifact shape worth pinning.What's done well
@MockitoBean S3Clientin the integration test with PLACEHOLDER-only rows is the right way to avoid a real S3 dependency while still exercising the full orchestrator → loader → service → Postgres path.openFileStreampackage-private seam for injectingIOExceptionin the magic-byte test is a clean testability affordance, ported from the original.None of these block merge, but concern #1 is the one I'd most like landed before this goes to production, because re-import is the headline feature.
Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
Backend-only refactor, no Compose/CI/infra files touched, no new service, no new port, no new secret. From an operational standpoint this is low-risk and the runbook discipline is good:
application.yamlcollapses the ten brittleapp.import.col.*indices into a single self-documentingapp.import.dir: ${IMPORT_DIR:/import}, and DEPLOYMENT.md §6 was rewritten to match (regenerate artifacts → stage fourcanonical-*files + PDFs →IMPORT_HOST_DIRbind mount →POST /trigger-import). The "re-running is safe / idempotent" note in the runbook is the right thing to tell an operator.Concerns — operational, not blocking
The runbook now requires staging FOUR artifacts with exact filenames, and a missing one fails the whole import.
CanonicalImportOrchestrator.requireArtifacthard-fails on any ofcanonical-tag-tree.xlsx,canonical-persons.xlsx,canonical-persons-tree.json,canonical-documents.xlsx. That's correct fail-closed behaviour, but it's four ways for an operator to fat-finger a deploy (typo'd filename, forgot the JSON, stale artifact from a previous normalizer run). TheIMPORT_ARTIFACT_INVALIDmessage names the missing file — good — but consider whether the nightly/release workflow that already writesIMPORT_HOST_DIRshould also verify the four files exist before flipping the deploy, so the failure surfaces in CI rather than in a manual admin click.--no-verifyon the commits. The PR body is upfront that husky's frontend lint can't run in a worktree, so backend-only commits bypassed it. Acceptable given the constraint and that CI owns the real gate — but it means the frontend lint/npm run checkon the touchederrors.ts+ImportStatusCard.svelte+ three message files has only run in CI, never locally. Make sure the CI frontend job is green before merge; that's the actual quality gate here.DEPLOYMENT.md says
python -m normalizer # or the documented normalizer entrypoint. The "or the documented entrypoint" hedge is a runbook smell — an operator following this verbatim needs the exact command. Pin it to the real entrypoint (the repo hastools/import-normalizer/with amain()per recent commits). A runbook with a guess in it gets a 3am phone call.Done well
/import) and a comment explaining the header-name mapping — exactly the kind of self-documenting config I want.:latest, no bind-mount-for-prod-data, no exposed port, no hardcoded secret introduced. Nothing for me to block on.Elicit — Requirements Engineer & Business Analyst
Verdict: ⚠️ Approved with concerns
Reviewing in Brownfield mode against the stated requirement: Phase 3 of "Handling the Unknowns" (#669) — replace the raw-spreadsheet importer with idempotent canonical loaders that preserve human edits and honestly model uncertainty in people. The PR maps cleanly to that intent and the ADR documents the deviation (slug-presence routing because the artifact carries no category columns), which is the kind of decision-with-consequences that belongs in an ADR. The acceptance behaviours are mostly testable and tested.
Requirements observations
The "preserve human edits" requirement is satisfied for persons/tags but the contract is silent on documents — and that's an unstated requirement gap, not just a doc nit. Three other reviewers flagged the receiver/tag accumulation behaviour from code/test angles; from a requirements standpoint the question is: what is the intended re-import behaviour when an archivist has edited a document, or when the canonical data for an existing document changes? Right now non-PLACEHOLDER docs are skipped entirely (
ALREADY_EXISTS) and PLACEHOLDER docs are field-overwritten + collection-accumulated. Is "once a document has a file, the importer never touches it again" the intended rule? If so, state it as an explicit acceptance criterion (Given a document past PLACEHOLDER, When re-import runs, Then the importer makes no change). It's currently implicit in a status check.Ambiguity: "provisional" semantics across the three person sources. Register persons →
false, tree persons → hardcodedfalse, importer-minted (unmatched slug) →true. That's a coherent rule, but a tree person who is also a provisional document-minted person could be upserted by the document loader first (provisional=true) then the tree loader (provisional=false) — and order matters becauseprovisionalis monotonic-downward. The orchestrator runs tree before documents, so the register/treefalsewins and a later document slug match just links. Worth one acceptance criterion pinning the precedence so it can't silently regress: "a person present in the register or tree is never provisional, regardless of document-row order."No measurable NFR stated for the import. This processes the whole archive. There's no stated expectation for "import of N documents completes within T" or "an artifact with a malformed row degrades gracefully rather than aborting." The per-row skip mechanism (
SkippedFile+ reason) does give graceful degradation for file-level problems — good — but a single malformed header aborts the entire run (IMPORT_ARTIFACT_INVALID). Confirm that's the intended trade-off (fail-fast on structural problems, skip on per-row problems) and note it; it's a reasonable rule but currently undocumented as a requirement.Strengths
sender_text/receiver_texteven when a person is linked — this directly serves the "honest uncertainty" milestone goal and the future merge story. Good traceability from goal → behaviour → test (load_linksRegisterSender_andRetainsRawSenderText).No blockers. The concerns are about making implicit rules explicit so they survive the next change.
Leonie Voss — UI/UX Design Lead & Accessibility Advocate
Verdict: ✅ Approved
This is a backend-heavy PR; the only frontend surface is the small follow-up to
ImportStatusCard.svelte(an admin-only page) plus the i18n + error-mapping plumbing. I reviewed all four touched frontend files.What I checked
ImportStatusCard.svelte— adds a third branch to the$derivedfailure message for the newIMPORT_FAILED_ARTIFACTstatus, and a newreasonLabelcase forINVALID_FILENAME_PATH_TRAVERSAL. Logic stays in the$derived/function, not the template — consistent with project conventions. No raw backend strings reach the user; the code maps status codes to Paraglide messages. Good.error_import_artifact_invalid,import_reason_path_traversal, andadmin_system_import_failed_artifactare added to de/en/es (I confirmed de; the diff shows all three message files touched). The German copy is user-appropriate and actionable ("Eine Importdatei fehlt oder ist ungültig. Bitte führen Sie den Normalizer erneut aus.") — it tells the operator what to do, which satisfies Nielsen heuristic #9 (help users recover from errors).errors.ts—IMPORT_ARTIFACT_INVALIDadded to theErrorCodeunion and agetErrorMessagecase, so any surfacing of this code is localized. Noany, no leaked implementation detail.Notes (no action)
reasonLabel('INVALID_FILENAME_PATH_TRAVERSAL')will show alongside other skip reasons in the skipped-files list. As long as that list already uses redundant text labels (it does — these are text strings, not color chips), it's accessible. No change needed.Browser/axe tests are CI-only per project policy, so I'm assessing by reading. Nothing here degrades accessibility or brand consistency. Approved.
Review concerns addressed (Felix Brandt)
Pushed six atomic commits (
2f7ea37..fc53e77) resolving every raised concern. All seven affected/added test classes pass (68 tests, 0 fail, Testcontainers Postgres).--no-verifyused (backend-only; husky frontend lint can't run in a worktree).Main / Sara #1 / Elicit #1 — document collections now canonical-authoritative
2f7ea37—DocumentImporter.buildDocumentnow clears then re-populatesreceiversandtagsso a shrunk canonical row prunes stale links on re-import instead of accumulating. Rawsender_text/receiver_textretention is unchanged. Unit testload_prunesReceiversAndTags_whenCanonicalRowShrinks+ real-Postgresreimport_prunesRemovedReceiverAndTag_whenCanonicalRowShrinks(5f53c36).Sara #2 / #3 — coverage gaps closed
7ebf7ac:PersonTreeImporterTest.load_propagatesUnexpectedDomainException_fromAddRelationship— asserts a non-DUPLICATE/non-CIRCULARDomainExceptionpropagates (not swallowed), guarding against a future swallow-all refactor.CanonicalSheetReaderTest.get_returnsEmptyString_forTrailingColumns_whenRowShorterThanHeader— pins the short-row (POI omits trailing empty cells) read path.Markus #1 / Elicit #2 — provisional precedence pinned
5f53c36—import_neverFlipsRegisterPersonToProvisional_whenReferencedByDocumentRowproves a register/tree person (provisional=false) is never flipped by a later document row referencing the samesource_ref, regardless of order. The orchestrator already loads register+tree before documents; the monotonic-downward guard inmergeCanonical(now commented) does the rest.Markus #2 — ADR consequence
4fa2b83— ADR-025 now recordsrunImport()is non-transactional (per-loader transactions only); a partial failure leaves earlier loaders committed and is safe to retry because the import is idempotent.Main / Elicit #1 — ADR idempotency rule clarified
4fa2b83— ADR-025 now splits the idempotency rule: Person/Tag scalar fields = preserve human edits; document sender/receivers/tags = canonical-authoritative (clear-then-add). Notes the non-PLACEHOLDER skip bounds the blast radius.Felix — clarifying comments + idiom unification
e9ddaed—birthYear/deathYearfill-blank unified under anInteger preferHumanoverload (one self-documenting precedence idiom); guard test added. Comment inPersonTreeImporter.createRelationshipsnotes the relationshippersonIdfield carries a tree rowId, not a slug. Authoritative-collection semantics commented inDocumentImporter.Tobias #3 — DEPLOYMENT runbook
fc53e77— dropped the "or the documented entrypoint" hedge; pinned the exact command.venv/bin/python normalize.py(+ one-time venv setup). CI/branch-protection changes left to the owner per the review.Nora and Leonie approved with no action items. Not merging — leaving for re-review.
Markus Keller — Senior Application Architect
Verdict: Approved
I reviewed this as a fresh pass against the current diff, with special attention to whether the prior round's fixes held and introduced nothing new. The architecture is clean and the layering rules are respected throughout.
What I checked and liked
TagService,PersonService,RelationshipService,DocumentService) — never a foreign repository.PersonTreeImporterexplicitly routes relationships throughRelationshipService, exactly as the rule demands. The newfindBySourceRef/upsertBySourceRef/upsertBySourceRefmethods are added on the services, with the repository methods kept package-internal to the domain.CanonicalSheetReaderis a pure value-level helper with no Spring/domain coupling; each loader does one artifact. This is the modular-monolith shape I want — these packages could be extracted later without disentangling.runImport()consequence with the recovery story. A future maintainer is explicitly warned not to assume all-or-nothing semantics. This is exactly what ADRs are for.buildDocument→clear()+addAll()for receivers andattachTag→clear()for tags. The ADR is honest that this rule does NOT extend to human-linked associations being "kept" — the spec wants the canonical row to own these sets, and theALREADY_EXISTSguard (non-PLACEHOLDER docs skipped) bounds the overwrite blast radius to placeholder rows. No human-curated association on an archived/uploaded document is ever touched. This matches the design intent.l3-backend-3b) updated faithfully — the four loaders, the sheet reader, the numbered DAG edges, and theRelationshipServicedependency all match the code.Non-blocking observations
findFileRecursivere-walks the entire import directory for every document row (O(rows x tree)). For a large import this is repeated full-tree I/O. The oldMassImportServicehad the same shape, so this is not a regression — but a single pre-walk index built once perload()would be the cleaner structure if import volume grows. Admin-triggered, bounded, so it does not block.currentStatusas avolatilefield on a singleton — fine for the single-active-import contract, andrunImportAsyncguards re-entry. Acceptable.The structure is boring in the best way. Ship it.
Felix Brandt — Senior Fullstack Developer
Verdict: Approved
Clean, well-named, guard-clause-first code. I read every production file in the diff line by line.
What's good
resolveSender,resolveReceivers,attachTag,parseIsoDate,parsePrecision,isValidImportFilename,isPdfMagicBytes— each does one thing and is short enough to hold in working memory. NoManager/Helper/Wrappernames.preferHumanidiom inPersonServiceis the right call — a single overloaded helper expresses the fill-blank-only precedence for both String and Integer fields, instead of repeating the ternary at every setter. The comment explains the why (human edits win), not the what.if (index.isBlank()) continue;,if (slug.isBlank()) return null;,if (tagPath.isBlank()) return;. No nested conditionals.preferHumanrename (from the prior round) reads correctly and the rowId-not-slug trap comment inPersonTreeImporter.createRelationshipsis genuinely load-bearing — that's a non-obvious mapping that would bite a future reader, so the comment earns its place.blankToNullduplicated acrossDocumentImporter,PersonRegisterImporter,PersonTreeImporter, andPersonService(private static in each). Mild DRY violation, but each is a 1-liner and extracting a shared util would create a cross-package dependency for trivial gain — KISS wins, I would leave it.Frontend
errors.ts:IMPORT_ARTIFACT_INVALIDadded to the union type AND thegetErrorMessageswitch, with the three message bundles (de/en/es) all carryingerror_import_artifact_invalidand the newimport_reason_path_traversal/admin_system_import_failed_artifactkeys. The full error-code wiring checklist from CLAUDE.md is satisfied.ImportStatusCard.svelteuses a$derivedforfailureMessageand a cleanreasonLabellookup. The newINVALID_FILENAME_PATH_TRAVERSALandIMPORT_FAILED_ARTIFACTbranches are both wired.Minor
DocumentImportermixesPaths.get/Files.walk/new File(...)(java.io + java.nio). Consistent with the ported guards, so acceptable, but a future cleanup could standardize on NIO.TDD discipline is evident — the tests read as the spec. No changes requested.
Nora "NullX" Steiner — Application Security Engineer
Verdict: Approved with concerns
I came in adversarial — a "modular rewrite that deletes the old hardened code path" is exactly where path-traversal and XXE regressions sneak back in. I'm satisfied the guards survived the migration intact. The threat-model comments are exemplary.
Verified controls (CWE-22 / CWE-434)
isValidImportFilenamerejects/,\, the three Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5),.., the bare., null byte, and absolute paths. Ported verbatim and carries the "do not weaken" comment.findFileRecursiveresolves the match and assertscandidate.getCanonicalPath().startsWith(baseDirCanonical + File.separator), throwing on escape. The+ File.separatorsuffix correctly prevents the/import-evilsibling-prefix bypass. The symlink-escape regression test (findFileRecursive_throwsDomainException_whenSymlinkEscapesImportDir) proves a symlink whose target lives outside the import dir is caught —Files.walkdoes not follow links to directories, andgetCanonicalPath()resolves the link target so the containment check fires. Good.%PDF(0x25 0x50 0x44 0x46) is checked before any S3 upload; non-PDF content is skipped withINVALID_PDF_SIGNATURE. The IOException path is tested via a Mockito spy onopenFileStream.XxeSafeXmlParserand the whole ODS/XML path are deleted. The tree JSON is parsed with a plain JacksonObjectMapper(readTree) — JSON has no external-entity surface, so this is a genuine attack-surface reduction, not a regression. Good.filecolumn is treated as hostile even though it originates from our own normalizer — the ADR and the class Javadoc both state this explicitly ("CWE-22 does not care that it came from our Python tool"). This is the correct defense-in-depth posture.AdminControllerstill gatestrigger-import/import-statusbehindADMIN(the@RequirePermissionaspect is on the existing endpoints; this PR only swaps the injected bean). No new endpoints, no permission gaps.Concern (non-blocking)
findFileRecursivethrowsDomainException.internal, which propagates out ofimportRow(onlyInvalidImportFilenameExceptionis caught there) → out ofload()→ orchestrator marks the whole runFAILED. For a genuine attack this hard-fail is defensible, but it means one hostile row denies the whole import rather than being recorded as a per-file skip like the basename-reject case. Since the artifacts are operator-staged on a trusted host, the practical risk is low — but consider downgrading a containment failure to aSkipReasonfor consistency with the other file-level rejects. Not a blocker.No injection, no auth bypass, no data exposure. The deleted code was the riskiest path and it's gone. Approved.
Sara Holt — Senior QA Engineer
Verdict: Approved
This is the round where the test-isolation bug was the headline fix, so I scrutinized it specifically and swept every new test class for the same leak class. It's genuinely fixed and nothing else reintroduces it.
Test-isolation fix — verified
CanonicalImportIntegrationTestnow has an@AfterEach cleanup()that callsdocumentRepository.deleteAll()/personRepository.deleteAll()/tagRepository.deleteAll(), mirroring the@BeforeEach. The Javadoc on it is exactly the right kind of test comment: it explains why (the orchestrator is non-transactional, so committed rows can't roll back, and they were leaking into the shared Testcontainers DB and pollutingDocumentDensityIntegrationTest/DocumentSearchPagedIntegrationTest). This will stop a future maintainer from "cleaning up" the apparently-redundant teardown.CanonicalImportIntegrationTestis the only@SpringBootTestthat commits real rows. Every other new test —CanonicalImportOrchestratorTest,CanonicalSheetReaderTest,DocumentImporterTest,PersonRegisterImporterTest,PersonTreeImporterTest,TagTreeImporterTest,PersonImportUpsertTest,TagImportUpsertTest— is a pure@ExtendWith(MockitoExtension.class)unit test that mocks the repositories/services and never touches the DB. No new leak surface.Prior-round additions — verified present and correct
DocumentImporterTest.load_prunesReceiversAndTags_whenCanonicalRowShrinks(unit) andCanonicalImportIntegrationTest.reimport_prunesRemovedReceiverAndTag_whenCanonicalRowShrinks(real Postgres, with a re-staged narrower sheet). The integration test correctly re-reads viafindByIdso the lazy collections initialise.PersonTreeImporterTest.load_propagatesUnexpectedDomainException_fromAddRelationshipasserts anINTERNAL_ERRORis NOT swallowed, complementingload_swallowsDuplicateRelationship…. The swallow-list (DUPLICATE_RELATIONSHIP/CIRCULAR_RELATIONSHIP) is exactly scoped.get_returnsEmptyString_forTrailingColumns_whenRowShorterThanHeadercovers POI's trailing-empty-cell omission — the real-world failure mode. Good.PersonImportUpsertTest.upsertBySourceRef_fillsBlankYears_butPreservesHumanEditedYears_onReimportasserts a human birthYear survives while a blank deathYear is filled.CanonicalImportIntegrationTest.import_neverFlipsRegisterPersonToProvisional_whenReferencedByDocumentRowruns the import twice and asserts both register persons stayfalse. Solid integration-level proof.Style
Test names read as sentences; AAA structure throughout; sheet/JSON fixtures factored into
writeSheet/writeDocs/writehelpers. The security regressions are individually named per attack vector — easy to see at a glance which homoglyph or traversal token each covers.Coverage is thorough at the right pyramid levels. Approved.
Tobias Wendt — DevOps & Platform Engineer
Verdict: Approved with concerns
Backend-only change, no compose/CI/Dockerfile edits, so my surface is config, the runbook, and operational failure modes. The config simplification is a net win.
What I like
application.yamlsimplification. The brittle positionalapp.import.col.*index block is gone, replaced by a single self-documentingapp.import.dir: ${IMPORT_DIR:/import}with a comment pointing at ADR-025. Fewer knobs, fewer ways to misconfigure. The${IMPORT_DIR:/import}default matches the dev bind-mount convention..venv/bin/python normalize.py, with the one-time venv bootstrap), lists the four required artifacts by name, and documents the fail-closed behaviour (IMPORT_ARTIFACT_INVALIDwhen any artifact is missing). The staging/prodrsync+IMPORT_HOST_DIRsteps are unchanged and still correct. The idempotent-rerun note ("Re-running is safe") is exactly what an operator needs to know.FAILED, and the fix is "repair the artifact and re-trigger" with no manual DB cleanup. That's the right operability property for a small team — no fragile all-or-nothing transaction spanning four loaders.Concerns (non-blocking)
persist(), when a PLACEHOLDER doc gains a file, a freshdocuments/{UUID}_{name}key is minted and uploaded; if that exact run is interrupted after upload but the doc later re-imports, the prior object can be left unreferenced in MinIO (theALREADY_EXISTSguard bounds this to the transition, so it's at most one orphan per doc, not unbounded growth). No lifecycle/cleanup job exists. Low impact at this archive's volume — worth a backlog note, not a blocker.findFileRecursivewalks the whole/importmount once per document row. On a large staged import over a network/bind mount this is repeated I/O. Admin-triggered and bounded, but if import sets grow into the thousands of PDFs, watch the wall-clock time. Pre-existing shape fromMassImportService, so not a regression.--no-verifycommits. The PR notes husky frontend lint can't run in a worktree, so backend commits bypassed hooks. CI owns the real lint/check sweep, so this is acceptable here, but flagging per the project's own rule that hook-skips should be explicit and intentional — which this is.No infrastructure regressions. Config got simpler and the runbook is accurate. Approved.
Leonie Voss — UI/UX Design Lead
Verdict: Approved
The frontend footprint is small — a follow-up to surface the two new backend states in the admin Import Status Card — and it's handled correctly. This is an admin-only tool, not the read path our 67-year-old phone users touch, so the bar here is "an admin can understand what failed," and it clears it.
What I checked
ImportStatusCard.svelteadds theIMPORT_FAILED_ARTIFACTbranch to thefailureMessage$derived, mapping toadmin_system_import_failed_artifact("A canonical import file is missing or invalid." / "Eine Importdatei fehlt oder ist ungültig."). No raw status code leaks to the screen.reasonLabelgainsINVALID_FILENAME_PATH_TRAVERSAL → import_reason_path_traversal("Invalid filename (path)" / "Ungültiger Dateiname (Pfad)"). It joins the existing per-reason labels in the amber skipped-files section.error_import_artifact_invalid,import_reason_path_traversal,admin_system_import_failed_artifact. No locale will fall back to a missing-key crash or an English string on a German screen. The Spanish translations read naturally.Note
statusCodeand the lookup function is a flatif-chain — both render-predictable. No accessibility regressions: this is text content inside the existing card structure, no new interactive controls.Small, correct, fully localized. Approved.
Elicit — Requirements Engineer & Business Analyst
Verdict: Approved with concerns
I review requirements coverage and traceability, not code. This PR closes #669 (Phase 3 of "Handling the Unknowns") and the behaviour is well-specified and testable. The migration story is honest about its trade-offs, which is what I most care about.
Requirements coverage — clear and testable
provisionalmonotonic-downward); document sender/receivers/tags are canonical-authoritative (re-import prunes removed links). Each rule has a named acceptance test. This is exactly the kind of testable, unambiguous requirement I push for.sender_text/receiver_text) is always retained even when a person is linked — this is the "honest uncertainty about people" promise of the milestone, and it has its own dedicated assertions. The provisional flag (importer-minted = uncertain, register/tree = confident) operationalizes "honest uncertainty" in the data model.IMPORT_ARTIFACT_INVALID, surfaced to the admin with actionable copy), not an accident. The GLOSSARY entries (Canonical import, canonical artifact, CanonicalSheetReader, SkippedFile) keep the shared vocabulary current so a non-technical owner can still reason about the system.Concerns / open items to capture in the backlog (non-blocking)
ALREADY_EXISTS) bounds this to placeholder rows, and the design intent is that the canonical row owns these sets. That's a reasonable rule, but it's a workflow assumption — confirm the product owner accepts "don't hand-curate associations on placeholder docs; do it after upload." Worth one line in the user-facing import docs.FAILED, recovered by re-trigger. That's sound, but it's a runbook expectation an operator must internalize — the DEPLOYMENT note covers it; just ensure whoever runs imports has read it.Requirements are traceable to #669 and to named tests. No scope creep, no contradictions. Approved.