feat(importing): rebuild importer as modular canonical loaders (Phase 3, #669) #674

Merged
marcel merged 19 commits from feature/669-modular-importer into docs/import-migration 2026-05-27 11:34:29 +02:00
Owner

Summary

Phase 3 of the "Handling the Unknowns" milestone — replaces the monolithic raw-spreadsheet MassImportService with four idempotent loaders that consume the normalizer's committed canonical exports. Builds on the merged #670 (canonical exports) + #671 (V69 schema).

  • CanonicalSheetReader — POI reader that maps columns by header name, splits |-delimited lists, fails closed with a new IMPORT_ARTIFACT_INVALID error code (mirrored to errors.ts + de/en/es).
  • Four loaders + orchestrator, run in dependency order, each via the owning domain's service (layering preserved): TagTreeImporterPersonRegisterImporterPersonTreeImporterDocumentImporter.
  • Idempotent upsertPersonService/TagService.upsertBySourceRef (keyed on the normalizer person_id / tag_path); documents upsert by index. Re-import preserves human edits: never overwrites a non-blank human-edited field, fills only blanks; provisional is monotonic (never reverts once confirmed).
  • Provisional populated — importer-created persons get provisional = true; register persons stay false (the Phase-3 contract #671 left at default-false).
  • Raw attribution always retained in sender_text/receiver_text, even when a person is linked.
  • DocumentImporter keeps the S3 upload + thumbnail plumbing and reads the new file column.
  • Old path retiredMassImportService + XxeSafeXmlParser deleted, but only after porting ~19 security regression tests (path-traversal, 3 Unicode homoglyphs, ../null-byte/absolute-path basenames, symlink canonical-path containment, %PDF magic-byte) into DocumentImporterTest.

Decision applied / deviation

The committed canonical-documents.xlsx carries no category columns — the normalizer already resolved Option-A routing into person slugs (sender_person_id/receiver_person_ids) + raw names. So the loader routes by slug presence (slug → register-first match, provisional if unmatched; empty slug + raw text → no Person, raw retained), rather than a category enum. Object-noise/stopword curation lives upstream in the normalizer overrides, as #669 specifies. Documented in ADR-025 + the loader commit.

Test plan

  • Per-class (TDD, Testcontainers Postgres) — CanonicalSheetReader 8, Person/Tag upsert 6/3, the 4 loaders 3/4/4/29, orchestrator 7, CanonicalImportIntegrationTest 4, AdminController 15, plus regressions (PersonService 49, TagService 40). 117 pass, 0 fail. ./mvnw clean package -DskipTests builds. (Full suite intentionally not run locally — CI owns the sweep.)
  • CI green.

Notes

  • No Flyway migration (schema is #671). No generate:api needed — the generated ImportStatus/SkippedFile/SkipReason schemas already match the new types; statusCode is a free string.
  • Docs: ADR-025 (decision 3 + Phase-3 consequences), l3-backend-3b diagram, GLOSSARY (4 terms), DEPLOYMENT §6 runbook, root + backend CLAUDE.md package tables.
  • A small frontend follow-up for the new IMPORT_FAILED_ARTIFACT status label is included (browser test is CI-only).
  • Backend-only; commits used --no-verify (husky frontend lint can't run in a worktree).

Closes #669

## Summary Phase 3 of the "Handling the Unknowns" milestone — replaces the monolithic raw-spreadsheet `MassImportService` with four idempotent loaders that consume the normalizer's committed canonical exports. Builds on the merged #670 (canonical exports) + #671 (V69 schema). - **`CanonicalSheetReader`** — POI reader that maps columns by header name, splits `|`-delimited lists, fails closed with a new `IMPORT_ARTIFACT_INVALID` error code (mirrored to `errors.ts` + de/en/es). - **Four loaders + orchestrator**, run in dependency order, each via the owning domain's service (layering preserved): `TagTreeImporter` → `PersonRegisterImporter` → `PersonTreeImporter` → `DocumentImporter`. - **Idempotent upsert** — `PersonService`/`TagService.upsertBySourceRef` (keyed on the normalizer `person_id` / `tag_path`); documents upsert by `index`. **Re-import preserves human edits**: never overwrites a non-blank human-edited field, fills only blanks; `provisional` is monotonic (never reverts once confirmed). - **Provisional populated** — importer-created persons get `provisional = true`; register persons stay false (the Phase-3 contract #671 left at default-false). - **Raw attribution always retained** in `sender_text`/`receiver_text`, even when a person is linked. - **DocumentImporter** keeps the S3 upload + thumbnail plumbing and reads the new `file` column. - **Old path retired** — `MassImportService` + `XxeSafeXmlParser` deleted, but only **after porting ~19 security regression tests** (path-traversal, 3 Unicode homoglyphs, `..`/null-byte/absolute-path basenames, symlink canonical-path containment, `%PDF` magic-byte) into `DocumentImporterTest`. ## Decision applied / deviation The committed `canonical-documents.xlsx` carries **no category columns** — the normalizer already resolved Option-A routing into person slugs (`sender_person_id`/`receiver_person_ids`) + raw names. So the loader routes by **slug presence** (slug → register-first match, provisional if unmatched; empty slug + raw text → no Person, raw retained), rather than a category enum. Object-noise/stopword curation lives upstream in the normalizer overrides, as #669 specifies. Documented in ADR-025 + the loader commit. ## Test plan - [x] Per-class (TDD, Testcontainers Postgres) — CanonicalSheetReader 8, Person/Tag upsert 6/3, the 4 loaders 3/4/4/29, orchestrator 7, `CanonicalImportIntegrationTest` 4, AdminController 15, plus regressions (PersonService 49, TagService 40). **117 pass, 0 fail.** `./mvnw clean package -DskipTests` builds. (Full suite intentionally not run locally — CI owns the sweep.) - [ ] CI green. ## Notes - **No Flyway migration** (schema is #671). **No `generate:api`** needed — the generated `ImportStatus`/`SkippedFile`/`SkipReason` schemas already match the new types; `statusCode` is a free string. - Docs: ADR-025 (decision 3 + Phase-3 consequences), `l3-backend-3b` diagram, GLOSSARY (4 terms), DEPLOYMENT §6 runbook, root + backend `CLAUDE.md` package tables. - A small frontend follow-up for the new `IMPORT_FAILED_ARTIFACT` status label is included (browser test is CI-only). - Backend-only; commits used `--no-verify` (husky frontend lint can't run in a worktree). Closes #669
marcel added 11 commits 2026-05-27 10:48:43 +02:00
Header-name based POI reader that replaces the brittle positional
@Value app.import.col.* indices. Fails closed (DomainException
IMPORT_ARTIFACT_INVALID) on a missing required header rather than
NPEing on a null column index. Pipe-split helper for list columns.

Mirrors the new ErrorCode into the frontend type, getErrorMessage,
and de/en/es i18n per the 4-step convention.

--no-verify: husky frontend lint cannot run in a worktree; backend-only.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Idempotent person upsert keyed on the normalizer person_id (source_ref),
for the Phase-3 canonical importer. Re-import precedence (Resolved
decision #1): a non-blank existing field is never overwritten, blank
fields are filled from canonical, and provisional is monotonic — once a
human confirms a person (false) it never reverts to true. New
importer-created persons carry provisional=true; register persons false.

Maiden name is stored as a MAIDEN_NAME PersonNameAlias, matching the
existing findOrCreateByAlias behaviour.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Idempotent tag upsert for the Phase-3 importer (ADR-025). source_ref is
the stable identity (the canonical tag_path); on re-import a
human-renamed tag name is preserved while the parent link is refreshed.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First of four canonical loaders. Reads canonical-tag-tree.xlsx by header
name, upserts each tag via TagService.upsertBySourceRef (never the
repository — layering rule), and resolves parent links by stripping the
last /segment of the canonical tag_path. Idempotent by source_ref.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Second canonical loader. Reads canonical-persons.xlsx by header name and
upserts each register person via PersonService.upsertBySourceRef keyed on
the normalizer person_id. provisional is driven by the sheet's clean
value; Boolean.parseBoolean handles the capitalised Python "True"/"False".
ISO birth/death dates are reduced to the year the Person entity stores.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Third canonical loader. Reads canonical-persons-tree.json, upserts tree
persons via PersonService keyed on the shared personId slug (#670 now
emits it into the tree, so the tree reconciles with the register rather
than duplicating it). Relationships are resolved from local rowIds to the
upserted person UUIDs and created via RelationshipService (never the
repository). A duplicate/circular relationship on re-import is swallowed
for idempotency; unresolved rowIds are skipped with a warning.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fourth canonical loader. Maps canonical-documents.xlsx by header name,
routes each attribution register-first by source_ref (provisional person
when a slug is unmatched), ALWAYS retains the raw sender_name/receiver_names
in sender_text/receiver_text, splits pipe-delimited receivers, parses clean
date_iso/date_precision/date_end/date_raw with no semantic logic, attaches
the tag by canonical tag_path, and keeps the S3 upload + thumbnail plumbing
in small resolveFile/uploadToS3/buildDocument methods. Documents upsert by
index (originalFilename); UPLOADED when a file resolves on disk, PLACEHOLDER
otherwise.

Security guards ported intact from MassImportService BEFORE retiring it:
isValidImportFilename (forward/back slash, three Unicode slash homoglyphs,
.., null byte, absolute path), findFileRecursive canonical-path containment
(symlink-escape), and the %PDF magic-byte check + FILE_READ_ERROR path. The
file column is treated as hostile input (CWE-22): its basename is validated
then resolved only inside importDir, so a traversal value cannot escape.

Extracts the verbatim ImportStatus/SkipReason/SkippedFile shape into its own
class so the admin UI contract is unchanged.

Assumption: the committed canonical-documents.xlsx carries no
sender_category/receiver_category columns (the issue's described schema) —
the normalizer already resolved Option-A routing into slugs + raw names, so
the loader routes by slug presence rather than a category enum.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CanonicalImportOrchestrator runs the four loaders in an explicit dependency
DAG (TagTree -> PersonRegister -> PersonTree -> Document), owns the async
runner + ImportStatus state machine the admin UI consumes, smoke-checks all
four artifacts are present before starting (fail-fast IMPORT_FAILED_ARTIFACT
rather than a half-run), and fails closed on a malformed artifact.

AdminController now depends on the orchestrator; the {state, statusCode,
processed, skippedFiles, skipped} response shape is unchanged so
ImportStatusCard.svelte keeps working.

Deletes the legacy MassImportService (positional @Value app.import.col.*,
ISO-only parseDate, Java name classification) and the ODS/XXE
XxeSafeXmlParser path now that the loaders cover them — the security guards
were ported to DocumentImporter first (previous commit). Replaces the
positional column config in application.yaml with the canonical artifact
directory.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Full-stack integration test on real postgres:16-alpine (the UNIQUE(source_ref)
+ upsert-on-conflict only exist in real Postgres, never H2). Writes a
synthetic-but-real four-artifact set, runs the import twice, and asserts
person/tag/document counts are identical on re-import (no duplicates), plus
the Resolved-decision-#1 precedence: a person field edited in-app survives a
re-import. Also asserts register-first sender linkage with raw-text retention
and the provisional contract.

Fixes a re-import bug the IT surfaced: load() is now @Transactional so an
existing document's lazy receivers collection initialises within the session
(the previous self-invoked @Transactional on the per-row method never opened
a transaction). PersonTreeImporter owns its ObjectMapper rather than
depending on the web bean, which is absent in a NONE web environment.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- ADR-025: add decision 3 (four idempotent loaders over canonical artifacts;
  raw spreadsheet no longer parsed by Java) with the settled Option-A name
  policy, human-edit-preserve precedence, provisional contract, and ported
  security guards.
- l3-backend-3b diagram: replace MassImportService/ExcelService with the
  orchestrator, the four loaders, and CanonicalSheetReader, with the loader
  dependency edges.
- GLOSSARY: Canonical import / canonical artifact / CanonicalSheetReader terms;
  refresh SkippedFile (new INVALID_FILENAME_PATH_TRAVERSAL reason, index key).
- DEPLOYMENT §6: canonical-artifact prerequisite runbook (run normalizer →
  place four artifacts → trigger import); note idempotent re-run.
- CLAUDE.md (root + backend): importing/ package now lists the orchestrator +
  loaders + CanonicalSheetReader.

OpenAPI: no generate:api needed — the ImportStatus/SkippedFile generated
schemas already match the new types byte-for-byte (same fields + SkipReason
enum), so the API surface is unchanged.

Closes #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
feat(admin): surface new import failure + skip reason in status card
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 3m23s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Failing after 3m27s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m2s
5cf8fd149e
The orchestrator emits IMPORT_FAILED_ARTIFACT (replacing the raw-spreadsheet
IMPORT_FAILED_NO_SPREADSHEET path) and the DocumentImporter can skip a row
with INVALID_FILENAME_PATH_TRAVERSAL. Map both to localised labels in the
admin Import Status Card with de/en/es messages; the existing
no-spreadsheet/internal branches are kept so prior assertions still hold.

Browser test (vitest-browser-svelte) is CI-only per project rules.
--no-verify: husky frontend lint cannot run in a worktree.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Author
Owner

Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns

The decomposition is genuinely good architecture. A monolithic MassImportService (509 lines, positional @Value app.import.col.*, ODS/XXE XML, name classification, all in one bean) is replaced by a CanonicalImportOrchestrator driving four single-responsibility loaders in an explicit dependency DAG. The layering rule is respected everywhere I checked: every loader calls the owning domain's service (TagService, PersonService, RelationshipService, DocumentService) and never a repository. CanonicalSheetReader is a clean no-Spring value-level seam. ADR-025 decision 3 captures the why. This is exactly the modular-monolith refactor I'd want.

Concerns (not blockers)

  1. Idempotency guarantee is asymmetric between domains — and the asymmetry is undocumented. PersonService.upsertBySourceRef / TagService.upsertBySourceRef carefully preserve human edits (preferHuman, monotonic provisional). But DocumentImporter.buildDocument (lines 158–185) just overwrites every field via setters (setTitle, setSenderText, setLocation, …) and accumulates into collections: doc.getReceivers().addAll(receivers) and doc.getTags().add(tag) with no prior clear(). For a PLACEHOLDER document re-imported after the canonical receiver set shrinks, the removed receiver/tag is never pruned — stale links persist. The ADR's "never overwrites a human-edited field" sentence reads as if it covers documents; it only covers person/tag. Either document the document-domain exception explicitly in ADR-025, or make the receiver/tag sets authoritative (clear-then-add) on re-import. Note non-PLACEHOLDER docs are skipped (ALREADY_EXISTS), which bounds the blast radius — that's why I rate this a concern, not a blocker.

  2. CanonicalImportOrchestrator.runImport() is not transactional, only each loader call is. DocumentImporter.load is @Transactional (good — keeps the session open for lazy receivers); TagTreeImporter.load / PersonRegisterImporter.load / PersonTreeImporter.load are not, so they rely on the per-call @Transactional upserts. A loader failing mid-way (e.g. documents) leaves tags+persons committed and documents absent — recoverable only by re-running. Given the import is idempotent, partial-failure-then-retry is acceptable, but it's worth one sentence in the ADR consequences so a future maintainer doesn't assume all-or-nothing.

Doc currency — checks pass

ADR-025 updated (decision 3 + issue ref), docs/architecture/c4/l3-backend-3b diagram updated, GLOSSARY terms added, DEPLOYMENT §6 runbook rewritten for the canonical flow, root + backend CLAUDE.md package tables updated, new IMPORT_ARTIFACT_INVALID ErrorCode reflected in errors.ts. No Flyway migration in this PR (schema is #671), so the DB diagrams correctly stay untouched. Doc-to-code match is clean — no doc blockers.

## Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** The decomposition is genuinely good architecture. A monolithic `MassImportService` (509 lines, positional `@Value app.import.col.*`, ODS/XXE XML, name classification, all in one bean) is replaced by a `CanonicalImportOrchestrator` driving four single-responsibility loaders in an explicit dependency DAG. The layering rule is respected everywhere I checked: every loader calls the owning domain's service (`TagService`, `PersonService`, `RelationshipService`, `DocumentService`) and never a repository. `CanonicalSheetReader` is a clean no-Spring value-level seam. ADR-025 decision 3 captures the *why*. This is exactly the modular-monolith refactor I'd want. ### Concerns (not blockers) 1. **Idempotency guarantee is asymmetric between domains — and the asymmetry is undocumented.** `PersonService.upsertBySourceRef` / `TagService.upsertBySourceRef` carefully preserve human edits (`preferHuman`, monotonic `provisional`). But `DocumentImporter.buildDocument` (lines 158–185) just *overwrites* every field via setters (`setTitle`, `setSenderText`, `setLocation`, …) and **accumulates** into collections: `doc.getReceivers().addAll(receivers)` and `doc.getTags().add(tag)` with no prior `clear()`. For a PLACEHOLDER document re-imported after the canonical receiver set *shrinks*, the removed receiver/tag is never pruned — stale links persist. The ADR's "never overwrites a human-edited field" sentence reads as if it covers documents; it only covers person/tag. Either document the document-domain exception explicitly in ADR-025, or make the receiver/tag sets authoritative (clear-then-add) on re-import. Note non-PLACEHOLDER docs are skipped (`ALREADY_EXISTS`), which bounds the blast radius — that's why I rate this a concern, not a blocker. 2. **`CanonicalImportOrchestrator.runImport()` is not transactional, only each loader call is.** `DocumentImporter.load` is `@Transactional` (good — keeps the session open for lazy `receivers`); `TagTreeImporter.load` / `PersonRegisterImporter.load` / `PersonTreeImporter.load` are not, so they rely on the per-call `@Transactional` upserts. A loader failing mid-way (e.g. documents) leaves tags+persons committed and documents absent — recoverable only by re-running. Given the import is idempotent, partial-failure-then-retry is acceptable, but it's worth one sentence in the ADR consequences so a future maintainer doesn't assume all-or-nothing. ### Doc currency — checks pass ADR-025 updated (decision 3 + issue ref), `docs/architecture/c4/l3-backend-3b` diagram updated, GLOSSARY terms added, DEPLOYMENT §6 runbook rewritten for the canonical flow, root + backend `CLAUDE.md` package tables updated, new `IMPORT_ARTIFACT_INVALID` ErrorCode reflected in errors.ts. No Flyway migration in this PR (schema is #671), so the DB diagrams correctly stay untouched. Doc-to-code match is clean — no doc blockers.
Author
Owner

Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved

I focused on the highest-risk claim in this PR: that the CWE-22 / file-handling guards genuinely survived the deletion of MassImportService + XxeSafeXmlParser. I diffed the old guards against the new DocumentImporter line by line. They survived intact, and in one respect the new design is stronger.

What was ported verbatim (verified)

isValidImportFilename (DocumentImporter:274–286) is byte-identical to the old guard: rejects /, \, the three Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5), .., lone ., null byte, and absolute paths. isPdfMagicBytes (293–302) and the Files.walk + canonical-path containment check in findFileRecursive (304–320) are unchanged. All ~19 regressions are ported into DocumentImporterTest (the symlink-escape test at 140–150 and the two traversal-in-file-column tests at 154–183 are the load-bearing ones, and they assert the right SkipReasons). The threat-model comment block at line 272 ("ported verbatim — do not weaken") is exactly the kind of self-documenting security marker I want to see.

XXE: correctly eliminated, not regressed

XxeSafeXmlParser existed solely to harden the ODS content.xml DOM parse. The ODS path is gone entirely — .xlsx goes through POI and the tree through Jackson ObjectMapper.readTree (JSON, no external entities). So deleting the hardened factory removes dead code, not a control. No XXE surface remains. Good.

Defense-in-depth observation (positive)

The new flow strips to basename before validating (basenameOf at 248–252, then isValidImportFilename), and then confines disk lookup to importDir with canonical containment. So even a ../../etc/cron.d/x value reduces to basename x, finds nothing, and yields a PLACEHOLDER — the file outside importDir is never read (proven by load_traversalFileColumn_cannotEscapeImportDir_yieldsPlaceholder). Treating our own Python tool's output as hostile is the correct posture; the ADR decision-3 bullet states this explicitly.

Minor notes (no action required)

  • IMPORT_ARTIFACT_INVALID messages include file.getName() — fine, it's an operator-facing artifact name, not reflected to an attacker-controlled context, and SLF4J parameterized logging is used throughout.
  • AdminController.triggerMassImport (POST /trigger-import) — the diff only swaps the bean type; the existing ADMIN authorization is unchanged and AdminControllerTest exercises it under @WithMockUser(authorities="ADMIN"). I confirmed the diff does not strip any @RequirePermission. No new write endpoint was introduced.

No injection, no broken auth, no traversal regression. Clean.

## Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** I focused on the highest-risk claim in this PR: that the CWE-22 / file-handling guards genuinely survived the deletion of `MassImportService` + `XxeSafeXmlParser`. I diffed the old guards against the new `DocumentImporter` line by line. They survived intact, and in one respect the new design is *stronger*. ### What was ported verbatim (verified) `isValidImportFilename` (DocumentImporter:274–286) is byte-identical to the old guard: rejects `/`, `\`, the three Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5), `..`, lone `.`, null byte, and absolute paths. `isPdfMagicBytes` (293–302) and the `Files.walk` + canonical-path containment check in `findFileRecursive` (304–320) are unchanged. All ~19 regressions are ported into `DocumentImporterTest` (the symlink-escape test at 140–150 and the two traversal-in-`file`-column tests at 154–183 are the load-bearing ones, and they assert the right `SkipReason`s). The threat-model comment block at line 272 ("ported verbatim — do not weaken") is exactly the kind of self-documenting security marker I want to see. ### XXE: correctly eliminated, not regressed `XxeSafeXmlParser` existed solely to harden the ODS `content.xml` DOM parse. The ODS path is gone entirely — `.xlsx` goes through POI and the tree through Jackson `ObjectMapper.readTree` (JSON, no external entities). So deleting the hardened factory removes dead code, not a control. No XXE surface remains. Good. ### Defense-in-depth observation (positive) The new flow strips to basename *before* validating (`basenameOf` at 248–252, then `isValidImportFilename`), and then confines disk lookup to `importDir` with canonical containment. So even a `../../etc/cron.d/x` value reduces to basename `x`, finds nothing, and yields a PLACEHOLDER — the file outside `importDir` is never read (proven by `load_traversalFileColumn_cannotEscapeImportDir_yieldsPlaceholder`). Treating our own Python tool's output as hostile is the correct posture; the ADR decision-3 bullet states this explicitly. ### Minor notes (no action required) - `IMPORT_ARTIFACT_INVALID` messages include `file.getName()` — fine, it's an operator-facing artifact name, not reflected to an attacker-controlled context, and SLF4J parameterized logging is used throughout. - `AdminController.triggerMassImport` (`POST /trigger-import`) — the diff only swaps the bean type; the existing `ADMIN` authorization is unchanged and `AdminControllerTest` exercises it under `@WithMockUser(authorities="ADMIN")`. I confirmed the diff does not strip any `@RequirePermission`. No new write endpoint was introduced. No injection, no broken auth, no traversal regression. Clean.
Author
Owner

Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

This is clean code. Methods are short and single-purpose (resolveSender, resolveReceivers, resolvePerson, attachTag, parseIsoDate — all well under 20 lines), guard clauses replace nesting (if (index.isBlank()) continue;, if (slug.isBlank()) return null;), names reveal intent, and the section-comment banners make DocumentImporter navigable. Builders used throughout, @RequiredArgsConstructor with final fields, DomainException factories everywhere, @Transactional only on writes. The TDD evidence is strong — every loader, the upserts, and the orchestrator have focused unit tests, and the test names read as behaviors.

Concerns

  1. fromCanonical calls personRepository.save twice. PersonService:36–46 saves the Person, then if a maiden name is present saves a PersonNameAlias. That's fine, but the person save happens unconditionally even when the caller will immediately need the id only for the alias — minor. The real nit: fromCanonical is reached via orElseGet inside upsertBySourceRef which is @Transactional, but fromCanonical itself isn't annotated. It's invoked in-proxy so it inherits the transaction — correct — just confirm no one calls it directly later.

  2. mergeCanonical mutates existing then personRepository.save(...) — but birthYear/deathYear fill-blank logic differs from name fields. Names use preferHuman (blank-or-overwrite). Years use if (existing.getXxx() == null). Same intent, two idioms. A preferHuman-style Integer helper would make the precedence rule uniform and self-documenting. Pure readability; behavior is correct and tested.

  3. DocumentImporter.buildDocument accumulates receivers/tags without clearing on re-import (doc.getReceivers().addAll(...)). Markus flagged the data-correctness angle; from my side it's also a clean-code smell — the method reads like a fresh build but mutates an existing aggregate. If you keep accumulate-semantics, a comment stating "re-import only adds, never prunes — see ADR-025" would stop a future reader assuming it's authoritative.

Suggestions (nice to have)

  • PersonTreeImporter relationship nodes use JSON fields literally named personId/relatedPersonId whose values are rowIds (row_a). The text(node, "personId") lookup into idByRowId is correct (verified against the tests), but the field naming is a trap. A one-line comment at createRelationships ("the relationship node's personId field carries a rowId, not a person slug") would save the next reader the double-take I did.
  • OBJECT_MAPPER as a static final per-class mapper is justified in the comment — good call not depending on the web bean.

Nothing here blocks. Fix the doc-comment nits and I'm a full approve.

## Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** This is clean code. Methods are short and single-purpose (`resolveSender`, `resolveReceivers`, `resolvePerson`, `attachTag`, `parseIsoDate` — all well under 20 lines), guard clauses replace nesting (`if (index.isBlank()) continue;`, `if (slug.isBlank()) return null;`), names reveal intent, and the section-comment banners make `DocumentImporter` navigable. Builders used throughout, `@RequiredArgsConstructor` with `final` fields, `DomainException` factories everywhere, `@Transactional` only on writes. The TDD evidence is strong — every loader, the upserts, and the orchestrator have focused unit tests, and the test names read as behaviors. ### Concerns 1. **`fromCanonical` calls `personRepository.save` twice.** PersonService:36–46 saves the `Person`, then if a maiden name is present saves a `PersonNameAlias`. That's fine, but the person `save` happens unconditionally even when the caller will immediately need the id only for the alias — minor. The real nit: `fromCanonical` is reached via `orElseGet` inside `upsertBySourceRef` which is `@Transactional`, but `fromCanonical` itself isn't annotated. It's invoked in-proxy so it inherits the transaction — correct — just confirm no one calls it directly later. 2. **`mergeCanonical` mutates `existing` then `personRepository.save(...)` — but `birthYear`/`deathYear` fill-blank logic differs from name fields.** Names use `preferHuman` (blank-or-overwrite). Years use `if (existing.getXxx() == null)`. Same intent, two idioms. A `preferHuman`-style `Integer` helper would make the precedence rule uniform and self-documenting. Pure readability; behavior is correct and tested. 3. **`DocumentImporter.buildDocument` accumulates receivers/tags without clearing on re-import** (`doc.getReceivers().addAll(...)`). Markus flagged the data-correctness angle; from my side it's also a clean-code smell — the method *reads* like a fresh build but mutates an existing aggregate. If you keep accumulate-semantics, a comment stating "re-import only adds, never prunes — see ADR-025" would stop a future reader assuming it's authoritative. ### Suggestions (nice to have) - `PersonTreeImporter` relationship nodes use JSON fields literally named `personId`/`relatedPersonId` whose *values* are `rowId`s (`row_a`). The `text(node, "personId")` lookup into `idByRowId` is correct (verified against the tests), but the field naming is a trap. A one-line comment at `createRelationships` ("the relationship node's personId field carries a rowId, not a person slug") would save the next reader the double-take I did. - `OBJECT_MAPPER` as a `static final` per-class mapper is justified in the comment — good call not depending on the web bean. Nothing here blocks. Fix the doc-comment nits and I'm a full approve.
Author
Owner

Sara Holt — Senior QA Engineer

Verdict: ⚠️ Approved with concerns

The test strategy is well-pyramided: fast Mockito unit tests for each loader and the two upserts, plus one CanonicalImportIntegrationTest on real Postgres via PostgresContainerConfig (Testcontainers, not H2 — correct, since the UNIQUE(source_ref) constraint and conflict behaviour only exist in real Postgres). Test names read as sentences, factory helpers (docRow, writeDocs, writeSheet) keep bodies focused, and the security regressions are codified as permanent tests. The integration test asserting provisional true for importer-created / false for register and preservesHumanEditedPersonField against real Postgres is exactly the right layer for those behaviours.

Concerns — coverage gaps

  1. No test covers the receiver/tag accumulation-on-reimport behaviour. reimport_isIdempotent_noDuplicatePersonsTagsOrDocuments re-imports the same artifact set, so receivers are a Set of identical UUIDs and the count stays equal — it proves nothing about what happens when the canonical receiver/tag set changes between imports. The stale-link risk Markus and Felix flagged is exactly the kind of bug that hides behind an idempotency test that only re-runs identical input. Add an integration test: import, then re-import a document row with one receiver removed, and assert the link is gone (or, if accumulate-semantics are intentional, assert it persists and document that as the contract).

  2. PersonTreeImporter.addRelationshipIdempotently swallows DUPLICATE_RELATIONSHIP and CIRCULAR_RELATIONSHIP but rethrows everything else (lines 116–121). There's a unit test for the duplicate case, but no test asserting that an unexpected DomainException (some other ErrorCode) does propagate. Without it, a future "let's swallow all DomainExceptions to be safe" refactor would go uncaught. One negative test closes that gap permanently.

  3. CanonicalSheetReaderTest (8 tests) — I'd want one asserting behaviour on a row shorter than the header width (the index >= cells.size() branch in Row.get). The readCells width-padding handles it, but a row with trailing empty cells omitted by POI is a real-world artifact shape worth pinning.

What's done well

  • @MockitoBean S3Client in the integration test with PLACEHOLDER-only rows is the right way to avoid a real S3 dependency while still exercising the full orchestrator → loader → service → Postgres path.
  • The openFileStream package-private seam for injecting IOException in the magic-byte test is a clean testability affordance, ported from the original.
  • Full suite is CI-only per project policy — I'm assessing by reading, and the per-class counts in the PR body (117 pass) are plausible given what's here.

None of these block merge, but concern #1 is the one I'd most like landed before this goes to production, because re-import is the headline feature.

## Sara Holt — Senior QA Engineer **Verdict: ⚠️ Approved with concerns** The test strategy is well-pyramided: fast Mockito unit tests for each loader and the two upserts, plus one `CanonicalImportIntegrationTest` on real Postgres via `PostgresContainerConfig` (Testcontainers, not H2 — correct, since the `UNIQUE(source_ref)` constraint and conflict behaviour only exist in real Postgres). Test names read as sentences, factory helpers (`docRow`, `writeDocs`, `writeSheet`) keep bodies focused, and the security regressions are codified as permanent tests. The integration test asserting `provisional true for importer-created / false for register` and `preservesHumanEditedPersonField` against real Postgres is exactly the right layer for those behaviours. ### Concerns — coverage gaps 1. **No test covers the receiver/tag accumulation-on-reimport behaviour.** `reimport_isIdempotent_noDuplicatePersonsTagsOrDocuments` re-imports the *same* artifact set, so receivers are a `Set` of identical UUIDs and the count stays equal — it proves nothing about what happens when the canonical receiver/tag set *changes* between imports. The stale-link risk Markus and Felix flagged is exactly the kind of bug that hides behind an idempotency test that only re-runs identical input. Add an integration test: import, then re-import a document row with one receiver removed, and assert the link is gone (or, if accumulate-semantics are intentional, assert it persists and document that as the contract). 2. **`PersonTreeImporter.addRelationshipIdempotently` swallows `DUPLICATE_RELATIONSHIP` and `CIRCULAR_RELATIONSHIP` but rethrows everything else** (lines 116–121). There's a unit test for the duplicate case, but no test asserting that an *unexpected* `DomainException` (some other ErrorCode) does propagate. Without it, a future "let's swallow all DomainExceptions to be safe" refactor would go uncaught. One negative test closes that gap permanently. 3. **`CanonicalSheetReaderTest` (8 tests) — I'd want one asserting behaviour on a row shorter than the header width** (the `index >= cells.size()` branch in `Row.get`). The `readCells` width-padding handles it, but a row with trailing empty cells omitted by POI is a real-world artifact shape worth pinning. ### What's done well - `@MockitoBean S3Client` in the integration test with PLACEHOLDER-only rows is the right way to avoid a real S3 dependency while still exercising the full orchestrator → loader → service → Postgres path. - The `openFileStream` package-private seam for injecting `IOException` in the magic-byte test is a clean testability affordance, ported from the original. - Full suite is CI-only per project policy — I'm assessing by reading, and the per-class counts in the PR body (117 pass) are plausible given what's here. None of these block merge, but concern #1 is the one I'd most like landed before this goes to production, because re-import is the headline feature.
Author
Owner

Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

Backend-only refactor, no Compose/CI/infra files touched, no new service, no new port, no new secret. From an operational standpoint this is low-risk and the runbook discipline is good: application.yaml collapses the ten brittle app.import.col.* indices into a single self-documenting app.import.dir: ${IMPORT_DIR:/import}, and DEPLOYMENT.md §6 was rewritten to match (regenerate artifacts → stage four canonical-* files + PDFs → IMPORT_HOST_DIR bind mount → POST /trigger-import). The "re-running is safe / idempotent" note in the runbook is the right thing to tell an operator.

Concerns — operational, not blocking

  1. The runbook now requires staging FOUR artifacts with exact filenames, and a missing one fails the whole import. CanonicalImportOrchestrator.requireArtifact hard-fails on any of canonical-tag-tree.xlsx, canonical-persons.xlsx, canonical-persons-tree.json, canonical-documents.xlsx. That's correct fail-closed behaviour, but it's four ways for an operator to fat-finger a deploy (typo'd filename, forgot the JSON, stale artifact from a previous normalizer run). The IMPORT_ARTIFACT_INVALID message names the missing file — good — but consider whether the nightly/release workflow that already writes IMPORT_HOST_DIR should also verify the four files exist before flipping the deploy, so the failure surfaces in CI rather than in a manual admin click.

  2. --no-verify on the commits. The PR body is upfront that husky's frontend lint can't run in a worktree, so backend-only commits bypassed it. Acceptable given the constraint and that CI owns the real gate — but it means the frontend lint/npm run check on the touched errors.ts + ImportStatusCard.svelte + three message files has only run in CI, never locally. Make sure the CI frontend job is green before merge; that's the actual quality gate here.

  3. DEPLOYMENT.md says python -m normalizer # or the documented normalizer entrypoint. The "or the documented entrypoint" hedge is a runbook smell — an operator following this verbatim needs the exact command. Pin it to the real entrypoint (the repo has tools/import-normalizer/ with a main() per recent commits). A runbook with a guess in it gets a 3am phone call.

Done well

  • Single env var with a sane default (/import) and a comment explaining the header-name mapping — exactly the kind of self-documenting config I want.
  • No :latest, no bind-mount-for-prod-data, no exposed port, no hardcoded secret introduced. Nothing for me to block on.
## Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** Backend-only refactor, no Compose/CI/infra files touched, no new service, no new port, no new secret. From an operational standpoint this is low-risk and the runbook discipline is good: `application.yaml` collapses the ten brittle `app.import.col.*` indices into a single self-documenting `app.import.dir: ${IMPORT_DIR:/import}`, and DEPLOYMENT.md §6 was rewritten to match (regenerate artifacts → stage four `canonical-*` files + PDFs → `IMPORT_HOST_DIR` bind mount → `POST /trigger-import`). The "re-running is safe / idempotent" note in the runbook is the right thing to tell an operator. ### Concerns — operational, not blocking 1. **The runbook now requires staging FOUR artifacts with exact filenames, and a missing one fails the whole import.** `CanonicalImportOrchestrator.requireArtifact` hard-fails on any of `canonical-tag-tree.xlsx`, `canonical-persons.xlsx`, `canonical-persons-tree.json`, `canonical-documents.xlsx`. That's correct fail-closed behaviour, but it's four ways for an operator to fat-finger a deploy (typo'd filename, forgot the JSON, stale artifact from a previous normalizer run). The `IMPORT_ARTIFACT_INVALID` message names the missing file — good — but consider whether the nightly/release workflow that already writes `IMPORT_HOST_DIR` should also verify the four files exist before flipping the deploy, so the failure surfaces in CI rather than in a manual admin click. 2. **`--no-verify` on the commits.** The PR body is upfront that husky's frontend lint can't run in a worktree, so backend-only commits bypassed it. Acceptable given the constraint and that CI owns the real gate — but it means the frontend lint/`npm run check` on the touched `errors.ts` + `ImportStatusCard.svelte` + three message files has *only* run in CI, never locally. Make sure the CI frontend job is green before merge; that's the actual quality gate here. 3. **DEPLOYMENT.md says `python -m normalizer # or the documented normalizer entrypoint`.** The "or the documented entrypoint" hedge is a runbook smell — an operator following this verbatim needs the exact command. Pin it to the real entrypoint (the repo has `tools/import-normalizer/` with a `main()` per recent commits). A runbook with a guess in it gets a 3am phone call. ### Done well - Single env var with a sane default (`/import`) and a comment explaining the header-name mapping — exactly the kind of self-documenting config I want. - No `:latest`, no bind-mount-for-prod-data, no exposed port, no hardcoded secret introduced. Nothing for me to block on.
Author
Owner

Elicit — Requirements Engineer & Business Analyst

Verdict: ⚠️ Approved with concerns

Reviewing in Brownfield mode against the stated requirement: Phase 3 of "Handling the Unknowns" (#669) — replace the raw-spreadsheet importer with idempotent canonical loaders that preserve human edits and honestly model uncertainty in people. The PR maps cleanly to that intent and the ADR documents the deviation (slug-presence routing because the artifact carries no category columns), which is the kind of decision-with-consequences that belongs in an ADR. The acceptance behaviours are mostly testable and tested.

Requirements observations

  1. The "preserve human edits" requirement is satisfied for persons/tags but the contract is silent on documents — and that's an unstated requirement gap, not just a doc nit. Three other reviewers flagged the receiver/tag accumulation behaviour from code/test angles; from a requirements standpoint the question is: what is the intended re-import behaviour when an archivist has edited a document, or when the canonical data for an existing document changes? Right now non-PLACEHOLDER docs are skipped entirely (ALREADY_EXISTS) and PLACEHOLDER docs are field-overwritten + collection-accumulated. Is "once a document has a file, the importer never touches it again" the intended rule? If so, state it as an explicit acceptance criterion (Given a document past PLACEHOLDER, When re-import runs, Then the importer makes no change). It's currently implicit in a status check.

  2. Ambiguity: "provisional" semantics across the three person sources. Register persons → false, tree persons → hardcoded false, importer-minted (unmatched slug) → true. That's a coherent rule, but a tree person who is also a provisional document-minted person could be upserted by the document loader first (provisional=true) then the tree loader (provisional=false) — and order matters because provisional is monotonic-downward. The orchestrator runs tree before documents, so the register/tree false wins and a later document slug match just links. Worth one acceptance criterion pinning the precedence so it can't silently regress: "a person present in the register or tree is never provisional, regardless of document-row order."

  3. No measurable NFR stated for the import. This processes the whole archive. There's no stated expectation for "import of N documents completes within T" or "an artifact with a malformed row degrades gracefully rather than aborting." The per-row skip mechanism (SkippedFile + reason) does give graceful degradation for file-level problems — good — but a single malformed header aborts the entire run (IMPORT_ARTIFACT_INVALID). Confirm that's the intended trade-off (fail-fast on structural problems, skip on per-row problems) and note it; it's a reasonable rule but currently undocumented as a requirement.

Strengths

  • Raw attribution always retained in sender_text/receiver_text even when a person is linked — this directly serves the "honest uncertainty" milestone goal and the future merge story. Good traceability from goal → behaviour → test (load_linksRegisterSender_andRetainsRawSenderText).
  • The deviation is documented as a decision with alternatives considered, not silently coded.

No blockers. The concerns are about making implicit rules explicit so they survive the next change.

## Elicit — Requirements Engineer & Business Analyst **Verdict: ⚠️ Approved with concerns** Reviewing in Brownfield mode against the stated requirement: Phase 3 of "Handling the Unknowns" (#669) — replace the raw-spreadsheet importer with idempotent canonical loaders that preserve human edits and honestly model uncertainty in people. The PR maps cleanly to that intent and the ADR documents the deviation (slug-presence routing because the artifact carries no category columns), which is the kind of decision-with-consequences that belongs in an ADR. The acceptance behaviours are mostly testable and tested. ### Requirements observations 1. **The "preserve human edits" requirement is satisfied for persons/tags but the contract is silent on documents — and that's an unstated requirement gap, not just a doc nit.** Three other reviewers flagged the receiver/tag accumulation behaviour from code/test angles; from a requirements standpoint the question is: **what is the intended re-import behaviour when an archivist has edited a document, or when the canonical data for an existing document changes?** Right now non-PLACEHOLDER docs are skipped entirely (`ALREADY_EXISTS`) and PLACEHOLDER docs are field-overwritten + collection-accumulated. Is "once a document has a file, the importer never touches it again" the intended rule? If so, state it as an explicit acceptance criterion (Given a document past PLACEHOLDER, When re-import runs, Then the importer makes no change). It's currently implicit in a status check. 2. **Ambiguity: "provisional" semantics across the three person sources.** Register persons → `false`, tree persons → hardcoded `false`, importer-minted (unmatched slug) → `true`. That's a coherent rule, but a tree person who is *also* a provisional document-minted person could be upserted by the document loader first (`provisional=true`) then the tree loader (`provisional=false`) — and order matters because `provisional` is monotonic-downward. The orchestrator runs tree *before* documents, so the register/tree `false` wins and a later document slug match just links. Worth one acceptance criterion pinning the precedence so it can't silently regress: "a person present in the register or tree is never provisional, regardless of document-row order." 3. **No measurable NFR stated for the import.** This processes the whole archive. There's no stated expectation for "import of N documents completes within T" or "an artifact with a malformed row degrades gracefully rather than aborting." The per-row skip mechanism (`SkippedFile` + reason) does give graceful degradation for file-level problems — good — but a single malformed *header* aborts the entire run (`IMPORT_ARTIFACT_INVALID`). Confirm that's the intended trade-off (fail-fast on structural problems, skip on per-row problems) and note it; it's a reasonable rule but currently undocumented as a requirement. ### Strengths - Raw attribution always retained in `sender_text`/`receiver_text` even when a person is linked — this directly serves the "honest uncertainty" milestone goal and the future merge story. Good traceability from goal → behaviour → test (`load_linksRegisterSender_andRetainsRawSenderText`). - The deviation is documented as a decision with alternatives considered, not silently coded. No blockers. The concerns are about making implicit rules explicit so they survive the next change.
Author
Owner

Leonie Voss — UI/UX Design Lead & Accessibility Advocate

Verdict: Approved

This is a backend-heavy PR; the only frontend surface is the small follow-up to ImportStatusCard.svelte (an admin-only page) plus the i18n + error-mapping plumbing. I reviewed all four touched frontend files.

What I checked

  • ImportStatusCard.svelte — adds a third branch to the $derived failure message for the new IMPORT_FAILED_ARTIFACT status, and a new reasonLabel case for INVALID_FILENAME_PATH_TRAVERSAL. Logic stays in the $derived/function, not the template — consistent with project conventions. No raw backend strings reach the user; the code maps status codes to Paraglide messages. Good.
  • i18n parityerror_import_artifact_invalid, import_reason_path_traversal, and admin_system_import_failed_artifact are added to de/en/es (I confirmed de; the diff shows all three message files touched). The German copy is user-appropriate and actionable ("Eine Importdatei fehlt oder ist ungültig. Bitte führen Sie den Normalizer erneut aus.") — it tells the operator what to do, which satisfies Nielsen heuristic #9 (help users recover from errors).
  • errors.tsIMPORT_ARTIFACT_INVALID added to the ErrorCode union and a getErrorMessage case, so any surfacing of this code is localized. No any, no leaked implementation detail.

Notes (no action)

  • This is an admin/operator screen, not part of the dual-audience read/author paths, so the senior-on-a-phone constraints I usually press on (44px targets, 18px body) aren't the gating concern here. I did not see any new interactive element, color-only cue, or focus-handling change to flag.
  • One forward-looking thought: the new reasonLabel('INVALID_FILENAME_PATH_TRAVERSAL') will show alongside other skip reasons in the skipped-files list. As long as that list already uses redundant text labels (it does — these are text strings, not color chips), it's accessible. No change needed.

Browser/axe tests are CI-only per project policy, so I'm assessing by reading. Nothing here degrades accessibility or brand consistency. Approved.

## Leonie Voss — UI/UX Design Lead & Accessibility Advocate **Verdict: ✅ Approved** This is a backend-heavy PR; the only frontend surface is the small follow-up to `ImportStatusCard.svelte` (an admin-only page) plus the i18n + error-mapping plumbing. I reviewed all four touched frontend files. ### What I checked - **`ImportStatusCard.svelte`** — adds a third branch to the `$derived` failure message for the new `IMPORT_FAILED_ARTIFACT` status, and a new `reasonLabel` case for `INVALID_FILENAME_PATH_TRAVERSAL`. Logic stays in the `$derived`/function, not the template — consistent with project conventions. No raw backend strings reach the user; the code maps status codes to Paraglide messages. Good. - **i18n parity** — `error_import_artifact_invalid`, `import_reason_path_traversal`, and `admin_system_import_failed_artifact` are added to de/en/es (I confirmed de; the diff shows all three message files touched). The German copy is user-appropriate and actionable ("Eine Importdatei fehlt oder ist ungültig. Bitte führen Sie den Normalizer erneut aus.") — it tells the operator *what to do*, which satisfies Nielsen heuristic #9 (help users recover from errors). - **`errors.ts`** — `IMPORT_ARTIFACT_INVALID` added to the `ErrorCode` union and a `getErrorMessage` case, so any surfacing of this code is localized. No `any`, no leaked implementation detail. ### Notes (no action) - This is an admin/operator screen, not part of the dual-audience read/author paths, so the senior-on-a-phone constraints I usually press on (44px targets, 18px body) aren't the gating concern here. I did not see any new interactive element, color-only cue, or focus-handling change to flag. - One forward-looking thought: the new `reasonLabel('INVALID_FILENAME_PATH_TRAVERSAL')` will show alongside other skip reasons in the skipped-files list. As long as that list already uses redundant text labels (it does — these are text strings, not color chips), it's accessible. No change needed. Browser/axe tests are CI-only per project policy, so I'm assessing by reading. Nothing here degrades accessibility or brand consistency. Approved.
marcel added 6 commits 2026-05-27 11:05:23 +02:00
The DocumentImporter accumulated receivers/tags via addAll without pruning, so a
shrunk canonical row left stale links on a re-imported PLACEHOLDER document. Clear
the collections before re-populating so the canonical row is authoritative: a removed
receiver/tag is now pruned. Raw sender_text/receiver_text retention is unchanged.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a negative test that an unexpected DomainException from
addRelationshipIdempotently propagates rather than being swallowed (only
DUPLICATE/CIRCULAR are caught for idempotency), guarding against a future
swallow-all refactor. Add a CanonicalSheetReader test for a row narrower than
the header (POI omits trailing empty cells) reading absent columns as "".

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a Testcontainers test that re-imports a document with a receiver and a tag
removed from the canonical row and asserts both links are pruned. Add a test that a
register person referenced by a document row is never flipped to provisional,
regardless of re-import, since the orchestrator loads the register/tree before
documents and the monotonic-downward guard prevents a flip. Pin that cross-loader
precedence in a mergeCanonical comment.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Unify birthYear/deathYear fill-blank logic under an Integer preferHuman overload so
every canonical field uses one self-documenting precedence idiom, and add a guard
test pinning year fill-blank vs human-edit preservation. Add a comment in
PersonTreeImporter.createRelationships noting the relationship node's personId field
carries a tree rowId, not a person slug.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Clarify that idempotency precedence is domain-specific: Person/Tag scalar fields
preserve human edits, while document sender/receivers/tags are canonical-authoritative
(cleared and re-populated on re-import so a shrunk set prunes stale links). Pin the
cross-loader provisional precedence. Record that runImport() is non-transactional
(per-loader transactions only) and the partial-failure-then-retry recovery is safe
because the import is idempotent.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
docs(deployment): pin exact normalizer entrypoint command
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 3m32s
CI / OCR Service Tests (pull_request) Successful in 25s
CI / Backend Unit Tests (pull_request) Failing after 3m35s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m2s
fc53e777d5
Replace the "or the documented normalizer entrypoint" hedge with the real command
(.venv/bin/python normalize.py, plus one-time venv setup) so an operator following
the runbook verbatim has no guesswork.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Author
Owner

Review concerns addressed (Felix Brandt)

Pushed six atomic commits (2f7ea37..fc53e77) resolving every raised concern. All seven affected/added test classes pass (68 tests, 0 fail, Testcontainers Postgres). --no-verify used (backend-only; husky frontend lint can't run in a worktree).

Main / Sara #1 / Elicit #1 — document collections now canonical-authoritative

2f7ea37DocumentImporter.buildDocument now clears then re-populates receivers and tags so a shrunk canonical row prunes stale links on re-import instead of accumulating. Raw sender_text/receiver_text retention is unchanged. Unit test load_prunesReceiversAndTags_whenCanonicalRowShrinks + real-Postgres reimport_prunesRemovedReceiverAndTag_whenCanonicalRowShrinks (5f53c36).

Sara #2 / #3 — coverage gaps closed

7ebf7ac:

  • PersonTreeImporterTest.load_propagatesUnexpectedDomainException_fromAddRelationship — asserts a non-DUPLICATE/non-CIRCULAR DomainException propagates (not swallowed), guarding against a future swallow-all refactor.
  • CanonicalSheetReaderTest.get_returnsEmptyString_forTrailingColumns_whenRowShorterThanHeader — pins the short-row (POI omits trailing empty cells) read path.

Markus #1 / Elicit #2 — provisional precedence pinned

5f53c36import_neverFlipsRegisterPersonToProvisional_whenReferencedByDocumentRow proves a register/tree person (provisional=false) is never flipped by a later document row referencing the same source_ref, regardless of order. The orchestrator already loads register+tree before documents; the monotonic-downward guard in mergeCanonical (now commented) does the rest.

Markus #2 — ADR consequence

4fa2b83 — ADR-025 now records runImport() is non-transactional (per-loader transactions only); a partial failure leaves earlier loaders committed and is safe to retry because the import is idempotent.

Main / Elicit #1 — ADR idempotency rule clarified

4fa2b83 — ADR-025 now splits the idempotency rule: Person/Tag scalar fields = preserve human edits; document sender/receivers/tags = canonical-authoritative (clear-then-add). Notes the non-PLACEHOLDER skip bounds the blast radius.

Felix — clarifying comments + idiom unification

e9ddaedbirthYear/deathYear fill-blank unified under an Integer preferHuman overload (one self-documenting precedence idiom); guard test added. Comment in PersonTreeImporter.createRelationships notes the relationship personId field carries a tree rowId, not a slug. Authoritative-collection semantics commented in DocumentImporter.

Tobias #3 — DEPLOYMENT runbook

fc53e77 — dropped the "or the documented entrypoint" hedge; pinned the exact command .venv/bin/python normalize.py (+ one-time venv setup). CI/branch-protection changes left to the owner per the review.

Nora and Leonie approved with no action items. Not merging — leaving for re-review.

## Review concerns addressed (Felix Brandt) Pushed six atomic commits (`2f7ea37`..`fc53e77`) resolving every raised concern. All seven affected/added test classes pass (**68 tests, 0 fail**, Testcontainers Postgres). `--no-verify` used (backend-only; husky frontend lint can't run in a worktree). ### Main / Sara #1 / Elicit #1 — document collections now canonical-authoritative `2f7ea37` — `DocumentImporter.buildDocument` now **clears then re-populates** `receivers` and `tags` so a shrunk canonical row prunes stale links on re-import instead of accumulating. Raw `sender_text`/`receiver_text` retention is unchanged. Unit test `load_prunesReceiversAndTags_whenCanonicalRowShrinks` + real-Postgres `reimport_prunesRemovedReceiverAndTag_whenCanonicalRowShrinks` (`5f53c36`). ### Sara #2 / #3 — coverage gaps closed `7ebf7ac`: - `PersonTreeImporterTest.load_propagatesUnexpectedDomainException_fromAddRelationship` — asserts a non-DUPLICATE/non-CIRCULAR `DomainException` propagates (not swallowed), guarding against a future swallow-all refactor. - `CanonicalSheetReaderTest.get_returnsEmptyString_forTrailingColumns_whenRowShorterThanHeader` — pins the short-row (POI omits trailing empty cells) read path. ### Markus #1 / Elicit #2 — provisional precedence pinned `5f53c36` — `import_neverFlipsRegisterPersonToProvisional_whenReferencedByDocumentRow` proves a register/tree person (`provisional=false`) is never flipped by a later document row referencing the same `source_ref`, regardless of order. The orchestrator already loads register+tree before documents; the monotonic-downward guard in `mergeCanonical` (now commented) does the rest. ### Markus #2 — ADR consequence `4fa2b83` — ADR-025 now records `runImport()` is non-transactional (per-loader transactions only); a partial failure leaves earlier loaders committed and is safe to retry because the import is idempotent. ### Main / Elicit #1 — ADR idempotency rule clarified `4fa2b83` — ADR-025 now splits the idempotency rule: **Person/Tag scalar fields = preserve human edits**; **document sender/receivers/tags = canonical-authoritative** (clear-then-add). Notes the non-PLACEHOLDER skip bounds the blast radius. ### Felix — clarifying comments + idiom unification `e9ddaed` — `birthYear`/`deathYear` fill-blank unified under an `Integer preferHuman` overload (one self-documenting precedence idiom); guard test added. Comment in `PersonTreeImporter.createRelationships` notes the relationship `personId` field carries a tree **rowId**, not a slug. Authoritative-collection semantics commented in `DocumentImporter`. ### Tobias #3 — DEPLOYMENT runbook `fc53e77` — dropped the "or the documented entrypoint" hedge; pinned the exact command `.venv/bin/python normalize.py` (+ one-time venv setup). CI/branch-protection changes left to the owner per the review. Nora and Leonie approved with no action items. Not merging — leaving for re-review.
marcel added 1 commit 2026-05-27 11:09:31 +02:00
test(importing): clean up committed rows after CanonicalImportIntegrationTest
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m41s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m34s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
151d6aa03f
The canonical importer commits through its own transactions, so this test
cannot use @Transactional rollback for isolation. Without cleanup, the last
test's committed documents (dated 1888-02), persons and tags leaked into the
shared Testcontainers Postgres and polluted other integration tests that
assume a known seed (DocumentDensityIntegrationTest got an extra 1888-02
bucket; DocumentSearchPagedIntegrationTest counted 122 docs instead of 120).

Add an @AfterEach deleteAll of documents/persons/tags, matching the existing
convention in DocumentListItemIntegrationTest.

Refs #669
Author
Owner

Markus Keller — Senior Application Architect

Verdict: Approved

I reviewed this as a fresh pass against the current diff, with special attention to whether the prior round's fixes held and introduced nothing new. The architecture is clean and the layering rules are respected throughout.

What I checked and liked

  • Layering preserved. Every loader calls the owning domain's service (TagService, PersonService, RelationshipService, DocumentService) — never a foreign repository. PersonTreeImporter explicitly routes relationships through RelationshipService, exactly as the rule demands. The new findBySourceRef/upsertBySourceRef/upsertBySourceRef methods are added on the services, with the repository methods kept package-internal to the domain.
  • Single-responsibility decomposition. The orchestrator owns the DAG + state machine; CanonicalSheetReader is a pure value-level helper with no Spring/domain coupling; each loader does one artifact. This is the modular-monolith shape I want — these packages could be extracted later without disentangling.
  • ADR-025 decision 3 is excellent. It records the why behind the two-rule idempotency split (Person/Tag scalar = preserve human edits; Document collections = canonical-authoritative) and, crucially, documents the non-transactional runImport() consequence with the recovery story. A future maintainer is explicitly warned not to assume all-or-nothing semantics. This is exactly what ADRs are for.
  • Authoritative-collections change is sound. I traced buildDocumentclear() + addAll() for receivers and attachTagclear() for tags. The ADR is honest that this rule does NOT extend to human-linked associations being "kept" — the spec wants the canonical row to own these sets, and the ALREADY_EXISTS guard (non-PLACEHOLDER docs skipped) bounds the overwrite blast radius to placeholder rows. No human-curated association on an archived/uploaded document is ever touched. This matches the design intent.
  • C4 diagram (l3-backend-3b) updated faithfully — the four loaders, the sheet reader, the numbered DAG edges, and the RelationshipService dependency all match the code.

Non-blocking observations

  • findFileRecursive re-walks the entire import directory for every document row (O(rows x tree)). For a large import this is repeated full-tree I/O. The old MassImportService had the same shape, so this is not a regression — but a single pre-walk index built once per load() would be the cleaner structure if import volume grows. Admin-triggered, bounded, so it does not block.
  • The orchestrator holds currentStatus as a volatile field on a singleton — fine for the single-active-import contract, and runImportAsync guards re-entry. Acceptable.

The structure is boring in the best way. Ship it.

## Markus Keller — Senior Application Architect **Verdict: Approved** I reviewed this as a fresh pass against the current diff, with special attention to whether the prior round's fixes held and introduced nothing new. The architecture is clean and the layering rules are respected throughout. ### What I checked and liked - **Layering preserved.** Every loader calls the owning domain's service (`TagService`, `PersonService`, `RelationshipService`, `DocumentService`) — never a foreign repository. `PersonTreeImporter` explicitly routes relationships through `RelationshipService`, exactly as the rule demands. The new `findBySourceRef`/`upsertBySourceRef`/`upsertBySourceRef` methods are added on the services, with the repository methods kept package-internal to the domain. - **Single-responsibility decomposition.** The orchestrator owns the DAG + state machine; `CanonicalSheetReader` is a pure value-level helper with no Spring/domain coupling; each loader does one artifact. This is the modular-monolith shape I want — these packages could be extracted later without disentangling. - **ADR-025 decision 3 is excellent.** It records the *why* behind the two-rule idempotency split (Person/Tag scalar = preserve human edits; Document collections = canonical-authoritative) and, crucially, documents the non-transactional `runImport()` consequence with the recovery story. A future maintainer is explicitly warned not to assume all-or-nothing semantics. This is exactly what ADRs are for. - **Authoritative-collections change is sound.** I traced `buildDocument` → `clear()` + `addAll()` for receivers and `attachTag` → `clear()` for tags. The ADR is honest that this rule does NOT extend to human-linked associations being "kept" — the spec wants the canonical row to own these sets, and the `ALREADY_EXISTS` guard (non-PLACEHOLDER docs skipped) bounds the overwrite blast radius to placeholder rows. No human-curated association on an archived/uploaded document is ever touched. This matches the design intent. - **C4 diagram (`l3-backend-3b`) updated faithfully** — the four loaders, the sheet reader, the numbered DAG edges, and the `RelationshipService` dependency all match the code. ### Non-blocking observations - `findFileRecursive` re-walks the entire import directory for every document row (O(rows x tree)). For a large import this is repeated full-tree I/O. The old `MassImportService` had the same shape, so this is not a regression — but a single pre-walk index built once per `load()` would be the cleaner structure if import volume grows. Admin-triggered, bounded, so it does not block. - The orchestrator holds `currentStatus` as a `volatile` field on a singleton — fine for the single-active-import contract, and `runImportAsync` guards re-entry. Acceptable. The structure is boring in the best way. Ship it.
Author
Owner

Felix Brandt — Senior Fullstack Developer

Verdict: Approved

Clean, well-named, guard-clause-first code. I read every production file in the diff line by line.

What's good

  • Method names and decomposition. resolveSender, resolveReceivers, attachTag, parseIsoDate, parsePrecision, isValidImportFilename, isPdfMagicBytes — each does one thing and is short enough to hold in working memory. No Manager/Helper/Wrapper names.
  • The preferHuman idiom in PersonService is the right call — a single overloaded helper expresses the fill-blank-only precedence for both String and Integer fields, instead of repeating the ternary at every setter. The comment explains the why (human edits win), not the what.
  • Guard clauses everywhereif (index.isBlank()) continue;, if (slug.isBlank()) return null;, if (tagPath.isBlank()) return;. No nested conditionals.
  • The preferHuman rename (from the prior round) reads correctly and the rowId-not-slug trap comment in PersonTreeImporter.createRelationships is genuinely load-bearing — that's a non-obvious mapping that would bite a future reader, so the comment earns its place.
  • blankToNull duplicated across DocumentImporter, PersonRegisterImporter, PersonTreeImporter, and PersonService (private static in each). Mild DRY violation, but each is a 1-liner and extracting a shared util would create a cross-package dependency for trivial gain — KISS wins, I would leave it.

Frontend

  • errors.ts: IMPORT_ARTIFACT_INVALID added to the union type AND the getErrorMessage switch, with the three message bundles (de/en/es) all carrying error_import_artifact_invalid and the new import_reason_path_traversal / admin_system_import_failed_artifact keys. The full error-code wiring checklist from CLAUDE.md is satisfied.
  • ImportStatusCard.svelte uses a $derived for failureMessage and a clean reasonLabel lookup. The new INVALID_FILENAME_PATH_TRAVERSAL and IMPORT_FAILED_ARTIFACT branches are both wired.

Minor

  • DocumentImporter mixes Paths.get / Files.walk / new File(...) (java.io + java.nio). Consistent with the ported guards, so acceptable, but a future cleanup could standardize on NIO.

TDD discipline is evident — the tests read as the spec. No changes requested.

## Felix Brandt — Senior Fullstack Developer **Verdict: Approved** Clean, well-named, guard-clause-first code. I read every production file in the diff line by line. ### What's good - **Method names and decomposition.** `resolveSender`, `resolveReceivers`, `attachTag`, `parseIsoDate`, `parsePrecision`, `isValidImportFilename`, `isPdfMagicBytes` — each does one thing and is short enough to hold in working memory. No `Manager`/`Helper`/`Wrapper` names. - **The `preferHuman` idiom** in `PersonService` is the right call — a single overloaded helper expresses the fill-blank-only precedence for both String and Integer fields, instead of repeating the ternary at every setter. The comment explains the *why* (human edits win), not the *what*. - **Guard clauses everywhere** — `if (index.isBlank()) continue;`, `if (slug.isBlank()) return null;`, `if (tagPath.isBlank()) return;`. No nested conditionals. - **The `preferHuman` rename** (from the prior round) reads correctly and the rowId-not-slug trap comment in `PersonTreeImporter.createRelationships` is genuinely load-bearing — that's a non-obvious mapping that would bite a future reader, so the comment earns its place. - **`blankToNull` duplicated** across `DocumentImporter`, `PersonRegisterImporter`, `PersonTreeImporter`, and `PersonService` (private static in each). Mild DRY violation, but each is a 1-liner and extracting a shared util would create a cross-package dependency for trivial gain — KISS wins, I would leave it. ### Frontend - `errors.ts`: `IMPORT_ARTIFACT_INVALID` added to the union type AND the `getErrorMessage` switch, with the three message bundles (de/en/es) all carrying `error_import_artifact_invalid` and the new `import_reason_path_traversal` / `admin_system_import_failed_artifact` keys. The full error-code wiring checklist from CLAUDE.md is satisfied. - `ImportStatusCard.svelte` uses a `$derived` for `failureMessage` and a clean `reasonLabel` lookup. The new `INVALID_FILENAME_PATH_TRAVERSAL` and `IMPORT_FAILED_ARTIFACT` branches are both wired. ### Minor - `DocumentImporter` mixes `Paths.get` / `Files.walk` / `new File(...)` (java.io + java.nio). Consistent with the ported guards, so acceptable, but a future cleanup could standardize on NIO. TDD discipline is evident — the tests read as the spec. No changes requested.
Author
Owner

Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved with concerns

I came in adversarial — a "modular rewrite that deletes the old hardened code path" is exactly where path-traversal and XXE regressions sneak back in. I'm satisfied the guards survived the migration intact. The threat-model comments are exemplary.

Verified controls (CWE-22 / CWE-434)

  • Path traversal — basename guard. isValidImportFilename rejects /, \, the three Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5), .., the bare ., null byte, and absolute paths. Ported verbatim and carries the "do not weaken" comment.
  • Path traversal — canonical-path containment. findFileRecursive resolves the match and asserts candidate.getCanonicalPath().startsWith(baseDirCanonical + File.separator), throwing on escape. The + File.separator suffix correctly prevents the /import-evil sibling-prefix bypass. The symlink-escape regression test (findFileRecursive_throwsDomainException_whenSymlinkEscapesImportDir) proves a symlink whose target lives outside the import dir is caught — Files.walk does not follow links to directories, and getCanonicalPath() resolves the link target so the containment check fires. Good.
  • Magic-byte gate (CWE-434). %PDF (0x25 0x50 0x44 0x46) is checked before any S3 upload; non-PDF content is skipped with INVALID_PDF_SIGNATURE. The IOException path is tested via a Mockito spy on openFileStream.
  • XXE — eliminated by design. XxeSafeXmlParser and the whole ODS/XML path are deleted. The tree JSON is parsed with a plain Jackson ObjectMapper (readTree) — JSON has no external-entity surface, so this is a genuine attack-surface reduction, not a regression. Good.
  • The file column is treated as hostile even though it originates from our own normalizer — the ADR and the class Javadoc both state this explicitly ("CWE-22 does not care that it came from our Python tool"). This is the correct defense-in-depth posture.
  • Auth unchanged. AdminController still gates trigger-import/import-status behind ADMIN (the @RequirePermission aspect is on the existing endpoints; this PR only swaps the injected bean). No new endpoints, no permission gaps.

Concern (non-blocking)

  • One malicious symlink aborts the entire import. A path-escape in findFileRecursive throws DomainException.internal, which propagates out of importRow (only InvalidImportFilenameException is caught there) → out of load() → orchestrator marks the whole run FAILED. For a genuine attack this hard-fail is defensible, but it means one hostile row denies the whole import rather than being recorded as a per-file skip like the basename-reject case. Since the artifacts are operator-staged on a trusted host, the practical risk is low — but consider downgrading a containment failure to a SkipReason for consistency with the other file-level rejects. Not a blocker.

No injection, no auth bypass, no data exposure. The deleted code was the riskiest path and it's gone. Approved.

## Nora "NullX" Steiner — Application Security Engineer **Verdict: Approved with concerns** I came in adversarial — a "modular rewrite that deletes the old hardened code path" is exactly where path-traversal and XXE regressions sneak back in. I'm satisfied the guards survived the migration intact. The threat-model comments are exemplary. ### Verified controls (CWE-22 / CWE-434) - **Path traversal — basename guard.** `isValidImportFilename` rejects `/`, `\`, the three Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5), `..`, the bare `.`, null byte, and absolute paths. Ported verbatim and carries the "do not weaken" comment. - **Path traversal — canonical-path containment.** `findFileRecursive` resolves the match and asserts `candidate.getCanonicalPath().startsWith(baseDirCanonical + File.separator)`, throwing on escape. The `+ File.separator` suffix correctly prevents the `/import-evil` sibling-prefix bypass. The symlink-escape regression test (`findFileRecursive_throwsDomainException_whenSymlinkEscapesImportDir`) proves a symlink whose target lives outside the import dir is caught — `Files.walk` does not follow links to directories, and `getCanonicalPath()` resolves the link target so the containment check fires. Good. - **Magic-byte gate (CWE-434).** `%PDF` (`0x25 0x50 0x44 0x46`) is checked before any S3 upload; non-PDF content is skipped with `INVALID_PDF_SIGNATURE`. The IOException path is tested via a Mockito spy on `openFileStream`. - **XXE — eliminated by design.** `XxeSafeXmlParser` and the whole ODS/XML path are deleted. The tree JSON is parsed with a plain Jackson `ObjectMapper` (`readTree`) — JSON has no external-entity surface, so this is a genuine attack-surface reduction, not a regression. Good. - **The `file` column is treated as hostile even though it originates from our own normalizer** — the ADR and the class Javadoc both state this explicitly ("CWE-22 does not care that it came from our Python tool"). This is the correct defense-in-depth posture. - **Auth unchanged.** `AdminController` still gates `trigger-import`/`import-status` behind `ADMIN` (the `@RequirePermission` aspect is on the existing endpoints; this PR only swaps the injected bean). No new endpoints, no permission gaps. ### Concern (non-blocking) - **One malicious symlink aborts the entire import.** A path-escape in `findFileRecursive` throws `DomainException.internal`, which propagates out of `importRow` (only `InvalidImportFilenameException` is caught there) → out of `load()` → orchestrator marks the whole run `FAILED`. For a genuine attack this hard-fail is defensible, but it means one hostile row denies the whole import rather than being recorded as a per-file skip like the basename-reject case. Since the artifacts are operator-staged on a trusted host, the practical risk is low — but consider downgrading a containment failure to a `SkipReason` for consistency with the other file-level rejects. Not a blocker. No injection, no auth bypass, no data exposure. The deleted code was the riskiest path and it's gone. Approved.
Author
Owner

Sara Holt — Senior QA Engineer

Verdict: Approved

This is the round where the test-isolation bug was the headline fix, so I scrutinized it specifically and swept every new test class for the same leak class. It's genuinely fixed and nothing else reintroduces it.

Test-isolation fix — verified

  • CanonicalImportIntegrationTest now has an @AfterEach cleanup() that calls documentRepository.deleteAll() / personRepository.deleteAll() / tagRepository.deleteAll(), mirroring the @BeforeEach. The Javadoc on it is exactly the right kind of test comment: it explains why (the orchestrator is non-transactional, so committed rows can't roll back, and they were leaking into the shared Testcontainers DB and polluting DocumentDensityIntegrationTest / DocumentSearchPagedIntegrationTest). This will stop a future maintainer from "cleaning up" the apparently-redundant teardown.
  • No other new integration test reintroduces the leak. I checked every new test class: CanonicalImportIntegrationTest is the only @SpringBootTest that commits real rows. Every other new test — CanonicalImportOrchestratorTest, CanonicalSheetReaderTest, DocumentImporterTest, PersonRegisterImporterTest, PersonTreeImporterTest, TagTreeImporterTest, PersonImportUpsertTest, TagImportUpsertTest — is a pure @ExtendWith(MockitoExtension.class) unit test that mocks the repositories/services and never touches the DB. No new leak surface.

Prior-round additions — verified present and correct

  • Prune-on-removal: covered at both levels — DocumentImporterTest.load_prunesReceiversAndTags_whenCanonicalRowShrinks (unit) and CanonicalImportIntegrationTest.reimport_prunesRemovedReceiverAndTag_whenCanonicalRowShrinks (real Postgres, with a re-staged narrower sheet). The integration test correctly re-reads via findById so the lazy collections initialise.
  • Unexpected-DomainException propagation: PersonTreeImporterTest.load_propagatesUnexpectedDomainException_fromAddRelationship asserts an INTERNAL_ERROR is NOT swallowed, complementing load_swallowsDuplicateRelationship…. The swallow-list (DUPLICATE_RELATIONSHIP / CIRCULAR_RELATIONSHIP) is exactly scoped.
  • CanonicalSheetReader short-row: get_returnsEmptyString_forTrailingColumns_whenRowShorterThanHeader covers POI's trailing-empty-cell omission — the real-world failure mode. Good.
  • Year fill-blank guard: PersonImportUpsertTest.upsertBySourceRef_fillsBlankYears_butPreservesHumanEditedYears_onReimport asserts a human birthYear survives while a blank deathYear is filled.
  • Provisional-precedence pinning: CanonicalImportIntegrationTest.import_neverFlipsRegisterPersonToProvisional_whenReferencedByDocumentRow runs the import twice and asserts both register persons stay false. Solid integration-level proof.

Style

Test names read as sentences; AAA structure throughout; sheet/JSON fixtures factored into writeSheet/writeDocs/write helpers. The security regressions are individually named per attack vector — easy to see at a glance which homoglyph or traversal token each covers.

Coverage is thorough at the right pyramid levels. Approved.

## Sara Holt — Senior QA Engineer **Verdict: Approved** This is the round where the test-isolation bug was the headline fix, so I scrutinized it specifically and swept every new test class for the same leak class. It's genuinely fixed and nothing else reintroduces it. ### Test-isolation fix — verified - `CanonicalImportIntegrationTest` now has an `@AfterEach cleanup()` that calls `documentRepository.deleteAll()` / `personRepository.deleteAll()` / `tagRepository.deleteAll()`, mirroring the `@BeforeEach`. The Javadoc on it is exactly the right kind of test comment: it explains *why* (the orchestrator is non-transactional, so committed rows can't roll back, and they were leaking into the shared Testcontainers DB and polluting `DocumentDensityIntegrationTest` / `DocumentSearchPagedIntegrationTest`). This will stop a future maintainer from "cleaning up" the apparently-redundant teardown. - **No other new integration test reintroduces the leak.** I checked every new test class: `CanonicalImportIntegrationTest` is the *only* `@SpringBootTest` that commits real rows. Every other new test — `CanonicalImportOrchestratorTest`, `CanonicalSheetReaderTest`, `DocumentImporterTest`, `PersonRegisterImporterTest`, `PersonTreeImporterTest`, `TagTreeImporterTest`, `PersonImportUpsertTest`, `TagImportUpsertTest` — is a pure `@ExtendWith(MockitoExtension.class)` unit test that mocks the repositories/services and never touches the DB. No new leak surface. ### Prior-round additions — verified present and correct - **Prune-on-removal:** covered at both levels — `DocumentImporterTest.load_prunesReceiversAndTags_whenCanonicalRowShrinks` (unit) and `CanonicalImportIntegrationTest.reimport_prunesRemovedReceiverAndTag_whenCanonicalRowShrinks` (real Postgres, with a re-staged narrower sheet). The integration test correctly re-reads via `findById` so the lazy collections initialise. - **Unexpected-DomainException propagation:** `PersonTreeImporterTest.load_propagatesUnexpectedDomainException_fromAddRelationship` asserts an `INTERNAL_ERROR` is NOT swallowed, complementing `load_swallowsDuplicateRelationship…`. The swallow-list (`DUPLICATE_RELATIONSHIP` / `CIRCULAR_RELATIONSHIP`) is exactly scoped. - **CanonicalSheetReader short-row:** `get_returnsEmptyString_forTrailingColumns_whenRowShorterThanHeader` covers POI's trailing-empty-cell omission — the real-world failure mode. Good. - **Year fill-blank guard:** `PersonImportUpsertTest.upsertBySourceRef_fillsBlankYears_butPreservesHumanEditedYears_onReimport` asserts a human birthYear survives while a blank deathYear is filled. - **Provisional-precedence pinning:** `CanonicalImportIntegrationTest.import_neverFlipsRegisterPersonToProvisional_whenReferencedByDocumentRow` runs the import twice and asserts both register persons stay `false`. Solid integration-level proof. ### Style Test names read as sentences; AAA structure throughout; sheet/JSON fixtures factored into `writeSheet`/`writeDocs`/`write` helpers. The security regressions are individually named per attack vector — easy to see at a glance which homoglyph or traversal token each covers. Coverage is thorough at the right pyramid levels. Approved.
Author
Owner

Tobias Wendt — DevOps & Platform Engineer

Verdict: Approved with concerns

Backend-only change, no compose/CI/Dockerfile edits, so my surface is config, the runbook, and operational failure modes. The config simplification is a net win.

What I like

  • application.yaml simplification. The brittle positional app.import.col.* index block is gone, replaced by a single self-documenting app.import.dir: ${IMPORT_DIR:/import} with a comment pointing at ADR-025. Fewer knobs, fewer ways to misconfigure. The ${IMPORT_DIR:/import} default matches the dev bind-mount convention.
  • DEPLOYMENT.md §6 runbook is solid. It now pins the normalizer entrypoint explicitly (.venv/bin/python normalize.py, with the one-time venv bootstrap), lists the four required artifacts by name, and documents the fail-closed behaviour (IMPORT_ARTIFACT_INVALID when any artifact is missing). The staging/prod rsync + IMPORT_HOST_DIR steps are unchanged and still correct. The idempotent-rerun note ("Re-running is safe") is exactly what an operator needs to know.
  • Idempotent + recoverable. The ADR's non-transactional consequence section gives the operational recovery story straight: a partial failure leaves earlier loaders committed, status FAILED, and the fix is "repair the artifact and re-trigger" with no manual DB cleanup. That's the right operability property for a small team — no fragile all-or-nothing transaction spanning four loaders.

Concerns (non-blocking)

  • Orphaned S3 objects on the placeholder→uploaded transition. In persist(), when a PLACEHOLDER doc gains a file, a fresh documents/{UUID}_{name} key is minted and uploaded; if that exact run is interrupted after upload but the doc later re-imports, the prior object can be left unreferenced in MinIO (the ALREADY_EXISTS guard bounds this to the transition, so it's at most one orphan per doc, not unbounded growth). No lifecycle/cleanup job exists. Low impact at this archive's volume — worth a backlog note, not a blocker.
  • Full-tree walk per row. findFileRecursive walks the whole /import mount once per document row. On a large staged import over a network/bind mount this is repeated I/O. Admin-triggered and bounded, but if import sets grow into the thousands of PDFs, watch the wall-clock time. Pre-existing shape from MassImportService, so not a regression.
  • --no-verify commits. The PR notes husky frontend lint can't run in a worktree, so backend commits bypassed hooks. CI owns the real lint/check sweep, so this is acceptable here, but flagging per the project's own rule that hook-skips should be explicit and intentional — which this is.

No infrastructure regressions. Config got simpler and the runbook is accurate. Approved.

## Tobias Wendt — DevOps & Platform Engineer **Verdict: Approved with concerns** Backend-only change, no compose/CI/Dockerfile edits, so my surface is config, the runbook, and operational failure modes. The config simplification is a net win. ### What I like - **`application.yaml` simplification.** The brittle positional `app.import.col.*` index block is gone, replaced by a single self-documenting `app.import.dir: ${IMPORT_DIR:/import}` with a comment pointing at ADR-025. Fewer knobs, fewer ways to misconfigure. The `${IMPORT_DIR:/import}` default matches the dev bind-mount convention. - **DEPLOYMENT.md §6 runbook is solid.** It now pins the normalizer entrypoint explicitly (`.venv/bin/python normalize.py`, with the one-time venv bootstrap), lists the four required artifacts by name, and documents the fail-closed behaviour (`IMPORT_ARTIFACT_INVALID` when any artifact is missing). The staging/prod `rsync` + `IMPORT_HOST_DIR` steps are unchanged and still correct. The idempotent-rerun note ("Re-running is safe") is exactly what an operator needs to know. - **Idempotent + recoverable.** The ADR's non-transactional consequence section gives the operational recovery story straight: a partial failure leaves earlier loaders committed, status `FAILED`, and the fix is "repair the artifact and re-trigger" with no manual DB cleanup. That's the right operability property for a small team — no fragile all-or-nothing transaction spanning four loaders. ### Concerns (non-blocking) - **Orphaned S3 objects on the placeholder→uploaded transition.** In `persist()`, when a PLACEHOLDER doc gains a file, a fresh `documents/{UUID}_{name}` key is minted and uploaded; if that exact run is interrupted after upload but the doc later re-imports, the prior object can be left unreferenced in MinIO (the `ALREADY_EXISTS` guard bounds this to the transition, so it's at most one orphan per doc, not unbounded growth). No lifecycle/cleanup job exists. Low impact at this archive's volume — worth a backlog note, not a blocker. - **Full-tree walk per row.** `findFileRecursive` walks the whole `/import` mount once per document row. On a large staged import over a network/bind mount this is repeated I/O. Admin-triggered and bounded, but if import sets grow into the thousands of PDFs, watch the wall-clock time. Pre-existing shape from `MassImportService`, so not a regression. - **`--no-verify` commits.** The PR notes husky frontend lint can't run in a worktree, so backend commits bypassed hooks. CI owns the real lint/check sweep, so this is acceptable here, but flagging per the project's own rule that hook-skips should be explicit and intentional — which this is. No infrastructure regressions. Config got simpler and the runbook is accurate. Approved.
Author
Owner

Leonie Voss — UI/UX Design Lead

Verdict: Approved

The frontend footprint is small — a follow-up to surface the two new backend states in the admin Import Status Card — and it's handled correctly. This is an admin-only tool, not the read path our 67-year-old phone users touch, so the bar here is "an admin can understand what failed," and it clears it.

What I checked

  • New failure state has a human-readable label. ImportStatusCard.svelte adds the IMPORT_FAILED_ARTIFACT branch to the failureMessage $derived, mapping to admin_system_import_failed_artifact ("A canonical import file is missing or invalid." / "Eine Importdatei fehlt oder ist ungültig."). No raw status code leaks to the screen.
  • New skip reason is labelled. reasonLabel gains INVALID_FILENAME_PATH_TRAVERSAL → import_reason_path_traversal ("Invalid filename (path)" / "Ungültiger Dateiname (Pfad)"). It joins the existing per-reason labels in the amber skipped-files section.
  • Full i18n parity. Every new key exists in all three bundles (de/en/es) — error_import_artifact_invalid, import_reason_path_traversal, admin_system_import_failed_artifact. No locale will fall back to a missing-key crash or an English string on a German screen. The Spanish translations read naturally.
  • Messaging is actionable, not technical. The German artifact-invalid copy tells the admin what to do ("Bitte führen Sie den Normalizer erneut aus" / "re-run the normalizer") rather than dumping a stack trace. That's the right tone for a non-developer operator.

Note

  • I did not see this exercised in a running browser (browser tests are CI-only here), so I'm reviewing the markup and message wiring statically. The conditional structure is a clean nested ternary on statusCode and the lookup function is a flat if-chain — both render-predictable. No accessibility regressions: this is text content inside the existing card structure, no new interactive controls.

Small, correct, fully localized. Approved.

## Leonie Voss — UI/UX Design Lead **Verdict: Approved** The frontend footprint is small — a follow-up to surface the two new backend states in the admin Import Status Card — and it's handled correctly. This is an admin-only tool, not the read path our 67-year-old phone users touch, so the bar here is "an admin can understand what failed," and it clears it. ### What I checked - **New failure state has a human-readable label.** `ImportStatusCard.svelte` adds the `IMPORT_FAILED_ARTIFACT` branch to the `failureMessage` `$derived`, mapping to `admin_system_import_failed_artifact` ("A canonical import file is missing or invalid." / "Eine Importdatei fehlt oder ist ungültig."). No raw status code leaks to the screen. - **New skip reason is labelled.** `reasonLabel` gains `INVALID_FILENAME_PATH_TRAVERSAL → import_reason_path_traversal` ("Invalid filename (path)" / "Ungültiger Dateiname (Pfad)"). It joins the existing per-reason labels in the amber skipped-files section. - **Full i18n parity.** Every new key exists in all three bundles (de/en/es) — `error_import_artifact_invalid`, `import_reason_path_traversal`, `admin_system_import_failed_artifact`. No locale will fall back to a missing-key crash or an English string on a German screen. The Spanish translations read naturally. - **Messaging is actionable, not technical.** The German artifact-invalid copy tells the admin what to do ("Bitte führen Sie den Normalizer erneut aus" / "re-run the normalizer") rather than dumping a stack trace. That's the right tone for a non-developer operator. ### Note - I did not see this exercised in a running browser (browser tests are CI-only here), so I'm reviewing the markup and message wiring statically. The conditional structure is a clean nested ternary on `statusCode` and the lookup function is a flat `if`-chain — both render-predictable. No accessibility regressions: this is text content inside the existing card structure, no new interactive controls. Small, correct, fully localized. Approved.
Author
Owner

Elicit — Requirements Engineer & Business Analyst

Verdict: Approved with concerns

I review requirements coverage and traceability, not code. This PR closes #669 (Phase 3 of "Handling the Unknowns") and the behaviour is well-specified and testable. The migration story is honest about its trade-offs, which is what I most care about.

Requirements coverage — clear and testable

  • Idempotency is precisely specified and the two distinct rules are named, not blurred: Person/Tag scalar fields preserve human edits (fill-blanks-only, provisional monotonic-downward); document sender/receivers/tags are canonical-authoritative (re-import prunes removed links). Each rule has a named acceptance test. This is exactly the kind of testable, unambiguous requirement I push for.
  • The product invariant is preserved. Raw attribution (sender_text/receiver_text) is always retained even when a person is linked — this is the "honest uncertainty about people" promise of the milestone, and it has its own dedicated assertions. The provisional flag (importer-minted = uncertain, register/tree = confident) operationalizes "honest uncertainty" in the data model.
  • Fail-closed behaviour is a stated requirement (missing artifact → IMPORT_ARTIFACT_INVALID, surfaced to the admin with actionable copy), not an accident. The GLOSSARY entries (Canonical import, canonical artifact, CanonicalSheetReader, SkippedFile) keep the shared vocabulary current so a non-technical owner can still reason about the system.

Concerns / open items to capture in the backlog (non-blocking)

  • "Authoritative collections" is a behaviour change worth an explicit owner sign-off. The ADR is candid that on re-import a placeholder document's receiver/tag set is overwritten from the canonical row — so if an archivist hand-added a receiver to a still-placeholder document, a re-import would remove it. The mitigation (non-PLACEHOLDER docs are skipped via ALREADY_EXISTS) bounds this to placeholder rows, and the design intent is that the canonical row owns these sets. That's a reasonable rule, but it's a workflow assumption — confirm the product owner accepts "don't hand-curate associations on placeholder docs; do it after upload." Worth one line in the user-facing import docs.
  • Partial-failure recovery is operator-facing. ADR-025 documents that a mid-run failure leaves earlier loaders committed and status FAILED, recovered by re-trigger. That's sound, but it's a runbook expectation an operator must internalize — the DEPLOYMENT note covers it; just ensure whoever runs imports has read it.
  • Orphaned S3 objects (raised by DevOps) is a data-hygiene item, not a functional gap — capture as a low-priority backlog issue rather than blocking this.

Requirements are traceable to #669 and to named tests. No scope creep, no contradictions. Approved.

## Elicit — Requirements Engineer & Business Analyst **Verdict: Approved with concerns** I review requirements coverage and traceability, not code. This PR closes #669 (Phase 3 of "Handling the Unknowns") and the behaviour is well-specified and testable. The migration story is honest about its trade-offs, which is what I most care about. ### Requirements coverage — clear and testable - **Idempotency is precisely specified** and the two distinct rules are named, not blurred: Person/Tag scalar fields preserve human edits (fill-blanks-only, `provisional` monotonic-downward); document sender/receivers/tags are canonical-authoritative (re-import prunes removed links). Each rule has a named acceptance test. This is exactly the kind of testable, unambiguous requirement I push for. - **The product invariant is preserved.** Raw attribution (`sender_text`/`receiver_text`) is always retained even when a person is linked — this is the "honest uncertainty about people" promise of the milestone, and it has its own dedicated assertions. The provisional flag (importer-minted = uncertain, register/tree = confident) operationalizes "honest uncertainty" in the data model. - **Fail-closed behaviour is a stated requirement** (missing artifact → `IMPORT_ARTIFACT_INVALID`, surfaced to the admin with actionable copy), not an accident. The GLOSSARY entries (Canonical import, canonical artifact, CanonicalSheetReader, SkippedFile) keep the shared vocabulary current so a non-technical owner can still reason about the system. ### Concerns / open items to capture in the backlog (non-blocking) - **"Authoritative collections" is a behaviour change worth an explicit owner sign-off.** The ADR is candid that on re-import a placeholder document's receiver/tag set is overwritten from the canonical row — so if an archivist hand-added a receiver to a *still-placeholder* document, a re-import would remove it. The mitigation (non-PLACEHOLDER docs are skipped via `ALREADY_EXISTS`) bounds this to placeholder rows, and the design intent is that the canonical row owns these sets. That's a reasonable rule, but it's a workflow assumption — confirm the product owner accepts "don't hand-curate associations on placeholder docs; do it after upload." Worth one line in the user-facing import docs. - **Partial-failure recovery is operator-facing.** ADR-025 documents that a mid-run failure leaves earlier loaders committed and status `FAILED`, recovered by re-trigger. That's sound, but it's a runbook expectation an operator must internalize — the DEPLOYMENT note covers it; just ensure whoever runs imports has read it. - **Orphaned S3 objects** (raised by DevOps) is a data-hygiene item, not a functional gap — capture as a low-priority backlog issue rather than blocking this. Requirements are traceable to #669 and to named tests. No scope creep, no contradictions. Approved.
marcel added 1 commit 2026-05-27 11:20:42 +02:00
docs: record owner decisions on re-import authority and path-escape
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 4m5s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 3m42s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
e4a154406e
- DEPLOYMENT §6: clarify re-import keeps person/tag scalar human edits but
  re-applies document sender/receivers/tags from the canonical export
  (canonical-authoritative), per owner sign-off.
- ADR-025: path-escape/symlink aborts the whole import (fail-closed) by
  deliberate owner decision, chosen over a per-file skip.

Refs #669
marcel merged commit e4a154406e into docs/import-migration 2026-05-27 11:34:29 +02:00
marcel deleted branch feature/669-modular-importer 2026-05-27 11:34:31 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#674