As the archive owner I want the importer rebuilt as modular loaders over the normalizer's canonical exports, so dates/people/tags import correctly and re-runs are idempotent #669

New Issue

marcel · 2026-05-26T21:07:04+02:00

marcel commented

2026-05-26 21:07:04 +02:00

Context — Phase 3 of the import rebuild

The legacy backend/.../importing/MassImportService.java (509 lines) reads the raw original spreadsheet via Apache POI by positional column index (@Value app.import.col.sender:3, col.date:7, …) and re-derives everything in Java — parseDate is ISO-only (MassImportService.java:468), name resolution goes through PersonService.findOrCreateByAlias, tags via TagService.findOrCreate. The normalizer's clean output is never touched. The semantic transformation has already been proven in the Python normalizer (tools/import-normalizer/); this issue makes the backend consume that output and retires the raw path.

This is Phase 3 of a three-phase rebuild:

Phase 1 — #670 (normalizer exports complete): emits the canonical artifacts, including the new file column on the document sheet and a stable person_id inside canonical-persons-tree.json so the tree reconciles with the register.
Phase 2 — #671 (schema): the single Flyway migration adding source_ref (persons + tag), provisional (persons), and the date precision / attribution columns (sender_text, receiver_text, date_precision) on documents.
Phase 3 — this issue: four idempotent loaders behind an orchestrator that consume those artifacts and that schema.

This issue adds NO Flyway migration — all schema lives in #671. It depends on #670 and #671 and compiles only after both land (DocumentImporter references DatePrecision, source_ref, sender_text/receiver_text, which do not exist in the codebase until #671). Sequence: #670 + #671 → #669.

Canonical artifacts consumed (produced by Phase 1):

Artifact	Shape (key columns)	Loader
`canonical-tag-tree.xlsx`	`tag_path, parent_name, tag_name`	TagTreeImporter
`canonical-persons.xlsx`	`person_id, last_name, first_name, maiden_name, …, provisional`	PersonRegisterImporter
`canonical-persons-tree.json`	`{persons:[{rowId, person_id, firstName, lastName, …}], relationships:[…]}`	PersonTreeImporter
`canonical-documents.xlsx`	`index, box, folder, sender_person_id, sender_name, sender_category, receiver_person_ids, receiver_names, file, date_iso, date_raw, date_precision, tags, summary, source_row, needs_review`	DocumentImporter

Module layout

CanonicalImportOrchestrator              # runs loaders in dependency order, reports ImportStatus
  1. TagTreeImporter        → TagService                                  (independent)
  2. PersonRegisterImporter → PersonService          (upsert persons by source_ref + provisional)
  3. PersonTreeImporter     → PersonService / relationship service        (needs persons)
  4. DocumentImporter       → DocumentService + FileService/S3            (needs persons + tags)

Ordering is a real dependency DAG (documents → persons + tags; tree → persons), not a preference — encode it explicitly and named in the orchestrator, not implicit in call order. Each loader calls the owning domain's service, never a repository (layering rule). PersonTreeImporter must call the relationship service, never PersonRelationshipRepository. Keep the existing async runner + ImportStatus state machine (IDLE/RUNNING/DONE/FAILED) and SkippedFile shape verbatim — admin/system/ImportStatusCard.svelte consumes {state, statusCode, processed, skippedFiles, skipped} via generated types; changing it breaks the admin UI. Wrap the orchestrator inside the existing async runner.

What changes (and what does NOT)

Semantic transformation stays in the Python normalizer. Java no longer parses German dates or classifies names. Java still parses the spreadsheets structurally — opens each canonical .xlsx/.json, maps by header name (replacing the brittle positional @Value indices), splits the pipe-|-delimited list columns, and converts clean values: LocalDate.parse(date_iso), DatePrecision.valueOf(date_precision) (from #671), Boolean.parseBoolean(provisional).
Java still owns S3 + files. File lookup on disk, upload to the bucket, and ThumbnailAsyncRunner dispatch stay in DocumentImporter.

File-level breakdown

Backend — new importing sub-structure

CanonicalImportOrchestrator — replaces MassImportService's monolithic processRows; keeps runImportAsync, ImportStatus, SkippedFile. Smoke-checks all four expected artifacts are present before starting; fails fast with ImportStatus.FAILED rather than a half-run that loads tags but no documents.
TagTreeImporter, PersonRegisterImporter, PersonTreeImporter, DocumentImporter — one class each.
CanonicalSheetReader — a value-level POI helper (no Spring, no domain knowledge): workbook in, header-name → column index map + |-split helper, typed rows out. The seam that replaces positional @Value app.import.col.*. Throws a DomainException.badRequest on a missing required header (never NPE on a null index).
DocumentImporter keeps file/S3/thumbnail plumbing in small ≤20-line methods: resolveFile(), uploadToS3(), buildDocument().
Delete the positional @Value app.import.col.* indices, the ISO-only parseDate, the Java name-classification path, and the raw-spreadsheet / ODS path (XxeSafeXmlParser, NoSpreadsheetException) once loaders cover them.

Error handling — new loaders use DomainException.internal/badRequest (not raw RuntimeException), likely a new ErrorCode IMPORT_ARTIFACT_INVALID (4-step change: ErrorCode.java + errors.ts + getErrorMessage() case + i18n keys in messages/{de,en,es}.json). Fail closed on a malformed artifact (throw, set FAILED); skip-and-continue is only for an individual bad file via the existing SkippedFile mechanism. Log artifact filenames with parameterized SLF4J, never concatenation.

Docs (blockers) — docs/architecture/db/ diagrams reflect #671's columns (owned there, not here); new backend classes in importing/ → the l3-backend-* diagram; new terms → docs/GLOSSARY.md; an ADR ("importer consumes the normalizer's canonical artifacts; the raw spreadsheet is no longer parsed by Java" — next clean number 025); and docs/DEPLOYMENT.md gains the import prerequisite step (run normalizer → place artifacts → trigger import).

API — npm run generate:api after any model/endpoint touch.

Idempotency & re-import

Every loader is idempotent: persons and tags upsert by source_ref (the normalizer person_id / tag_path, unique+indexed per #671), documents upsert by index — never blind insert. Re-running the import after re-running the normalizer never duplicates persons, tags, or documents. The UNIQUE(source_ref) constraint (in #671) makes the upsert atomic at the DB layer.

"Idempotent" is under-specified on its own (review finding): upsert must define a precedence rule — see Resolved decisions #1. The acceptance test for re-import cannot be written until that rule is chosen (upsert-overwrite vs upsert-preserve are different code and different assertions).

Identity reconciliation

canonical-persons.xlsx keys persons by the slug person_id; canonical-persons-tree.json historically keyed only by rowId with no person_id, so the tree loader had nothing to join on. Phase 1 #670 now emits person_id into the tree JSON. PersonTreeImporter joins the tree to register persons via that person_id. The slug must be computed by one shared Python function across both code paths, or the join silently fails (review finding — verify in #670).

Name-routing policy (folded from the now-deleted #665)

DocumentImporter routes each sender/receiver cell by the normalizer's category, retaining the raw cell text in all cases:

Category	Routing
`single_token` / resolved	Link to a register person (register-first match by `source_ref`); if unmatched, a provisional single-token person — see fallback below
`collective`	`GROUP` person
`institution`	`INSTITUTION` person
`ambiguous_pair` (e.g. `"Ella Anita"`)	Split into two persons, both attached
`prose` / `?` / noise	No person + keep raw text only

ALWAYS retain the raw cell in sender_text / receiver_text, even when a person is linked. This is the load-bearing invariant behind the merge story (no per-document split exists; PersonService.mergePersons + POST /{id}/merge is the only cleanup path) — test it explicitly: matched sender → both sender set AND sender_text == raw cell.
Resolver signature must change (review finding): the current PersonService.findOrCreateByAlias returns a single @Nullable Person. Replace with a small value type, e.g. record AttributionResult(List<Person> persons, String rawText), where persons is empty (prose/noise/?), one (single/collective/institution), or two (pair). The pair-split method belongs on PersonService, not the importer.
Fallback lastName (review finding): Person.lastName is @Column(nullable = false). A new provisional single-token person still needs a non-null lastName — register-first matching dodges this for matched names, but define and test the fallback for a genuinely new single token (empty string vs token-as-lastName) or the insert throws.
Receivers are a Set — receiver_text is a single column; populate it always (per the always-retain rule), even when persons resolved. Test the "always retain even when linked" rule explicitly.
Frontend display is escape-on-render: sender_text/receiver_text carry arbitrary cell content (?, Geschirr, markup-like prose) — render with plain {value} interpolation, never {@html} (stored-XSS guard, low severity).

Security — port guards before deleting the old importer

The rewrite drops ~64 MassImportServiceTest methods, including path-traversal and PDF-magic-byte guards (review finding — these MUST be ported, not lost). Today:

isValidImportFilename (MassImportService.java:336-351) rejects /, \, Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5), .., null bytes, absolute paths.
findFileRecursive (:499-504) re-validates via canonical-path containment.
isPdfMagicBytes checks the %PDF signature before upload.

All three move into DocumentImporter intact, with their tests ported as security regression tests before the old method is deleted (should_reject_path_traversal_in_file_column, should_reject_unicode_slash_homoglyph, should_reject_absolute_path). The file value now arrives via the canonical file column — treat it as hostile input regardless of upstream-trust (CWE-22 does not care the value came from "our" Python tool). Defense in depth: validate the string with isValidImportFilename, then keep canonical-path containment on the resolved real path. Confirm POI 5.5.0 rejects external entities (POI disables DTDs by default — verify, don't assume). The orchestrator entry point must remain reachable only through AdminController @RequirePermission(Permission.ADMIN) — add no second un-annotated path.

Acceptance criteria (Gherkin)

Feature: Modular canonical importer

  Scenario: Loaders run in dependency order
    Given the four canonical artifacts are present
    When the import runs
    Then tags and persons are loaded before documents
    And person relationships are loaded after persons

  Scenario: Documents resolve people by stable id
    Given a canonical document row with sender_person_id "degruyter-clara"
    And a person with source_ref "degruyter-clara" was loaded from canonical-persons.xlsx
    When the document is imported
    Then the document's sender is that person
    And no new person is created

  Scenario: Raw attribution is always retained
    Given a document row whose sender resolves to a register person
    When the document is imported
    Then the sender person is linked
    And sender_text equals the raw cell value

  Scenario: Ambiguous pair splits into two persons
    Given a receiver cell "Ella Anita" categorized ambiguous_pair
    When the document is imported
    Then exactly two persons are attached as receivers
    And the total person count increases by the expected amount only

  Scenario: Prose / noise / "?" creates no person
    Given a sender cell categorized prose, noise, or "?"
    When the document is imported
    Then no person is created for that cell
    And sender_text retains the raw cell value

  Scenario: Collective and institution routing
    Given a cell categorized collective
    Then a GROUP person is linked
    And a cell categorized institution links an INSTITUTION person

  Scenario: File path comes from the sheet
    Given the canonical document sheet carries a "file" column
    When a row with a present file is imported
    Then the named file is uploaded to S3 and status is UPLOADED
    And a row with an empty file yields status PLACEHOLDER

  Scenario: Re-import is idempotent
    Given a full import has completed
    When the same canonical artifacts are imported again
    Then no duplicate persons, tags, or documents are created
    And existing rows are updated in place by source_ref / index
    # precedence (overwrite vs preserve a human-edited field) per Resolved decision #1

  Scenario: Path traversal in the file column is rejected
    Given a document row whose file column is "../../etc/cron.d/x"
    When the import runs
    Then the file is rejected and not uploaded

  Scenario: Malformed artifact fails closed
    Given a required header is missing from an artifact
    When the import runs
    Then ImportStatus is FAILED with a clear ErrorCode
    And no partial load occurs

  Scenario: Clean values parse without semantic logic in Java
    Given date_iso "1916-06-01" and date_precision "MONTH"
    When the document is imported
    Then documentDate is 1916-06-01 and precision is MONTH
    And Java performs no German-date or name-classification parsing

  Scenario: Provisional flag is populated by the importer
    Given the importer auto-creates a Person for an unresolved/provisional attribution
    Then that Person's provisional is true
    And persons loaded from the register remain provisional = false
    And the value surfaces on PersonSummaryDTO

Implementation plan (TDD, red first per behavior)

CanonicalSheetReader first red/green cycle: header present, header missing (throw DomainException.badRequest), |-split of "a|b|c", empty cell → "", single value. No DB.
Four loaders, one test class each (@ExtendWith(MockitoExtension.class), owning service mocked), each idempotent (upsert by source_ref/index), each via the owning service. Named idempotency unit per loader: should_update_person_in_place_when_source_ref_exists. Add a provisional == "True" test (the normalizer writes capitalized Python bools) so a future format change fails loudly.
Name-routing: one failing test per category row (single resolved, single new-provisional, collective→GROUP, institution→INSTITUTION, pair-split→two, prose→none+raw, noise→none, ?→none), plus the "always retain raw even when linked" invariant.
Port the path-traversal / homoglyph / absolute-path / PDF-magic-byte tests into DocumentImporterTest before deleting the old methods.
CanonicalImportOrchestrator wiring named ordering + ImportStatus; strip positional config + parseDate + Java name logic + raw/ODS path.
Integration test (@SpringBootTest + Testcontainers postgres:16-alpine, never H2 — the UNIQUE(source_ref) + upsert conflict only exist in real Postgres): run all four artifacts, snapshot persons/tag/documents counts, run again, assert counts identical AND assert the precedence decision (mutate a field in-app, re-import, assert survival per Decision #1). Use Awaitility on ImportStatus, never Thread.sleep.
npm run generate:api; update l3-backend-* diagram + GLOSSARY.md + ADR 025 + DEPLOYMENT.md runbook step.

Mind the branch JaCoCo gate — currently 0.77 (77%), ratcheting toward 80% (see pom.xml / #496) — every routing arm and error path needs an explicit test.

Resolved decisions (settled 2026-05-27)

Re-import precedence = preserve human edits. Upsert by source_ref/index, but never overwrite a field a human changed in-app (merges, confirmations, manual date/name corrections). Track human-touched fields so a canonical re-import only fills/updates fields the human has not edited. (Raised by: issue, Sara, Elicit)
Name policy = Option A. Prose descriptions and literal "?" → create NO Person; keep the original cell verbatim in sender_text/receiver_text. Pristine register, no triage worklist. (Raised by: #665 author, Elicit)
Object-noise (resolved by extension of Option A — owner may override). Geschirr/Bierbecher/Steuerbescheid are categorized single_token (exactly like a real first name Clara), so Option A alone won't drop them. Resolution by extension of A: maintain a small curated override/stopword list in the normalizer's overrides/ marking known non-person tokens → treated as raw text, no Person. Deterministic, testable, light upkeep; the owner can extend the list. (Raised by: Elicit)
Relational labels (resolved by extension of Option A — owner may override). Schwester Hanni (×41), Tante Tüten (×11), 73 occurrences total — treated like single_token under A: best-effort register match by source_ref; if only a bare relation label with no resolvable name, keep as raw text rather than minting a Person. (Raised by: Elicit)
Artifact delivery = commit the canonical files. Commit out/canonical-documents.xlsx, canonical-persons.xlsx, canonical-tag-tree.xlsx, and canonical-persons-tree.json to the repo and update .gitignore (it currently excludes out/ except the tree JSON). The loader reads them from the repo; they are regenerated when the normalizer changes (Phase 1 #670). (Raised by: Tobias, Markus)

Out of scope

The schema migration and column definitions — Phase 2 #671 (source_ref, provisional, date_precision, sender_text, receiver_text).
Normalizer export changes (file column, tree person_id) — Phase 1 #670.
The date-precision rendering/formatter and the directory/timeline UI (#667/#668).
Full provisional-person visual treatment and the directory filter (dependent UI issue).
Briefwechsel — dead feature being removed; not a surface here.

Dependencies

Blocked-by #670 (Phase 1 — normalizer exports complete: file column, tree person_id).
Blocked-by #671 (Phase 2 — schema: source_ref, provisional, precision/attribution columns). This issue references DatePrecision, source_ref, sender_text/receiver_text and compiles only after #671.
Merge order: #670 + #671 first, then #669.
#667 / #668 consume the resulting clean data downstream; unaffected by this restructure.

## Context — Phase 3 of the import rebuild The legacy `backend/.../importing/MassImportService.java` (509 lines) reads the **raw** original spreadsheet via Apache POI **by positional column index** (`@Value app.import.col.sender:3`, `col.date:7`, …) and re-derives everything in Java — `parseDate` is ISO-only (`MassImportService.java:468`), name resolution goes through `PersonService.findOrCreateByAlias`, tags via `TagService.findOrCreate`. The normalizer's clean output is never touched. The semantic transformation has already been proven in the Python normalizer (`tools/import-normalizer/`); this issue makes the backend consume that output and retires the raw path. This is **Phase 3** of a three-phase rebuild: - **Phase 1 — #670** (normalizer exports complete): emits the canonical artifacts, including the new `file` column on the document sheet and a stable `person_id` inside `canonical-persons-tree.json` so the tree reconciles with the register. - **Phase 2 — #671** (schema): the single Flyway migration adding `source_ref` (persons + tag), `provisional` (persons), and the date precision / attribution columns (`sender_text`, `receiver_text`, `date_precision`) on `documents`. - **Phase 3 — this issue**: four idempotent loaders behind an orchestrator that consume those artifacts and that schema. **This issue adds NO Flyway migration** — all schema lives in #671. It **depends on #670 and #671** and **compiles only after both** land (`DocumentImporter` references `DatePrecision`, `source_ref`, `sender_text`/`receiver_text`, which do not exist in the codebase until #671). Sequence: #670 + #671 → #669. Canonical artifacts consumed (produced by Phase 1): | Artifact | Shape (key columns) | Loader | |---|---|---| | `canonical-tag-tree.xlsx` | `tag_path, parent_name, tag_name` | **TagTreeImporter** | | `canonical-persons.xlsx` | `person_id, last_name, first_name, maiden_name, …, provisional` | **PersonRegisterImporter** | | `canonical-persons-tree.json` | `{persons:[{rowId, person_id, firstName, lastName, …}], relationships:[…]}` | **PersonTreeImporter** | | `canonical-documents.xlsx` | `index, box, folder, sender_person_id, sender_name, sender_category, receiver_person_ids, receiver_names, file, date_iso, date_raw, date_precision, tags, summary, source_row, needs_review` | **DocumentImporter** | ## Module layout ``` CanonicalImportOrchestrator # runs loaders in dependency order, reports ImportStatus 1. TagTreeImporter → TagService (independent) 2. PersonRegisterImporter → PersonService (upsert persons by source_ref + provisional) 3. PersonTreeImporter → PersonService / relationship service (needs persons) 4. DocumentImporter → DocumentService + FileService/S3 (needs persons + tags) ``` Ordering is a real dependency DAG (documents → persons + tags; tree → persons), not a preference — encode it **explicitly and named** in the orchestrator, not implicit in call order. **Each loader calls the owning domain's service, never a repository** (layering rule). `PersonTreeImporter` must call the relationship **service**, never `PersonRelationshipRepository`. Keep the existing async runner + `ImportStatus` state machine (`IDLE/RUNNING/DONE/FAILED`) and `SkippedFile` shape **verbatim** — `admin/system/ImportStatusCard.svelte` consumes `{state, statusCode, processed, skippedFiles, skipped}` via generated types; changing it breaks the admin UI. Wrap the orchestrator inside the existing async runner. ## What changes (and what does NOT) - **Semantic transformation stays in the Python normalizer.** Java no longer parses German dates or classifies names. Java still parses the spreadsheets *structurally* — opens each canonical `.xlsx`/`.json`, maps **by header name** (replacing the brittle positional `@Value` indices), splits the **pipe-`|`-delimited** list columns, and converts clean values: `LocalDate.parse(date_iso)`, `DatePrecision.valueOf(date_precision)` (from #671), `Boolean.parseBoolean(provisional)`. - **Java still owns S3 + files.** File lookup on disk, upload to the bucket, and `ThumbnailAsyncRunner` dispatch stay in `DocumentImporter`. ## File-level breakdown **Backend — new `importing` sub-structure** - `CanonicalImportOrchestrator` — replaces `MassImportService`'s monolithic `processRows`; keeps `runImportAsync`, `ImportStatus`, `SkippedFile`. Smoke-checks all four expected artifacts are present before starting; fails fast with `ImportStatus.FAILED` rather than a half-run that loads tags but no documents. - `TagTreeImporter`, `PersonRegisterImporter`, `PersonTreeImporter`, `DocumentImporter` — one class each. - `CanonicalSheetReader` — a value-level POI helper (no Spring, no domain knowledge): workbook in, header-name → column index map + `|`-split helper, typed rows out. The seam that replaces positional `@Value app.import.col.*`. Throws a `DomainException.badRequest` on a missing required header (never NPE on a null index). - `DocumentImporter` keeps file/S3/thumbnail plumbing in small ≤20-line methods: `resolveFile()`, `uploadToS3()`, `buildDocument()`. - **Delete** the positional `@Value app.import.col.*` indices, the ISO-only `parseDate`, the Java name-classification path, and the raw-spreadsheet / ODS path (`XxeSafeXmlParser`, `NoSpreadsheetException`) once loaders cover them. **Error handling** — new loaders use `DomainException.internal/badRequest` (not raw `RuntimeException`), likely a new `ErrorCode IMPORT_ARTIFACT_INVALID` (4-step change: `ErrorCode.java` + `errors.ts` + `getErrorMessage()` case + i18n keys in `messages/{de,en,es}.json`). Fail **closed** on a malformed *artifact* (throw, set `FAILED`); skip-and-continue is only for an individual bad *file* via the existing `SkippedFile` mechanism. Log artifact filenames with parameterized SLF4J, never concatenation. **Docs (blockers)** — `docs/architecture/db/` diagrams reflect #671's columns (owned there, not here); new backend classes in `importing/` → the `l3-backend-*` diagram; new terms → `docs/GLOSSARY.md`; an **ADR** ("importer consumes the normalizer's canonical artifacts; the raw spreadsheet is no longer parsed by Java" — next clean number `025`); and `docs/DEPLOYMENT.md` gains the import prerequisite step (run normalizer → place artifacts → trigger import). **API** — `npm run generate:api` after any model/endpoint touch. ## Idempotency & re-import Every loader is **idempotent**: persons and tags upsert by `source_ref` (the normalizer `person_id` / `tag_path`, unique+indexed per #671), documents upsert by `index` — never blind insert. Re-running the import after re-running the normalizer never duplicates persons, tags, or documents. The `UNIQUE(source_ref)` constraint (in #671) makes the upsert atomic at the DB layer. **"Idempotent" is under-specified on its own** (review finding): upsert must define a precedence rule — see Resolved decisions #1. The acceptance test for re-import cannot be written until that rule is chosen (upsert-overwrite vs upsert-preserve are different code and different assertions). ## Identity reconciliation `canonical-persons.xlsx` keys persons by the slug `person_id`; `canonical-persons-tree.json` historically keyed only by `rowId` with **no `person_id`**, so the tree loader had nothing to join on. **Phase 1 #670 now emits `person_id` into the tree JSON.** `PersonTreeImporter` joins the tree to register persons via that `person_id`. The slug must be computed by **one shared Python function** across both code paths, or the join silently fails (review finding — verify in #670). ## Name-routing policy (folded from the now-deleted #665) `DocumentImporter` routes each sender/receiver cell by the normalizer's category, retaining the raw cell text in all cases: | Category | Routing | |---|---| | `single_token` / resolved | Link to a register person (**register-first match by `source_ref`**); if unmatched, a provisional single-token person — see fallback below | | `collective` | `GROUP` person | | `institution` | `INSTITUTION` person | | `ambiguous_pair` (e.g. `"Ella Anita"`) | **Split into two persons**, both attached | | `prose` / `?` / noise | **No person** + keep raw text only | - **ALWAYS retain the raw cell** in `sender_text` / `receiver_text`, even when a person is linked. This is the load-bearing invariant behind the merge story (no per-document split exists; `PersonService.mergePersons` + `POST /{id}/merge` is the only cleanup path) — test it explicitly: matched sender → both `sender` set AND `sender_text` == raw cell. - **Resolver signature must change** (review finding): the current `PersonService.findOrCreateByAlias` returns a single `@Nullable Person`. Replace with a small value type, e.g. `record AttributionResult(List<Person> persons, String rawText)`, where `persons` is **empty** (prose/noise/`?`), **one** (single/collective/institution), or **two** (pair). The pair-split method belongs on `PersonService`, not the importer. - **Fallback `lastName`** (review finding): `Person.lastName` is `@Column(nullable = false)`. A new provisional single-token person still needs a non-null `lastName` — register-first matching dodges this for matched names, but define and test the fallback for a genuinely new single token (empty string vs token-as-lastName) or the insert throws. - **Receivers are a `Set`** — `receiver_text` is a single column; populate it always (per the always-retain rule), even when persons resolved. Test the "always retain even when linked" rule explicitly. - **Frontend display is escape-on-render**: `sender_text`/`receiver_text` carry arbitrary cell content (`?`, `Geschirr`, markup-like prose) — render with plain `{value}` interpolation, never `{@html}` (stored-XSS guard, low severity). ## Security — port guards before deleting the old importer The rewrite drops ~64 `MassImportServiceTest` methods, including **path-traversal** and **PDF-magic-byte** guards (review finding — these MUST be ported, not lost). Today: - `isValidImportFilename` (`MassImportService.java:336-351`) rejects `/`, `\`, Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5), `..`, null bytes, absolute paths. - `findFileRecursive` (`:499-504`) re-validates via canonical-path containment. - `isPdfMagicBytes` checks the `%PDF` signature before upload. **All three move into `DocumentImporter` intact, with their tests ported as security regression tests *before* the old method is deleted** (`should_reject_path_traversal_in_file_column`, `should_reject_unicode_slash_homoglyph`, `should_reject_absolute_path`). The `file` value now arrives via the canonical `file` column — treat it as hostile input regardless of upstream-trust (CWE-22 does not care the value came from "our" Python tool). Defense in depth: validate the string with `isValidImportFilename`, then keep canonical-path containment on the resolved real path. Confirm POI 5.5.0 rejects external entities (POI disables DTDs by default — verify, don't assume). The orchestrator entry point must remain reachable only through `AdminController` `@RequirePermission(Permission.ADMIN)` — add no second un-annotated path. ## Acceptance criteria (Gherkin) ```gherkin Feature: Modular canonical importer Scenario: Loaders run in dependency order Given the four canonical artifacts are present When the import runs Then tags and persons are loaded before documents And person relationships are loaded after persons Scenario: Documents resolve people by stable id Given a canonical document row with sender_person_id "degruyter-clara" And a person with source_ref "degruyter-clara" was loaded from canonical-persons.xlsx When the document is imported Then the document's sender is that person And no new person is created Scenario: Raw attribution is always retained Given a document row whose sender resolves to a register person When the document is imported Then the sender person is linked And sender_text equals the raw cell value Scenario: Ambiguous pair splits into two persons Given a receiver cell "Ella Anita" categorized ambiguous_pair When the document is imported Then exactly two persons are attached as receivers And the total person count increases by the expected amount only Scenario: Prose / noise / "?" creates no person Given a sender cell categorized prose, noise, or "?" When the document is imported Then no person is created for that cell And sender_text retains the raw cell value Scenario: Collective and institution routing Given a cell categorized collective Then a GROUP person is linked And a cell categorized institution links an INSTITUTION person Scenario: File path comes from the sheet Given the canonical document sheet carries a "file" column When a row with a present file is imported Then the named file is uploaded to S3 and status is UPLOADED And a row with an empty file yields status PLACEHOLDER Scenario: Re-import is idempotent Given a full import has completed When the same canonical artifacts are imported again Then no duplicate persons, tags, or documents are created And existing rows are updated in place by source_ref / index # precedence (overwrite vs preserve a human-edited field) per Resolved decision #1 Scenario: Path traversal in the file column is rejected Given a document row whose file column is "../../etc/cron.d/x" When the import runs Then the file is rejected and not uploaded Scenario: Malformed artifact fails closed Given a required header is missing from an artifact When the import runs Then ImportStatus is FAILED with a clear ErrorCode And no partial load occurs Scenario: Clean values parse without semantic logic in Java Given date_iso "1916-06-01" and date_precision "MONTH" When the document is imported Then documentDate is 1916-06-01 and precision is MONTH And Java performs no German-date or name-classification parsing Scenario: Provisional flag is populated by the importer Given the importer auto-creates a Person for an unresolved/provisional attribution Then that Person's provisional is true And persons loaded from the register remain provisional = false And the value surfaces on PersonSummaryDTO ``` ## Implementation plan (TDD, red first per behavior) 1. `CanonicalSheetReader` first red/green cycle: header present, header missing (throw `DomainException.badRequest`), `|`-split of `"a|b|c"`, empty cell → "", single value. No DB. 2. Four loaders, one test class each (`@ExtendWith(MockitoExtension.class)`, owning service mocked), each idempotent (upsert by `source_ref`/`index`), each via the owning service. Named idempotency unit per loader: `should_update_person_in_place_when_source_ref_exists`. Add a `provisional == "True"` test (the normalizer writes capitalized Python bools) so a future format change fails loudly. 3. Name-routing: one failing test per category row (single resolved, single new-provisional, collective→GROUP, institution→INSTITUTION, pair-split→two, prose→none+raw, noise→none, `?`→none), plus the "always retain raw even when linked" invariant. 4. Port the path-traversal / homoglyph / absolute-path / PDF-magic-byte tests into `DocumentImporterTest` **before** deleting the old methods. 5. `CanonicalImportOrchestrator` wiring named ordering + `ImportStatus`; strip positional config + `parseDate` + Java name logic + raw/ODS path. 6. Integration test (`@SpringBootTest` + **Testcontainers `postgres:16-alpine`**, never H2 — the `UNIQUE(source_ref)` + upsert conflict only exist in real Postgres): run all four artifacts, snapshot `persons`/`tag`/`documents` counts, run again, assert counts identical AND assert the precedence decision (mutate a field in-app, re-import, assert survival per Decision #1). Use **Awaitility** on `ImportStatus`, never `Thread.sleep`. 7. `npm run generate:api`; update `l3-backend-*` diagram + `GLOSSARY.md` + ADR 025 + `DEPLOYMENT.md` runbook step. Mind the **branch JaCoCo gate** — currently **0.77 (77%)**, ratcheting toward 80% (see `pom.xml` / #496) — every routing arm and error path needs an explicit test. ## Resolved decisions (settled 2026-05-27) 1. **Re-import precedence = preserve human edits.** Upsert by `source_ref`/`index`, but never overwrite a field a human changed in-app (merges, confirmations, manual date/name corrections). Track human-touched fields so a canonical re-import only fills/updates fields the human has not edited. *(Raised by: issue, Sara, Elicit)* 2. **Name policy = Option A.** Prose descriptions and literal "?" → create NO Person; keep the original cell verbatim in `sender_text`/`receiver_text`. Pristine register, no triage worklist. *(Raised by: #665 author, Elicit)* 3. **Object-noise (resolved by extension of Option A — owner may override).** `Geschirr`/`Bierbecher`/`Steuerbescheid` are categorized `single_token` (exactly like a real first name `Clara`), so Option A alone won't drop them. Resolution by extension of A: maintain a **small curated override/stopword list in the normalizer's `overrides/`** marking known non-person tokens → treated as raw text, no Person. Deterministic, testable, light upkeep; the owner can extend the list. *(Raised by: Elicit)* 4. **Relational labels (resolved by extension of Option A — owner may override).** `Schwester Hanni` (×41), `Tante Tüten` (×11), 73 occurrences total — treated like `single_token` under A: best-effort register match by `source_ref`; if only a bare relation label with no resolvable name, keep as raw text rather than minting a Person. *(Raised by: Elicit)* 5. **Artifact delivery = commit the canonical files.** Commit `out/canonical-documents.xlsx`, `canonical-persons.xlsx`, `canonical-tag-tree.xlsx`, and `canonical-persons-tree.json` to the repo and update `.gitignore` (it currently excludes `out/` except the tree JSON). The loader reads them from the repo; they are regenerated when the normalizer changes (Phase 1 #670). *(Raised by: Tobias, Markus)* ## Out of scope - The schema migration and column definitions — **Phase 2 #671** (`source_ref`, `provisional`, `date_precision`, `sender_text`, `receiver_text`). - Normalizer export changes (`file` column, tree `person_id`) — **Phase 1 #670**. - The date-precision **rendering**/formatter and the directory/timeline UI (#667/#668). - Full provisional-person visual treatment and the directory filter (dependent UI issue). - **Briefwechsel** — dead feature being removed; not a surface here. ## Dependencies - **Blocked-by #670** (Phase 1 — normalizer exports complete: `file` column, tree `person_id`). - **Blocked-by #671** (Phase 2 — schema: `source_ref`, `provisional`, precision/attribution columns). This issue references `DatePrecision`, `source_ref`, `sender_text`/`receiver_text` and **compiles only after #671**. - Merge order: #670 + #671 first, then #669. - #667 / #668 consume the resulting clean data downstream; unaffected by this restructure.

marcel added this to the Handling the Unknowns — honest uncertainty in dates & people milestone 2026-05-26 21:07:04 +02:00

marcel added the P0-critical feature needs-discussion labels 2026-05-26 21:07:10 +02:00

~~marcel referenced this issue 2026-05-26 21:14:21 +02:00~~

As a reader I want undated and imprecisely-dated letters to be honestly labelled in browse views so I always understand a document's date position #668

As the archive owner I want the importer rebuilt as modular loaders over the normalizer's canonical exports, so dates/people/tags import correctly and re-runs are idempotent #669

Context — Phase 3 of the import rebuild

Module layout

What changes (and what does NOT)

File-level breakdown

Idempotency & re-import

Identity reconciliation

Name-routing policy (folded from the now-deleted #665)

Security — port guards before deleting the old importer

Acceptance criteria (Gherkin)

Implementation plan (TDD, red first per behavior)

Resolved decisions (settled 2026-05-27)

Out of scope

Dependencies

Markus Keller — Senior Application Architect

Observations

Recommendations

Open Decisions

Felix Brandt — Senior Fullstack Developer

Observations

Recommendations

Open Decisions (none)

Nora "NullX" Steiner — Application Security Engineer

Observations

Recommendations

Open Decisions (none)

Sara Holt — Senior QA Engineer

Observations

Recommendations

Open Decisions

Elicit — Requirements Engineer & Business Analyst

Observations

Ambiguities / contradictions I must surface

Recommendations

Open Decisions

Tobias Wendt — DevOps & Platform Engineer

Observations

Recommendations

Open Decisions

Leonie Voss — UX Designer & Accessibility Strategist

Observations

Recommendations

Open Decisions (none)

Decision Queue — Action Required

Data / Requirements

Infrastructure