As the archive owner I want the importer rebuilt as modular loaders over the normalizer's canonical exports, so dates/people/tags import correctly and re-runs are idempotent #669

Closed
opened 2026-05-26 21:07:04 +02:00 by marcel · 8 comments
Owner

Context — Phase 3 of the import rebuild

The legacy backend/.../importing/MassImportService.java (509 lines) reads the raw original spreadsheet via Apache POI by positional column index (@Value app.import.col.sender:3, col.date:7, …) and re-derives everything in Java — parseDate is ISO-only (MassImportService.java:468), name resolution goes through PersonService.findOrCreateByAlias, tags via TagService.findOrCreate. The normalizer's clean output is never touched. The semantic transformation has already been proven in the Python normalizer (tools/import-normalizer/); this issue makes the backend consume that output and retires the raw path.

This is Phase 3 of a three-phase rebuild:

  • Phase 1 — #670 (normalizer exports complete): emits the canonical artifacts, including the new file column on the document sheet and a stable person_id inside canonical-persons-tree.json so the tree reconciles with the register.
  • Phase 2 — #671 (schema): the single Flyway migration adding source_ref (persons + tag), provisional (persons), and the date precision / attribution columns (sender_text, receiver_text, date_precision) on documents.
  • Phase 3 — this issue: four idempotent loaders behind an orchestrator that consume those artifacts and that schema.

This issue adds NO Flyway migration — all schema lives in #671. It depends on #670 and #671 and compiles only after both land (DocumentImporter references DatePrecision, source_ref, sender_text/receiver_text, which do not exist in the codebase until #671). Sequence: #670 + #671#669.

Canonical artifacts consumed (produced by Phase 1):

Artifact Shape (key columns) Loader
canonical-tag-tree.xlsx tag_path, parent_name, tag_name TagTreeImporter
canonical-persons.xlsx person_id, last_name, first_name, maiden_name, …, provisional PersonRegisterImporter
canonical-persons-tree.json {persons:[{rowId, person_id, firstName, lastName, …}], relationships:[…]} PersonTreeImporter
canonical-documents.xlsx index, box, folder, sender_person_id, sender_name, sender_category, receiver_person_ids, receiver_names, file, date_iso, date_raw, date_precision, tags, summary, source_row, needs_review DocumentImporter

Module layout

CanonicalImportOrchestrator              # runs loaders in dependency order, reports ImportStatus
  1. TagTreeImporter        → TagService                                  (independent)
  2. PersonRegisterImporter → PersonService          (upsert persons by source_ref + provisional)
  3. PersonTreeImporter     → PersonService / relationship service        (needs persons)
  4. DocumentImporter       → DocumentService + FileService/S3            (needs persons + tags)

Ordering is a real dependency DAG (documents → persons + tags; tree → persons), not a preference — encode it explicitly and named in the orchestrator, not implicit in call order. Each loader calls the owning domain's service, never a repository (layering rule). PersonTreeImporter must call the relationship service, never PersonRelationshipRepository. Keep the existing async runner + ImportStatus state machine (IDLE/RUNNING/DONE/FAILED) and SkippedFile shape verbatimadmin/system/ImportStatusCard.svelte consumes {state, statusCode, processed, skippedFiles, skipped} via generated types; changing it breaks the admin UI. Wrap the orchestrator inside the existing async runner.

What changes (and what does NOT)

  • Semantic transformation stays in the Python normalizer. Java no longer parses German dates or classifies names. Java still parses the spreadsheets structurally — opens each canonical .xlsx/.json, maps by header name (replacing the brittle positional @Value indices), splits the pipe-|-delimited list columns, and converts clean values: LocalDate.parse(date_iso), DatePrecision.valueOf(date_precision) (from #671), Boolean.parseBoolean(provisional).
  • Java still owns S3 + files. File lookup on disk, upload to the bucket, and ThumbnailAsyncRunner dispatch stay in DocumentImporter.

File-level breakdown

Backend — new importing sub-structure

  • CanonicalImportOrchestrator — replaces MassImportService's monolithic processRows; keeps runImportAsync, ImportStatus, SkippedFile. Smoke-checks all four expected artifacts are present before starting; fails fast with ImportStatus.FAILED rather than a half-run that loads tags but no documents.
  • TagTreeImporter, PersonRegisterImporter, PersonTreeImporter, DocumentImporter — one class each.
  • CanonicalSheetReader — a value-level POI helper (no Spring, no domain knowledge): workbook in, header-name → column index map + |-split helper, typed rows out. The seam that replaces positional @Value app.import.col.*. Throws a DomainException.badRequest on a missing required header (never NPE on a null index).
  • DocumentImporter keeps file/S3/thumbnail plumbing in small ≤20-line methods: resolveFile(), uploadToS3(), buildDocument().
  • Delete the positional @Value app.import.col.* indices, the ISO-only parseDate, the Java name-classification path, and the raw-spreadsheet / ODS path (XxeSafeXmlParser, NoSpreadsheetException) once loaders cover them.

Error handling — new loaders use DomainException.internal/badRequest (not raw RuntimeException), likely a new ErrorCode IMPORT_ARTIFACT_INVALID (4-step change: ErrorCode.java + errors.ts + getErrorMessage() case + i18n keys in messages/{de,en,es}.json). Fail closed on a malformed artifact (throw, set FAILED); skip-and-continue is only for an individual bad file via the existing SkippedFile mechanism. Log artifact filenames with parameterized SLF4J, never concatenation.

Docs (blockers)docs/architecture/db/ diagrams reflect #671's columns (owned there, not here); new backend classes in importing/ → the l3-backend-* diagram; new terms → docs/GLOSSARY.md; an ADR ("importer consumes the normalizer's canonical artifacts; the raw spreadsheet is no longer parsed by Java" — next clean number 025); and docs/DEPLOYMENT.md gains the import prerequisite step (run normalizer → place artifacts → trigger import).

APInpm run generate:api after any model/endpoint touch.

Idempotency & re-import

Every loader is idempotent: persons and tags upsert by source_ref (the normalizer person_id / tag_path, unique+indexed per #671), documents upsert by index — never blind insert. Re-running the import after re-running the normalizer never duplicates persons, tags, or documents. The UNIQUE(source_ref) constraint (in #671) makes the upsert atomic at the DB layer.

"Idempotent" is under-specified on its own (review finding): upsert must define a precedence rule — see Resolved decisions #1. The acceptance test for re-import cannot be written until that rule is chosen (upsert-overwrite vs upsert-preserve are different code and different assertions).

Identity reconciliation

canonical-persons.xlsx keys persons by the slug person_id; canonical-persons-tree.json historically keyed only by rowId with no person_id, so the tree loader had nothing to join on. Phase 1 #670 now emits person_id into the tree JSON. PersonTreeImporter joins the tree to register persons via that person_id. The slug must be computed by one shared Python function across both code paths, or the join silently fails (review finding — verify in #670).

Name-routing policy (folded from the now-deleted #665)

DocumentImporter routes each sender/receiver cell by the normalizer's category, retaining the raw cell text in all cases:

Category Routing
single_token / resolved Link to a register person (register-first match by source_ref); if unmatched, a provisional single-token person — see fallback below
collective GROUP person
institution INSTITUTION person
ambiguous_pair (e.g. "Ella Anita") Split into two persons, both attached
prose / ? / noise No person + keep raw text only
  • ALWAYS retain the raw cell in sender_text / receiver_text, even when a person is linked. This is the load-bearing invariant behind the merge story (no per-document split exists; PersonService.mergePersons + POST /{id}/merge is the only cleanup path) — test it explicitly: matched sender → both sender set AND sender_text == raw cell.
  • Resolver signature must change (review finding): the current PersonService.findOrCreateByAlias returns a single @Nullable Person. Replace with a small value type, e.g. record AttributionResult(List<Person> persons, String rawText), where persons is empty (prose/noise/?), one (single/collective/institution), or two (pair). The pair-split method belongs on PersonService, not the importer.
  • Fallback lastName (review finding): Person.lastName is @Column(nullable = false). A new provisional single-token person still needs a non-null lastName — register-first matching dodges this for matched names, but define and test the fallback for a genuinely new single token (empty string vs token-as-lastName) or the insert throws.
  • Receivers are a Setreceiver_text is a single column; populate it always (per the always-retain rule), even when persons resolved. Test the "always retain even when linked" rule explicitly.
  • Frontend display is escape-on-render: sender_text/receiver_text carry arbitrary cell content (?, Geschirr, markup-like prose) — render with plain {value} interpolation, never {@html} (stored-XSS guard, low severity).

Security — port guards before deleting the old importer

The rewrite drops ~64 MassImportServiceTest methods, including path-traversal and PDF-magic-byte guards (review finding — these MUST be ported, not lost). Today:

  • isValidImportFilename (MassImportService.java:336-351) rejects /, \, Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5), .., null bytes, absolute paths.
  • findFileRecursive (:499-504) re-validates via canonical-path containment.
  • isPdfMagicBytes checks the %PDF signature before upload.

All three move into DocumentImporter intact, with their tests ported as security regression tests before the old method is deleted (should_reject_path_traversal_in_file_column, should_reject_unicode_slash_homoglyph, should_reject_absolute_path). The file value now arrives via the canonical file column — treat it as hostile input regardless of upstream-trust (CWE-22 does not care the value came from "our" Python tool). Defense in depth: validate the string with isValidImportFilename, then keep canonical-path containment on the resolved real path. Confirm POI 5.5.0 rejects external entities (POI disables DTDs by default — verify, don't assume). The orchestrator entry point must remain reachable only through AdminController @RequirePermission(Permission.ADMIN) — add no second un-annotated path.

Acceptance criteria (Gherkin)

Feature: Modular canonical importer

  Scenario: Loaders run in dependency order
    Given the four canonical artifacts are present
    When the import runs
    Then tags and persons are loaded before documents
    And person relationships are loaded after persons

  Scenario: Documents resolve people by stable id
    Given a canonical document row with sender_person_id "degruyter-clara"
    And a person with source_ref "degruyter-clara" was loaded from canonical-persons.xlsx
    When the document is imported
    Then the document's sender is that person
    And no new person is created

  Scenario: Raw attribution is always retained
    Given a document row whose sender resolves to a register person
    When the document is imported
    Then the sender person is linked
    And sender_text equals the raw cell value

  Scenario: Ambiguous pair splits into two persons
    Given a receiver cell "Ella Anita" categorized ambiguous_pair
    When the document is imported
    Then exactly two persons are attached as receivers
    And the total person count increases by the expected amount only

  Scenario: Prose / noise / "?" creates no person
    Given a sender cell categorized prose, noise, or "?"
    When the document is imported
    Then no person is created for that cell
    And sender_text retains the raw cell value

  Scenario: Collective and institution routing
    Given a cell categorized collective
    Then a GROUP person is linked
    And a cell categorized institution links an INSTITUTION person

  Scenario: File path comes from the sheet
    Given the canonical document sheet carries a "file" column
    When a row with a present file is imported
    Then the named file is uploaded to S3 and status is UPLOADED
    And a row with an empty file yields status PLACEHOLDER

  Scenario: Re-import is idempotent
    Given a full import has completed
    When the same canonical artifacts are imported again
    Then no duplicate persons, tags, or documents are created
    And existing rows are updated in place by source_ref / index
    # precedence (overwrite vs preserve a human-edited field) per Resolved decision #1

  Scenario: Path traversal in the file column is rejected
    Given a document row whose file column is "../../etc/cron.d/x"
    When the import runs
    Then the file is rejected and not uploaded

  Scenario: Malformed artifact fails closed
    Given a required header is missing from an artifact
    When the import runs
    Then ImportStatus is FAILED with a clear ErrorCode
    And no partial load occurs

  Scenario: Clean values parse without semantic logic in Java
    Given date_iso "1916-06-01" and date_precision "MONTH"
    When the document is imported
    Then documentDate is 1916-06-01 and precision is MONTH
    And Java performs no German-date or name-classification parsing

  Scenario: Provisional flag is populated by the importer
    Given the importer auto-creates a Person for an unresolved/provisional attribution
    Then that Person's provisional is true
    And persons loaded from the register remain provisional = false
    And the value surfaces on PersonSummaryDTO

Implementation plan (TDD, red first per behavior)

  1. CanonicalSheetReader first red/green cycle: header present, header missing (throw DomainException.badRequest), |-split of "a|b|c", empty cell → "", single value. No DB.
  2. Four loaders, one test class each (@ExtendWith(MockitoExtension.class), owning service mocked), each idempotent (upsert by source_ref/index), each via the owning service. Named idempotency unit per loader: should_update_person_in_place_when_source_ref_exists. Add a provisional == "True" test (the normalizer writes capitalized Python bools) so a future format change fails loudly.
  3. Name-routing: one failing test per category row (single resolved, single new-provisional, collective→GROUP, institution→INSTITUTION, pair-split→two, prose→none+raw, noise→none, ?→none), plus the "always retain raw even when linked" invariant.
  4. Port the path-traversal / homoglyph / absolute-path / PDF-magic-byte tests into DocumentImporterTest before deleting the old methods.
  5. CanonicalImportOrchestrator wiring named ordering + ImportStatus; strip positional config + parseDate + Java name logic + raw/ODS path.
  6. Integration test (@SpringBootTest + Testcontainers postgres:16-alpine, never H2 — the UNIQUE(source_ref) + upsert conflict only exist in real Postgres): run all four artifacts, snapshot persons/tag/documents counts, run again, assert counts identical AND assert the precedence decision (mutate a field in-app, re-import, assert survival per Decision #1). Use Awaitility on ImportStatus, never Thread.sleep.
  7. npm run generate:api; update l3-backend-* diagram + GLOSSARY.md + ADR 025 + DEPLOYMENT.md runbook step.

Mind the branch JaCoCo gate — currently 0.77 (77%), ratcheting toward 80% (see pom.xml / #496) — every routing arm and error path needs an explicit test.

Resolved decisions (settled 2026-05-27)

  1. Re-import precedence = preserve human edits. Upsert by source_ref/index, but never overwrite a field a human changed in-app (merges, confirmations, manual date/name corrections). Track human-touched fields so a canonical re-import only fills/updates fields the human has not edited. (Raised by: issue, Sara, Elicit)
  2. Name policy = Option A. Prose descriptions and literal "?" → create NO Person; keep the original cell verbatim in sender_text/receiver_text. Pristine register, no triage worklist. (Raised by: #665 author, Elicit)
  3. Object-noise (resolved by extension of Option A — owner may override). Geschirr/Bierbecher/Steuerbescheid are categorized single_token (exactly like a real first name Clara), so Option A alone won't drop them. Resolution by extension of A: maintain a small curated override/stopword list in the normalizer's overrides/ marking known non-person tokens → treated as raw text, no Person. Deterministic, testable, light upkeep; the owner can extend the list. (Raised by: Elicit)
  4. Relational labels (resolved by extension of Option A — owner may override). Schwester Hanni (×41), Tante Tüten (×11), 73 occurrences total — treated like single_token under A: best-effort register match by source_ref; if only a bare relation label with no resolvable name, keep as raw text rather than minting a Person. (Raised by: Elicit)
  5. Artifact delivery = commit the canonical files. Commit out/canonical-documents.xlsx, canonical-persons.xlsx, canonical-tag-tree.xlsx, and canonical-persons-tree.json to the repo and update .gitignore (it currently excludes out/ except the tree JSON). The loader reads them from the repo; they are regenerated when the normalizer changes (Phase 1 #670). (Raised by: Tobias, Markus)

Out of scope

  • The schema migration and column definitions — Phase 2 #671 (source_ref, provisional, date_precision, sender_text, receiver_text).
  • Normalizer export changes (file column, tree person_id) — Phase 1 #670.
  • The date-precision rendering/formatter and the directory/timeline UI (#667/#668).
  • Full provisional-person visual treatment and the directory filter (dependent UI issue).
  • Briefwechsel — dead feature being removed; not a surface here.

Dependencies

  • Blocked-by #670 (Phase 1 — normalizer exports complete: file column, tree person_id).
  • Blocked-by #671 (Phase 2 — schema: source_ref, provisional, precision/attribution columns). This issue references DatePrecision, source_ref, sender_text/receiver_text and compiles only after #671.
  • Merge order: #670 + #671 first, then #669.
  • #667 / #668 consume the resulting clean data downstream; unaffected by this restructure.
## Context — Phase 3 of the import rebuild The legacy `backend/.../importing/MassImportService.java` (509 lines) reads the **raw** original spreadsheet via Apache POI **by positional column index** (`@Value app.import.col.sender:3`, `col.date:7`, …) and re-derives everything in Java — `parseDate` is ISO-only (`MassImportService.java:468`), name resolution goes through `PersonService.findOrCreateByAlias`, tags via `TagService.findOrCreate`. The normalizer's clean output is never touched. The semantic transformation has already been proven in the Python normalizer (`tools/import-normalizer/`); this issue makes the backend consume that output and retires the raw path. This is **Phase 3** of a three-phase rebuild: - **Phase 1 — #670** (normalizer exports complete): emits the canonical artifacts, including the new `file` column on the document sheet and a stable `person_id` inside `canonical-persons-tree.json` so the tree reconciles with the register. - **Phase 2 — #671** (schema): the single Flyway migration adding `source_ref` (persons + tag), `provisional` (persons), and the date precision / attribution columns (`sender_text`, `receiver_text`, `date_precision`) on `documents`. - **Phase 3 — this issue**: four idempotent loaders behind an orchestrator that consume those artifacts and that schema. **This issue adds NO Flyway migration** — all schema lives in #671. It **depends on #670 and #671** and **compiles only after both** land (`DocumentImporter` references `DatePrecision`, `source_ref`, `sender_text`/`receiver_text`, which do not exist in the codebase until #671). Sequence: #670 + #671 → #669. Canonical artifacts consumed (produced by Phase 1): | Artifact | Shape (key columns) | Loader | |---|---|---| | `canonical-tag-tree.xlsx` | `tag_path, parent_name, tag_name` | **TagTreeImporter** | | `canonical-persons.xlsx` | `person_id, last_name, first_name, maiden_name, …, provisional` | **PersonRegisterImporter** | | `canonical-persons-tree.json` | `{persons:[{rowId, person_id, firstName, lastName, …}], relationships:[…]}` | **PersonTreeImporter** | | `canonical-documents.xlsx` | `index, box, folder, sender_person_id, sender_name, sender_category, receiver_person_ids, receiver_names, file, date_iso, date_raw, date_precision, tags, summary, source_row, needs_review` | **DocumentImporter** | ## Module layout ``` CanonicalImportOrchestrator # runs loaders in dependency order, reports ImportStatus 1. TagTreeImporter → TagService (independent) 2. PersonRegisterImporter → PersonService (upsert persons by source_ref + provisional) 3. PersonTreeImporter → PersonService / relationship service (needs persons) 4. DocumentImporter → DocumentService + FileService/S3 (needs persons + tags) ``` Ordering is a real dependency DAG (documents → persons + tags; tree → persons), not a preference — encode it **explicitly and named** in the orchestrator, not implicit in call order. **Each loader calls the owning domain's service, never a repository** (layering rule). `PersonTreeImporter` must call the relationship **service**, never `PersonRelationshipRepository`. Keep the existing async runner + `ImportStatus` state machine (`IDLE/RUNNING/DONE/FAILED`) and `SkippedFile` shape **verbatim** — `admin/system/ImportStatusCard.svelte` consumes `{state, statusCode, processed, skippedFiles, skipped}` via generated types; changing it breaks the admin UI. Wrap the orchestrator inside the existing async runner. ## What changes (and what does NOT) - **Semantic transformation stays in the Python normalizer.** Java no longer parses German dates or classifies names. Java still parses the spreadsheets *structurally* — opens each canonical `.xlsx`/`.json`, maps **by header name** (replacing the brittle positional `@Value` indices), splits the **pipe-`|`-delimited** list columns, and converts clean values: `LocalDate.parse(date_iso)`, `DatePrecision.valueOf(date_precision)` (from #671), `Boolean.parseBoolean(provisional)`. - **Java still owns S3 + files.** File lookup on disk, upload to the bucket, and `ThumbnailAsyncRunner` dispatch stay in `DocumentImporter`. ## File-level breakdown **Backend — new `importing` sub-structure** - `CanonicalImportOrchestrator` — replaces `MassImportService`'s monolithic `processRows`; keeps `runImportAsync`, `ImportStatus`, `SkippedFile`. Smoke-checks all four expected artifacts are present before starting; fails fast with `ImportStatus.FAILED` rather than a half-run that loads tags but no documents. - `TagTreeImporter`, `PersonRegisterImporter`, `PersonTreeImporter`, `DocumentImporter` — one class each. - `CanonicalSheetReader` — a value-level POI helper (no Spring, no domain knowledge): workbook in, header-name → column index map + `|`-split helper, typed rows out. The seam that replaces positional `@Value app.import.col.*`. Throws a `DomainException.badRequest` on a missing required header (never NPE on a null index). - `DocumentImporter` keeps file/S3/thumbnail plumbing in small ≤20-line methods: `resolveFile()`, `uploadToS3()`, `buildDocument()`. - **Delete** the positional `@Value app.import.col.*` indices, the ISO-only `parseDate`, the Java name-classification path, and the raw-spreadsheet / ODS path (`XxeSafeXmlParser`, `NoSpreadsheetException`) once loaders cover them. **Error handling** — new loaders use `DomainException.internal/badRequest` (not raw `RuntimeException`), likely a new `ErrorCode IMPORT_ARTIFACT_INVALID` (4-step change: `ErrorCode.java` + `errors.ts` + `getErrorMessage()` case + i18n keys in `messages/{de,en,es}.json`). Fail **closed** on a malformed *artifact* (throw, set `FAILED`); skip-and-continue is only for an individual bad *file* via the existing `SkippedFile` mechanism. Log artifact filenames with parameterized SLF4J, never concatenation. **Docs (blockers)** — `docs/architecture/db/` diagrams reflect #671's columns (owned there, not here); new backend classes in `importing/` → the `l3-backend-*` diagram; new terms → `docs/GLOSSARY.md`; an **ADR** ("importer consumes the normalizer's canonical artifacts; the raw spreadsheet is no longer parsed by Java" — next clean number `025`); and `docs/DEPLOYMENT.md` gains the import prerequisite step (run normalizer → place artifacts → trigger import). **API** — `npm run generate:api` after any model/endpoint touch. ## Idempotency & re-import Every loader is **idempotent**: persons and tags upsert by `source_ref` (the normalizer `person_id` / `tag_path`, unique+indexed per #671), documents upsert by `index` — never blind insert. Re-running the import after re-running the normalizer never duplicates persons, tags, or documents. The `UNIQUE(source_ref)` constraint (in #671) makes the upsert atomic at the DB layer. **"Idempotent" is under-specified on its own** (review finding): upsert must define a precedence rule — see Resolved decisions #1. The acceptance test for re-import cannot be written until that rule is chosen (upsert-overwrite vs upsert-preserve are different code and different assertions). ## Identity reconciliation `canonical-persons.xlsx` keys persons by the slug `person_id`; `canonical-persons-tree.json` historically keyed only by `rowId` with **no `person_id`**, so the tree loader had nothing to join on. **Phase 1 #670 now emits `person_id` into the tree JSON.** `PersonTreeImporter` joins the tree to register persons via that `person_id`. The slug must be computed by **one shared Python function** across both code paths, or the join silently fails (review finding — verify in #670). ## Name-routing policy (folded from the now-deleted #665) `DocumentImporter` routes each sender/receiver cell by the normalizer's category, retaining the raw cell text in all cases: | Category | Routing | |---|---| | `single_token` / resolved | Link to a register person (**register-first match by `source_ref`**); if unmatched, a provisional single-token person — see fallback below | | `collective` | `GROUP` person | | `institution` | `INSTITUTION` person | | `ambiguous_pair` (e.g. `"Ella Anita"`) | **Split into two persons**, both attached | | `prose` / `?` / noise | **No person** + keep raw text only | - **ALWAYS retain the raw cell** in `sender_text` / `receiver_text`, even when a person is linked. This is the load-bearing invariant behind the merge story (no per-document split exists; `PersonService.mergePersons` + `POST /{id}/merge` is the only cleanup path) — test it explicitly: matched sender → both `sender` set AND `sender_text` == raw cell. - **Resolver signature must change** (review finding): the current `PersonService.findOrCreateByAlias` returns a single `@Nullable Person`. Replace with a small value type, e.g. `record AttributionResult(List<Person> persons, String rawText)`, where `persons` is **empty** (prose/noise/`?`), **one** (single/collective/institution), or **two** (pair). The pair-split method belongs on `PersonService`, not the importer. - **Fallback `lastName`** (review finding): `Person.lastName` is `@Column(nullable = false)`. A new provisional single-token person still needs a non-null `lastName` — register-first matching dodges this for matched names, but define and test the fallback for a genuinely new single token (empty string vs token-as-lastName) or the insert throws. - **Receivers are a `Set`** — `receiver_text` is a single column; populate it always (per the always-retain rule), even when persons resolved. Test the "always retain even when linked" rule explicitly. - **Frontend display is escape-on-render**: `sender_text`/`receiver_text` carry arbitrary cell content (`?`, `Geschirr`, markup-like prose) — render with plain `{value}` interpolation, never `{@html}` (stored-XSS guard, low severity). ## Security — port guards before deleting the old importer The rewrite drops ~64 `MassImportServiceTest` methods, including **path-traversal** and **PDF-magic-byte** guards (review finding — these MUST be ported, not lost). Today: - `isValidImportFilename` (`MassImportService.java:336-351`) rejects `/`, `\`, Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5), `..`, null bytes, absolute paths. - `findFileRecursive` (`:499-504`) re-validates via canonical-path containment. - `isPdfMagicBytes` checks the `%PDF` signature before upload. **All three move into `DocumentImporter` intact, with their tests ported as security regression tests *before* the old method is deleted** (`should_reject_path_traversal_in_file_column`, `should_reject_unicode_slash_homoglyph`, `should_reject_absolute_path`). The `file` value now arrives via the canonical `file` column — treat it as hostile input regardless of upstream-trust (CWE-22 does not care the value came from "our" Python tool). Defense in depth: validate the string with `isValidImportFilename`, then keep canonical-path containment on the resolved real path. Confirm POI 5.5.0 rejects external entities (POI disables DTDs by default — verify, don't assume). The orchestrator entry point must remain reachable only through `AdminController` `@RequirePermission(Permission.ADMIN)` — add no second un-annotated path. ## Acceptance criteria (Gherkin) ```gherkin Feature: Modular canonical importer Scenario: Loaders run in dependency order Given the four canonical artifacts are present When the import runs Then tags and persons are loaded before documents And person relationships are loaded after persons Scenario: Documents resolve people by stable id Given a canonical document row with sender_person_id "degruyter-clara" And a person with source_ref "degruyter-clara" was loaded from canonical-persons.xlsx When the document is imported Then the document's sender is that person And no new person is created Scenario: Raw attribution is always retained Given a document row whose sender resolves to a register person When the document is imported Then the sender person is linked And sender_text equals the raw cell value Scenario: Ambiguous pair splits into two persons Given a receiver cell "Ella Anita" categorized ambiguous_pair When the document is imported Then exactly two persons are attached as receivers And the total person count increases by the expected amount only Scenario: Prose / noise / "?" creates no person Given a sender cell categorized prose, noise, or "?" When the document is imported Then no person is created for that cell And sender_text retains the raw cell value Scenario: Collective and institution routing Given a cell categorized collective Then a GROUP person is linked And a cell categorized institution links an INSTITUTION person Scenario: File path comes from the sheet Given the canonical document sheet carries a "file" column When a row with a present file is imported Then the named file is uploaded to S3 and status is UPLOADED And a row with an empty file yields status PLACEHOLDER Scenario: Re-import is idempotent Given a full import has completed When the same canonical artifacts are imported again Then no duplicate persons, tags, or documents are created And existing rows are updated in place by source_ref / index # precedence (overwrite vs preserve a human-edited field) per Resolved decision #1 Scenario: Path traversal in the file column is rejected Given a document row whose file column is "../../etc/cron.d/x" When the import runs Then the file is rejected and not uploaded Scenario: Malformed artifact fails closed Given a required header is missing from an artifact When the import runs Then ImportStatus is FAILED with a clear ErrorCode And no partial load occurs Scenario: Clean values parse without semantic logic in Java Given date_iso "1916-06-01" and date_precision "MONTH" When the document is imported Then documentDate is 1916-06-01 and precision is MONTH And Java performs no German-date or name-classification parsing Scenario: Provisional flag is populated by the importer Given the importer auto-creates a Person for an unresolved/provisional attribution Then that Person's provisional is true And persons loaded from the register remain provisional = false And the value surfaces on PersonSummaryDTO ``` ## Implementation plan (TDD, red first per behavior) 1. `CanonicalSheetReader` first red/green cycle: header present, header missing (throw `DomainException.badRequest`), `|`-split of `"a|b|c"`, empty cell → "", single value. No DB. 2. Four loaders, one test class each (`@ExtendWith(MockitoExtension.class)`, owning service mocked), each idempotent (upsert by `source_ref`/`index`), each via the owning service. Named idempotency unit per loader: `should_update_person_in_place_when_source_ref_exists`. Add a `provisional == "True"` test (the normalizer writes capitalized Python bools) so a future format change fails loudly. 3. Name-routing: one failing test per category row (single resolved, single new-provisional, collective→GROUP, institution→INSTITUTION, pair-split→two, prose→none+raw, noise→none, `?`→none), plus the "always retain raw even when linked" invariant. 4. Port the path-traversal / homoglyph / absolute-path / PDF-magic-byte tests into `DocumentImporterTest` **before** deleting the old methods. 5. `CanonicalImportOrchestrator` wiring named ordering + `ImportStatus`; strip positional config + `parseDate` + Java name logic + raw/ODS path. 6. Integration test (`@SpringBootTest` + **Testcontainers `postgres:16-alpine`**, never H2 — the `UNIQUE(source_ref)` + upsert conflict only exist in real Postgres): run all four artifacts, snapshot `persons`/`tag`/`documents` counts, run again, assert counts identical AND assert the precedence decision (mutate a field in-app, re-import, assert survival per Decision #1). Use **Awaitility** on `ImportStatus`, never `Thread.sleep`. 7. `npm run generate:api`; update `l3-backend-*` diagram + `GLOSSARY.md` + ADR 025 + `DEPLOYMENT.md` runbook step. Mind the **branch JaCoCo gate** — currently **0.77 (77%)**, ratcheting toward 80% (see `pom.xml` / #496) — every routing arm and error path needs an explicit test. ## Resolved decisions (settled 2026-05-27) 1. **Re-import precedence = preserve human edits.** Upsert by `source_ref`/`index`, but never overwrite a field a human changed in-app (merges, confirmations, manual date/name corrections). Track human-touched fields so a canonical re-import only fills/updates fields the human has not edited. *(Raised by: issue, Sara, Elicit)* 2. **Name policy = Option A.** Prose descriptions and literal "?" → create NO Person; keep the original cell verbatim in `sender_text`/`receiver_text`. Pristine register, no triage worklist. *(Raised by: #665 author, Elicit)* 3. **Object-noise (resolved by extension of Option A — owner may override).** `Geschirr`/`Bierbecher`/`Steuerbescheid` are categorized `single_token` (exactly like a real first name `Clara`), so Option A alone won't drop them. Resolution by extension of A: maintain a **small curated override/stopword list in the normalizer's `overrides/`** marking known non-person tokens → treated as raw text, no Person. Deterministic, testable, light upkeep; the owner can extend the list. *(Raised by: Elicit)* 4. **Relational labels (resolved by extension of Option A — owner may override).** `Schwester Hanni` (×41), `Tante Tüten` (×11), 73 occurrences total — treated like `single_token` under A: best-effort register match by `source_ref`; if only a bare relation label with no resolvable name, keep as raw text rather than minting a Person. *(Raised by: Elicit)* 5. **Artifact delivery = commit the canonical files.** Commit `out/canonical-documents.xlsx`, `canonical-persons.xlsx`, `canonical-tag-tree.xlsx`, and `canonical-persons-tree.json` to the repo and update `.gitignore` (it currently excludes `out/` except the tree JSON). The loader reads them from the repo; they are regenerated when the normalizer changes (Phase 1 #670). *(Raised by: Tobias, Markus)* ## Out of scope - The schema migration and column definitions — **Phase 2 #671** (`source_ref`, `provisional`, `date_precision`, `sender_text`, `receiver_text`). - Normalizer export changes (`file` column, tree `person_id`) — **Phase 1 #670**. - The date-precision **rendering**/formatter and the directory/timeline UI (#667/#668). - Full provisional-person visual treatment and the directory filter (dependent UI issue). - **Briefwechsel** — dead feature being removed; not a surface here. ## Dependencies - **Blocked-by #670** (Phase 1 — normalizer exports complete: `file` column, tree `person_id`). - **Blocked-by #671** (Phase 2 — schema: `source_ref`, `provisional`, precision/attribution columns). This issue references `DatePrecision`, `source_ref`, `sender_text`/`receiver_text` and **compiles only after #671**. - Merge order: #670 + #671 first, then #669. - #667 / #668 consume the resulting clean data downstream; unaffected by this restructure.
marcel added this to the Handling the Unknowns — honest uncertainty in dates & people milestone 2026-05-26 21:07:04 +02:00
marcel added the P0-criticalfeatureneeds-discussion labels 2026-05-26 21:07:10 +02:00
Author
Owner

Markus Keller — Senior Application Architect

Observations

  • The dependency DAG is real and correctly named: documents need persons + tags; tree needs persons. I verified RelationshipService already exists at person/relationship/RelationshipService.java, so PersonTreeImporter calling the service (never PersonRelationshipRepository) is fully satisfiable today — good, no new abstraction needed.
  • The layering instruction is sound. Today MassImportService already goes through DocumentService/PersonService/TagService (no repository reach-in), so the four loaders just preserve that. Keep it.
  • Idempotency via UNIQUE(source_ref) + DB-layer upsert is exactly the right call — push integrity to Postgres, not a Java existsBy check (which has a TOCTOU race under the async runner). This belongs in #671 and the issue correctly scopes it there.
  • The CanonicalSheetReader "value-level POI helper, no Spring, no domain knowledge" is a clean seam. It is the one place positional @Value app.import.col.* brittleness dies. Endorse.
  • ADR numbering: highest committed is 024. Note there is already a duplicate 022 in docs/adr/ (022-csrf-... and 022-eager-to-lazy-...). 025 is still the next clean number, so the issue is right — but whoever writes it should not repeat the collision.

Recommendations

  • Make the orchestrator's ordering a typed, named structure (e.g. an explicit ordered list of named loader steps), not call-site ordering in a method body. A new reader must see the DAG by reading one declaration. This is the difference between "documented" and "discoverable."
  • The smoke-check ("all four artifacts present, fail fast with FAILED") is an architectural guard — make it a single named method assertAllArtifactsPresent() returning before any loader runs. Fail-closed-on-missing-artifact is the correct posture.
  • New importing/ classes belong in docs/architecture/c4/l3-backend-3g-supporting.puml (that is where the importing domain currently sits — there is no dedicated import diagram). Add the orchestrator + 4 loaders + CanonicalSheetReader there. This is a doc blocker, not optional.
  • The "delete the raw/ODS path + XxeSafeXmlParser" step removes the only XML-parsing surface in the importer — that simplification is worth an explicit line in ADR 025 ("Java no longer parses the raw spreadsheet; XXE surface eliminated").
  • Keep MassImportService.ImportStatus/SkippedFile/State as the public contract verbatim; the orchestrator wraps them. I confirmed admin/system/ImportStatusCard.svelte reads state, statusCode, processed, skipped — changing the record breaks the admin UI silently through generated types.

Open Decisions

  • Artifact delivery mechanism (Decision #5). This is a genuine topology choice and I lean hard to option (d): operator drops the four files into the existing app.import.dir mount and triggers via the admin endpoint — zero new infrastructure, reuses what exists. Committing a regenerated 740K canonical-documents.xlsx to git (option a) bloats history for a file that changes every normalizer run; reject it. The only thing that would move me off (d) is a concrete requirement that the normalizer runs in CI on a schedule with no operator in the loop — which I don't see stated. Confirm who runs the normalizer and how often.
## Markus Keller — Senior Application Architect ### Observations - The dependency DAG is real and correctly named: documents need persons + tags; tree needs persons. I verified `RelationshipService` already exists at `person/relationship/RelationshipService.java`, so `PersonTreeImporter` calling the **service** (never `PersonRelationshipRepository`) is fully satisfiable today — good, no new abstraction needed. - The layering instruction is sound. Today `MassImportService` already goes through `DocumentService`/`PersonService`/`TagService` (no repository reach-in), so the four loaders just preserve that. Keep it. - Idempotency via `UNIQUE(source_ref)` + DB-layer upsert is exactly the right call — push integrity to Postgres, not a Java `existsBy` check (which has a TOCTOU race under the async runner). This belongs in #671 and the issue correctly scopes it there. - The `CanonicalSheetReader` "value-level POI helper, no Spring, no domain knowledge" is a clean seam. It is the one place positional `@Value app.import.col.*` brittleness dies. Endorse. - ADR numbering: highest committed is `024`. Note there is already a **duplicate `022`** in `docs/adr/` (`022-csrf-...` and `022-eager-to-lazy-...`). `025` is still the next clean number, so the issue is right — but whoever writes it should not repeat the collision. ### Recommendations - Make the orchestrator's ordering a typed, named structure (e.g. an explicit ordered list of named loader steps), not call-site ordering in a method body. A new reader must see the DAG by reading one declaration. This is the difference between "documented" and "discoverable." - The smoke-check ("all four artifacts present, fail fast with `FAILED`") is an architectural guard — make it a single named method `assertAllArtifactsPresent()` returning before any loader runs. Fail-closed-on-missing-artifact is the correct posture. - New `importing/` classes belong in `docs/architecture/c4/l3-backend-3g-supporting.puml` (that is where the importing domain currently sits — there is no dedicated import diagram). Add the orchestrator + 4 loaders + `CanonicalSheetReader` there. This is a **doc blocker**, not optional. - The "delete the raw/ODS path + `XxeSafeXmlParser`" step removes the only XML-parsing surface in the importer — that simplification is worth an explicit line in ADR 025 ("Java no longer parses the raw spreadsheet; XXE surface eliminated"). - Keep `MassImportService.ImportStatus`/`SkippedFile`/`State` as the public contract verbatim; the orchestrator wraps them. I confirmed `admin/system/ImportStatusCard.svelte` reads `state`, `statusCode`, `processed`, `skipped` — changing the record breaks the admin UI silently through generated types. ### Open Decisions - **Artifact delivery mechanism (Decision #5).** This is a genuine topology choice and I lean hard to option (d): operator drops the four files into the existing `app.import.dir` mount and triggers via the admin endpoint — zero new infrastructure, reuses what exists. Committing a regenerated 740K `canonical-documents.xlsx` to git (option a) bloats history for a file that changes every normalizer run; reject it. The only thing that would move me off (d) is a concrete requirement that the normalizer runs in CI on a schedule with no operator in the loop — which I don't see stated. Confirm who runs the normalizer and how often.
Author
Owner

Felix Brandt — Senior Fullstack Developer

Observations

  • The TDD plan is the right shape: CanonicalSheetReader first (no DB), then four loaders with the owning service mocked under @ExtendWith(MockitoExtension.class), then routing, then the integration test on Testcontainers. Red-first per behavior is explicit. I endorse the sequence.
  • The resolver signature change is correct and necessary. I confirmed PersonService.findOrCreateByAlias (PersonService.java:85) is @Nullable and returns a single Person — it cannot express the pair-split ("Ella Anita" → two persons) or the "no person but keep raw text" cases. A record AttributionResult(List<Person> persons, String rawText) with persons of size 0/1/2 is the clean fit. The pair-split method belongs on PersonService, not the importer — agreed.
  • The lastName fallback warning is real. findOrCreateByAlias today dodges the non-null lastName constraint by using the alias as the lastName for INSTITUTION/GROUP (PersonService.java:94) and PersonNameParser.split for the rest. A genuinely new provisional single-token person needs the same explicit fallback or the insert throws DataIntegrityViolationException. Define it (token-as-lastName is the least surprising) and test it.
  • The "always retain raw cell even when linked" invariant is the load-bearing one. I confirmed there is no per-document split — mergePersons (PersonService.java:179) is the only cleanup path — so sender_text/receiver_text is the audit trail. Test matched sender → both sender set AND sender_text == raw cell explicitly.

Recommendations

  • Each loader's helpers stay ≤20 lines and do one thing: DocumentImporter with resolveFile(), uploadToS3(), buildDocument() is the right decomposition — the current importSingleDocument (MassImportService.java:375-458) is an 80-line method doing validate+transform+upload+persist; do not port that shape.
  • Use guard clauses for the routing categories, not a nested switch with side effects. Each category arm should be a named branch that produces an AttributionResult.
  • Use DomainException.internal/badRequest and the new IMPORT_ARTIFACT_INVALID code — never raw RuntimeException. Note the old code already violates this: readOds throws new RuntimeException("Ungültige ODS-Datei...") (MassImportService.java:211) and NoSpreadsheetException extends RuntimeException. Those die with the raw path, good.
  • Add the provisional == "True" test (capitalized Python bool) — Boolean.parseBoolean("True") returns true in Java, but a silent format drift to "yes"/"1" would parse to false with no error. Pin it with a test so it fails loudly.
  • Parameterized SLF4J everywhere: log.warn("Skipping artifact {}", name) not concatenation. The old code is mostly clean here already; keep it.
  • 4-step ErrorCode change is mandatory and easy to half-do: ErrorCode.java + errors.ts + getErrorMessage() case + i18n keys in messages/{de,en,es}.json. All four or the frontend shows the generic fallback.

Open Decisions (none)

  • The implementation choices above all have clear winners; the only genuinely open item is re-import precedence (Decision #1), which is a product call, not a code call — deferring to Sara/Elicit.
## Felix Brandt — Senior Fullstack Developer ### Observations - The TDD plan is the right shape: `CanonicalSheetReader` first (no DB), then four loaders with the owning service mocked under `@ExtendWith(MockitoExtension.class)`, then routing, then the integration test on Testcontainers. Red-first per behavior is explicit. I endorse the sequence. - The resolver signature change is correct and necessary. I confirmed `PersonService.findOrCreateByAlias` (`PersonService.java:85`) is `@Nullable` and returns a **single** `Person` — it cannot express the pair-split (`"Ella Anita"` → two persons) or the "no person but keep raw text" cases. A `record AttributionResult(List<Person> persons, String rawText)` with persons of size 0/1/2 is the clean fit. The pair-split method belongs on `PersonService`, not the importer — agreed. - The `lastName` fallback warning is real. `findOrCreateByAlias` today dodges the non-null `lastName` constraint by using the alias **as** the lastName for INSTITUTION/GROUP (`PersonService.java:94`) and `PersonNameParser.split` for the rest. A genuinely new provisional single-token person needs the same explicit fallback or the insert throws `DataIntegrityViolationException`. Define it (token-as-lastName is the least surprising) and test it. - The "always retain raw cell even when linked" invariant is the load-bearing one. I confirmed there is no per-document split — `mergePersons` (`PersonService.java:179`) is the only cleanup path — so `sender_text`/`receiver_text` is the audit trail. Test `matched sender → both sender set AND sender_text == raw cell` explicitly. ### Recommendations - Each loader's helpers stay ≤20 lines and do one thing: `DocumentImporter` with `resolveFile()`, `uploadToS3()`, `buildDocument()` is the right decomposition — the current `importSingleDocument` (`MassImportService.java:375-458`) is an 80-line method doing validate+transform+upload+persist; do not port that shape. - Use guard clauses for the routing categories, not a nested switch with side effects. Each category arm should be a named branch that produces an `AttributionResult`. - Use `DomainException.internal/badRequest` and the new `IMPORT_ARTIFACT_INVALID` code — never raw `RuntimeException`. Note the old code already violates this: `readOds` throws `new RuntimeException("Ungültige ODS-Datei...")` (`MassImportService.java:211`) and `NoSpreadsheetException extends RuntimeException`. Those die with the raw path, good. - Add the `provisional == "True"` test (capitalized Python bool) — `Boolean.parseBoolean("True")` returns `true` in Java, but a silent format drift to `"yes"`/`"1"` would parse to `false` with no error. Pin it with a test so it fails loudly. - Parameterized SLF4J everywhere: `log.warn("Skipping artifact {}", name)` not concatenation. The old code is mostly clean here already; keep it. - 4-step `ErrorCode` change is mandatory and easy to half-do: `ErrorCode.java` + `errors.ts` + `getErrorMessage()` case + i18n keys in `messages/{de,en,es}.json`. All four or the frontend shows the generic fallback. ### Open Decisions _(none)_ - The implementation choices above all have clear winners; the only genuinely open item is re-import precedence (Decision #1), which is a product call, not a code call — deferring to Sara/Elicit.
Author
Owner

Nora "NullX" Steiner — Application Security Engineer

Observations

  • The "port the guards before deleting the old method" instruction is exactly right and is the single most important security line in this issue. I read the three controls in MassImportService:
    • isValidImportFilename (:336-351) — rejects /, \, U+2215, U+FF0F, U+29F5 homoglyphs, .., ., null byte, and absolute paths. Solid.
    • findFileRecursive (:492-508) — canonical-path containment check (candidate.getCanonicalPath().startsWith(baseDirCanonical + File.separator)). This is the real CWE-22 backstop.
    • isPdfMagicBytes (:358-367) — %PDF signature check before upload.
  • The file column now arriving from "our" Python normalizer does not lower the trust boundary. CWE-22 does not care about provenance — the canonical .xlsx is a file on a mounted volume an operator can hand-edit or an attacker with write access can poison. Treat the file value as hostile input. The issue states this correctly; hold the line in review.
  • Auth surface is correct today: AdminController is annotated @RequirePermission(Permission.ADMIN) at the class level (AdminController.java:20) and exposes POST /trigger-import + GET /import-status. The orchestrator must remain reachable only through this path — adding a second un-annotated entry point (e.g. a startup @PostConstruct auto-import, or a new controller) would be a privilege-bypass.

Recommendations

  • Port all three guards into DocumentImporter with their tests first (red), then delete the originals. Concrete regression test names: should_reject_path_traversal_in_file_column, should_reject_unicode_slash_homoglyph (assert all three: U+2215/U+FF0F/U+29F5), should_reject_absolute_path, should_reject_null_byte, should_reject_non_pdf_magic_bytes. The old suite has 64 @Test methods — do not let that count silently drop; diff it.
  • Defense in depth: validate the string with isValidImportFilename and keep the canonical-path containment on the resolved real path. Two layers; a bypass in one is caught by the other.
  • POI 5.5.0 / XXE: the issue says "POI disables DTDs by default — verify, don't assume." Correct posture. But note the bigger win — deleting readOds/XxeSafeXmlParser removes the hand-rolled DocumentBuilderFactory XML path entirely (MassImportService.java:206-252). The canonical .xlsx is read via POI WorkbookFactory; confirm POI's XMLReader factory has disallow-doctype-decl / external-entity features off, and add one test asserting an entity-laden workbook does not resolve.
  • Stored-XSS (low severity): sender_text/receiver_text now carry arbitrary cell content (?, Geschirr, markup-like prose). The frontend rendering issue is downstream, but flag here for traceability: render with {value} interpolation, never {@html}. Add a data fixture with <img src=x onerror=...> in a sender cell to prove it round-trips as inert text.
  • Log artifact filenames with parameterized SLF4J (log.warn("Rejected file {}", name)) — never concatenation. A poisoned filename containing log-forging characters should not be concatenated into a log line.

Open Decisions (none)

  • No genuine security tradeoff requiring product input. The guards-before-delete sequencing is non-negotiable, not a decision — if the ported tests are not green before deletion, the PR does not merge.
## Nora "NullX" Steiner — Application Security Engineer ### Observations - The "port the guards before deleting the old method" instruction is exactly right and is the single most important security line in this issue. I read the three controls in `MassImportService`: - `isValidImportFilename` (`:336-351`) — rejects `/`, `\`, U+2215, U+FF0F, U+29F5 homoglyphs, `..`, `.`, null byte, and absolute paths. Solid. - `findFileRecursive` (`:492-508`) — canonical-path containment check (`candidate.getCanonicalPath().startsWith(baseDirCanonical + File.separator)`). This is the real CWE-22 backstop. - `isPdfMagicBytes` (`:358-367`) — `%PDF` signature check before upload. - The `file` column now arriving from "our" Python normalizer does **not** lower the trust boundary. CWE-22 does not care about provenance — the canonical `.xlsx` is a file on a mounted volume an operator can hand-edit or an attacker with write access can poison. Treat the `file` value as hostile input. The issue states this correctly; hold the line in review. - Auth surface is correct today: `AdminController` is annotated `@RequirePermission(Permission.ADMIN)` at the class level (`AdminController.java:20`) and exposes `POST /trigger-import` + `GET /import-status`. The orchestrator must remain reachable **only** through this path — adding a second un-annotated entry point (e.g. a startup `@PostConstruct` auto-import, or a new controller) would be a privilege-bypass. ### Recommendations - Port all three guards into `DocumentImporter` **with their tests first** (red), then delete the originals. Concrete regression test names: `should_reject_path_traversal_in_file_column`, `should_reject_unicode_slash_homoglyph` (assert all three: U+2215/U+FF0F/U+29F5), `should_reject_absolute_path`, `should_reject_null_byte`, `should_reject_non_pdf_magic_bytes`. The old suite has 64 `@Test` methods — do not let that count silently drop; diff it. - Defense in depth: validate the string with `isValidImportFilename` **and** keep the canonical-path containment on the resolved real path. Two layers; a bypass in one is caught by the other. - POI 5.5.0 / XXE: the issue says "POI disables DTDs by default — verify, don't assume." Correct posture. But note the bigger win — deleting `readOds`/`XxeSafeXmlParser` removes the hand-rolled `DocumentBuilderFactory` XML path entirely (`MassImportService.java:206-252`). The canonical `.xlsx` is read via POI `WorkbookFactory`; confirm POI's `XMLReader` factory has `disallow-doctype-decl` / external-entity features off, and add one test asserting an entity-laden workbook does not resolve. - Stored-XSS (low severity): `sender_text`/`receiver_text` now carry arbitrary cell content (`?`, `Geschirr`, markup-like prose). The frontend rendering issue is downstream, but flag here for traceability: render with `{value}` interpolation, never `{@html}`. Add a `data` fixture with `<img src=x onerror=...>` in a sender cell to prove it round-trips as inert text. - Log artifact filenames with parameterized SLF4J (`log.warn("Rejected file {}", name)`) — never concatenation. A poisoned filename containing log-forging characters should not be concatenated into a log line. ### Open Decisions _(none)_ - No genuine security tradeoff requiring product input. The guards-before-delete sequencing is non-negotiable, not a decision — if the ported tests are not green before deletion, the PR does not merge.
Author
Owner

Sara Holt — Senior QA Engineer

Observations

  • The Gherkin is unusually complete — 11 scenarios covering ordering, id-resolution, raw retention, pair-split, prose/noise, collective/institution, file-vs-placeholder, idempotency, path traversal, fail-closed, and the no-Java-semantics rule. That maps almost 1:1 to a test plan. Good.
  • The integration-test prescription is correct on the points that matter: Testcontainers postgres:16-alpine (never H2 — UNIQUE(source_ref) + the upsert conflict only exist in real Postgres), and Awaitility on ImportStatus (never Thread.sleep) because the orchestrator runs under @Async. I endorse both as gates.
  • The idempotency acceptance test is blocked until re-import precedence (Decision #1) is chosen. The issue itself flags this. Upsert-overwrite and upsert-preserve are different assertions: overwrite asserts an in-app edit is gone after re-import; preserve asserts it survives. You cannot write "Then existing rows are updated in place" without picking one.

Recommendations

  • Coverage gate correction (factual). The issue and backend/CLAUDE.md both say "88% branch JaCoCo gate." The actual gate in backend/pom.xml:321 is BRANCH COVEREDRATIO minimum 0.77 (ratcheted down per #496, with a comment saying so). Test the implementation against the real 0.77 gate; do not over-invest chasing a phantom 88%. Worth fixing the stale "88%" in both docs while you are in here.
  • Per-layer test plan:
    • Unit (Mockito, <10s): CanonicalSheetReader (header present / missing-throws / |-split / empty→"" / single value); one test class per loader with the owning service mocked; one named idempotency unit per loader (should_update_person_in_place_when_source_ref_exists); one routing test per category arm (8 arms) + the always-retain-raw invariant.
    • Integration (Testcontainers, <2min): the full four-artifact run, snapshot persons/tag/documents counts, re-run, assert counts identical, then mutate one field in-app + re-import + assert per Decision #1.
    • Security regression (ported): the path-traversal / homoglyph / absolute-path / PDF-magic-byte tests must land before deletion. Diff the @Test count — the old suite has 64; a silent drop is a coverage regression.
  • Add the missing edge fixtures the Gherkin implies but doesn't enumerate: empty receiver_person_ids with non-empty receiver_names; a file column pointing at a present-but-non-PDF file (→ skipped, not failed); a document index collision on re-import (→ update, not duplicate); an artifact present but with zero data rows (→ DONE with processed=0, not FAILED).
  • "Malformed artifact fails closed" needs a sharp boundary: a missing header fails the whole import (FAILED), but a single bad file only adds a SkippedFile. Test both so the distinction can't erode — they are different failure modes and the issue is explicit that skip-and-continue is file-level only.

Open Decisions

  • Re-import precedence (Decision #1) is a testability blocker, not just a design choice. I cannot author the idempotency acceptance test until overwrite-vs-preserve is decided. Restating for the queue with the QA framing: this blocks one of the eleven acceptance scenarios from being written at all.
## Sara Holt — Senior QA Engineer ### Observations - The Gherkin is unusually complete — 11 scenarios covering ordering, id-resolution, raw retention, pair-split, prose/noise, collective/institution, file-vs-placeholder, idempotency, path traversal, fail-closed, and the no-Java-semantics rule. That maps almost 1:1 to a test plan. Good. - The integration-test prescription is correct on the points that matter: Testcontainers `postgres:16-alpine` (never H2 — `UNIQUE(source_ref)` + the upsert conflict only exist in real Postgres), and Awaitility on `ImportStatus` (never `Thread.sleep`) because the orchestrator runs under `@Async`. I endorse both as gates. - The idempotency acceptance test is **blocked** until re-import precedence (Decision #1) is chosen. The issue itself flags this. Upsert-overwrite and upsert-preserve are different assertions: overwrite asserts an in-app edit is gone after re-import; preserve asserts it survives. You cannot write "Then existing rows are updated in place" without picking one. ### Recommendations - **Coverage gate correction (factual).** The issue and `backend/CLAUDE.md` both say "88% branch JaCoCo gate." The actual gate in `backend/pom.xml:321` is `BRANCH COVEREDRATIO minimum 0.77` (ratcheted down per #496, with a comment saying so). Test the implementation against the real 0.77 gate; do not over-invest chasing a phantom 88%. Worth fixing the stale "88%" in both docs while you are in here. - Per-layer test plan: - **Unit (Mockito, <10s):** `CanonicalSheetReader` (header present / missing-throws / `|`-split / empty→"" / single value); one test class per loader with the owning service mocked; one named idempotency unit per loader (`should_update_person_in_place_when_source_ref_exists`); one routing test per category arm (8 arms) + the always-retain-raw invariant. - **Integration (Testcontainers, <2min):** the full four-artifact run, snapshot `persons`/`tag`/`documents` counts, re-run, assert counts identical, then mutate one field in-app + re-import + assert per Decision #1. - **Security regression (ported):** the path-traversal / homoglyph / absolute-path / PDF-magic-byte tests must land **before** deletion. Diff the `@Test` count — the old suite has 64; a silent drop is a coverage regression. - Add the missing edge fixtures the Gherkin implies but doesn't enumerate: empty `receiver_person_ids` with non-empty `receiver_names`; a `file` column pointing at a present-but-non-PDF file (→ skipped, not failed); a document `index` collision on re-import (→ update, not duplicate); an artifact present but with **zero data rows** (→ DONE with processed=0, not FAILED). - "Malformed artifact fails closed" needs a sharp boundary: a missing **header** fails the whole import (`FAILED`), but a single bad **file** only adds a `SkippedFile`. Test both so the distinction can't erode — they are different failure modes and the issue is explicit that skip-and-continue is file-level only. ### Open Decisions - **Re-import precedence (Decision #1) is a testability blocker, not just a design choice.** I cannot author the idempotency acceptance test until overwrite-vs-preserve is decided. Restating for the queue with the QA framing: this blocks one of the eleven acceptance scenarios from being written at all.
Author
Owner

Elicit — Requirements Engineer & Business Analyst

I am in Brownfield mode. This is a well-specified issue — INVEST-compliant, 11 Gherkin scenarios, explicit dependencies and out-of-scope. My job is to hunt the remaining ambiguities and contradictions before they become rework.

Observations

  • Testable, Estimable, Valuable — all pass. The story has clear acceptance criteria and a stated benefit ("dates/people/tags import correctly and re-runs are idempotent").
  • Independent — fails by design: it cannot compile until #670 and #671 land. That is acknowledged and the merge order is stated. Acceptable, but it means this issue cannot be picked up in isolation — flag for sequencing.
  • The four enumerated Open Decisions are genuine tradeoffs, correctly flagged with needs-discussion. Two of them (#1 precedence, #3 object-noise) are acceptance-criteria blockers, not just nice-to-haves — they change which scenarios this issue can claim as done.

Ambiguities / contradictions I must surface

  • "Idempotent" is under-specified (Decision #1). "Upsert by source_ref" does not say what happens to a human-edited field on re-import. The deciding question is sequencing: does the operator re-run the normalizer before humans edit in-app (overwrite is safe) or after (preserve is mandatory)? This is unresolved and blocks the idempotency AC.
  • "Object-noise → no person" is not satisfiable from categories (Decision #3). Geschirr / Bierbecher are single_token to the normalizer — identical to Clara. So the AC "prose/noise/? creates no person" silently does not catch object-noise; object-noise will become a provisional single-token person unless a demotion mechanism exists. Either add a deterministic override list in the normalizer overrides/ (recommended — testable, light upkeep), or explicitly drop the object-noise claim from this issue's AC. Right now the issue implies a guarantee it cannot honor.
  • relational has no acceptance criterion (Decision #4). Schwester Hanni (×41), Tante Tüten (×11) — 73 occurrences — have no Gherkin scenario and no defined "confident match" threshold. Either add a scenario (conservative: exact register-alias match else fall through to raw text) or mark relational explicitly out of scope for this issue. As written it is a silent gap.
  • Hidden assumption in Decision #2 (Option A vs B). The recommended Option A (no person, keep raw text) assumes the owner will never want a triage worklist of unresolved senders. If the owner does want a "who is this?" cleanup queue, Option B (UNKNOWN provisional per string) is required — but that re-introduces the noise this rebuild exists to remove. This needs the owner's explicit confirmation, not a developer's default.

Recommendations

  • Resolve Decisions #1 and #3 before implementation starts — they are the two that gate acceptance criteria. #2 and #4 can be resolved in parallel but must be locked before the routing rule is written.
  • For #4, default to "out of scope, tracked as follow-up" unless the owner says relational matching is needed now. Scope discipline beats gold-plating.
  • NFR check for this issue: Observability — the import is operator-triggered and async; confirm the ImportStatus statusCode + skippedFiles give the operator enough to diagnose a failed run without reading server logs. Data retention — re-import precedence (#1) is effectively a data-retention policy for human corrections; treat it as such.

Open Decisions

  • #1 Re-import precedence (overwrite vs preserve) — blocks the idempotency AC. (also raised by issue, Sara)
  • #2 Name policy A vs B for prose/noise/? — needs owner confirmation, not a default. (raised by #665 author)
  • #3 Object-noise demotion (override list vs defer vs accept-as-provisional) — changes which AC this issue can claim.
  • #4 relational threshold or out-of-scope — no scenario exists; decide before locking routing.
## Elicit — Requirements Engineer & Business Analyst I am in Brownfield mode. This is a well-specified issue — INVEST-compliant, 11 Gherkin scenarios, explicit dependencies and out-of-scope. My job is to hunt the remaining ambiguities and contradictions before they become rework. ### Observations - **Testable, Estimable, Valuable** — all pass. The story has clear acceptance criteria and a stated benefit ("dates/people/tags import correctly and re-runs are idempotent"). - **Independent** — fails by design: it cannot compile until #670 and #671 land. That is acknowledged and the merge order is stated. Acceptable, but it means this issue cannot be picked up in isolation — flag for sequencing. - The four enumerated Open Decisions are genuine tradeoffs, correctly flagged with `needs-discussion`. Two of them (#1 precedence, #3 object-noise) are **acceptance-criteria blockers**, not just nice-to-haves — they change which scenarios this issue can claim as done. ### Ambiguities / contradictions I must surface - **"Idempotent" is under-specified (Decision #1).** "Upsert by source_ref" does not say what happens to a human-edited field on re-import. The deciding question is sequencing: does the operator re-run the normalizer *before* humans edit in-app (overwrite is safe) or *after* (preserve is mandatory)? This is unresolved and blocks the idempotency AC. - **"Object-noise → no person" is not satisfiable from categories (Decision #3).** `Geschirr` / `Bierbecher` are `single_token` to the normalizer — identical to `Clara`. So the AC "prose/noise/`?` creates no person" silently does **not** catch object-noise; object-noise will become a provisional single-token person unless a demotion mechanism exists. Either add a deterministic override list in the normalizer `overrides/` (recommended — testable, light upkeep), or explicitly drop the object-noise claim from this issue's AC. Right now the issue implies a guarantee it cannot honor. - **`relational` has no acceptance criterion (Decision #4).** `Schwester Hanni` (×41), `Tante Tüten` (×11) — 73 occurrences — have no Gherkin scenario and no defined "confident match" threshold. Either add a scenario (conservative: exact register-alias match else fall through to raw text) or mark `relational` explicitly **out of scope** for this issue. As written it is a silent gap. - **Hidden assumption in Decision #2 (Option A vs B).** The recommended Option A (no person, keep raw text) assumes the owner will never want a triage worklist of unresolved senders. If the owner *does* want a "who is this?" cleanup queue, Option B (UNKNOWN provisional per string) is required — but that re-introduces the noise this rebuild exists to remove. This needs the owner's explicit confirmation, not a developer's default. ### Recommendations - Resolve Decisions #1 and #3 **before** implementation starts — they are the two that gate acceptance criteria. #2 and #4 can be resolved in parallel but must be locked before the routing rule is written. - For #4, default to "out of scope, tracked as follow-up" unless the owner says relational matching is needed now. Scope discipline beats gold-plating. - NFR check for this issue: **Observability** — the import is operator-triggered and async; confirm the `ImportStatus` `statusCode` + `skippedFiles` give the operator enough to diagnose a failed run without reading server logs. **Data retention** — re-import precedence (#1) is effectively a data-retention policy for human corrections; treat it as such. ### Open Decisions - **#1 Re-import precedence** (overwrite vs preserve) — blocks the idempotency AC. *(also raised by issue, Sara)* - **#2 Name policy A vs B** for prose/noise/`?` — needs owner confirmation, not a default. *(raised by #665 author)* - **#3 Object-noise demotion** (override list vs defer vs accept-as-provisional) — changes which AC this issue can claim. - **#4 `relational` threshold or out-of-scope** — no scenario exists; decide before locking routing.
Author
Owner

Tobias Wendt — DevOps & Platform Engineer

Observations

  • The async runner + ImportStatus state machine (IDLE/RUNNING/DONE/FAILED) is being kept verbatim — good, that is the operator's only window into a long-running, fire-and-forget job. I confirmed the operator path is POST /api/admin/trigger-importGET /api/admin/import-status polled by ImportStatusCard.svelte. No new infra needed to run this.
  • The smoke-check (all four artifacts present → else FAILED fast) is operationally the right behavior. A half-run that loads tags but no documents would leave the operator with a silently-inconsistent DB and no clear signal. Fail-closed is correct.
  • The integration test mandates Testcontainers postgres:16-alpine — that matches what already runs in CI for this stack. No new CI service required; this slots into the existing integration job.

Recommendations

  • Artifact delivery: reuse app.import.dir, add zero infra (this is Decision #5, and I lean (d)). The four canonical files drop into the existing /import mount and the operator triggers the import exactly as today. Concretely:
    • Do not commit out/*.xlsx to git (option a). canonical-documents.xlsx is 740K and regenerated every normalizer run; it is currently gitignored for good reason. Committing it bloats history and creates merge churn on a binary.
    • CI-generated build artifact (option b) only makes sense if the normalizer runs in CI on a schedule — which requires a Python runner step and an artifact-passing stage we do not have. Adds a failure mode for no stated benefit.
    • Admin UI upload (option c) adds a multi-file upload endpoint + validation + storage wiring — real surface for a four-times-a-year operation. Reject unless the operator genuinely cannot reach the mount.
  • Document the runbook in docs/DEPLOYMENT.md as the issue requires: (1) run the normalizer, (2) place the four artifacts in app.import.dir, (3) trigger import via admin, (4) watch ImportStatus. This is a doc blocker — an import with an undocumented prerequisite sequence is an incident waiting to happen.
  • Add the smoke-check artifact filenames to the failure statusCode/message so the operator sees which artifact was missing without grepping logs. "Import failed: canonical-documents.xlsx not found in /import" beats "IMPORT_FAILED_INTERNAL."
  • One operational note on retiring the raw path: ImportStatusCard.svelte:14 branches on statusCode === 'IMPORT_FAILED_NO_SPREADSHEET'. When the ODS/raw path and NoSpreadsheetException are deleted, that statusCode disappears — make sure the frontend branch is updated or repurposed (e.g. to IMPORT_ARTIFACT_INVALID) so the operator still gets a meaningful "artifacts missing" message instead of a generic failure.

Open Decisions

  • Artifact delivery mechanism (#5). My recommendation is (d) drop-in mount + existing trigger — zero new infra, ~23 EUR/month bill unchanged. The only thing that flips this to (b) CI artifact is a requirement that the normalizer runs unattended on a schedule. That depends on who runs the normalizer and how often — needs the owner's answer. (also raised by Markus)
## Tobias Wendt — DevOps & Platform Engineer ### Observations - The async runner + `ImportStatus` state machine (`IDLE/RUNNING/DONE/FAILED`) is being kept verbatim — good, that is the operator's only window into a long-running, fire-and-forget job. I confirmed the operator path is `POST /api/admin/trigger-import` → `GET /api/admin/import-status` polled by `ImportStatusCard.svelte`. No new infra needed to run this. - The smoke-check (all four artifacts present → else `FAILED` fast) is operationally the right behavior. A half-run that loads tags but no documents would leave the operator with a silently-inconsistent DB and no clear signal. Fail-closed is correct. - The integration test mandates Testcontainers `postgres:16-alpine` — that matches what already runs in CI for this stack. No new CI service required; this slots into the existing integration job. ### Recommendations - **Artifact delivery: reuse `app.import.dir`, add zero infra (this is Decision #5, and I lean (d)).** The four canonical files drop into the existing `/import` mount and the operator triggers the import exactly as today. Concretely: - **Do not** commit `out/*.xlsx` to git (option a). `canonical-documents.xlsx` is 740K and regenerated every normalizer run; it is currently gitignored for good reason. Committing it bloats history and creates merge churn on a binary. - CI-generated build artifact (option b) only makes sense if the normalizer runs in CI on a schedule — which requires a Python runner step and an artifact-passing stage we do not have. Adds a failure mode for no stated benefit. - Admin UI upload (option c) adds a multi-file upload endpoint + validation + storage wiring — real surface for a four-times-a-year operation. Reject unless the operator genuinely cannot reach the mount. - Document the runbook in `docs/DEPLOYMENT.md` as the issue requires: (1) run the normalizer, (2) place the four artifacts in `app.import.dir`, (3) trigger import via admin, (4) watch `ImportStatus`. This is a **doc blocker** — an import with an undocumented prerequisite sequence is an incident waiting to happen. - Add the smoke-check artifact filenames to the failure `statusCode`/message so the operator sees *which* artifact was missing without grepping logs. "Import failed: canonical-documents.xlsx not found in /import" beats "IMPORT_FAILED_INTERNAL." - One operational note on retiring the raw path: `ImportStatusCard.svelte:14` branches on `statusCode === 'IMPORT_FAILED_NO_SPREADSHEET'`. When the ODS/raw path and `NoSpreadsheetException` are deleted, that statusCode disappears — make sure the frontend branch is updated or repurposed (e.g. to `IMPORT_ARTIFACT_INVALID`) so the operator still gets a meaningful "artifacts missing" message instead of a generic failure. ### Open Decisions - **Artifact delivery mechanism (#5).** My recommendation is (d) drop-in mount + existing trigger — zero new infra, ~23 EUR/month bill unchanged. The only thing that flips this to (b) CI artifact is a requirement that the normalizer runs unattended on a schedule. That depends on **who runs the normalizer and how often** — needs the owner's answer. *(also raised by Markus)*
Author
Owner

Leonie Voss — UX Designer & Accessibility Strategist

Observations

  • This is a backend/data-pipeline issue with no new UI surface. The issue is explicit that the ImportStatus/SkippedFile shape must stay verbatim because admin/system/ImportStatusCard.svelte consumes it via generated types — so the existing admin import card is the only UI touchpoint, and it is intentionally untouched. I verified that card reads state, statusCode, processed, and skipped.
  • The full provisional-person visual treatment and the directory filter are explicitly out of scope (dependent UI issue), and date-precision rendering is #667/#668. Correctly deferred — no UI gold-plating here.

Recommendations

  • The one UX-adjacent item that lives in this issue is the escape-on-render rule for sender_text/receiver_text. Those columns now carry arbitrary, messy cell content (?, Geschirr, prose, markup-like fragments). When the dependent UI issue renders them, it must use plain {value} interpolation — never {@html}. This is both a stored-XSS guard (Nora's point) and a correctness point: a literal ? or < must show as typed, not be interpreted. Worth a one-line note carried forward into the consuming UI issue so it isn't lost.
  • If the operator-facing failure message changes (e.g. the new IMPORT_ARTIFACT_INVALID code), make sure the admin card surfaces a human-readable, localized message via getErrorMessage() rather than a raw code — the admin is still a person who needs to know which artifact is missing and what to do. Pair any new failure state with the i18n keys the issue already calls for in messages/{de,en,es}.json.

Open Decisions (none)

  • No UI decisions for this issue. I checked the only touchpoint (the admin import card) and the deferred-UI boundaries; both are sound. The escape-on-render note is a concrete recommendation to carry into the consuming UI issue, not a decision needing input.
## Leonie Voss — UX Designer & Accessibility Strategist ### Observations - This is a backend/data-pipeline issue with no new UI surface. The issue is explicit that the `ImportStatus`/`SkippedFile` shape must stay verbatim **because** `admin/system/ImportStatusCard.svelte` consumes it via generated types — so the existing admin import card is the only UI touchpoint, and it is intentionally untouched. I verified that card reads `state`, `statusCode`, `processed`, and `skipped`. - The full provisional-person visual treatment and the directory filter are explicitly out of scope (dependent UI issue), and date-precision rendering is #667/#668. Correctly deferred — no UI gold-plating here. ### Recommendations - The one UX-adjacent item that lives in this issue is the **escape-on-render** rule for `sender_text`/`receiver_text`. Those columns now carry arbitrary, messy cell content (`?`, `Geschirr`, prose, markup-like fragments). When the dependent UI issue renders them, it must use plain `{value}` interpolation — never `{@html}`. This is both a stored-XSS guard (Nora's point) and a correctness point: a literal `?` or `<` must show as typed, not be interpreted. Worth a one-line note carried forward into the consuming UI issue so it isn't lost. - If the operator-facing failure message changes (e.g. the new `IMPORT_ARTIFACT_INVALID` code), make sure the admin card surfaces a human-readable, localized message via `getErrorMessage()` rather than a raw code — the admin is still a person who needs to know *which* artifact is missing and *what to do*. Pair any new failure state with the i18n keys the issue already calls for in `messages/{de,en,es}.json`. ### Open Decisions _(none)_ - No UI decisions for this issue. I checked the only touchpoint (the admin import card) and the deferred-UI boundaries; both are sound. The escape-on-render note is a concrete recommendation to carry into the consuming UI issue, not a decision needing input.
Author
Owner

Decision Queue — Action Required

5 decisions need your input before implementation starts. Two of them (#1, #3) gate acceptance criteria — the relevant tests cannot be written until they are resolved.

Data / Requirements

  • #1 — Re-import precedence: overwrite vs preserve human edits. "Upsert by source_ref" is under-specified. Overwrite = normalizer is source of truth, simplest, but every in-app correction is destroyed on re-import. Preserve = human edits win, but re-import can no longer fix a wrongly-corrected field and you need per-field provenance (much more code). Deciding question: does the operator re-run the normalizer before humans edit in-app (overwrite is safe) or after (preserve mandatory)? Blocks the idempotency acceptance test — it cannot be written until this is chosen. (Raised by: issue, Sara, Elicit)
  • #3 — Object-noise demotion. Geschirr/Bierbecher are single_token to the normalizer — indistinguishable from Clara. So "object-noise → no person" is not satisfiable from categories alone. Options: (a) deterministic manual demotion list in the normalizer overrides/ (recommended — testable, light upkeep); (b) defer to a heuristic follow-up and drop the object-noise AC from this issue; (c) accept object-noise becomes a provisional person cleaned manually (contradicts the no-junk goal). Changes which AC this issue can claim. (Raised by: Elicit)
  • #2 — Name policy A vs B for prose / noise / ?. A (recommended): create no person, keep the raw cell in sender_text/receiver_text — pristine register, no triage worklist. B: create an UNKNOWN provisional person per string — fully reversible, gives a "who is this?" worklist, but re-introduces the noise this rebuild removes. Needs your explicit confirmation, not a developer default, because it depends on whether you want a cleanup worklist. (Raised by: #665 author, Elicit)
  • #4relational threshold, or mark out-of-scope. Schwester Hanni (×41), Tante Tüten (×11), 73 occurrences total — no Gherkin scenario and no defined "confident match" rule. Either add a scenario (conservative: exact register-alias match else fall through to raw text) or explicitly mark relational out of scope for this issue and track as a follow-up. Default recommendation: out-of-scope unless you need it now. (Raised by: Elicit)

Infrastructure

  • #5 — Artifact delivery mechanism at import time. (a) commit out/*.xlsx to git — rejected by both Markus and Tobias (740K binary, regenerated every run, currently gitignored, bloats history); (b) generate in CI as a build artifact — only justified if the normalizer runs unattended on a schedule; (c) admin uploads via UI — adds upload endpoint + validation surface for a rare operation; (d) operator drops the four files into the existing app.import.dir mount and triggers via the admin endpoint — zero new infra, reuses what exists (Markus's and Tobias's lean). Deciding question: who runs the normalizer, and how often? (Raised by: Markus, Tobias)
## Decision Queue — Action Required _5 decisions need your input before implementation starts. Two of them (#1, #3) gate acceptance criteria — the relevant tests cannot be written until they are resolved._ ### Data / Requirements - **#1 — Re-import precedence: overwrite vs preserve human edits.** "Upsert by `source_ref`" is under-specified. *Overwrite* = normalizer is source of truth, simplest, but every in-app correction is destroyed on re-import. *Preserve* = human edits win, but re-import can no longer fix a wrongly-corrected field and you need per-field provenance (much more code). Deciding question: does the operator re-run the normalizer **before** humans edit in-app (overwrite is safe) or **after** (preserve mandatory)? **Blocks the idempotency acceptance test — it cannot be written until this is chosen.** _(Raised by: issue, Sara, Elicit)_ - **#3 — Object-noise demotion.** `Geschirr`/`Bierbecher` are `single_token` to the normalizer — indistinguishable from `Clara`. So "object-noise → no person" is **not satisfiable from categories alone**. Options: (a) deterministic manual demotion list in the normalizer `overrides/` (recommended — testable, light upkeep); (b) defer to a heuristic follow-up and drop the object-noise AC from this issue; (c) accept object-noise becomes a provisional person cleaned manually (contradicts the no-junk goal). **Changes which AC this issue can claim.** _(Raised by: Elicit)_ - **#2 — Name policy A vs B for prose / noise / `?`.** **A (recommended):** create no person, keep the raw cell in `sender_text`/`receiver_text` — pristine register, no triage worklist. **B:** create an `UNKNOWN` provisional person per string — fully reversible, gives a "who is this?" worklist, but re-introduces the noise this rebuild removes. Needs your explicit confirmation, not a developer default, because it depends on whether you want a cleanup worklist. _(Raised by: #665 author, Elicit)_ - **#4 — `relational` threshold, or mark out-of-scope.** `Schwester Hanni` (×41), `Tante Tüten` (×11), 73 occurrences total — no Gherkin scenario and no defined "confident match" rule. Either add a scenario (conservative: exact register-alias match else fall through to raw text) or explicitly mark `relational` **out of scope** for this issue and track as a follow-up. Default recommendation: out-of-scope unless you need it now. _(Raised by: Elicit)_ ### Infrastructure - **#5 — Artifact delivery mechanism at import time.** (a) commit `out/*.xlsx` to git — rejected by both Markus and Tobias (740K binary, regenerated every run, currently gitignored, bloats history); (b) generate in CI as a build artifact — only justified if the normalizer runs unattended on a schedule; (c) admin uploads via UI — adds upload endpoint + validation surface for a rare operation; (d) **operator drops the four files into the existing `app.import.dir` mount and triggers via the admin endpoint — zero new infra, reuses what exists (Markus's and Tobias's lean).** Deciding question: **who runs the normalizer, and how often?** _(Raised by: Markus, Tobias)_
marcel removed the needs-discussion label 2026-05-27 07:49:56 +02:00
Sign in to join this conversation.
No Label P0-critical feature
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#669