As the archive owner I want the importer rebuilt as modular loaders over the normalizer's canonical exports, so dates/people/tags import correctly and re-runs are idempotent #669
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context — Phase 3 of the import rebuild
The legacy
backend/.../importing/MassImportService.java(509 lines) reads the raw original spreadsheet via Apache POI by positional column index (@Value app.import.col.sender:3,col.date:7, …) and re-derives everything in Java —parseDateis ISO-only (MassImportService.java:468), name resolution goes throughPersonService.findOrCreateByAlias, tags viaTagService.findOrCreate. The normalizer's clean output is never touched. The semantic transformation has already been proven in the Python normalizer (tools/import-normalizer/); this issue makes the backend consume that output and retires the raw path.This is Phase 3 of a three-phase rebuild:
filecolumn on the document sheet and a stableperson_idinsidecanonical-persons-tree.jsonso the tree reconciles with the register.source_ref(persons + tag),provisional(persons), and the date precision / attribution columns (sender_text,receiver_text,date_precision) ondocuments.This issue adds NO Flyway migration — all schema lives in #671. It depends on #670 and #671 and compiles only after both land (
DocumentImporterreferencesDatePrecision,source_ref,sender_text/receiver_text, which do not exist in the codebase until #671). Sequence: #670 + #671 → #669.Canonical artifacts consumed (produced by Phase 1):
canonical-tag-tree.xlsxtag_path, parent_name, tag_namecanonical-persons.xlsxperson_id, last_name, first_name, maiden_name, …, provisionalcanonical-persons-tree.json{persons:[{rowId, person_id, firstName, lastName, …}], relationships:[…]}canonical-documents.xlsxindex, box, folder, sender_person_id, sender_name, sender_category, receiver_person_ids, receiver_names, file, date_iso, date_raw, date_precision, tags, summary, source_row, needs_reviewModule layout
Ordering is a real dependency DAG (documents → persons + tags; tree → persons), not a preference — encode it explicitly and named in the orchestrator, not implicit in call order. Each loader calls the owning domain's service, never a repository (layering rule).
PersonTreeImportermust call the relationship service, neverPersonRelationshipRepository. Keep the existing async runner +ImportStatusstate machine (IDLE/RUNNING/DONE/FAILED) andSkippedFileshape verbatim —admin/system/ImportStatusCard.svelteconsumes{state, statusCode, processed, skippedFiles, skipped}via generated types; changing it breaks the admin UI. Wrap the orchestrator inside the existing async runner.What changes (and what does NOT)
.xlsx/.json, maps by header name (replacing the brittle positional@Valueindices), splits the pipe-|-delimited list columns, and converts clean values:LocalDate.parse(date_iso),DatePrecision.valueOf(date_precision)(from #671),Boolean.parseBoolean(provisional).ThumbnailAsyncRunnerdispatch stay inDocumentImporter.File-level breakdown
Backend — new
importingsub-structureCanonicalImportOrchestrator— replacesMassImportService's monolithicprocessRows; keepsrunImportAsync,ImportStatus,SkippedFile. Smoke-checks all four expected artifacts are present before starting; fails fast withImportStatus.FAILEDrather than a half-run that loads tags but no documents.TagTreeImporter,PersonRegisterImporter,PersonTreeImporter,DocumentImporter— one class each.CanonicalSheetReader— a value-level POI helper (no Spring, no domain knowledge): workbook in, header-name → column index map +|-split helper, typed rows out. The seam that replaces positional@Value app.import.col.*. Throws aDomainException.badRequeston a missing required header (never NPE on a null index).DocumentImporterkeeps file/S3/thumbnail plumbing in small ≤20-line methods:resolveFile(),uploadToS3(),buildDocument().@Value app.import.col.*indices, the ISO-onlyparseDate, the Java name-classification path, and the raw-spreadsheet / ODS path (XxeSafeXmlParser,NoSpreadsheetException) once loaders cover them.Error handling — new loaders use
DomainException.internal/badRequest(not rawRuntimeException), likely a newErrorCode IMPORT_ARTIFACT_INVALID(4-step change:ErrorCode.java+errors.ts+getErrorMessage()case + i18n keys inmessages/{de,en,es}.json). Fail closed on a malformed artifact (throw, setFAILED); skip-and-continue is only for an individual bad file via the existingSkippedFilemechanism. Log artifact filenames with parameterized SLF4J, never concatenation.Docs (blockers) —
docs/architecture/db/diagrams reflect #671's columns (owned there, not here); new backend classes inimporting/→ thel3-backend-*diagram; new terms →docs/GLOSSARY.md; an ADR ("importer consumes the normalizer's canonical artifacts; the raw spreadsheet is no longer parsed by Java" — next clean number025); anddocs/DEPLOYMENT.mdgains the import prerequisite step (run normalizer → place artifacts → trigger import).API —
npm run generate:apiafter any model/endpoint touch.Idempotency & re-import
Every loader is idempotent: persons and tags upsert by
source_ref(the normalizerperson_id/tag_path, unique+indexed per #671), documents upsert byindex— never blind insert. Re-running the import after re-running the normalizer never duplicates persons, tags, or documents. TheUNIQUE(source_ref)constraint (in #671) makes the upsert atomic at the DB layer."Idempotent" is under-specified on its own (review finding): upsert must define a precedence rule — see Resolved decisions #1. The acceptance test for re-import cannot be written until that rule is chosen (upsert-overwrite vs upsert-preserve are different code and different assertions).
Identity reconciliation
canonical-persons.xlsxkeys persons by the slugperson_id;canonical-persons-tree.jsonhistorically keyed only byrowIdwith noperson_id, so the tree loader had nothing to join on. Phase 1 #670 now emitsperson_idinto the tree JSON.PersonTreeImporterjoins the tree to register persons via thatperson_id. The slug must be computed by one shared Python function across both code paths, or the join silently fails (review finding — verify in #670).Name-routing policy (folded from the now-deleted #665)
DocumentImporterroutes each sender/receiver cell by the normalizer's category, retaining the raw cell text in all cases:single_token/ resolvedsource_ref); if unmatched, a provisional single-token person — see fallback belowcollectiveGROUPpersoninstitutionINSTITUTIONpersonambiguous_pair(e.g."Ella Anita")prose/?/ noisesender_text/receiver_text, even when a person is linked. This is the load-bearing invariant behind the merge story (no per-document split exists;PersonService.mergePersons+POST /{id}/mergeis the only cleanup path) — test it explicitly: matched sender → bothsenderset ANDsender_text== raw cell.PersonService.findOrCreateByAliasreturns a single@Nullable Person. Replace with a small value type, e.g.record AttributionResult(List<Person> persons, String rawText), wherepersonsis empty (prose/noise/?), one (single/collective/institution), or two (pair). The pair-split method belongs onPersonService, not the importer.lastName(review finding):Person.lastNameis@Column(nullable = false). A new provisional single-token person still needs a non-nulllastName— register-first matching dodges this for matched names, but define and test the fallback for a genuinely new single token (empty string vs token-as-lastName) or the insert throws.Set—receiver_textis a single column; populate it always (per the always-retain rule), even when persons resolved. Test the "always retain even when linked" rule explicitly.sender_text/receiver_textcarry arbitrary cell content (?,Geschirr, markup-like prose) — render with plain{value}interpolation, never{@html}(stored-XSS guard, low severity).Security — port guards before deleting the old importer
The rewrite drops ~64
MassImportServiceTestmethods, including path-traversal and PDF-magic-byte guards (review finding — these MUST be ported, not lost). Today:isValidImportFilename(MassImportService.java:336-351) rejects/,\, Unicode slash homoglyphs (U+2215, U+FF0F, U+29F5),.., null bytes, absolute paths.findFileRecursive(:499-504) re-validates via canonical-path containment.isPdfMagicByteschecks the%PDFsignature before upload.All three move into
DocumentImporterintact, with their tests ported as security regression tests before the old method is deleted (should_reject_path_traversal_in_file_column,should_reject_unicode_slash_homoglyph,should_reject_absolute_path). Thefilevalue now arrives via the canonicalfilecolumn — treat it as hostile input regardless of upstream-trust (CWE-22 does not care the value came from "our" Python tool). Defense in depth: validate the string withisValidImportFilename, then keep canonical-path containment on the resolved real path. Confirm POI 5.5.0 rejects external entities (POI disables DTDs by default — verify, don't assume). The orchestrator entry point must remain reachable only throughAdminController@RequirePermission(Permission.ADMIN)— add no second un-annotated path.Acceptance criteria (Gherkin)
Implementation plan (TDD, red first per behavior)
CanonicalSheetReaderfirst red/green cycle: header present, header missing (throwDomainException.badRequest),|-split of"a|b|c", empty cell → "", single value. No DB.@ExtendWith(MockitoExtension.class), owning service mocked), each idempotent (upsert bysource_ref/index), each via the owning service. Named idempotency unit per loader:should_update_person_in_place_when_source_ref_exists. Add aprovisional == "True"test (the normalizer writes capitalized Python bools) so a future format change fails loudly.?→none), plus the "always retain raw even when linked" invariant.DocumentImporterTestbefore deleting the old methods.CanonicalImportOrchestratorwiring named ordering +ImportStatus; strip positional config +parseDate+ Java name logic + raw/ODS path.@SpringBootTest+ Testcontainerspostgres:16-alpine, never H2 — theUNIQUE(source_ref)+ upsert conflict only exist in real Postgres): run all four artifacts, snapshotpersons/tag/documentscounts, run again, assert counts identical AND assert the precedence decision (mutate a field in-app, re-import, assert survival per Decision #1). Use Awaitility onImportStatus, neverThread.sleep.npm run generate:api; updatel3-backend-*diagram +GLOSSARY.md+ ADR 025 +DEPLOYMENT.mdrunbook step.Mind the branch JaCoCo gate — currently 0.77 (77%), ratcheting toward 80% (see
pom.xml/ #496) — every routing arm and error path needs an explicit test.Resolved decisions (settled 2026-05-27)
source_ref/index, but never overwrite a field a human changed in-app (merges, confirmations, manual date/name corrections). Track human-touched fields so a canonical re-import only fills/updates fields the human has not edited. (Raised by: issue, Sara, Elicit)sender_text/receiver_text. Pristine register, no triage worklist. (Raised by: #665 author, Elicit)Geschirr/Bierbecher/Steuerbescheidare categorizedsingle_token(exactly like a real first nameClara), so Option A alone won't drop them. Resolution by extension of A: maintain a small curated override/stopword list in the normalizer'soverrides/marking known non-person tokens → treated as raw text, no Person. Deterministic, testable, light upkeep; the owner can extend the list. (Raised by: Elicit)Schwester Hanni(×41),Tante Tüten(×11), 73 occurrences total — treated likesingle_tokenunder A: best-effort register match bysource_ref; if only a bare relation label with no resolvable name, keep as raw text rather than minting a Person. (Raised by: Elicit)out/canonical-documents.xlsx,canonical-persons.xlsx,canonical-tag-tree.xlsx, andcanonical-persons-tree.jsonto the repo and update.gitignore(it currently excludesout/except the tree JSON). The loader reads them from the repo; they are regenerated when the normalizer changes (Phase 1 #670). (Raised by: Tobias, Markus)Out of scope
source_ref,provisional,date_precision,sender_text,receiver_text).filecolumn, treeperson_id) — Phase 1 #670.Dependencies
filecolumn, treeperson_id).source_ref,provisional, precision/attribution columns). This issue referencesDatePrecision,source_ref,sender_text/receiver_textand compiles only after #671.marcel referenced this issue2026-05-26 21:14:21 +02:00
marcel referenced this issue2026-05-26 21:33:52 +02:00
marcel referenced this issue2026-05-26 21:33:56 +02:00
marcel referenced this issue2026-05-26 21:34:18 +02:00
marcel referenced this issue2026-05-26 21:34:38 +02:00
marcel referenced this issue2026-05-26 21:35:01 +02:00
marcel referenced this issue2026-05-26 21:35:38 +02:00
marcel referenced this issue2026-05-26 21:36:08 +02:00
marcel referenced this issue2026-05-26 21:36:28 +02:00
Markus Keller — Senior Application Architect
Observations
RelationshipServicealready exists atperson/relationship/RelationshipService.java, soPersonTreeImportercalling the service (neverPersonRelationshipRepository) is fully satisfiable today — good, no new abstraction needed.MassImportServicealready goes throughDocumentService/PersonService/TagService(no repository reach-in), so the four loaders just preserve that. Keep it.UNIQUE(source_ref)+ DB-layer upsert is exactly the right call — push integrity to Postgres, not a JavaexistsBycheck (which has a TOCTOU race under the async runner). This belongs in #671 and the issue correctly scopes it there.CanonicalSheetReader"value-level POI helper, no Spring, no domain knowledge" is a clean seam. It is the one place positional@Value app.import.col.*brittleness dies. Endorse.024. Note there is already a duplicate022indocs/adr/(022-csrf-...and022-eager-to-lazy-...).025is still the next clean number, so the issue is right — but whoever writes it should not repeat the collision.Recommendations
FAILED") is an architectural guard — make it a single named methodassertAllArtifactsPresent()returning before any loader runs. Fail-closed-on-missing-artifact is the correct posture.importing/classes belong indocs/architecture/c4/l3-backend-3g-supporting.puml(that is where the importing domain currently sits — there is no dedicated import diagram). Add the orchestrator + 4 loaders +CanonicalSheetReaderthere. This is a doc blocker, not optional.XxeSafeXmlParser" step removes the only XML-parsing surface in the importer — that simplification is worth an explicit line in ADR 025 ("Java no longer parses the raw spreadsheet; XXE surface eliminated").MassImportService.ImportStatus/SkippedFile/Stateas the public contract verbatim; the orchestrator wraps them. I confirmedadmin/system/ImportStatusCard.sveltereadsstate,statusCode,processed,skipped— changing the record breaks the admin UI silently through generated types.Open Decisions
app.import.dirmount and triggers via the admin endpoint — zero new infrastructure, reuses what exists. Committing a regenerated 740Kcanonical-documents.xlsxto git (option a) bloats history for a file that changes every normalizer run; reject it. The only thing that would move me off (d) is a concrete requirement that the normalizer runs in CI on a schedule with no operator in the loop — which I don't see stated. Confirm who runs the normalizer and how often.Felix Brandt — Senior Fullstack Developer
Observations
CanonicalSheetReaderfirst (no DB), then four loaders with the owning service mocked under@ExtendWith(MockitoExtension.class), then routing, then the integration test on Testcontainers. Red-first per behavior is explicit. I endorse the sequence.PersonService.findOrCreateByAlias(PersonService.java:85) is@Nullableand returns a singlePerson— it cannot express the pair-split ("Ella Anita"→ two persons) or the "no person but keep raw text" cases. Arecord AttributionResult(List<Person> persons, String rawText)with persons of size 0/1/2 is the clean fit. The pair-split method belongs onPersonService, not the importer — agreed.lastNamefallback warning is real.findOrCreateByAliastoday dodges the non-nulllastNameconstraint by using the alias as the lastName for INSTITUTION/GROUP (PersonService.java:94) andPersonNameParser.splitfor the rest. A genuinely new provisional single-token person needs the same explicit fallback or the insert throwsDataIntegrityViolationException. Define it (token-as-lastName is the least surprising) and test it.mergePersons(PersonService.java:179) is the only cleanup path — sosender_text/receiver_textis the audit trail. Testmatched sender → both sender set AND sender_text == raw cellexplicitly.Recommendations
DocumentImporterwithresolveFile(),uploadToS3(),buildDocument()is the right decomposition — the currentimportSingleDocument(MassImportService.java:375-458) is an 80-line method doing validate+transform+upload+persist; do not port that shape.AttributionResult.DomainException.internal/badRequestand the newIMPORT_ARTIFACT_INVALIDcode — never rawRuntimeException. Note the old code already violates this:readOdsthrowsnew RuntimeException("Ungültige ODS-Datei...")(MassImportService.java:211) andNoSpreadsheetException extends RuntimeException. Those die with the raw path, good.provisional == "True"test (capitalized Python bool) —Boolean.parseBoolean("True")returnstruein Java, but a silent format drift to"yes"/"1"would parse tofalsewith no error. Pin it with a test so it fails loudly.log.warn("Skipping artifact {}", name)not concatenation. The old code is mostly clean here already; keep it.ErrorCodechange is mandatory and easy to half-do:ErrorCode.java+errors.ts+getErrorMessage()case + i18n keys inmessages/{de,en,es}.json. All four or the frontend shows the generic fallback.Open Decisions (none)
Nora "NullX" Steiner — Application Security Engineer
Observations
MassImportService:isValidImportFilename(:336-351) — rejects/,\, U+2215, U+FF0F, U+29F5 homoglyphs,..,., null byte, and absolute paths. Solid.findFileRecursive(:492-508) — canonical-path containment check (candidate.getCanonicalPath().startsWith(baseDirCanonical + File.separator)). This is the real CWE-22 backstop.isPdfMagicBytes(:358-367) —%PDFsignature check before upload.filecolumn now arriving from "our" Python normalizer does not lower the trust boundary. CWE-22 does not care about provenance — the canonical.xlsxis a file on a mounted volume an operator can hand-edit or an attacker with write access can poison. Treat thefilevalue as hostile input. The issue states this correctly; hold the line in review.AdminControlleris annotated@RequirePermission(Permission.ADMIN)at the class level (AdminController.java:20) and exposesPOST /trigger-import+GET /import-status. The orchestrator must remain reachable only through this path — adding a second un-annotated entry point (e.g. a startup@PostConstructauto-import, or a new controller) would be a privilege-bypass.Recommendations
DocumentImporterwith their tests first (red), then delete the originals. Concrete regression test names:should_reject_path_traversal_in_file_column,should_reject_unicode_slash_homoglyph(assert all three: U+2215/U+FF0F/U+29F5),should_reject_absolute_path,should_reject_null_byte,should_reject_non_pdf_magic_bytes. The old suite has 64@Testmethods — do not let that count silently drop; diff it.isValidImportFilenameand keep the canonical-path containment on the resolved real path. Two layers; a bypass in one is caught by the other.readOds/XxeSafeXmlParserremoves the hand-rolledDocumentBuilderFactoryXML path entirely (MassImportService.java:206-252). The canonical.xlsxis read via POIWorkbookFactory; confirm POI'sXMLReaderfactory hasdisallow-doctype-decl/ external-entity features off, and add one test asserting an entity-laden workbook does not resolve.sender_text/receiver_textnow carry arbitrary cell content (?,Geschirr, markup-like prose). The frontend rendering issue is downstream, but flag here for traceability: render with{value}interpolation, never{@html}. Add adatafixture with<img src=x onerror=...>in a sender cell to prove it round-trips as inert text.log.warn("Rejected file {}", name)) — never concatenation. A poisoned filename containing log-forging characters should not be concatenated into a log line.Open Decisions (none)
Sara Holt — Senior QA Engineer
Observations
postgres:16-alpine(never H2 —UNIQUE(source_ref)+ the upsert conflict only exist in real Postgres), and Awaitility onImportStatus(neverThread.sleep) because the orchestrator runs under@Async. I endorse both as gates.Recommendations
backend/CLAUDE.mdboth say "88% branch JaCoCo gate." The actual gate inbackend/pom.xml:321isBRANCH COVEREDRATIO minimum 0.77(ratcheted down per #496, with a comment saying so). Test the implementation against the real 0.77 gate; do not over-invest chasing a phantom 88%. Worth fixing the stale "88%" in both docs while you are in here.CanonicalSheetReader(header present / missing-throws /|-split / empty→"" / single value); one test class per loader with the owning service mocked; one named idempotency unit per loader (should_update_person_in_place_when_source_ref_exists); one routing test per category arm (8 arms) + the always-retain-raw invariant.persons/tag/documentscounts, re-run, assert counts identical, then mutate one field in-app + re-import + assert per Decision #1.@Testcount — the old suite has 64; a silent drop is a coverage regression.receiver_person_idswith non-emptyreceiver_names; afilecolumn pointing at a present-but-non-PDF file (→ skipped, not failed); a documentindexcollision on re-import (→ update, not duplicate); an artifact present but with zero data rows (→ DONE with processed=0, not FAILED).FAILED), but a single bad file only adds aSkippedFile. Test both so the distinction can't erode — they are different failure modes and the issue is explicit that skip-and-continue is file-level only.Open Decisions
Elicit — Requirements Engineer & Business Analyst
I am in Brownfield mode. This is a well-specified issue — INVEST-compliant, 11 Gherkin scenarios, explicit dependencies and out-of-scope. My job is to hunt the remaining ambiguities and contradictions before they become rework.
Observations
needs-discussion. Two of them (#1 precedence, #3 object-noise) are acceptance-criteria blockers, not just nice-to-haves — they change which scenarios this issue can claim as done.Ambiguities / contradictions I must surface
Geschirr/Bierbecheraresingle_tokento the normalizer — identical toClara. So the AC "prose/noise/?creates no person" silently does not catch object-noise; object-noise will become a provisional single-token person unless a demotion mechanism exists. Either add a deterministic override list in the normalizeroverrides/(recommended — testable, light upkeep), or explicitly drop the object-noise claim from this issue's AC. Right now the issue implies a guarantee it cannot honor.relationalhas no acceptance criterion (Decision #4).Schwester Hanni(×41),Tante Tüten(×11) — 73 occurrences — have no Gherkin scenario and no defined "confident match" threshold. Either add a scenario (conservative: exact register-alias match else fall through to raw text) or markrelationalexplicitly out of scope for this issue. As written it is a silent gap.Recommendations
ImportStatusstatusCode+skippedFilesgive the operator enough to diagnose a failed run without reading server logs. Data retention — re-import precedence (#1) is effectively a data-retention policy for human corrections; treat it as such.Open Decisions
?— needs owner confirmation, not a default. (raised by #665 author)relationalthreshold or out-of-scope — no scenario exists; decide before locking routing.Tobias Wendt — DevOps & Platform Engineer
Observations
ImportStatusstate machine (IDLE/RUNNING/DONE/FAILED) is being kept verbatim — good, that is the operator's only window into a long-running, fire-and-forget job. I confirmed the operator path isPOST /api/admin/trigger-import→GET /api/admin/import-statuspolled byImportStatusCard.svelte. No new infra needed to run this.FAILEDfast) is operationally the right behavior. A half-run that loads tags but no documents would leave the operator with a silently-inconsistent DB and no clear signal. Fail-closed is correct.postgres:16-alpine— that matches what already runs in CI for this stack. No new CI service required; this slots into the existing integration job.Recommendations
app.import.dir, add zero infra (this is Decision #5, and I lean (d)). The four canonical files drop into the existing/importmount and the operator triggers the import exactly as today. Concretely:out/*.xlsxto git (option a).canonical-documents.xlsxis 740K and regenerated every normalizer run; it is currently gitignored for good reason. Committing it bloats history and creates merge churn on a binary.docs/DEPLOYMENT.mdas the issue requires: (1) run the normalizer, (2) place the four artifacts inapp.import.dir, (3) trigger import via admin, (4) watchImportStatus. This is a doc blocker — an import with an undocumented prerequisite sequence is an incident waiting to happen.statusCode/message so the operator sees which artifact was missing without grepping logs. "Import failed: canonical-documents.xlsx not found in /import" beats "IMPORT_FAILED_INTERNAL."ImportStatusCard.svelte:14branches onstatusCode === 'IMPORT_FAILED_NO_SPREADSHEET'. When the ODS/raw path andNoSpreadsheetExceptionare deleted, that statusCode disappears — make sure the frontend branch is updated or repurposed (e.g. toIMPORT_ARTIFACT_INVALID) so the operator still gets a meaningful "artifacts missing" message instead of a generic failure.Open Decisions
Leonie Voss — UX Designer & Accessibility Strategist
Observations
ImportStatus/SkippedFileshape must stay verbatim becauseadmin/system/ImportStatusCard.svelteconsumes it via generated types — so the existing admin import card is the only UI touchpoint, and it is intentionally untouched. I verified that card readsstate,statusCode,processed, andskipped.Recommendations
sender_text/receiver_text. Those columns now carry arbitrary, messy cell content (?,Geschirr, prose, markup-like fragments). When the dependent UI issue renders them, it must use plain{value}interpolation — never{@html}. This is both a stored-XSS guard (Nora's point) and a correctness point: a literal?or<must show as typed, not be interpreted. Worth a one-line note carried forward into the consuming UI issue so it isn't lost.IMPORT_ARTIFACT_INVALIDcode), make sure the admin card surfaces a human-readable, localized message viagetErrorMessage()rather than a raw code — the admin is still a person who needs to know which artifact is missing and what to do. Pair any new failure state with the i18n keys the issue already calls for inmessages/{de,en,es}.json.Open Decisions (none)
Decision Queue — Action Required
5 decisions need your input before implementation starts. Two of them (#1, #3) gate acceptance criteria — the relevant tests cannot be written until they are resolved.
Data / Requirements
source_ref" is under-specified. Overwrite = normalizer is source of truth, simplest, but every in-app correction is destroyed on re-import. Preserve = human edits win, but re-import can no longer fix a wrongly-corrected field and you need per-field provenance (much more code). Deciding question: does the operator re-run the normalizer before humans edit in-app (overwrite is safe) or after (preserve mandatory)? Blocks the idempotency acceptance test — it cannot be written until this is chosen. (Raised by: issue, Sara, Elicit)Geschirr/Bierbecheraresingle_tokento the normalizer — indistinguishable fromClara. So "object-noise → no person" is not satisfiable from categories alone. Options: (a) deterministic manual demotion list in the normalizeroverrides/(recommended — testable, light upkeep); (b) defer to a heuristic follow-up and drop the object-noise AC from this issue; (c) accept object-noise becomes a provisional person cleaned manually (contradicts the no-junk goal). Changes which AC this issue can claim. (Raised by: Elicit)?. A (recommended): create no person, keep the raw cell insender_text/receiver_text— pristine register, no triage worklist. B: create anUNKNOWNprovisional person per string — fully reversible, gives a "who is this?" worklist, but re-introduces the noise this rebuild removes. Needs your explicit confirmation, not a developer default, because it depends on whether you want a cleanup worklist. (Raised by: #665 author, Elicit)relationalthreshold, or mark out-of-scope.Schwester Hanni(×41),Tante Tüten(×11), 73 occurrences total — no Gherkin scenario and no defined "confident match" rule. Either add a scenario (conservative: exact register-alias match else fall through to raw text) or explicitly markrelationalout of scope for this issue and track as a follow-up. Default recommendation: out-of-scope unless you need it now. (Raised by: Elicit)Infrastructure
out/*.xlsxto git — rejected by both Markus and Tobias (740K binary, regenerated every run, currently gitignored, bloats history); (b) generate in CI as a build artifact — only justified if the normalizer runs unattended on a schedule; (c) admin uploads via UI — adds upload endpoint + validation surface for a rare operation; (d) operator drops the four files into the existingapp.import.dirmount and triggers via the admin endpoint — zero new infra, reuses what exists (Markus's and Tobias's lean). Deciding question: who runs the normalizer, and how often? (Raised by: Markus, Tobias)