docs(importing): document the canonical importer rebuild
- ADR-025: add decision 3 (four idempotent loaders over canonical artifacts; raw spreadsheet no longer parsed by Java) with the settled Option-A name policy, human-edit-preserve precedence, provisional contract, and ported security guards. - l3-backend-3b diagram: replace MassImportService/ExcelService with the orchestrator, the four loaders, and CanonicalSheetReader, with the loader dependency edges. - GLOSSARY: Canonical import / canonical artifact / CanonicalSheetReader terms; refresh SkippedFile (new INVALID_FILENAME_PATH_TRAVERSAL reason, index key). - DEPLOYMENT §6: canonical-artifact prerequisite runbook (run normalizer → place four artifacts → trigger import); note idempotent re-run. - CLAUDE.md (root + backend): importing/ package now lists the orchestrator + loaders + CanonicalSheetReader. OpenAPI: no generate:api needed — the ImportStatus/SkippedFile generated schemas already match the new types byte-for-byte (same fields + SkipReason enum), so the API surface is unchanged. Closes #669 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
|
||||
**Date:** 2026-05-27
|
||||
**Status:** Accepted
|
||||
**Issue:** #671
|
||||
**Issue:** #671 (schema, decisions 1–2); #669 (importer architecture, decision 3)
|
||||
**Milestone:** Handling the Unknowns — honest uncertainty in dates & people
|
||||
|
||||
---
|
||||
@@ -56,6 +56,44 @@ The importer reads the Python normalizer's canonical output
|
||||
output strings are persisted as-is. The same applies to `source_ref`, which carries the
|
||||
normalizer's `person_id` / canonical `tag_path` unchanged as the re-import idempotency key.
|
||||
|
||||
### 3. The importer is four idempotent loaders over the canonical artifacts; Java no longer parses the raw spreadsheet (Phase 3, #669)
|
||||
|
||||
The legacy `MassImportService` read the *raw* original spreadsheet by positional column
|
||||
index (`@Value app.import.col.*`) and re-derived everything in Java (ISO-only date parsing,
|
||||
name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **deleted**.
|
||||
|
||||
The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in
|
||||
an explicit dependency DAG — `TagTreeImporter` → `PersonRegisterImporter` →
|
||||
`PersonTreeImporter` → `DocumentImporter` — that **consume the committed canonical artifacts**
|
||||
(`tools/import-normalizer/out/`). A shared `CanonicalSheetReader` maps columns **by header
|
||||
name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each
|
||||
loader calls the **owning domain's service**, never a repository (layering rule); the tree
|
||||
loader uses `RelationshipService`, never the relationship repository.
|
||||
|
||||
Settled sub-decisions:
|
||||
|
||||
- **Idempotency precedence = preserve human edits.** Persons/tags upsert by `source_ref`,
|
||||
documents by `index`. On re-import a non-blank field a human changed in-app is never
|
||||
overwritten (blank fields are filled from canonical), and `provisional` is monotonic — once
|
||||
a human confirms a person (`false`) it never reverts to `true`. Verified against real
|
||||
Postgres in `CanonicalImportIntegrationTest`.
|
||||
- **Name policy = Option A.** The normalizer resolved attribution upstream: the document sheet
|
||||
carries the resolved slug in `sender_person_id` / `receiver_person_ids` and the raw cell in
|
||||
`sender_name` / `receiver_names`. The importer routes register-first by `source_ref`
|
||||
(provisional `Person` when a slug is unmatched), and **always retains the raw cell** in
|
||||
`sender_text` / `receiver_text` even when a person is linked — the load-bearing invariant
|
||||
behind the merge story. A row with no slug but raw text (prose / `?` / object-noise) links
|
||||
no person and keeps only the raw text.
|
||||
- **`provisional` is now populated.** Importer-minted persons are `provisional = true`;
|
||||
register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1)
|
||||
left at default-`false`.
|
||||
- **Security guards are defense-in-depth, not upstream-trust.** The `file` column is treated as
|
||||
hostile (CWE-22 does not care it came from our tool): its basename is validated
|
||||
(`isValidImportFilename` — slash/backslash, three Unicode slash homoglyphs, `..`, null byte,
|
||||
absolute path) and resolved only inside the import dir with canonical-path containment, so a
|
||||
traversal value can never escape. The `%PDF` magic-byte check gates upload. These guards and
|
||||
their tests were ported from `MassImportService` **before** it was deleted.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
Reference in New Issue
Block a user