From 21c85ff0818a5e337dce61bf49197d79c7039b9b Mon Sep 17 00:00:00 2001 From: Marcel Date: Wed, 27 May 2026 10:44:45 +0200 Subject: [PATCH] docs(importing): document the canonical importer rebuild MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - ADR-025: add decision 3 (four idempotent loaders over canonical artifacts; raw spreadsheet no longer parsed by Java) with the settled Option-A name policy, human-edit-preserve precedence, provisional contract, and ported security guards. - l3-backend-3b diagram: replace MassImportService/ExcelService with the orchestrator, the four loaders, and CanonicalSheetReader, with the loader dependency edges. - GLOSSARY: Canonical import / canonical artifact / CanonicalSheetReader terms; refresh SkippedFile (new INVALID_FILENAME_PATH_TRAVERSAL reason, index key). - DEPLOYMENT §6: canonical-artifact prerequisite runbook (run normalizer → place four artifacts → trigger import); note idempotent re-run. - CLAUDE.md (root + backend): importing/ package now lists the orchestrator + loaders + CanonicalSheetReader. OpenAPI: no generate:api needed — the ImportStatus/SkippedFile generated schemas already match the new types byte-for-byte (same fields + SkipReason enum), so the API surface is unchanged. Closes #669 Co-Authored-By: Claude Opus 4.7 --- CLAUDE.md | 2 +- backend/CLAUDE.md | 2 +- docs/DEPLOYMENT.md | 27 +++++++++++-- docs/GLOSSARY.md | 8 +++- ...-and-single-migration-schema-foundation.md | 40 ++++++++++++++++++- .../c4/l3-backend-3b-document-management.puml | 36 ++++++++++++----- 6 files changed, 97 insertions(+), 18 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 10a3c368..b3a5b189 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -87,7 +87,7 @@ backend/src/main/java/org/raddatz/familienarchiv/ ├── exception/ DomainException, ErrorCode, GlobalExceptionHandler ├── filestorage/ FileService (S3/MinIO) ├── geschichte/ Geschichte (story) domain -├── importing/ MassImportService +├── importing/ CanonicalImportOrchestrator + four loaders (TagTree/PersonRegister/PersonTree/Document) + CanonicalSheetReader ├── notification/ Notification domain + SseEmitterRegistry ├── ocr/ OCR domain — OcrService, OcrBatchService, training ├── person/ Person domain diff --git a/backend/CLAUDE.md b/backend/CLAUDE.md index 249221cc..b96d242a 100644 --- a/backend/CLAUDE.md +++ b/backend/CLAUDE.md @@ -34,7 +34,7 @@ src/main/java/org/raddatz/familienarchiv/ ├── exception/ # DomainException, ErrorCode, GlobalExceptionHandler ├── filestorage/ # FileService (S3/MinIO) ├── geschichte/ # Geschichte (story) domain -├── importing/ # MassImportService +├── importing/ # CanonicalImportOrchestrator + 4 loaders + CanonicalSheetReader ├── notification/ # Notification domain + SseEmitterRegistry ├── ocr/ # OCR domain — OcrService, OcrBatchService, training ├── person/ # Person domain — Person, PersonService, PersonController diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md index c6560a0a..3102d135 100644 --- a/docs/DEPLOYMENT.md +++ b/docs/DEPLOYMENT.md @@ -559,20 +559,39 @@ bash scripts/download-kraken-models.sh > Downloads the Kurrent/Sütterlin HTR models. Run once after a fresh clone or when models are updated. -### Trigger a mass import (Excel/ODS) +### Trigger a canonical import -**Dev:** drop the ODS spreadsheet + PDFs into `./import/` at the repo root — the dev compose bind-mounts it to `/import` automatically. +The importer no longer parses the raw spreadsheet. It consumes the **canonical artifacts** +produced by the normalizer (`tools/import-normalizer/`) — `canonical-tag-tree.xlsx`, +`canonical-persons.xlsx`, `canonical-persons-tree.json`, `canonical-documents.xlsx` — which +are committed under `tools/import-normalizer/out/`. The semantic transformation +(German-date parsing, name classification) lives entirely in the normalizer; the backend +maps the clean columns by header name. See [ADR-025](adr/025-canonical-import-and-single-migration-schema-foundation.md). + +**Prerequisite — regenerate the artifacts when the source data changes:** + +```bash +cd tools/import-normalizer +python -m normalizer # or the documented normalizer entrypoint +# writes the four canonical artifacts into ./out/ +``` + +**Dev:** place all four canonical artifacts **plus** the referenced PDFs into `./import/` +at the repo root (the dev compose bind-mounts it to `/import`, which is `app.import.dir`). +The orchestrator smoke-checks that all four artifacts are present before starting and fails +closed (`IMPORT_ARTIFACT_INVALID`) if any is missing. **Staging/production:** -1. Pre-stage the payload on the host. Convention: `/srv/familienarchiv-staging/import/` or `/srv/familienarchiv-production/import/`. +1. Pre-stage the four canonical artifacts + PDFs on the host. Convention: + `/srv/familienarchiv-staging/import/` or `/srv/familienarchiv-production/import/`. ```bash rsync -avh --progress ./import/ user@host:/srv/familienarchiv-staging/import/ ``` 2. Make sure `IMPORT_HOST_DIR=` is set in `.env.staging` / `.env.production` (the nightly/release workflows already write this — see §3). Compose refuses to start without it. 3. Redeploy the stack so the bind mount picks up — or, if the mount is already in place, skip to step 4. 4. Call `POST /api/admin/trigger-import` (requires `ADMIN` permission), or click the "Import starten" button on `/admin/system`. -5. The import runs asynchronously — poll `GET /api/admin/import-status`, watch `/admin/system`, or tail the backend logs. +5. The import runs asynchronously — poll `GET /api/admin/import-status`, watch `/admin/system`, or tail the backend logs. Re-running is safe: the import is idempotent (upsert by `source_ref` / document `index`) and never overwrites a human-edited field. --- diff --git a/docs/GLOSSARY.md b/docs/GLOSSARY.md index 1fefb7af..074f2fe1 100644 --- a/docs/GLOSSARY.md +++ b/docs/GLOSSARY.md @@ -64,9 +64,13 @@ _See also [Annotation](#annotation-documentannotation)._ - `REVIEWED`: a reviewer has approved the transcription. - `ARCHIVED`: the document is finalized and read-only. -**Mass import** — an asynchronous batch process (`MassImportService`) that reads an Excel or ODS file and creates `Person`s, `Tag`s, and `PLACEHOLDER` `Document`s in one shot. Only one import can run at a time (`IMPORT_ALREADY_RUNNING` error if attempted concurrently). +**Canonical import** — an asynchronous batch process (`CanonicalImportOrchestrator`) that consumes the normalizer's committed canonical artifacts and creates `Tag`s, `Person`s (register + tree), family relationships, and `Document`s. Four idempotent loaders run in a fixed dependency order — `TagTreeImporter` → `PersonRegisterImporter` → `PersonTreeImporter` → `DocumentImporter` — each calling the owning domain's service. Re-running it never duplicates rows (upsert by `source_ref` / document `index`) and never overwrites a human-edited field. Only one import can run at a time (`IMPORT_ALREADY_RUNNING` error if attempted concurrently); a missing or malformed artifact fails closed (`IMPORT_ARTIFACT_INVALID`). Replaced the legacy raw-spreadsheet `MassImportService` (see ADR-025). -**SkippedFile** (`MassImportService.SkippedFile`) — a file that was presented for import but not processed, recorded with a `filename` and a `reason` code. Possible reasons: `INVALID_PDF_SIGNATURE` (magic-byte validation failed), `S3_UPLOAD_FAILED` (file upload to MinIO/S3 threw an exception), `FILE_READ_ERROR` (the file could not be opened for reading), or `ALREADY_EXISTS` (a document with the same filename already exists in the archive with a status other than `PLACEHOLDER`). +**canonical artifact** — one of the four files the normalizer (`tools/import-normalizer/`) emits and commits to `tools/import-normalizer/out/`: `canonical-tag-tree.xlsx`, `canonical-persons.xlsx`, `canonical-persons-tree.json`, `canonical-documents.xlsx`. They are the contract the backend importer reads (mapped by header name); the semantic transformation (German-date parsing, name classification) lives only in the normalizer, never in Java. + +**CanonicalSheetReader** — the value-level POI helper that opens a canonical `.xlsx`, maps the header row to column indices by name (replacing the brittle positional column config), splits pipe-delimited list columns, and throws `IMPORT_ARTIFACT_INVALID` on a missing required header rather than NPE-ing on a null index. + +**SkippedFile** (`ImportStatus.SkippedFile`) — a file that was presented for import but not processed, recorded with a `filename` and a `reason` code. Possible reasons: `INVALID_FILENAME_PATH_TRAVERSAL` (the file-column basename failed the path-traversal guard), `INVALID_PDF_SIGNATURE` (magic-byte validation failed), `S3_UPLOAD_FAILED` (file upload to MinIO/S3 threw an exception), `FILE_READ_ERROR` (the file could not be opened for reading), or `ALREADY_EXISTS` (a document with the same `index` already exists in the archive with a status other than `PLACEHOLDER`). **skipped count** — the total number of `SkippedFile` entries accumulated during a single import run (`ImportStatus.skipped()`). Shown in the amber warning section of the Import Status Card in the admin UI; a value of zero suppresses the section entirely. diff --git a/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md b/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md index 0feb670b..8cfd897b 100644 --- a/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md +++ b/docs/adr/025-canonical-import-and-single-migration-schema-foundation.md @@ -2,7 +2,7 @@ **Date:** 2026-05-27 **Status:** Accepted -**Issue:** #671 +**Issue:** #671 (schema, decisions 1–2); #669 (importer architecture, decision 3) **Milestone:** Handling the Unknowns — honest uncertainty in dates & people --- @@ -56,6 +56,44 @@ The importer reads the Python normalizer's canonical output output strings are persisted as-is. The same applies to `source_ref`, which carries the normalizer's `person_id` / canonical `tag_path` unchanged as the re-import idempotency key. +### 3. The importer is four idempotent loaders over the canonical artifacts; Java no longer parses the raw spreadsheet (Phase 3, #669) + +The legacy `MassImportService` read the *raw* original spreadsheet by positional column +index (`@Value app.import.col.*`) and re-derived everything in Java (ISO-only date parsing, +name classification via `findOrCreateByAlias`, an ODS/XXE XML path). It is **deleted**. + +The rebuild is a `CanonicalImportOrchestrator` driving four single-responsibility loaders in +an explicit dependency DAG — `TagTreeImporter` → `PersonRegisterImporter` → +`PersonTreeImporter` → `DocumentImporter` — that **consume the committed canonical artifacts** +(`tools/import-normalizer/out/`). A shared `CanonicalSheetReader` maps columns **by header +name** (not by index) and fails closed (`IMPORT_ARTIFACT_INVALID`) on a missing header. Each +loader calls the **owning domain's service**, never a repository (layering rule); the tree +loader uses `RelationshipService`, never the relationship repository. + +Settled sub-decisions: + +- **Idempotency precedence = preserve human edits.** Persons/tags upsert by `source_ref`, + documents by `index`. On re-import a non-blank field a human changed in-app is never + overwritten (blank fields are filled from canonical), and `provisional` is monotonic — once + a human confirms a person (`false`) it never reverts to `true`. Verified against real + Postgres in `CanonicalImportIntegrationTest`. +- **Name policy = Option A.** The normalizer resolved attribution upstream: the document sheet + carries the resolved slug in `sender_person_id` / `receiver_person_ids` and the raw cell in + `sender_name` / `receiver_names`. The importer routes register-first by `source_ref` + (provisional `Person` when a slug is unmatched), and **always retains the raw cell** in + `sender_text` / `receiver_text` even when a person is linked — the load-bearing invariant + behind the merge story. A row with no slug but raw text (prose / `?` / object-noise) links + no person and keeps only the raw text. +- **`provisional` is now populated.** Importer-minted persons are `provisional = true`; + register and tree persons stay `false`. This is the Phase-3 contract the schema (decision 1) + left at default-`false`. +- **Security guards are defense-in-depth, not upstream-trust.** The `file` column is treated as + hostile (CWE-22 does not care it came from our tool): its basename is validated + (`isValidImportFilename` — slash/backslash, three Unicode slash homoglyphs, `..`, null byte, + absolute path) and resolved only inside the import dir with canonical-path containment, so a + traversal value can never escape. The `%PDF` magic-byte check gates upload. These guards and + their tests were ported from `MassImportService` **before** it was deleted. + --- ## Consequences diff --git a/docs/architecture/c4/l3-backend-3b-document-management.puml b/docs/architecture/c4/l3-backend-3b-document-management.puml index a15eb00b..89d4a68b 100644 --- a/docs/architecture/c4/l3-backend-3b-document-management.puml +++ b/docs/architecture/c4/l3-backend-3b-document-management.puml @@ -1,7 +1,7 @@ @startuml !include -title Component Diagram: API Backend — Document Management & Import +title Component Diagram: API Backend — Document Management & Canonical Import Container(frontend, "Web Frontend", "SvelteKit") ContainerDb(db, "PostgreSQL", "PostgreSQL 16") @@ -9,30 +9,48 @@ ContainerDb(minio, "Object Storage", "MinIO (S3-compatible)") System_Boundary(backend, "API Backend (Spring Boot)") { Component(docCtrl, "DocumentController", "Spring MVC — /api/documents", "CRUD for documents: search, get by ID, update metadata, upload/download file, conversation thread, batch metadata updates, and per-month density aggregation for the timeline filter widget.") - Component(adminCtrl, "AdminController", "Spring MVC — /api/admin", "Triggers asynchronous Excel/ODS mass import (requires ADMIN permission). Reports import state (IDLE/RUNNING/DONE/FAILED).") + Component(adminCtrl, "AdminController", "Spring MVC — /api/admin", "Triggers the asynchronous canonical import (requires ADMIN permission). Reports import state (IDLE/RUNNING/DONE/FAILED).") Component(docSvc, "DocumentService", "Spring Service", "Core document business logic: store, update, search. Resolves persons and tags, delegates file I/O to FileService, builds dynamic JPA Specifications, and integrates with audit logging.") Component(fileSvc, "FileService", "Spring Service", "Wraps AWS SDK v2 S3Client. Uploads files with UUID-keyed paths, computes SHA-256 hash, downloads with content-type detection, and generates presigned URLs for OCR access.") - Component(massImport, "MassImportService", "Spring Service — @Async", "Reads Excel/ODS files from /import mount. Tracks import state (IDLE/RUNNING/DONE/FAILED) and delegates to ExcelService. Returns immediately; processing runs asynchronously.") - Component(excelSvc, "ExcelService", "Spring Service", "Parses Excel/ODS workbooks (Apache POI). Column indices configurable via application.properties. Creates/updates document records per row.") + Component(importOrch, "CanonicalImportOrchestrator", "Spring Service — @Async", "Runs the four canonical loaders in an explicit dependency DAG (TagTree → PersonRegister → PersonTree → Document). Smoke-checks all four artifacts before starting, owns the IDLE/RUNNING/DONE/FAILED state machine, fails closed on a malformed artifact.") + Component(tagTreeLoader, "TagTreeImporter", "Spring Component", "Upserts the tag hierarchy from canonical-tag-tree.xlsx via TagService (by canonical tag_path).") + Component(personRegLoader, "PersonRegisterImporter", "Spring Component", "Upserts register persons from canonical-persons.xlsx via PersonService (by normalizer person_id).") + Component(personTreeLoader, "PersonTreeImporter", "Spring Component", "Upserts tree persons + relationships from canonical-persons-tree.json via PersonService and RelationshipService.") + Component(docLoader, "DocumentImporter", "Spring Component", "Loads canonical-documents.xlsx: routes attribution register-first (raw cell always retained in sender_text/receiver_text), parses clean dates, keeps the S3 upload + thumbnail plumbing, and ports the path-traversal / homoglyph / absolute-path / %PDF magic-byte security guards.") + Component(sheetReader, "CanonicalSheetReader", "POI helper", "Maps a canonical .xlsx by header name (no positional indices), splits pipe-delimited list columns, fails closed (IMPORT_ARTIFACT_INVALID) on a missing required header.") Component(minioConf, "MinioConfig", "Spring @Configuration", "Creates the S3Client and S3Presigner beans with path-style access for MinIO. Validates MinIO connectivity on startup.") Component(docRepo, "DocumentRepository", "Spring Data JPA", "Queries documents with Specification-based dynamic search, bidirectional conversation thread queries, full-text search with ranking and match highlighting, and transcription pipeline queue projections.") Component(docSpec, "DocumentSpecifications", "JPA Criteria API", "Factory for composable predicates: hasText (full-text), hasSender, hasReceiver, isBetween (date range), hasTags (subquery AND/OR logic).") } -Component(personSvc, "PersonService", "Spring Service", "See diagram 3e. Called by DocumentService to resolve sender / receiver persons by ID.") -Component(tagSvc, "TagService", "Spring Service", "See diagram 3d. Called by DocumentService to find or create tags by name.") +Component(personSvc, "PersonService", "Spring Service", "See diagram 3e. Resolves sender / receiver persons by ID; upserts persons by source_ref for the importer.") +Component(tagSvc, "TagService", "Spring Service", "See diagram 3d. Finds or creates tags by name; upserts tags by source_ref for the importer.") +Component(relSvc, "RelationshipService", "Spring Service", "See diagram 3e. Creates family relationships from the person tree during import.") Rel(frontend, docCtrl, "Document requests", "HTTP / JSON") Rel(frontend, adminCtrl, "Trigger import", "HTTP / JSON") Rel(docCtrl, docSvc, "Delegates to") -Rel(adminCtrl, massImport, "Triggers") +Rel(adminCtrl, importOrch, "Triggers") Rel(docSvc, fileSvc, "Upload / download files") Rel(docSvc, docRepo, "Reads / writes documents") Rel(docSvc, docSpec, "Builds search predicates") Rel(docSvc, personSvc, "Resolves sender / receivers") Rel(docSvc, tagSvc, "Finds or creates tags") -Rel(massImport, excelSvc, "Parses Excel/ODS file") -Rel(excelSvc, docSvc, "Creates / updates documents") +Rel(importOrch, tagTreeLoader, "1. Loads tags") +Rel(importOrch, personRegLoader, "2. Loads register persons") +Rel(importOrch, personTreeLoader, "3. Loads tree persons + relationships") +Rel(importOrch, docLoader, "4. Loads documents") +Rel(tagTreeLoader, sheetReader, "Reads canonical .xlsx") +Rel(personRegLoader, sheetReader, "Reads canonical .xlsx") +Rel(docLoader, sheetReader, "Reads canonical .xlsx") +Rel(tagTreeLoader, tagSvc, "Upserts tags by source_ref") +Rel(personRegLoader, personSvc, "Upserts persons by source_ref") +Rel(personTreeLoader, personSvc, "Upserts persons by source_ref") +Rel(personTreeLoader, relSvc, "Creates relationships") +Rel(docLoader, docSvc, "Upserts documents by index") +Rel(docLoader, personSvc, "Register-first match / provisional person") +Rel(docLoader, tagSvc, "Attaches tag by source_ref") +Rel(docLoader, fileSvc, "Uploads resolved file") Rel(minioConf, fileSvc, "Provides S3Client and S3Presigner beans") Rel(fileSvc, minio, "PUT / GET / presigned URL objects", "S3 API / HTTP") Rel(docRepo, db, "SQL queries", "JDBC")