docs(importing): document the canonical importer rebuild

- ADR-025: add decision 3 (four idempotent loaders over canonical artifacts;
  raw spreadsheet no longer parsed by Java) with the settled Option-A name
  policy, human-edit-preserve precedence, provisional contract, and ported
  security guards.
- l3-backend-3b diagram: replace MassImportService/ExcelService with the
  orchestrator, the four loaders, and CanonicalSheetReader, with the loader
  dependency edges.
- GLOSSARY: Canonical import / canonical artifact / CanonicalSheetReader terms;
  refresh SkippedFile (new INVALID_FILENAME_PATH_TRAVERSAL reason, index key).
- DEPLOYMENT §6: canonical-artifact prerequisite runbook (run normalizer →
  place four artifacts → trigger import); note idempotent re-run.
- CLAUDE.md (root + backend): importing/ package now lists the orchestrator +
  loaders + CanonicalSheetReader.

OpenAPI: no generate:api needed — the ImportStatus/SkippedFile generated
schemas already match the new types byte-for-byte (same fields + SkipReason
enum), so the API surface is unchanged.

Closes #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-27 10:44:45 +02:00
parent 9cc682cf72
commit 21c85ff081
6 changed files with 97 additions and 18 deletions

View File

@@ -559,20 +559,39 @@ bash scripts/download-kraken-models.sh
> Downloads the Kurrent/Sütterlin HTR models. Run once after a fresh clone or when models are updated.
### Trigger a mass import (Excel/ODS)
### Trigger a canonical import
**Dev:** drop the ODS spreadsheet + PDFs into `./import/` at the repo root — the dev compose bind-mounts it to `/import` automatically.
The importer no longer parses the raw spreadsheet. It consumes the **canonical artifacts**
produced by the normalizer (`tools/import-normalizer/`) — `canonical-tag-tree.xlsx`,
`canonical-persons.xlsx`, `canonical-persons-tree.json`, `canonical-documents.xlsx` — which
are committed under `tools/import-normalizer/out/`. The semantic transformation
(German-date parsing, name classification) lives entirely in the normalizer; the backend
maps the clean columns by header name. See [ADR-025](adr/025-canonical-import-and-single-migration-schema-foundation.md).
**Prerequisite — regenerate the artifacts when the source data changes:**
```bash
cd tools/import-normalizer
python -m normalizer # or the documented normalizer entrypoint
# writes the four canonical artifacts into ./out/
```
**Dev:** place all four canonical artifacts **plus** the referenced PDFs into `./import/`
at the repo root (the dev compose bind-mounts it to `/import`, which is `app.import.dir`).
The orchestrator smoke-checks that all four artifacts are present before starting and fails
closed (`IMPORT_ARTIFACT_INVALID`) if any is missing.
**Staging/production:**
1. Pre-stage the payload on the host. Convention: `/srv/familienarchiv-staging/import/` or `/srv/familienarchiv-production/import/`.
1. Pre-stage the four canonical artifacts + PDFs on the host. Convention:
`/srv/familienarchiv-staging/import/` or `/srv/familienarchiv-production/import/`.
```bash
rsync -avh --progress ./import/ user@host:/srv/familienarchiv-staging/import/
```
2. Make sure `IMPORT_HOST_DIR=<host-path>` is set in `.env.staging` / `.env.production` (the nightly/release workflows already write this — see §3). Compose refuses to start without it.
3. Redeploy the stack so the bind mount picks up — or, if the mount is already in place, skip to step 4.
4. Call `POST /api/admin/trigger-import` (requires `ADMIN` permission), or click the "Import starten" button on `/admin/system`.
5. The import runs asynchronously — poll `GET /api/admin/import-status`, watch `/admin/system`, or tail the backend logs.
5. The import runs asynchronously — poll `GET /api/admin/import-status`, watch `/admin/system`, or tail the backend logs. Re-running is safe: the import is idempotent (upsert by `source_ref` / document `index`) and never overwrites a human-edited field.
---