docs(importing): document the canonical importer rebuild
- ADR-025: add decision 3 (four idempotent loaders over canonical artifacts; raw spreadsheet no longer parsed by Java) with the settled Option-A name policy, human-edit-preserve precedence, provisional contract, and ported security guards. - l3-backend-3b diagram: replace MassImportService/ExcelService with the orchestrator, the four loaders, and CanonicalSheetReader, with the loader dependency edges. - GLOSSARY: Canonical import / canonical artifact / CanonicalSheetReader terms; refresh SkippedFile (new INVALID_FILENAME_PATH_TRAVERSAL reason, index key). - DEPLOYMENT §6: canonical-artifact prerequisite runbook (run normalizer → place four artifacts → trigger import); note idempotent re-run. - CLAUDE.md (root + backend): importing/ package now lists the orchestrator + loaders + CanonicalSheetReader. OpenAPI: no generate:api needed — the ImportStatus/SkippedFile generated schemas already match the new types byte-for-byte (same fields + SkipReason enum), so the API surface is unchanged. Closes #669 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -559,20 +559,39 @@ bash scripts/download-kraken-models.sh
|
||||
|
||||
> Downloads the Kurrent/Sütterlin HTR models. Run once after a fresh clone or when models are updated.
|
||||
|
||||
### Trigger a mass import (Excel/ODS)
|
||||
### Trigger a canonical import
|
||||
|
||||
**Dev:** drop the ODS spreadsheet + PDFs into `./import/` at the repo root — the dev compose bind-mounts it to `/import` automatically.
|
||||
The importer no longer parses the raw spreadsheet. It consumes the **canonical artifacts**
|
||||
produced by the normalizer (`tools/import-normalizer/`) — `canonical-tag-tree.xlsx`,
|
||||
`canonical-persons.xlsx`, `canonical-persons-tree.json`, `canonical-documents.xlsx` — which
|
||||
are committed under `tools/import-normalizer/out/`. The semantic transformation
|
||||
(German-date parsing, name classification) lives entirely in the normalizer; the backend
|
||||
maps the clean columns by header name. See [ADR-025](adr/025-canonical-import-and-single-migration-schema-foundation.md).
|
||||
|
||||
**Prerequisite — regenerate the artifacts when the source data changes:**
|
||||
|
||||
```bash
|
||||
cd tools/import-normalizer
|
||||
python -m normalizer # or the documented normalizer entrypoint
|
||||
# writes the four canonical artifacts into ./out/
|
||||
```
|
||||
|
||||
**Dev:** place all four canonical artifacts **plus** the referenced PDFs into `./import/`
|
||||
at the repo root (the dev compose bind-mounts it to `/import`, which is `app.import.dir`).
|
||||
The orchestrator smoke-checks that all four artifacts are present before starting and fails
|
||||
closed (`IMPORT_ARTIFACT_INVALID`) if any is missing.
|
||||
|
||||
**Staging/production:**
|
||||
|
||||
1. Pre-stage the payload on the host. Convention: `/srv/familienarchiv-staging/import/` or `/srv/familienarchiv-production/import/`.
|
||||
1. Pre-stage the four canonical artifacts + PDFs on the host. Convention:
|
||||
`/srv/familienarchiv-staging/import/` or `/srv/familienarchiv-production/import/`.
|
||||
```bash
|
||||
rsync -avh --progress ./import/ user@host:/srv/familienarchiv-staging/import/
|
||||
```
|
||||
2. Make sure `IMPORT_HOST_DIR=<host-path>` is set in `.env.staging` / `.env.production` (the nightly/release workflows already write this — see §3). Compose refuses to start without it.
|
||||
3. Redeploy the stack so the bind mount picks up — or, if the mount is already in place, skip to step 4.
|
||||
4. Call `POST /api/admin/trigger-import` (requires `ADMIN` permission), or click the "Import starten" button on `/admin/system`.
|
||||
5. The import runs asynchronously — poll `GET /api/admin/import-status`, watch `/admin/system`, or tail the backend logs.
|
||||
5. The import runs asynchronously — poll `GET /api/admin/import-status`, watch `/admin/system`, or tail the backend logs. Re-running is safe: the import is idempotent (upsert by `source_ref` / document `index`) and never overwrites a human-edited field.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user