Processes all 7 CLAUDE.md files according to the 3-bucket classification. Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md, domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last. ### scripts/CLAUDE.md → scripts/README.md New `scripts/README.md` with full script documentation (preserving the ⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md` reduced to a pointer + "document new scripts in README.md" reminder. ### .devcontainer/CLAUDE.md → .devcontainer/README.md New `.devcontainer/README.md` with all configuration, usage, and limitations. `devcontainer/CLAUDE.md` reduced to a single pointer line. ### docs/CLAUDE.md → docs/README.md New `docs/README.md` covering the folder structure, ADR guide, infrastructure docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder. ### ocr-service/CLAUDE.md Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6). Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk. ### backend/CLAUDE.md - Layering Rules → pointer to docs/ARCHITECTURE.md - Error Handling → pointer to CONTRIBUTING.md + reminder - Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder - Package Structure → tagged TODO post-REFACTOR-1 - Fixed errors.ts path to frontend/src/lib/shared/errors.ts - Added ANNOTATE_ALL + BLOG_WRITE to permission list - Key Entities, Entity Code Style, Services → kept (Bucket-2) ### root CLAUDE.md - Stack, Infrastructure, Dev Container → pointers - Layering Rules, Error Handling, Security, OpenAPI, API Client, Date Handling, UI Components, Frontend Error Handling → pointers + reminders - Package Structure → tagged TODO post-REFACTOR-1 - Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2) ### frontend/CLAUDE.md - API Client Pattern, Date Handling → pointers + reminders - Key UI Components → pointer to domain READMEs - Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ocr-service
Python FastAPI microservice that performs the actual handwritten text recognition (HTR) and OCR. The Spring Boot backend orchestrates jobs; this service executes them.
What this service owns
- Text recognition: Surya (typewritten text) and Kraken (Kurrent/Sütterlin historical handwriting)
- Baseline layout analysis: Kraken BLLA model
- Sender recognition: trained per-archive sender models
- HTTP API at port 8000 (internal Docker network — no external port)
What this service does NOT own
- Job lifecycle — tracked in the backend's
ocr/domain - MinIO storage — the service fetches PDFs via presigned URLs generated by the backend; it does not hold credentials
- Transcription block storage — results are streamed back to the backend, which writes them to PostgreSQL
API endpoints
| Endpoint | Auth | Purpose |
|---|---|---|
POST /ocr |
None (internal network only) | Run OCR on a PDF (presigned MinIO URL in request body) |
POST /train |
X-Training-Token header |
Trigger sender-model training |
POST /segtrain |
X-Training-Token header |
Trigger segmentation training |
GET /health |
None | Health check |
Environment variables
| Variable | Default | Required? | Sensitive? | Purpose |
|---|---|---|---|---|
TRAINING_TOKEN |
— | YES (prod) | YES | Guards /train and /segtrain. Do not leave empty in production. |
ALLOWED_PDF_HOSTS |
minio,localhost,127.0.0.1 |
YES | — | SSRF protection — comma-separated allowed PDF source hosts. Never set to *. |
KRAKEN_MODEL_PATH |
/app/models/ |
— | — | Directory where Kraken HTR models are stored (populated by download-kraken-models.sh) |
BLLA_MODEL_PATH |
/app/models/blla.mlmodel |
— | — | Kraken baseline layout analysis model. Auto-downloaded via ensure_blla_model.py on startup if missing. |
Key files
| File | Purpose |
|---|---|
main.py |
FastAPI app, endpoint definitions, SSRF validation |
engines/ |
Surya and Kraken engine wrappers |
models.py |
Pydantic request/response models |
preprocessing.py |
PDF-to-image conversion before OCR |
confidence.py |
Per-block confidence scoring |
spell_check.py |
Post-OCR spell correction using historical dictionaries |
ensure_blla_model.py |
Startup script that downloads the BLLA model if missing |
entrypoint.sh |
Docker entrypoint — runs ensure_blla_model.py then starts the server |
Backend counterpart
backend/src/main/java/org/raddatz/familienarchiv/ocr/README.md