A silent non-zero exit would previously cause the test to pass incorrectly because only directory creation was checked. Exit code is now the first assertion, catching regressions before the filesystem check runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ocr-service
Python FastAPI microservice that performs the actual handwritten text recognition (HTR) and OCR. The Spring Boot backend orchestrates jobs; this service executes them.
What this service owns
- Text recognition: Surya (typewritten text) and Kraken (Kurrent/Sütterlin historical handwriting)
- Baseline layout analysis: Kraken BLLA model
- Sender recognition: trained per-archive sender models
- HTTP API at port 8000 (internal Docker network — no external port)
What this service does NOT own
- Job lifecycle — tracked in the backend's
ocr/domain - MinIO storage — the service fetches PDFs via presigned URLs generated by the backend; it does not hold credentials
- Transcription block storage — results are streamed back to the backend, which writes them to PostgreSQL
API endpoints
| Endpoint | Auth | Purpose |
|---|---|---|
POST /ocr |
None (internal network only) | Run OCR on a PDF (presigned MinIO URL in request body) |
POST /train |
X-Training-Token header |
Trigger sender-model training |
POST /segtrain |
X-Training-Token header |
Trigger segmentation training |
GET /health |
None | Health check |
Environment variables
| Variable | Default | Required? | Sensitive? | Purpose |
|---|---|---|---|---|
TRAINING_TOKEN |
— | YES (prod) | YES | Guards /train and /segtrain. Do not leave empty in production. |
ALLOWED_PDF_HOSTS |
minio,localhost,127.0.0.1 |
YES | — | SSRF protection — comma-separated allowed PDF source hosts. Never set to *. |
KRAKEN_MODEL_PATH |
/app/models/ |
— | — | Directory where Kraken HTR models are stored (populated by download-kraken-models.sh) |
BLLA_MODEL_PATH |
/app/models/blla.mlmodel |
— | — | Kraken baseline layout analysis model. Auto-downloaded via ensure_blla_model.py on startup if missing. |
HF_HOME |
/app/cache |
— | — | HuggingFace model cache root. Keeps model downloads on the persistent cache volume. |
XDG_CACHE_HOME |
/app/cache |
— | — | XDG cache root (used by some Surya components alongside HF_HOME). |
TORCH_HOME |
/app/models/torch |
— | — | PyTorch model cache. Kept on the persistent models volume. |
TMPDIR |
/app/cache/.tmp |
— | — | Download-staging directory for GB-scale Surya model files. Must point to a disk-backed path, not the 512 MB /tmp tmpfs — see ADR-021. |
Key files
| File | Purpose |
|---|---|
main.py |
FastAPI app, endpoint definitions, SSRF validation |
engines/ |
Surya and Kraken engine wrappers |
models.py |
Pydantic request/response models |
preprocessing.py |
PDF-to-image conversion before OCR |
confidence.py |
Per-block confidence scoring |
spell_check.py |
Post-OCR spell correction using historical dictionaries |
ensure_blla_model.py |
Startup script that downloads the BLLA model if missing |
entrypoint.sh |
Docker entrypoint — runs ensure_blla_model.py then starts the server |
Backend counterpart
backend/src/main/java/org/raddatz/familienarchiv/ocr/README.md