Files
familienarchiv/ocr-service
Marcel e31dac5c9c test(ocr): assert entrypoint.sh exit code in test_entrypoint_creates_tmpdir
A silent non-zero exit would previously cause the test to pass incorrectly
because only directory creation was checked. Exit code is now the first
assertion, catching regressions before the filesystem check runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:18:14 +02:00
..

ocr-service

Python FastAPI microservice that performs the actual handwritten text recognition (HTR) and OCR. The Spring Boot backend orchestrates jobs; this service executes them.

What this service owns

  • Text recognition: Surya (typewritten text) and Kraken (Kurrent/Sütterlin historical handwriting)
  • Baseline layout analysis: Kraken BLLA model
  • Sender recognition: trained per-archive sender models
  • HTTP API at port 8000 (internal Docker network — no external port)

What this service does NOT own

  • Job lifecycle — tracked in the backend's ocr/ domain
  • MinIO storage — the service fetches PDFs via presigned URLs generated by the backend; it does not hold credentials
  • Transcription block storage — results are streamed back to the backend, which writes them to PostgreSQL

API endpoints

Endpoint Auth Purpose
POST /ocr None (internal network only) Run OCR on a PDF (presigned MinIO URL in request body)
POST /train X-Training-Token header Trigger sender-model training
POST /segtrain X-Training-Token header Trigger segmentation training
GET /health None Health check

Environment variables

Variable Default Required? Sensitive? Purpose
TRAINING_TOKEN YES (prod) YES Guards /train and /segtrain. Do not leave empty in production.
ALLOWED_PDF_HOSTS minio,localhost,127.0.0.1 YES SSRF protection — comma-separated allowed PDF source hosts. Never set to *.
KRAKEN_MODEL_PATH /app/models/ Directory where Kraken HTR models are stored (populated by download-kraken-models.sh)
BLLA_MODEL_PATH /app/models/blla.mlmodel Kraken baseline layout analysis model. Auto-downloaded via ensure_blla_model.py on startup if missing.
HF_HOME /app/cache HuggingFace model cache root. Keeps model downloads on the persistent cache volume.
XDG_CACHE_HOME /app/cache XDG cache root (used by some Surya components alongside HF_HOME).
TORCH_HOME /app/models/torch PyTorch model cache. Kept on the persistent models volume.
TMPDIR /app/cache/.tmp Download-staging directory for GB-scale Surya model files. Must point to a disk-backed path, not the 512 MB /tmp tmpfs — see ADR-021.

Key files

File Purpose
main.py FastAPI app, endpoint definitions, SSRF validation
engines/ Surya and Kraken engine wrappers
models.py Pydantic request/response models
preprocessing.py PDF-to-image conversion before OCR
confidence.py Per-block confidence scoring
spell_check.py Post-OCR spell correction using historical dictionaries
ensure_blla_model.py Startup script that downloads the BLLA model if missing
entrypoint.sh Docker entrypoint — runs ensure_blla_model.py then starts the server

Backend counterpart

backend/src/main/java/org/raddatz/familienarchiv/ocr/README.md