Files
familienarchiv/ocr-service/README.md
Marcel cfd49ff69e
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m7s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m7s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 18s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
docs(ocr): document TMPDIR convention and add ADR-021
- ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows
  to the environment variables table
- ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume
- docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision,
  trade-offs, and rejected alternatives (Approach B / C) for issue #614
- ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 10:58:10 +02:00

3.0 KiB

ocr-service

Python FastAPI microservice that performs the actual handwritten text recognition (HTR) and OCR. The Spring Boot backend orchestrates jobs; this service executes them.

What this service owns

  • Text recognition: Surya (typewritten text) and Kraken (Kurrent/Sütterlin historical handwriting)
  • Baseline layout analysis: Kraken BLLA model
  • Sender recognition: trained per-archive sender models
  • HTTP API at port 8000 (internal Docker network — no external port)

What this service does NOT own

  • Job lifecycle — tracked in the backend's ocr/ domain
  • MinIO storage — the service fetches PDFs via presigned URLs generated by the backend; it does not hold credentials
  • Transcription block storage — results are streamed back to the backend, which writes them to PostgreSQL

API endpoints

Endpoint Auth Purpose
POST /ocr None (internal network only) Run OCR on a PDF (presigned MinIO URL in request body)
POST /train X-Training-Token header Trigger sender-model training
POST /segtrain X-Training-Token header Trigger segmentation training
GET /health None Health check

Environment variables

Variable Default Required? Sensitive? Purpose
TRAINING_TOKEN YES (prod) YES Guards /train and /segtrain. Do not leave empty in production.
ALLOWED_PDF_HOSTS minio,localhost,127.0.0.1 YES SSRF protection — comma-separated allowed PDF source hosts. Never set to *.
KRAKEN_MODEL_PATH /app/models/ Directory where Kraken HTR models are stored (populated by download-kraken-models.sh)
BLLA_MODEL_PATH /app/models/blla.mlmodel Kraken baseline layout analysis model. Auto-downloaded via ensure_blla_model.py on startup if missing.
HF_HOME /app/cache HuggingFace model cache root. Keeps model downloads on the persistent cache volume.
XDG_CACHE_HOME /app/cache XDG cache root (used by some Surya components alongside HF_HOME).
TORCH_HOME /app/models/torch PyTorch model cache. Kept on the persistent models volume.
TMPDIR /app/cache/.tmp Download-staging directory for GB-scale Surya model files. Must point to a disk-backed path, not the 512 MB /tmp tmpfs — see ADR-021.

Key files

File Purpose
main.py FastAPI app, endpoint definitions, SSRF validation
engines/ Surya and Kraken engine wrappers
models.py Pydantic request/response models
preprocessing.py PDF-to-image conversion before OCR
confidence.py Per-block confidence scoring
spell_check.py Post-OCR spell correction using historical dictionaries
ensure_blla_model.py Startup script that downloads the BLLA model if missing
entrypoint.sh Docker entrypoint — runs ensure_blla_model.py then starts the server

Backend counterpart

backend/src/main/java/org/raddatz/familienarchiv/ocr/README.md