# ocr-service Python FastAPI microservice that performs the actual handwritten text recognition (HTR) and OCR. The Spring Boot backend orchestrates jobs; this service executes them. ## What this service owns - Text recognition: Surya (typewritten text) and Kraken (Kurrent/Sütterlin historical handwriting) - Baseline layout analysis: Kraken BLLA model - Sender recognition: trained per-archive sender models - HTTP API at port 8000 (internal Docker network — no external port) ## What this service does NOT own - Job lifecycle — tracked in the backend's `ocr/` domain - MinIO storage — the service fetches PDFs via presigned URLs generated by the backend; it does not hold credentials - Transcription block storage — results are streamed back to the backend, which writes them to PostgreSQL ## API endpoints | Endpoint | Auth | Purpose | |---|---|---| | `POST /ocr` | None (internal network only) | Run OCR on a PDF (presigned MinIO URL in request body) | | `POST /train` | `X-Training-Token` header | Trigger sender-model training | | `POST /segtrain` | `X-Training-Token` header | Trigger segmentation training | | `GET /health` | None | Health check | ## Environment variables | Variable | Default | Required? | Sensitive? | Purpose | |---|---|---|---|---| | `TRAINING_TOKEN` | — | YES (prod) | YES | Guards `/train` and `/segtrain`. Do not leave empty in production. | | `ALLOWED_PDF_HOSTS` | `minio,localhost,127.0.0.1` | YES | — | SSRF protection — comma-separated allowed PDF source hosts. Never set to `*`. | | `KRAKEN_MODEL_PATH` | `/app/models/` | — | — | Directory where Kraken HTR models are stored (populated by `download-kraken-models.sh`) | | `BLLA_MODEL_PATH` | `/app/models/blla.mlmodel` | — | — | Kraken baseline layout analysis model. Auto-downloaded via `ensure_blla_model.py` on startup if missing. | ## Key files | File | Purpose | |---|---| | `main.py` | FastAPI app, endpoint definitions, SSRF validation | | `engines/` | Surya and Kraken engine wrappers | | `models.py` | Pydantic request/response models | | `preprocessing.py` | PDF-to-image conversion before OCR | | `confidence.py` | Per-block confidence scoring | | `spell_check.py` | Post-OCR spell correction using historical dictionaries | | `ensure_blla_model.py` | Startup script that downloads the BLLA model if missing | | `entrypoint.sh` | Docker entrypoint — runs `ensure_blla_model.py` then starts the server | ## Backend counterpart `backend/src/main/java/org/raddatz/familienarchiv/ocr/README.md`