Files

Marcel e85057bed2 refactor(document): move document domain core to document/ package

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-05 12:39:20 +02:00

5.3 KiB

Raw Blame History

OCR Service — Familienarchiv

Overview

Python FastAPI microservice that performs OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition) on historical family documents. It exposes a simple HTTP API consumed by the Spring Boot backend. The service is stateless — all job tracking and business logic remain in Java.

Tech Stack

Framework: FastAPI 0.115.6 (Python 3.11)
OCR Engines:
- Surya (surya-ocr) — Transformer-based, handles typewritten and modern Latin handwriting
- Kraken (kraken==7.0) — Historical HTR model support, required for pre-1941 German Kurrent/Sütterlin scripts
ML: PyTorch 2.7.1 (CPU-only), torchvision, transformers
PDF Processing: pypdfium2 (rendering), pillow
Image Processing: opencv-python-headless, pyvips
Spell Checking: pyspellchecker
HTTP Client: httpx

Architecture

The service is a single-node container (see ADR-001). OCR training reloads the model in-process after each run, so multiple replicas would cause training conflicts and model-state divergence.

Interface Contract

Request:

{
  "pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...",
  "scriptType": "HANDWRITING_KURRENT",
  "language": "de"
}

Response: Array of OcrBlock objects:

[
  {
    "pageNumber": 0,
    "x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04,
    "polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]],
    "text": "Sehr geehrter Herr ..."
  }
]

Coordinates are normalized (0-1) relative to page dimensions.

File Structure

ocr-service/
├── main.py                  # FastAPI app, endpoints, request handling
├── models.py                # Pydantic models (OcrRequest, OcrBlock)
├── engines/
│   ├── __init__.py
│   ├── kraken.py            # Kraken engine wrapper (Kurrent models)
│   └── surya.py             # Surya engine wrapper (typewritten/Latin)
├── preprocessing.py         # Image preprocessing (CLAHE, deskew, denoise)
├── confidence.py            # Confidence scoring and thresholding
├── spell_check.py           # Post-OCR spell correction
├── ensure_blla_model.py     # Model download / verification helper
├── dictionaries/            # Historical word lists for spell checking
├── requirements.txt         # Python dependencies
├── Dockerfile               # Production container image
└── entrypoint.sh            # Container startup script

Key Endpoints

Endpoint	Method	Description
`/health`	GET	Returns 200 only after models are loaded
`/ocr`	POST	Extract text blocks from a PDF URL
`/ocr/stream`	POST	Streaming OCR with SSE-style progress events
`/training/submit`	POST	Submit training data for model fine-tuning

Environment Variables

Variable	Default	Description
`KRAKEN_MODEL_PATH`	`/app/models/german_kurrent.mlmodel`	Path to Kraken model file
`TRAINING_TOKEN`	`""`	Bearer token required for training endpoints
`OCR_CONFIDENCE_THRESHOLD`	`0.3`	Minimum confidence for Latin scripts
`OCR_CONFIDENCE_THRESHOLD_KURRENT`	`0.5`	Minimum confidence for Kurrent scripts
`RECOGNITION_BATCH_SIZE`	`16`	Kraken recognition batch size
`DETECTOR_BATCH_SIZE`	`8`	Surya detector batch size
`OCR_CLAHE_CLIP_LIMIT`	`2.0`	CLAHE contrast enhancement limit
`OCR_CLAHE_TILE_SIZE`	`8`	CLAHE tile grid size
`OCR_MAX_CACHED_MODELS`	`2`	LRU model cache size (~500 MB each)
`ALLOWED_PDF_HOSTS`	`minio,localhost,127.0.0.1`	SSRF protection — allowed PDF URL hosts

How to Run

Local Development (Python venv)

cd ocr-service
python -m venv .venv
source .venv/bin/activate

# Install PyTorch CPU first (saves ~2 GB vs CUDA)
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cpu

# Install remaining dependencies
pip install -r requirements.txt

# Run development server
fastapi dev main.py --host 0.0.0.0 --port 8000

# Or production mode
uvicorn main:app --host 0.0.0.0 --port 8000

Docker (via docker-compose)

The OCR service is included in the root docker-compose.yml:

docker-compose up -d ocr-service

The container:

Exposes port 8000 internally (not mapped to host by default)
Mounts ocr_models and ocr_cache volumes for persistence
Has a 120-second startup grace period for model loading
Memory limit: 12 GB

Model Downloads

Use the helper script to download Kraken models:

./scripts/download-kraken-models.sh

Models are stored in the ocr_models Docker volume or ./ocr-service/models/ locally.

Testing

Only a subset of tests can run without the full ML stack:

cd ocr-service
pip install pytest pytest-asyncio pyspellchecker

# No ML required — pure logic tests
python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py -v

Tests requiring PyTorch/Kraken/Surya (e.g., test_engines.py) must be run in the Docker container or a fully provisioned venv.

Training

The service supports in-process model fine-tuning via Kraken's ketos training pipeline. Training endpoints require the TRAINING_TOKEN bearer token. After training completes, the model is reloaded in-process — this is why only a single replica is supported.

5.3 KiB Raw Blame History