5.3 KiB
OCR Service — Familienarchiv
Overview
Python FastAPI microservice that performs OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition) on historical family documents. It exposes a simple HTTP API consumed by the Spring Boot backend. The service is stateless — all job tracking and business logic remain in Java.
Tech Stack
- Framework: FastAPI 0.115.6 (Python 3.11)
- OCR Engines:
- Surya (
surya-ocr) — Transformer-based, handles typewritten and modern Latin handwriting - Kraken (
kraken==7.0) — Historical HTR model support, required for pre-1941 German Kurrent/Sütterlin scripts
- Surya (
- ML: PyTorch 2.7.1 (CPU-only), torchvision, transformers
- PDF Processing:
pypdfium2(rendering),pillow - Image Processing:
opencv-python-headless,pyvips - Spell Checking:
pyspellchecker - HTTP Client:
httpx
Architecture
The service is a single-node container (see ADR-001). OCR training reloads the model in-process after each run, so multiple replicas would cause training conflicts and model-state divergence.
Interface Contract
Request:
{
"pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...",
"scriptType": "HANDWRITING_KURRENT",
"language": "de"
}
Response: Array of OcrBlock objects:
[
{
"pageNumber": 0,
"x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04,
"polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]],
"text": "Sehr geehrter Herr ..."
}
]
Coordinates are normalized (0-1) relative to page dimensions.
File Structure
ocr-service/
├── main.py # FastAPI app, endpoints, request handling
├── models.py # Pydantic models (OcrRequest, OcrBlock)
├── engines/
│ ├── __init__.py
│ ├── kraken.py # Kraken engine wrapper (Kurrent models)
│ └── surya.py # Surya engine wrapper (typewritten/Latin)
├── preprocessing.py # Image preprocessing (CLAHE, deskew, denoise)
├── confidence.py # Confidence scoring and thresholding
├── spell_check.py # Post-OCR spell correction
├── ensure_blla_model.py # Model download / verification helper
├── dictionaries/ # Historical word lists for spell checking
├── requirements.txt # Python dependencies
├── Dockerfile # Production container image
└── entrypoint.sh # Container startup script
Key Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Returns 200 only after models are loaded |
/ocr |
POST | Extract text blocks from a PDF URL |
/ocr/stream |
POST | Streaming OCR with SSE-style progress events |
/training/submit |
POST | Submit training data for model fine-tuning |
Environment Variables
| Variable | Default | Description |
|---|---|---|
KRAKEN_MODEL_PATH |
/app/models/german_kurrent.mlmodel |
Path to Kraken model file |
TRAINING_TOKEN |
"" |
Bearer token required for training endpoints |
OCR_CONFIDENCE_THRESHOLD |
0.3 |
Minimum confidence for Latin scripts |
OCR_CONFIDENCE_THRESHOLD_KURRENT |
0.5 |
Minimum confidence for Kurrent scripts |
RECOGNITION_BATCH_SIZE |
16 |
Kraken recognition batch size |
DETECTOR_BATCH_SIZE |
8 |
Surya detector batch size |
OCR_CLAHE_CLIP_LIMIT |
2.0 |
CLAHE contrast enhancement limit |
OCR_CLAHE_TILE_SIZE |
8 |
CLAHE tile grid size |
OCR_MAX_CACHED_MODELS |
2 |
LRU model cache size (~500 MB each) |
ALLOWED_PDF_HOSTS |
minio,localhost,127.0.0.1 |
SSRF protection — allowed PDF URL hosts |
How to Run
Local Development (Python venv)
cd ocr-service
python -m venv .venv
source .venv/bin/activate
# Install PyTorch CPU first (saves ~2 GB vs CUDA)
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cpu
# Install remaining dependencies
pip install -r requirements.txt
# Run development server
fastapi dev main.py --host 0.0.0.0 --port 8000
# Or production mode
uvicorn main:app --host 0.0.0.0 --port 8000
Docker (via docker-compose)
The OCR service is included in the root docker-compose.yml:
docker-compose up -d ocr-service
The container:
- Exposes port 8000 internally (not mapped to host by default)
- Mounts
ocr_modelsandocr_cachevolumes for persistence - Has a 120-second startup grace period for model loading
- Memory limit: 12 GB
Model Downloads
Use the helper script to download Kraken models:
./scripts/download-kraken-models.sh
Models are stored in the ocr_models Docker volume or ./ocr-service/models/ locally.
Testing
Only a subset of tests can run without the full ML stack:
cd ocr-service
pip install pytest pytest-asyncio pyspellchecker
# No ML required — pure logic tests
python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py -v
Tests requiring PyTorch/Kraken/Surya (e.g., test_engines.py) must be run in the Docker container or a fully provisioned venv.
Training
The service supports in-process model fine-tuning via Kraken's ketos training pipeline. Training endpoints require the TRAINING_TOKEN bearer token. After training completes, the model is reloaded in-process — this is why only a single replica is supported.