# OCR Service — Familienarchiv ## Overview Python FastAPI microservice that performs OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition) on historical family documents. It exposes a simple HTTP API consumed by the Spring Boot backend. The service is stateless — all job tracking and business logic remain in Java. ## Tech Stack - **Framework**: FastAPI 0.115.6 (Python 3.11) - **OCR Engines**: - **Surya** (`surya-ocr`) — Transformer-based, handles typewritten and modern Latin handwriting - **Kraken** (`kraken==7.0`) — Historical HTR model support, required for pre-1941 German Kurrent/Sütterlin scripts - **ML**: PyTorch 2.7.1 (CPU-only), torchvision, transformers - **PDF Processing**: `pypdfium2` (rendering), `pillow` - **Image Processing**: `opencv-python-headless`, `pyvips` - **Spell Checking**: `pyspellchecker` - **HTTP Client**: `httpx` ## Architecture The service is a single-node container (see ADR-001). OCR training reloads the model in-process after each run, so multiple replicas would cause training conflicts and model-state divergence. ### Interface Contract **Request:** ```json { "pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...", "scriptType": "HANDWRITING_KURRENT", "language": "de" } ``` **Response:** Array of `OcrBlock` objects: ```json [ { "pageNumber": 0, "x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04, "polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]], "text": "Sehr geehrter Herr ..." } ] ``` Coordinates are normalized (0-1) relative to page dimensions. ### File Structure ``` ocr-service/ ├── main.py # FastAPI app, endpoints, request handling ├── models.py # Pydantic models (OcrRequest, OcrBlock) ├── engines/ │ ├── __init__.py │ ├── kraken.py # Kraken engine wrapper (Kurrent models) │ └── surya.py # Surya engine wrapper (typewritten/Latin) ├── preprocessing.py # Image preprocessing (CLAHE, deskew, denoise) ├── confidence.py # Confidence scoring and thresholding ├── spell_check.py # Post-OCR spell correction ├── ensure_blla_model.py # Model download / verification helper ├── dictionaries/ # Historical word lists for spell checking ├── requirements.txt # Python dependencies ├── Dockerfile # Production container image └── entrypoint.sh # Container startup script ``` ### Key Endpoints | Endpoint | Method | Description | |---|---|---| | `/health` | GET | Returns 200 only after models are loaded | | `/ocr` | POST | Extract text blocks from a PDF URL | | `/ocr/stream` | POST | Streaming OCR with SSE-style progress events | | `/training/submit` | POST | Submit training data for model fine-tuning | ### Environment Variables | Variable | Default | Description | |---|---|---| | `KRAKEN_MODEL_PATH` | `/app/models/german_kurrent.mlmodel` | Path to Kraken model file | | `TRAINING_TOKEN` | `""` | Bearer token required for training endpoints | | `OCR_CONFIDENCE_THRESHOLD` | `0.3` | Minimum confidence for Latin scripts | | `OCR_CONFIDENCE_THRESHOLD_KURRENT` | `0.5` | Minimum confidence for Kurrent scripts | | `RECOGNITION_BATCH_SIZE` | `16` | Kraken recognition batch size | | `DETECTOR_BATCH_SIZE` | `8` | Surya detector batch size | | `OCR_CLAHE_CLIP_LIMIT` | `2.0` | CLAHE contrast enhancement limit | | `OCR_CLAHE_TILE_SIZE` | `8` | CLAHE tile grid size | | `OCR_MAX_CACHED_MODELS` | `2` | LRU model cache size (~500 MB each) | | `ALLOWED_PDF_HOSTS` | `minio,localhost,127.0.0.1` | SSRF protection — allowed PDF URL hosts | ## How to Run ### Local Development (Python venv) ```bash cd ocr-service python -m venv .venv source .venv/bin/activate # Install PyTorch CPU first (saves ~2 GB vs CUDA) pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cpu # Install remaining dependencies pip install -r requirements.txt # Run development server fastapi dev main.py --host 0.0.0.0 --port 8000 # Or production mode uvicorn main:app --host 0.0.0.0 --port 8000 ``` ### Docker (via docker-compose) The OCR service is included in the root `docker-compose.yml`: ```bash docker-compose up -d ocr-service ``` The container: - Exposes port 8000 internally (not mapped to host by default) - Mounts `ocr_models` and `ocr_cache` volumes for persistence - Has a 120-second startup grace period for model loading - Memory limit: 12 GB ### Model Downloads Use the helper script to download Kraken models: ```bash ./scripts/download-kraken-models.sh ``` Models are stored in the `ocr_models` Docker volume or `./ocr-service/models/` locally. ## Testing Only a subset of tests can run without the full ML stack: ```bash cd ocr-service pip install pytest pytest-asyncio pyspellchecker # No ML required — pure logic tests python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py -v ``` Tests requiring PyTorch/Kraken/Surya (e.g., `test_engines.py`) must be run in the Docker container or a fully provisioned venv. ## Training The service supports in-process model fine-tuning via Kraken's `ketos` training pipeline. Training endpoints require the `TRAINING_TOKEN` bearer token. After training completes, the model is reloaded in-process — this is why only a single replica is supported.