155 lines
5.3 KiB
Markdown
155 lines
5.3 KiB
Markdown
# OCR Service — Familienarchiv
|
|
|
|
## Overview
|
|
|
|
Python FastAPI microservice that performs OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition) on historical family documents. It exposes a simple HTTP API consumed by the Spring Boot backend. The service is stateless — all job tracking and business logic remain in Java.
|
|
|
|
## Tech Stack
|
|
|
|
- **Framework**: FastAPI 0.115.6 (Python 3.11)
|
|
- **OCR Engines**:
|
|
- **Surya** (`surya-ocr`) — Transformer-based, handles typewritten and modern Latin handwriting
|
|
- **Kraken** (`kraken==7.0`) — Historical HTR model support, required for pre-1941 German Kurrent/Sütterlin scripts
|
|
- **ML**: PyTorch 2.7.1 (CPU-only), torchvision, transformers
|
|
- **PDF Processing**: `pypdfium2` (rendering), `pillow`
|
|
- **Image Processing**: `opencv-python-headless`, `pyvips`
|
|
- **Spell Checking**: `pyspellchecker`
|
|
- **HTTP Client**: `httpx`
|
|
|
|
## Architecture
|
|
|
|
The service is a single-node container (see ADR-001). OCR training reloads the model in-process after each run, so multiple replicas would cause training conflicts and model-state divergence.
|
|
|
|
### Interface Contract
|
|
|
|
**Request:**
|
|
```json
|
|
{
|
|
"pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...",
|
|
"scriptType": "HANDWRITING_KURRENT",
|
|
"language": "de"
|
|
}
|
|
```
|
|
|
|
**Response:** Array of `OcrBlock` objects:
|
|
```json
|
|
[
|
|
{
|
|
"pageNumber": 0,
|
|
"x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04,
|
|
"polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]],
|
|
"text": "Sehr geehrter Herr ..."
|
|
}
|
|
]
|
|
```
|
|
|
|
Coordinates are normalized (0-1) relative to page dimensions.
|
|
|
|
### File Structure
|
|
|
|
```
|
|
ocr-service/
|
|
├── main.py # FastAPI app, endpoints, request handling
|
|
├── models.py # Pydantic models (OcrRequest, OcrBlock)
|
|
├── engines/
|
|
│ ├── __init__.py
|
|
│ ├── kraken.py # Kraken engine wrapper (Kurrent models)
|
|
│ └── surya.py # Surya engine wrapper (typewritten/Latin)
|
|
├── preprocessing.py # Image preprocessing (CLAHE, deskew, denoise)
|
|
├── confidence.py # Confidence scoring and thresholding
|
|
├── spell_check.py # Post-OCR spell correction
|
|
├── ensure_blla_model.py # Model download / verification helper
|
|
├── dictionaries/ # Historical word lists for spell checking
|
|
├── requirements.txt # Python dependencies
|
|
├── Dockerfile # Production container image
|
|
└── entrypoint.sh # Container startup script
|
|
```
|
|
|
|
### Key Endpoints
|
|
|
|
| Endpoint | Method | Description |
|
|
|---|---|---|
|
|
| `/health` | GET | Returns 200 only after models are loaded |
|
|
| `/ocr` | POST | Extract text blocks from a PDF URL |
|
|
| `/ocr/stream` | POST | Streaming OCR with SSE-style progress events |
|
|
| `/training/submit` | POST | Submit training data for model fine-tuning |
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|---|---|---|
|
|
| `KRAKEN_MODEL_PATH` | `/app/models/german_kurrent.mlmodel` | Path to Kraken model file |
|
|
| `TRAINING_TOKEN` | `""` | Bearer token required for training endpoints |
|
|
| `OCR_CONFIDENCE_THRESHOLD` | `0.3` | Minimum confidence for Latin scripts |
|
|
| `OCR_CONFIDENCE_THRESHOLD_KURRENT` | `0.5` | Minimum confidence for Kurrent scripts |
|
|
| `RECOGNITION_BATCH_SIZE` | `16` | Kraken recognition batch size |
|
|
| `DETECTOR_BATCH_SIZE` | `8` | Surya detector batch size |
|
|
| `OCR_CLAHE_CLIP_LIMIT` | `2.0` | CLAHE contrast enhancement limit |
|
|
| `OCR_CLAHE_TILE_SIZE` | `8` | CLAHE tile grid size |
|
|
| `OCR_MAX_CACHED_MODELS` | `2` | LRU model cache size (~500 MB each) |
|
|
| `ALLOWED_PDF_HOSTS` | `minio,localhost,127.0.0.1` | SSRF protection — allowed PDF URL hosts |
|
|
|
|
## How to Run
|
|
|
|
### Local Development (Python venv)
|
|
|
|
```bash
|
|
cd ocr-service
|
|
python -m venv .venv
|
|
source .venv/bin/activate
|
|
|
|
# Install PyTorch CPU first (saves ~2 GB vs CUDA)
|
|
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cpu
|
|
|
|
# Install remaining dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Run development server
|
|
fastapi dev main.py --host 0.0.0.0 --port 8000
|
|
|
|
# Or production mode
|
|
uvicorn main:app --host 0.0.0.0 --port 8000
|
|
```
|
|
|
|
### Docker (via docker-compose)
|
|
|
|
The OCR service is included in the root `docker-compose.yml`:
|
|
|
|
```bash
|
|
docker-compose up -d ocr-service
|
|
```
|
|
|
|
The container:
|
|
- Exposes port 8000 internally (not mapped to host by default)
|
|
- Mounts `ocr_models` and `ocr_cache` volumes for persistence
|
|
- Has a 120-second startup grace period for model loading
|
|
- Memory limit: 12 GB
|
|
|
|
### Model Downloads
|
|
|
|
Use the helper script to download Kraken models:
|
|
|
|
```bash
|
|
./scripts/download-kraken-models.sh
|
|
```
|
|
|
|
Models are stored in the `ocr_models` Docker volume or `./ocr-service/models/` locally.
|
|
|
|
## Testing
|
|
|
|
Only a subset of tests can run without the full ML stack:
|
|
|
|
```bash
|
|
cd ocr-service
|
|
pip install pytest pytest-asyncio pyspellchecker
|
|
|
|
# No ML required — pure logic tests
|
|
python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py -v
|
|
```
|
|
|
|
Tests requiring PyTorch/Kraken/Surya (e.g., `test_engines.py`) must be run in the Docker container or a fully provisioned venv.
|
|
|
|
## Training
|
|
|
|
The service supports in-process model fine-tuning via Kraken's `ketos` training pipeline. Training endpoints require the `TRAINING_TOKEN` bearer token. After training completes, the model is reloaded in-process — this is why only a single replica is supported.
|