# ADR-001: OCR Python Microservice ## Status Accepted ## Context The Familienarchiv needs OCR capability to pre-populate transcription blocks from scanned documents. Two OCR engines are required: - **Surya** — transformer-based, handles typewritten and modern Latin handwriting - **Kraken** — historical HTR model support, required for pre-1941 German Kurrent/Suetterlin scripts Both engines exist exclusively in the Python ecosystem. There are no production-quality Java bindings for either engine. Tess4J (Tesseract for Java) was considered but rejected: Tesseract has poor accuracy on degraded historical handwriting and no HTR-United model support. The server has no GPU. CPU-only inference is the target (16-32 GB system RAM). ## Decision Introduce a separate Python container (`ocr-service`) that exposes a simple HTTP API. Spring Boot calls this service via `RestClient`. The Python service is stateless — all job tracking and business logic remain in Spring Boot. **Interface contract:** Request: ```json { "pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...", "scriptType": "HANDWRITING_KURRENT", "language": "de" } ``` Response: ```json [ { "pageNumber": 0, "x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04, "polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]], "text": "Sehr geehrter Herr ..." } ] ``` Coordinates are normalized (0-1) relative to page dimensions. **Java-side integration:** - `OcrClient` interface with `extractBlocks()` method — mockable for unit tests - `OcrHealthClient` interface with `isHealthy()` — separate concern from block extraction - `RestClientOcrClient` implements both interfaces - `OcrService` orchestrates: presigned URL generation, OCR call, block mapping, TranscriptionService delegation **Docker networking:** - `ocr-service` is on the internal Docker network only — no host port mapping - Spring Boot reaches it via `http://ocr-service:8000` - Health check with `start_period: 60s` to account for model loading (~30-60s on CPU) ## Alternatives Considered | Alternative | Why rejected | |---|---| | Tess4J (Tesseract in Java) | No HTR-United model support; poor Kurrent accuracy | | Calling Python via ProcessBuilder | Fragile, no health checks, model reloading on every call | | Embedding Python via GraalVM | Experimental, complex dependency management for ML libraries | | External SaaS OCR (Google Vision, AWS Textract) | Data sovereignty concern for private family documents; no Kurrent support | ## Consequences **Easier:** - Each engine is used via its native Python API — no bridging complexity - OCR service can be updated independently of the main application - Models can be swapped via volume mount without code changes **Harder:** - One additional container to operate (memory, health checks, restarts) - Integration tests require WireMock stub — real OCR service is too slow for CI - Presigned URL TTL must be managed (15-30 min recommended) ## Future Direction - LISTEN/NOTIFY from PostgreSQL to push progress events when scaling to multiple instances - GPU acceleration if the server is upgraded — only the Docker image needs to change