ADR-001 documents the decision to use a separate Python container for OCR (Surya + Kraken), the interface contract, and why alternatives like Tess4J were rejected. ADR-002 documents the decision to store polygon annotations as JSONB with a 4-point CHECK constraint, backed by an AttributeConverter. Refs #226, #227 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
85 lines
3.1 KiB
Markdown
85 lines
3.1 KiB
Markdown
# ADR-001: OCR Python Microservice
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
The Familienarchiv needs OCR capability to pre-populate transcription blocks from scanned documents. Two OCR engines are required:
|
|
|
|
- **Surya** — transformer-based, handles typewritten and modern Latin handwriting
|
|
- **Kraken** — historical HTR model support, required for pre-1941 German Kurrent/Suetterlin scripts
|
|
|
|
Both engines exist exclusively in the Python ecosystem. There are no production-quality Java bindings for either engine. Tess4J (Tesseract for Java) was considered but rejected: Tesseract has poor accuracy on degraded historical handwriting and no HTR-United model support.
|
|
|
|
The server has no GPU. CPU-only inference is the target (16-32 GB system RAM).
|
|
|
|
## Decision
|
|
|
|
Introduce a separate Python container (`ocr-service`) that exposes a simple HTTP API. Spring Boot calls this service via `RestClient`. The Python service is stateless — all job tracking and business logic remain in Spring Boot.
|
|
|
|
**Interface contract:**
|
|
|
|
Request:
|
|
```json
|
|
{
|
|
"pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...",
|
|
"scriptType": "HANDWRITING_KURRENT",
|
|
"language": "de"
|
|
}
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
[
|
|
{
|
|
"pageNumber": 0,
|
|
"x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04,
|
|
"polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]],
|
|
"text": "Sehr geehrter Herr ..."
|
|
}
|
|
]
|
|
```
|
|
|
|
Coordinates are normalized (0-1) relative to page dimensions.
|
|
|
|
**Java-side integration:**
|
|
|
|
- `OcrClient` interface with `extractBlocks()` method — mockable for unit tests
|
|
- `OcrHealthClient` interface with `isHealthy()` — separate concern from block extraction
|
|
- `RestClientOcrClient` implements both interfaces
|
|
- `OcrService` orchestrates: presigned URL generation, OCR call, block mapping, TranscriptionService delegation
|
|
|
|
**Docker networking:**
|
|
|
|
- `ocr-service` is on the internal Docker network only — no host port mapping
|
|
- Spring Boot reaches it via `http://ocr-service:8000`
|
|
- Health check with `start_period: 60s` to account for model loading (~30-60s on CPU)
|
|
|
|
## Alternatives Considered
|
|
|
|
| Alternative | Why rejected |
|
|
|---|---|
|
|
| Tess4J (Tesseract in Java) | No HTR-United model support; poor Kurrent accuracy |
|
|
| Calling Python via ProcessBuilder | Fragile, no health checks, model reloading on every call |
|
|
| Embedding Python via GraalVM | Experimental, complex dependency management for ML libraries |
|
|
| External SaaS OCR (Google Vision, AWS Textract) | Data sovereignty concern for private family documents; no Kurrent support |
|
|
|
|
## Consequences
|
|
|
|
**Easier:**
|
|
- Each engine is used via its native Python API — no bridging complexity
|
|
- OCR service can be updated independently of the main application
|
|
- Models can be swapped via volume mount without code changes
|
|
|
|
**Harder:**
|
|
- One additional container to operate (memory, health checks, restarts)
|
|
- Integration tests require WireMock stub — real OCR service is too slow for CI
|
|
- Presigned URL TTL must be managed (15-30 min recommended)
|
|
|
|
## Future Direction
|
|
|
|
- LISTEN/NOTIFY from PostgreSQL to push progress events when scaling to multiple instances
|
|
- GPU acceleration if the server is upgraded — only the Docker image needs to change
|