Files
familienarchiv/docs/adr/001-ocr-python-microservice.md
Marcel ec32d225b5 docs(adr): add ADR-001 (OCR microservice) and ADR-002 (polygon JSONB)
ADR-001 documents the decision to use a separate Python container for
OCR (Surya + Kraken), the interface contract, and why alternatives
like Tess4J were rejected.

ADR-002 documents the decision to store polygon annotations as JSONB
with a 4-point CHECK constraint, backed by an AttributeConverter.

Refs #226, #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:07:46 +02:00

3.1 KiB

ADR-001: OCR Python Microservice

Status

Accepted

Context

The Familienarchiv needs OCR capability to pre-populate transcription blocks from scanned documents. Two OCR engines are required:

  • Surya — transformer-based, handles typewritten and modern Latin handwriting
  • Kraken — historical HTR model support, required for pre-1941 German Kurrent/Suetterlin scripts

Both engines exist exclusively in the Python ecosystem. There are no production-quality Java bindings for either engine. Tess4J (Tesseract for Java) was considered but rejected: Tesseract has poor accuracy on degraded historical handwriting and no HTR-United model support.

The server has no GPU. CPU-only inference is the target (16-32 GB system RAM).

Decision

Introduce a separate Python container (ocr-service) that exposes a simple HTTP API. Spring Boot calls this service via RestClient. The Python service is stateless — all job tracking and business logic remain in Spring Boot.

Interface contract:

Request:

{
  "pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...",
  "scriptType": "HANDWRITING_KURRENT",
  "language": "de"
}

Response:

[
  {
    "pageNumber": 0,
    "x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04,
    "polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]],
    "text": "Sehr geehrter Herr ..."
  }
]

Coordinates are normalized (0-1) relative to page dimensions.

Java-side integration:

  • OcrClient interface with extractBlocks() method — mockable for unit tests
  • OcrHealthClient interface with isHealthy() — separate concern from block extraction
  • RestClientOcrClient implements both interfaces
  • OcrService orchestrates: presigned URL generation, OCR call, block mapping, TranscriptionService delegation

Docker networking:

  • ocr-service is on the internal Docker network only — no host port mapping
  • Spring Boot reaches it via http://ocr-service:8000
  • Health check with start_period: 60s to account for model loading (~30-60s on CPU)

Alternatives Considered

Alternative Why rejected
Tess4J (Tesseract in Java) No HTR-United model support; poor Kurrent accuracy
Calling Python via ProcessBuilder Fragile, no health checks, model reloading on every call
Embedding Python via GraalVM Experimental, complex dependency management for ML libraries
External SaaS OCR (Google Vision, AWS Textract) Data sovereignty concern for private family documents; no Kurrent support

Consequences

Easier:

  • Each engine is used via its native Python API — no bridging complexity
  • OCR service can be updated independently of the main application
  • Models can be swapped via volume mount without code changes

Harder:

  • One additional container to operate (memory, health checks, restarts)
  • Integration tests require WireMock stub — real OCR service is too slow for CI
  • Presigned URL TTL must be managed (15-30 min recommended)

Future Direction

  • LISTEN/NOTIFY from PostgreSQL to push progress events when scaling to multiple instances
  • GPU acceleration if the server is upgraded — only the Docker image needs to change