feat(ocr): add Python OCR microservice, RestClientOcrClient, Docker Compose

Python microservice (ocr-service/):
- FastAPI app with /ocr and /health endpoints
- Surya engine: transformer-based OCR for typewritten/modern handwriting
- Kraken engine: historical HTR for Kurrent/Suetterlin with
  pure-Python polygon-to-quad approximation (gift wrapping + rotating calipers)
- Eager model loading at startup via lifespan context manager
- PDF download via httpx, page rendering via pypdfium2 at 300 DPI

Java RestClientOcrClient:
- Implements OcrClient + OcrHealthClient interfaces
- Calls Python service via Spring RestClient
- Health check with graceful fallback

Docker Compose:
- New ocr-service container (mem_limit 6g, no host ports)
- Health check with start_period 60s for model loading
- ocr_models volume for Kraken model files
- Backend depends on ocr-service health

Refs #226, #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-04-12 15:26:40 +02:00
parent aea46c5fd0
commit 6737bd6db5
9 changed files with 500 additions and 0 deletions

View File

@@ -71,6 +71,28 @@ services:
networks:
- archive-net
# --- OCR: Python microservice (Surya + Kraken) ---
ocr-service:
build:
context: ./ocr-service
dockerfile: Dockerfile
container_name: archive-ocr
restart: unless-stopped
mem_limit: 6g
memswap_limit: 6g
volumes:
- ocr_models:/app/models
environment:
KRAKEN_MODEL_PATH: /app/models/german_kurrent.mlmodel
networks:
- archive-net
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 10s
timeout: 5s
retries: 12
start_period: 60s
# --- Backend: Spring Boot ---
backend:
build:
@@ -89,6 +111,8 @@ services:
condition: service_healthy
mailpit:
condition: service_started
ocr-service:
condition: service_healthy
environment:
SPRING_DATASOURCE_URL: jdbc:postgresql://db:5432/${POSTGRES_DB}
SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
@@ -109,6 +133,8 @@ services:
# Mailpit needs no auth or STARTTLS; production SMTP overrides these via .env
SPRING_MAIL_PROPERTIES_MAIL_SMTP_AUTH: ${MAIL_SMTP_AUTH:-false}
SPRING_MAIL_PROPERTIES_MAIL_SMTP_STARTTLS_ENABLE: ${MAIL_STARTTLS_ENABLE:-false}
APP_OCR_BASE_URL: http://ocr-service:8000
APP_S3_INTERNAL_URL: http://minio:9000
ports:
- "${PORT_BACKEND}:8080"
networks:
@@ -155,3 +181,4 @@ networks:
volumes:
frontend_node_modules:
maven_cache:
ocr_models: