From ec32d225b59b369124ae67a81842b55d24581906 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sun, 12 Apr 2026 15:07:46 +0200
Subject: [PATCH] docs(adr): add ADR-001 (OCR microservice) and ADR-002
 (polygon JSONB)

ADR-001 documents the decision to use a separate Python container for
OCR (Surya + Kraken), the interface contract, and why alternatives
like Tess4J were rejected.

ADR-002 documents the decision to store polygon annotations as JSONB
with a 4-point CHECK constraint, backed by an AttributeConverter.

Refs #226, #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/adr/001-ocr-python-microservice.md | 84 +++++++++++++++++++++++++
 docs/adr/002-polygon-jsonb-storage.md   | 52 +++++++++++++++
 2 files changed, 136 insertions(+)
 create mode 100644 docs/adr/001-ocr-python-microservice.md
 create mode 100644 docs/adr/002-polygon-jsonb-storage.md

diff --git a/docs/adr/001-ocr-python-microservice.md b/docs/adr/001-ocr-python-microservice.md
new file mode 100644
index 00000000..869ff950
--- /dev/null
+++ b/docs/adr/001-ocr-python-microservice.md
@@ -0,0 +1,84 @@
+# ADR-001: OCR Python Microservice
+
+## Status
+
+Accepted
+
+## Context
+
+The Familienarchiv needs OCR capability to pre-populate transcription blocks from scanned documents. Two OCR engines are required:
+
+- **Surya** — transformer-based, handles typewritten and modern Latin handwriting
+- **Kraken** — historical HTR model support, required for pre-1941 German Kurrent/Suetterlin scripts
+
+Both engines exist exclusively in the Python ecosystem. There are no production-quality Java bindings for either engine. Tess4J (Tesseract for Java) was considered but rejected: Tesseract has poor accuracy on degraded historical handwriting and no HTR-United model support.
+
+The server has no GPU. CPU-only inference is the target (16-32 GB system RAM).
+
+## Decision
+
+Introduce a separate Python container (`ocr-service`) that exposes a simple HTTP API. Spring Boot calls this service via `RestClient`. The Python service is stateless — all job tracking and business logic remain in Spring Boot.
+
+**Interface contract:**
+
+Request:
+```json
+{
+  "pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...",
+  "scriptType": "HANDWRITING_KURRENT",
+  "language": "de"
+}
+```
+
+Response:
+```json
+[
+  {
+    "pageNumber": 0,
+    "x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04,
+    "polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]],
+    "text": "Sehr geehrter Herr ..."
+  }
+]
+```
+
+Coordinates are normalized (0-1) relative to page dimensions.
+
+**Java-side integration:**
+
+- `OcrClient` interface with `extractBlocks()` method — mockable for unit tests
+- `OcrHealthClient` interface with `isHealthy()` — separate concern from block extraction
+- `RestClientOcrClient` implements both interfaces
+- `OcrService` orchestrates: presigned URL generation, OCR call, block mapping, TranscriptionService delegation
+
+**Docker networking:**
+
+- `ocr-service` is on the internal Docker network only — no host port mapping
+- Spring Boot reaches it via `http://ocr-service:8000`
+- Health check with `start_period: 60s` to account for model loading (~30-60s on CPU)
+
+## Alternatives Considered
+
+| Alternative | Why rejected |
+|---|---|
+| Tess4J (Tesseract in Java) | No HTR-United model support; poor Kurrent accuracy |
+| Calling Python via ProcessBuilder | Fragile, no health checks, model reloading on every call |
+| Embedding Python via GraalVM | Experimental, complex dependency management for ML libraries |
+| External SaaS OCR (Google Vision, AWS Textract) | Data sovereignty concern for private family documents; no Kurrent support |
+
+## Consequences
+
+**Easier:**
+- Each engine is used via its native Python API — no bridging complexity
+- OCR service can be updated independently of the main application
+- Models can be swapped via volume mount without code changes
+
+**Harder:**
+- One additional container to operate (memory, health checks, restarts)
+- Integration tests require WireMock stub — real OCR service is too slow for CI
+- Presigned URL TTL must be managed (15-30 min recommended)
+
+## Future Direction
+
+- LISTEN/NOTIFY from PostgreSQL to push progress events when scaling to multiple instances
+- GPU acceleration if the server is upgraded — only the Docker image needs to change
diff --git a/docs/adr/002-polygon-jsonb-storage.md b/docs/adr/002-polygon-jsonb-storage.md
new file mode 100644
index 00000000..6383759c
--- /dev/null
+++ b/docs/adr/002-polygon-jsonb-storage.md
@@ -0,0 +1,52 @@
+# ADR-002: Polygon JSONB Storage for Annotations
+
+## Status
+
+Accepted
+
+## Context
+
+Document annotations currently store axis-aligned bounding boxes (`x, y, width, height`). Kraken OCR outputs polygon boundaries for text lines — historical handwriting (Kurrent, Suetterlin) produces rotated and curved text that axis-aligned rectangles approximate poorly.
+
+We need to store an optional quadrilateral (4 corner points) per annotation to represent the precise text region. The polygon is display-only — overlap detection and all server-side geometry logic continues to use the AABB fields.
+
+## Decision
+
+Add a `polygon JSONB` column to `document_annotations`:
+
+```sql
+ALTER TABLE document_annotations ADD COLUMN polygon JSONB;
+ALTER TABLE document_annotations
+ADD CONSTRAINT chk_annotation_polygon_quad
+    CHECK (polygon IS NULL OR jsonb_array_length(polygon) = 4);
+```
+
+- `null` means rectangle — render using existing `x, y, width, height` fields (fully backward compatible)
+- Non-null value is a normalized 4-point quadrilateral: `[[x1,y1],[x2,y2],[x3,y3],[x4,y4]]` with coordinates in the 0-1 range relative to page dimensions
+
+The existing AABB fields are always populated (even when a polygon is present) and remain the authoritative geometry for overlap detection.
+
+**Java entity:** `List<List<Double>> polygon` backed by a custom `AttributeConverter<List<List<Double>>, String>`. No new dependency (Hypersistence Utils is not in the project and won't be added for a single column).
+
+**Semantic invariant:** `polygon`, if present, is a 4-point quadrilateral with coordinates normalized to [0, 1] relative to page dimensions. It may originate from OCR engine output (Kraken) or from a future manual drawing tool. The AABB fields remain the geometry source of truth for server-side logic.
+
+## Alternatives Considered
+
+| Alternative | Why rejected |
+|---|---|
+| 8 `NUMERIC(8,6)` columns (x1,y1,...,x4,y4) | Verbose, no structural enforcement, awkward to query or extend |
+| Separate `annotation_polygons` join table | Unnecessary complexity for a 1:1 optional relationship |
+| PostGIS geometry column | Adds a heavyweight extension for a display-only field with no spatial queries |
+| `String polygon` on the entity | Requires manual parsing at every callsite; error-prone |
+
+## Consequences
+
+**Easier:**
+- Backward compatible — all existing annotations continue to work unchanged
+- Frontend renders `<polygon>` or `<rect>` based on a simple null check
+- Schema can accommodate N-point polygons in the future (JSONB is flexible), though the CHECK constraint currently enforces exactly 4
+
+**Harder:**
+- Cannot express range checks (`0 <= x <= 1`) as database constraints without a PL/pgSQL function — validated at the DTO layer instead
+- No server-side geometry queries on polygon coordinates (acceptable — polygon is display-only)
+- AttributeConverter adds a small amount of serialization code to maintain