feat: local OCR pipeline (batch + per-document) with Surya and Kraken #226

Closed
opened 2026-04-12 10:43:16 +02:00 by marcel · 13 comments
Owner

Overview

Add a local OCR pipeline that pre-populates TranscriptionBlocks automatically. Two entry points:

  1. Batch OCR — triggered after a large import, processes multiple documents asynchronously in the background
  2. Per-document OCR — user-triggered on a single document, with the option to specify the script type

Documents are scans of typewritten letters and handwritten letters (including historical German scripts like Kurrent). OCR results are always editable — the existing transcription editor is the correction surface.


Engine Strategy

Script type Engine Notes
typewriter Surya Transformer-based, handles degraded scans well
handwriting-modern Surya Acceptable quality on Latin cursive
handwriting-kurrent Kraken + historical model Pre-1941 German cursive; requires the right HTR-United model
unknown (default) Surya Safe general-purpose fallback

No automatic confidence-based switching between engines — the engine is selected by the document's scriptType field (set by the user). This avoids fragile threshold tuning and gives the user control.


Data Model Changes

Document entity

Add a scriptType field:

@Enumerated(EnumType.STRING)
@Column(name = "script_type", nullable = false)
@Builder.Default
private ScriptType scriptType = ScriptType.UNKNOWN;
public enum ScriptType {
    UNKNOWN, TYPEWRITER, HANDWRITING_MODERN, HANDWRITING_KURRENT
}

Flyway migration: ALTER TABLE documents ADD COLUMN script_type VARCHAR(30) NOT NULL DEFAULT 'UNKNOWN'.

Document DTO

Expose scriptType in DocumentUpdateDTO so users can set it from the edit form and from the per-document OCR trigger.


API Endpoints

Per-document OCR

POST /api/documents/{id}/ocr
Body: { "scriptType": "TYPEWRITER" }   ← optional, overrides document's stored value
  • Sets scriptType on the document if provided
  • Triggers async OCR for that document
  • Returns 202 Accepted with a job reference (or document ID to poll)
  • Replaces existing TranscriptionBlocks for the document (with confirmation in the UI)

Batch OCR

POST /api/ocr/batch
Body: { "documentIds": ["uuid1", "uuid2", ...] }   ← explicit list, from import result
  • Accepts a list of document IDs (MassImportService knows which documents it just created)
  • Enqueues all documents; processes them sequentially or with bounded parallelism
  • Each document uses its own scriptType (defaults to UNKNOWN → Surya)
  • Returns 202 Accepted; processing happens in the background

Status polling (optional, but needed for batch UX)

GET /api/ocr/jobs/{jobId}
→ { "status": "RUNNING", "processed": 12, "total": 47, "errors": 2 }

OCR Microservice

A separate Python container (ocr-service) in docker-compose.yml:

Spring Boot → HTTP POST /ocr → Python service
                              → returns [{pageNumber, x, y, width, height, text}]

Responsibilities of the Python service:

  • Accept a PDF (as bytes or a MinIO presigned URL)
  • Run the appropriate engine based on a scriptType parameter
  • Return a flat list of block candidates (page + bounding box + text)
  • No business logic — just PDF → blocks JSON

Surya path: layout detection + recognition in one pass, returns bounding boxes natively.

Kraken path: Kraken's own baseline segmentation → recognition with the specified historical model. Models are bundled into the Docker image or mounted as a volume.

Interface (request):

{
  "pdfUrl": "http://minio:9000/archive-documents/abc.pdf?...",
  "scriptType": "HANDWRITING_KURRENT",
  "language": "de"
}

Interface (response):

[
  { "pageNumber": 0, "x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04, "text": "Sehr geehrter Herr ..." },
  ...
]

Coordinates are normalized (0–1) relative to page dimensions, matching how the PDF viewer handles annotations.


Backend Integration

OcrService in Spring Boot:

  • Downloads the document from MinIO (or generates a presigned URL)
  • Calls the Python microservice via RestClient
  • Maps the response to CreateTranscriptionBlockDTO objects
  • Calls TranscriptionService.createBlock() for each result
  • Runs inside @Async for non-blocking execution

OcrJobService (or reuse MassImportService pattern):

  • Tracks job state in DB or in-memory map (simple first)
  • Exposes status for polling

Architectural Considerations

Block replacement on re-run

If OCR is triggered on a document that already has blocks, the existing blocks must be cleared first. Options:

  • Clear all then recreate (simple, loses manual edits) — probably the right default given OCR is a starting point, not a final result
  • Merge (complex, unclear value) — skip for now

Show a confirmation dialog in the UI if blocks already exist.

Async processing and failure handling

  • Each document's OCR should be independent — one failure must not abort the batch
  • Failures are recorded per-document (e.g. ocrError field or log entry)
  • The batch endpoint should report how many succeeded/failed

Resource constraints

Surya and Kraken are CPU-heavy (GPU optional). On a home NAS:

  • Process documents sequentially, not in parallel, to avoid OOM
  • Consider a queue size limit to avoid unbounded memory growth during large imports
  • The Python service should have a memory/timeout limit set in Docker

Kraken model management

  • Models from HTR-United must be evaluated against actual documents before committing to one
  • Models are large binary files — store outside the image (volume mount) or bundle a specific pinned version
  • Model path should be configurable via environment variable

sortOrder for auto-created blocks

Blocks returned by OCR are already ordered by page then vertical position. Assign sortOrder sequentially (0, 1, 2, …) in that order.

Bounding box coordinate system

The existing DocumentAnnotation stores x, y, width, height as doubles. Confirm whether these are normalized (0–1) or pixel values — the OCR service must output the same convention. Current frontend PDF viewer appears to use normalized values.

MassImport integration

MassImportService creates PLACEHOLDER documents. OCR only makes sense on UPLOADED documents (file present). The batch OCR endpoint must skip or reject documents not in UPLOADED (or later) status.


Open Questions

  • Should batch OCR be triggered automatically after import completes, or always manually by the user?
  • Which specific Kraken model from HTR-United should we start with for Kurrent? Needs evaluation against a sample document.
  • Do we expose scriptType on the document list/search, or only on the detail view?
  • Should failed OCR documents be surfaced in the UI (e.g. a badge), or just silently skipped?

Out of Scope

  • Confidence-based automatic engine switching
  • Parallel document processing in the batch
  • Handwriting recognition for non-German scripts
  • Training or fine-tuning models
## Overview Add a local OCR pipeline that pre-populates `TranscriptionBlock`s automatically. Two entry points: 1. **Batch OCR** — triggered after a large import, processes multiple documents asynchronously in the background 2. **Per-document OCR** — user-triggered on a single document, with the option to specify the script type Documents are scans of typewritten letters and handwritten letters (including historical German scripts like Kurrent). OCR results are always editable — the existing transcription editor is the correction surface. --- ## Engine Strategy | Script type | Engine | Notes | |---|---|---| | `typewriter` | **Surya** | Transformer-based, handles degraded scans well | | `handwriting-modern` | **Surya** | Acceptable quality on Latin cursive | | `handwriting-kurrent` | **Kraken** + historical model | Pre-1941 German cursive; requires the right HTR-United model | | `unknown` (default) | **Surya** | Safe general-purpose fallback | No automatic confidence-based switching between engines — the engine is selected by the document's `scriptType` field (set by the user). This avoids fragile threshold tuning and gives the user control. --- ## Data Model Changes ### `Document` entity Add a `scriptType` field: ```java @Enumerated(EnumType.STRING) @Column(name = "script_type", nullable = false) @Builder.Default private ScriptType scriptType = ScriptType.UNKNOWN; ``` ```java public enum ScriptType { UNKNOWN, TYPEWRITER, HANDWRITING_MODERN, HANDWRITING_KURRENT } ``` Flyway migration: `ALTER TABLE documents ADD COLUMN script_type VARCHAR(30) NOT NULL DEFAULT 'UNKNOWN'`. ### `Document` DTO Expose `scriptType` in `DocumentUpdateDTO` so users can set it from the edit form and from the per-document OCR trigger. --- ## API Endpoints ### Per-document OCR ``` POST /api/documents/{id}/ocr Body: { "scriptType": "TYPEWRITER" } ← optional, overrides document's stored value ``` - Sets `scriptType` on the document if provided - Triggers async OCR for that document - Returns `202 Accepted` with a job reference (or document ID to poll) - Replaces existing `TranscriptionBlock`s for the document (with confirmation in the UI) ### Batch OCR ``` POST /api/ocr/batch Body: { "documentIds": ["uuid1", "uuid2", ...] } ← explicit list, from import result ``` - Accepts a list of document IDs (MassImportService knows which documents it just created) - Enqueues all documents; processes them sequentially or with bounded parallelism - Each document uses its own `scriptType` (defaults to `UNKNOWN` → Surya) - Returns `202 Accepted`; processing happens in the background ### Status polling (optional, but needed for batch UX) ``` GET /api/ocr/jobs/{jobId} → { "status": "RUNNING", "processed": 12, "total": 47, "errors": 2 } ``` --- ## OCR Microservice A separate Python container (`ocr-service`) in `docker-compose.yml`: ``` Spring Boot → HTTP POST /ocr → Python service → returns [{pageNumber, x, y, width, height, text}] ``` **Responsibilities of the Python service:** - Accept a PDF (as bytes or a MinIO presigned URL) - Run the appropriate engine based on a `scriptType` parameter - Return a flat list of block candidates (page + bounding box + text) - No business logic — just PDF → blocks JSON **Surya path:** layout detection + recognition in one pass, returns bounding boxes natively. **Kraken path:** Kraken's own baseline segmentation → recognition with the specified historical model. Models are bundled into the Docker image or mounted as a volume. **Interface (request):** ```json { "pdfUrl": "http://minio:9000/archive-documents/abc.pdf?...", "scriptType": "HANDWRITING_KURRENT", "language": "de" } ``` **Interface (response):** ```json [ { "pageNumber": 0, "x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04, "text": "Sehr geehrter Herr ..." }, ... ] ``` Coordinates are normalized (0–1) relative to page dimensions, matching how the PDF viewer handles annotations. --- ## Backend Integration `OcrService` in Spring Boot: - Downloads the document from MinIO (or generates a presigned URL) - Calls the Python microservice via `RestClient` - Maps the response to `CreateTranscriptionBlockDTO` objects - Calls `TranscriptionService.createBlock()` for each result - Runs inside `@Async` for non-blocking execution `OcrJobService` (or reuse `MassImportService` pattern): - Tracks job state in DB or in-memory map (simple first) - Exposes status for polling --- ## Architectural Considerations ### Block replacement on re-run If OCR is triggered on a document that already has blocks, the existing blocks must be cleared first. Options: - **Clear all then recreate** (simple, loses manual edits) — probably the right default given OCR is a starting point, not a final result - **Merge** (complex, unclear value) — skip for now Show a confirmation dialog in the UI if blocks already exist. ### Async processing and failure handling - Each document's OCR should be independent — one failure must not abort the batch - Failures are recorded per-document (e.g. `ocrError` field or log entry) - The batch endpoint should report how many succeeded/failed ### Resource constraints Surya and Kraken are CPU-heavy (GPU optional). On a home NAS: - Process documents **sequentially**, not in parallel, to avoid OOM - Consider a queue size limit to avoid unbounded memory growth during large imports - The Python service should have a memory/timeout limit set in Docker ### Kraken model management - Models from HTR-United must be evaluated against actual documents before committing to one - Models are large binary files — store outside the image (volume mount) or bundle a specific pinned version - Model path should be configurable via environment variable ### `sortOrder` for auto-created blocks Blocks returned by OCR are already ordered by page then vertical position. Assign `sortOrder` sequentially (0, 1, 2, …) in that order. ### Bounding box coordinate system The existing `DocumentAnnotation` stores `x, y, width, height` as doubles. Confirm whether these are normalized (0–1) or pixel values — the OCR service must output the same convention. Current frontend PDF viewer appears to use normalized values. ### MassImport integration `MassImportService` creates `PLACEHOLDER` documents. OCR only makes sense on `UPLOADED` documents (file present). The batch OCR endpoint must skip or reject documents not in `UPLOADED` (or later) status. --- ## Open Questions - [ ] Should batch OCR be triggered automatically after import completes, or always manually by the user? - [ ] Which specific Kraken model from HTR-United should we start with for Kurrent? Needs evaluation against a sample document. - [ ] Do we expose `scriptType` on the document list/search, or only on the detail view? - [ ] Should failed OCR documents be surfaced in the UI (e.g. a badge), or just silently skipped? --- ## Out of Scope - Confidence-based automatic engine switching - Parallel document processing in the batch - Handwriting recognition for non-German scripts - Training or fine-tuning models
marcel added the feature label 2026-04-12 10:43:21 +02:00
Author
Owner

Hardware clarification

The server has no GPU. CPU-only inference is the target. The server RAM can be upgraded to meet requirements.

Hardware assumption: 16–32 GB system RAM, no GPU.

This resolves the resource constraint concern in the architectural considerations:

  • Surya loads its models once at container start (~3–4 GB RAM) and keeps them resident. No re-loading between documents — only the first request after startup is slow.
  • The ocr-service Docker container can be given a generous memory limit (e.g. mem_limit: 6g) rather than being squeezed.
  • PostgreSQL + Spring Boot + MinIO together need ~2–3 GB, so 16 GB total leaves comfortable headroom. 32 GB is ideal.
  • Pages within a document should still be processed sequentially (no parallelism within the service). Overlapping two documents in the batch may be possible with 32 GB but is not required.

Processing speed expectations on CPU:

  • Surya: ~20–60 seconds per page depending on CPU generation
  • Kraken: ~3–10 seconds per page (lighter models)
  • A 10-page letter: ~5–10 minutes via Surya, ~1 minute via Kraken

For overnight batch imports this is fully acceptable. Per-document OCR will feel slow if triggered interactively — the UI should reflect that it is a background job (spinner / async status) rather than a synchronous response.

No GPU, no CUDA, no NVIDIA Container Toolkit needed. PyTorch CPU-only build in the Docker image is sufficient and keeps the image smaller.

## Hardware clarification The server has no GPU. CPU-only inference is the target. The server RAM can be upgraded to meet requirements. **Hardware assumption: 16–32 GB system RAM, no GPU.** This resolves the resource constraint concern in the architectural considerations: - Surya loads its models once at container start (~3–4 GB RAM) and keeps them resident. No re-loading between documents — only the first request after startup is slow. - The `ocr-service` Docker container can be given a generous memory limit (e.g. `mem_limit: 6g`) rather than being squeezed. - PostgreSQL + Spring Boot + MinIO together need ~2–3 GB, so 16 GB total leaves comfortable headroom. 32 GB is ideal. - Pages within a document should still be processed sequentially (no parallelism within the service). Overlapping two documents in the batch may be possible with 32 GB but is not required. **Processing speed expectations on CPU:** - Surya: ~20–60 seconds per page depending on CPU generation - Kraken: ~3–10 seconds per page (lighter models) - A 10-page letter: ~5–10 minutes via Surya, ~1 minute via Kraken For overnight batch imports this is fully acceptable. Per-document OCR will feel slow if triggered interactively — the UI should reflect that it is a background job (spinner / async status) rather than a synchronous response. No GPU, no CUDA, no NVIDIA Container Toolkit needed. PyTorch CPU-only build in the Docker image is sufficient and keeps the image smaller.
Author
Owner

Progress reporting

Given that OCR is slow on CPU (minutes per document), the user needs live feedback while it runs — both for per-document and batch jobs.

SSE is the right fit here: one-way server→client stream, no WebSocket overhead, works natively with Spring Boot's SseEmitter on the existing Jetty stack, and the frontend can consume it with a plain EventSource.

GET /api/ocr/jobs/{jobId}/progress   ← SSE stream

Event shape:

{ "type": "page",     "page": 3, "totalPages": 12, "documentId": "uuid" }
{ "type": "document", "processed": 5, "total": 47, "documentId": "uuid", "status": "ok" }
{ "type": "done",     "processed": 47, "errors": 2 }
{ "type": "error",    "documentId": "uuid", "message": "OCR service timeout" }

The Python service emits progress per page (it processes pages sequentially). Spring Boot relays these to the SSE stream as it receives them from the OCR microservice.

Per-document OCR

Show a progress bar in the transcription panel: "Seite 3 von 12 wird analysiert…". Blocks appear incrementally as each page completes — the user sees results filling in rather than a blank screen until the end.

Batch OCR

Show a batch progress overlay (or a persistent status bar): "47 Dokumente · 5 abgeschlossen · 2 Fehler". Each completed document can be linked so the user can jump straight to its transcription.

Fallback: polling

If SSE proves complex to integrate with the async job infrastructure, a simple GET /api/ocr/jobs/{jobId} polling endpoint (every 2–3 seconds from the frontend) is an acceptable fallback. Coarser granularity but much simpler to implement. Worth starting here and upgrading to SSE if the UX feels laggy.

Connection loss

If the user closes the tab or loses the connection mid-job, the OCR job must continue running on the server. The SSE stream is display-only — job lifecycle is independent of the client connection.

## Progress reporting Given that OCR is slow on CPU (minutes per document), the user needs live feedback while it runs — both for per-document and batch jobs. ### Recommended approach: Server-Sent Events (SSE) SSE is the right fit here: one-way server→client stream, no WebSocket overhead, works natively with Spring Boot's `SseEmitter` on the existing Jetty stack, and the frontend can consume it with a plain `EventSource`. ``` GET /api/ocr/jobs/{jobId}/progress ← SSE stream ``` **Event shape:** ```json { "type": "page", "page": 3, "totalPages": 12, "documentId": "uuid" } { "type": "document", "processed": 5, "total": 47, "documentId": "uuid", "status": "ok" } { "type": "done", "processed": 47, "errors": 2 } { "type": "error", "documentId": "uuid", "message": "OCR service timeout" } ``` The Python service emits progress per page (it processes pages sequentially). Spring Boot relays these to the SSE stream as it receives them from the OCR microservice. ### Per-document OCR Show a progress bar in the transcription panel: **"Seite 3 von 12 wird analysiert…"**. Blocks appear incrementally as each page completes — the user sees results filling in rather than a blank screen until the end. ### Batch OCR Show a batch progress overlay (or a persistent status bar): **"47 Dokumente · 5 abgeschlossen · 2 Fehler"**. Each completed document can be linked so the user can jump straight to its transcription. ### Fallback: polling If SSE proves complex to integrate with the async job infrastructure, a simple `GET /api/ocr/jobs/{jobId}` polling endpoint (every 2–3 seconds from the frontend) is an acceptable fallback. Coarser granularity but much simpler to implement. Worth starting here and upgrading to SSE if the UX feels laggy. ### Connection loss If the user closes the tab or loses the connection mid-job, the OCR job must continue running on the server. The SSE stream is display-only — job lifecycle is independent of the client connection.
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

OcrService responsibilities are too broad
The proposed OcrService does four things: fetch from MinIO, call the Python service, map the response to DTOs, and call TranscriptionService. That's four reasons to change. I'd split it:

  • OcrService — orchestrates the flow, owns the job lifecycle
  • OcrMicroserviceClient — the RestClient wrapper; one method, one responsibility
  • Mapping stays inside OcrService as a private helper

TDD question: how do we test OcrService in isolation?
The Python microservice is an external HTTP dependency. That means OcrMicroserviceClient needs an interface so we can inject a mock in unit tests. Without an interface, OcrService is not unit-testable. Suggest:

public interface OcrClient {
    List<OcrBlockResult> extractBlocks(String pdfUrl, ScriptType scriptType);
}
// Real implementation: RestClientOcrClient
// Test implementation: StubOcrClient returning fixture data

Command-query separation on the POST /api/documents/{id}/ocr endpoint
The endpoint both triggers OCR (command) and returns a job reference (query). That's fine for REST semantics, but the service method should not mix the two internally. OcrService.startOcr(documentId) should return a job ID as a pure creation result — not a status read.

In-memory job tracking is a smell
"In-memory map (simple first)" will lose all job state on restart. That's fine for a first iteration, but the issue should explicitly flag it as tech debt with a follow-up ticket, otherwise it will never get addressed. A simple ocr_jobs table (Flyway-managed) is not much more work and gives you restart resilience from day one.

Frontend: EventSource in SvelteKit
EventSource is a browser API — it does not work in SvelteKit server load functions. Progress display must be purely client-side (an onMount-triggered EventSource or a Svelte action). This is not a problem architecturally but it means the progress UI cannot use the standard +page.server.ts data flow. Worth calling out explicitly so there's no confusion during implementation.

Naming: HANDWRITING_MODERN vs HANDWRITING_LATIN_CURSIVE
MODERN is ambiguous — modern relative to what? The Kurrent speaker in 1920 would have called their script "modern". Consider HANDWRITING_LATIN as a clearer contrast to HANDWRITING_KURRENT.

Suggestions

  • Define OcrClient as an interface from the start — TDD requires it
  • Track jobs in a DB table, not memory — restart resilience is worth the 20 extra lines
  • Add a ScriptType validator at the controller boundary that rejects unknown values with a 400, not a 500
  • The sortOrder assignment should be a named constant or helper, not an inline (i, _) -> i lambda scattered across the mapping code
## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Questions & Observations **OcrService responsibilities are too broad** The proposed `OcrService` does four things: fetch from MinIO, call the Python service, map the response to DTOs, and call `TranscriptionService`. That's four reasons to change. I'd split it: - `OcrService` — orchestrates the flow, owns the job lifecycle - `OcrMicroserviceClient` — the `RestClient` wrapper; one method, one responsibility - Mapping stays inside `OcrService` as a private helper **TDD question: how do we test `OcrService` in isolation?** The Python microservice is an external HTTP dependency. That means `OcrMicroserviceClient` needs an interface so we can inject a mock in unit tests. Without an interface, `OcrService` is not unit-testable. Suggest: ```java public interface OcrClient { List<OcrBlockResult> extractBlocks(String pdfUrl, ScriptType scriptType); } // Real implementation: RestClientOcrClient // Test implementation: StubOcrClient returning fixture data ``` **Command-query separation on the `POST /api/documents/{id}/ocr` endpoint** The endpoint both triggers OCR (command) and returns a job reference (query). That's fine for REST semantics, but the service method should not mix the two internally. `OcrService.startOcr(documentId)` should return a job ID as a pure creation result — not a status read. **In-memory job tracking is a smell** "In-memory map (simple first)" will lose all job state on restart. That's fine for a first iteration, but the issue should explicitly flag it as tech debt with a follow-up ticket, otherwise it will never get addressed. A simple `ocr_jobs` table (Flyway-managed) is not much more work and gives you restart resilience from day one. **Frontend: `EventSource` in SvelteKit** `EventSource` is a browser API — it does not work in SvelteKit server load functions. Progress display must be purely client-side (an `onMount`-triggered `EventSource` or a Svelte action). This is not a problem architecturally but it means the progress UI cannot use the standard `+page.server.ts` data flow. Worth calling out explicitly so there's no confusion during implementation. **Naming: `HANDWRITING_MODERN` vs `HANDWRITING_LATIN_CURSIVE`** `MODERN` is ambiguous — modern relative to what? The Kurrent speaker in 1920 would have called their script "modern". Consider `HANDWRITING_LATIN` as a clearer contrast to `HANDWRITING_KURRENT`. ### Suggestions - Define `OcrClient` as an interface from the start — TDD requires it - Track jobs in a DB table, not memory — restart resilience is worth the 20 extra lines - Add a `ScriptType` validator at the controller boundary that rejects unknown values with a `400`, not a `500` - The `sortOrder` assignment should be a named constant or helper, not an inline `(i, _) -> i` lambda scattered across the mapping code
Author
Owner

🏛️ Markus Keller — Application Architect

Questions & Observations

The microservice is justified — but document it
I'm generally skeptical of extracting services prematurely. In this case, the Python microservice is genuinely justified: the OCR engines (Surya, Kraken) have no Java bindings and only exist in the Python ecosystem. That's a concrete, present requirement. However, this should be captured in an ADR (docs/adr/) before implementation starts, covering: why a separate service, why not Tess4J, and what the interface contract guarantees. Otherwise future maintainers won't know why the complexity exists.

Job state belongs in PostgreSQL, not memory
The issue says "in-memory map (simple first)" for job tracking. I'd push back on this being the starting point. An ocr_jobs table is straightforward, gives you restart resilience, and can be queried by the SSE endpoint directly. In-memory state requires careful synchronization across @Async threads and disappears on any restart or crash — which is exactly when you need job state most. Consider:

CREATE TABLE ocr_jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
    total_documents INT NOT NULL,
    processed_documents INT NOT NULL DEFAULT 0,
    error_count INT NOT NULL DEFAULT 0,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

The scriptType field: processing hint or document attribute?
There's a design question here. scriptType describes how the document was written (a permanent fact about the document) — not just a processing hint for OCR. That argues for it living on Document permanently, which the issue proposes. I agree with this. But it also means scriptType should appear in the document edit form, not only as part of the OCR trigger — a user should be able to set it without triggering OCR.

POST /api/ocr/batch layering
The batch endpoint sits at /api/ocr/batch, outside the document resource tree. That's clean. But make sure OcrController only injects OcrService — it must not reach into DocumentService or TranscriptionService directly. The domain boundary must be: OCR service coordinates; transcription service owns blocks.

SSE is the right transport here
Agreed with the progress reporting comment: SSE via SseEmitter is the correct choice for one-way progress streaming. One consideration: SseEmitter has a default timeout. Set it explicitly (e.g. to the max expected batch duration) or the connection will close silently partway through a long batch. Also, each SseEmitter holds a thread — for a home server processing one batch at a time, this is fine, but worth noting.

LISTEN/NOTIFY as an alternative to in-memory job tracking
A pattern worth considering: when the Python service completes a page, it calls back to Spring Boot, which updates the ocr_jobs table and sends a PostgreSQL NOTIFY ocr_progress. The SSE endpoint holds a connection listening for notifications and pushes them to the browser. This eliminates the need for a separate polling mechanism entirely and keeps state durable. Probably overkill for v1, but worth flagging as the clean long-term direction.

Suggestions

  • Write the ADR first — one page, decision + alternatives + consequences
  • Start with ocr_jobs table, not in-memory map — same effort, much more resilient
  • Set SseEmitter timeout explicitly to prevent silent connection drops
  • Ensure OcrController only depends on OcrService — no cross-domain repository or service injection
## 🏛️ Markus Keller — Application Architect ### Questions & Observations **The microservice is justified — but document it** I'm generally skeptical of extracting services prematurely. In this case, the Python microservice is genuinely justified: the OCR engines (Surya, Kraken) have no Java bindings and only exist in the Python ecosystem. That's a concrete, present requirement. However, this should be captured in an ADR (`docs/adr/`) before implementation starts, covering: why a separate service, why not Tess4J, and what the interface contract guarantees. Otherwise future maintainers won't know why the complexity exists. **Job state belongs in PostgreSQL, not memory** The issue says "in-memory map (simple first)" for job tracking. I'd push back on this being the starting point. An `ocr_jobs` table is straightforward, gives you restart resilience, and can be queried by the SSE endpoint directly. In-memory state requires careful synchronization across `@Async` threads and disappears on any restart or crash — which is exactly when you need job state most. Consider: ```sql CREATE TABLE ocr_jobs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), status VARCHAR(20) NOT NULL DEFAULT 'PENDING', total_documents INT NOT NULL, processed_documents INT NOT NULL DEFAULT 0, error_count INT NOT NULL DEFAULT 0, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` **The `scriptType` field: processing hint or document attribute?** There's a design question here. `scriptType` describes how the document was written (a permanent fact about the document) — not just a processing hint for OCR. That argues for it living on `Document` permanently, which the issue proposes. I agree with this. But it also means `scriptType` should appear in the document edit form, not only as part of the OCR trigger — a user should be able to set it without triggering OCR. **`POST /api/ocr/batch` layering** The batch endpoint sits at `/api/ocr/batch`, outside the document resource tree. That's clean. But make sure `OcrController` only injects `OcrService` — it must not reach into `DocumentService` or `TranscriptionService` directly. The domain boundary must be: OCR service coordinates; transcription service owns blocks. **SSE is the right transport here** Agreed with the progress reporting comment: SSE via `SseEmitter` is the correct choice for one-way progress streaming. One consideration: `SseEmitter` has a default timeout. Set it explicitly (e.g. to the max expected batch duration) or the connection will close silently partway through a long batch. Also, each `SseEmitter` holds a thread — for a home server processing one batch at a time, this is fine, but worth noting. **LISTEN/NOTIFY as an alternative to in-memory job tracking** A pattern worth considering: when the Python service completes a page, it calls back to Spring Boot, which updates the `ocr_jobs` table and sends a PostgreSQL `NOTIFY ocr_progress`. The SSE endpoint holds a connection listening for notifications and pushes them to the browser. This eliminates the need for a separate polling mechanism entirely and keeps state durable. Probably overkill for v1, but worth flagging as the clean long-term direction. ### Suggestions - Write the ADR first — one page, decision + alternatives + consequences - Start with `ocr_jobs` table, not in-memory map — same effort, much more resilient - Set `SseEmitter` timeout explicitly to prevent silent connection drops - Ensure `OcrController` only depends on `OcrService` — no cross-domain repository or service injection
Author
Owner

🧪 Sara Holt — QA Engineer

Missing acceptance criteria

The issue has a thorough architectural description but no acceptance criteria. Before implementation starts, we need testable definitions of done. Proposed:

  • POST /api/documents/{id}/ocr on a PLACEHOLDER document returns 400 (no file present)
  • POST /api/documents/{id}/ocr on an UPLOADED document returns 202 and creates a job
  • Transcription blocks are created in page-then-vertical-position order with sequential sortOrder
  • Re-running OCR on a document that already has blocks clears the old blocks first
  • A single document failure in a batch job does not abort processing of remaining documents
  • GET /api/ocr/jobs/{jobId} returns 404 for an unknown job ID
  • SSE stream emits a done event after all documents in a batch are processed
  • Bounding box coordinates in created annotations match the normalized (0–1) convention used by the PDF viewer

Test strategy by layer

Unit tests (JUnit 5 + Mockito)

  • OcrService: mock OcrClient and TranscriptionService; test status transitions, block mapping, failure isolation, PLACEHOLDER guard
  • OcrJobService (if in-memory): test state machine — PENDING → RUNNING → DONE/FAILED
  • ScriptType enum: validate all values accepted at the controller boundary

Integration tests (Testcontainers + WireMock)

  • Wire up a WireMock stub for the Python OCR service returning fixture block data
  • POST /api/documents/{id}/ocr full flow through Spring context: job created, blocks persisted, annotations created
  • POST /api/ocr/batch with a mix of UPLOADED and PLACEHOLDER documents: only UPLOADED ones are processed
  • Block replacement: verify previous blocks and annotations are deleted before new ones are created

E2E tests (Playwright)

  • Per-document OCR: trigger OCR on a document with a real PDF fixture, verify blocks appear in the transcription panel
  • Batch OCR: trigger via import flow, verify progress counter increments

Key testability concerns

The Python OCR service must be mockable in CI
CI should not spin up Surya or Kraken — they are too large and slow. The OcrClient interface (as Felix suggests) makes this clean. WireMock handles the integration test layer. The Python service itself needs its own test suite (pytest), but that runs independently.

Async behavior — use Awaitility, not Thread.sleep()
OCR processing is @Async. Integration tests that verify block creation after triggering OCR must use Awaitility:

await().atMost(5, SECONDS).until(() ->
    transcriptionBlockRepository.countByDocumentId(documentId) > 0
);

Never Thread.sleep() — flaky tests are worse than no tests.

SSE is hard to test at the integration layer
Testing SseEmitter behavior requires either Spring's MockMvc SSE support or a small WebClient-based integration test. Make sure the SSE endpoint is covered at the integration layer, not just manually verified in a browser.

Block re-run test: preserve version history
When OCR replaces existing blocks, verify that TranscriptionBlockVersion records from the previous (manual) edits are also cleaned up. If they are orphaned in the DB, the history table grows unboundedly.

Suggestions

  • Add acceptance criteria to the issue before assigning it
  • OCR service should have a dedicated CI step that runs with WireMock, not the real Python service
  • Add a Flyway migration test to confirm the script_type column migration runs cleanly from a clean DB
## 🧪 Sara Holt — QA Engineer ### Missing acceptance criteria The issue has a thorough architectural description but no acceptance criteria. Before implementation starts, we need testable definitions of done. Proposed: - [ ] `POST /api/documents/{id}/ocr` on a `PLACEHOLDER` document returns `400` (no file present) - [ ] `POST /api/documents/{id}/ocr` on an `UPLOADED` document returns `202` and creates a job - [ ] Transcription blocks are created in page-then-vertical-position order with sequential `sortOrder` - [ ] Re-running OCR on a document that already has blocks clears the old blocks first - [ ] A single document failure in a batch job does not abort processing of remaining documents - [ ] `GET /api/ocr/jobs/{jobId}` returns `404` for an unknown job ID - [ ] SSE stream emits a `done` event after all documents in a batch are processed - [ ] Bounding box coordinates in created annotations match the normalized (0–1) convention used by the PDF viewer ### Test strategy by layer **Unit tests (JUnit 5 + Mockito)** - `OcrService`: mock `OcrClient` and `TranscriptionService`; test status transitions, block mapping, failure isolation, `PLACEHOLDER` guard - `OcrJobService` (if in-memory): test state machine — PENDING → RUNNING → DONE/FAILED - `ScriptType` enum: validate all values accepted at the controller boundary **Integration tests (Testcontainers + WireMock)** - Wire up a WireMock stub for the Python OCR service returning fixture block data - `POST /api/documents/{id}/ocr` full flow through Spring context: job created, blocks persisted, annotations created - `POST /api/ocr/batch` with a mix of `UPLOADED` and `PLACEHOLDER` documents: only `UPLOADED` ones are processed - Block replacement: verify previous blocks and annotations are deleted before new ones are created **E2E tests (Playwright)** - Per-document OCR: trigger OCR on a document with a real PDF fixture, verify blocks appear in the transcription panel - Batch OCR: trigger via import flow, verify progress counter increments ### Key testability concerns **The Python OCR service must be mockable in CI** CI should not spin up Surya or Kraken — they are too large and slow. The `OcrClient` interface (as Felix suggests) makes this clean. WireMock handles the integration test layer. The Python service itself needs its own test suite (pytest), but that runs independently. **Async behavior — use Awaitility, not `Thread.sleep()`** OCR processing is `@Async`. Integration tests that verify block creation after triggering OCR must use Awaitility: ```java await().atMost(5, SECONDS).until(() -> transcriptionBlockRepository.countByDocumentId(documentId) > 0 ); ``` Never `Thread.sleep()` — flaky tests are worse than no tests. **SSE is hard to test at the integration layer** Testing `SseEmitter` behavior requires either Spring's `MockMvc` SSE support or a small `WebClient`-based integration test. Make sure the SSE endpoint is covered at the integration layer, not just manually verified in a browser. **Block re-run test: preserve version history** When OCR replaces existing blocks, verify that `TranscriptionBlockVersion` records from the previous (manual) edits are also cleaned up. If they are orphaned in the DB, the history table grows unboundedly. ### Suggestions - Add acceptance criteria to the issue before assigning it - OCR service should have a dedicated CI step that runs with WireMock, not the real Python service - Add a Flyway migration test to confirm the `script_type` column migration runs cleanly from a clean DB
Author
Owner

🔒 Nora "NullX" Steiner — Application Security

Findings & Questions

1. Authorization on OCR endpoints — who can trigger OCR?

The issue specifies WRITE_ALL permission for transcription block creation, but doesn't explicitly state what permission guards POST /api/documents/{id}/ocr and POST /api/ocr/batch. These must be annotated with @RequirePermission — OCR replaces all existing blocks, making it a destructive operation. Triggering it without authorization could wipe manually verified transcriptions.

Recommended: WRITE_ALL for per-document OCR, ADMIN or WRITE_ALL for batch (batch is higher impact).

2. IDOR risk on POST /api/ocr/batch

The batch endpoint accepts an arbitrary list of documentIds. The implementation must verify that every document ID in the list belongs to a document the requesting user is permitted to access. Without this check, a user could supply document IDs they don't own and trigger OCR (and block replacement) on other users' documents.

// In OcrService.startBatch():
for (UUID documentId : dto.getDocumentIds()) {
    documentService.getById(documentId); // throws DomainException.notFound if not accessible
}

This also applies to the per-document endpoint — documentService.getById() must validate access, not just existence.

3. Presigned URL lifetime

The issue proposes passing a MinIO presigned URL to the Python service. Presigned URLs have a configurable TTL. For large documents processed slowly on CPU, a very short TTL risks expiry mid-processing. A very long TTL widens the exposure window if the URL is logged by the Python service. Recommended: set TTL to 15–30 minutes (enough for a large document on CPU) and ensure the Python service does not log the full URL.

4. The Python OCR service must not be publicly reachable

The ocr-service container should be on the internal Docker network only — no ports: mapping in docker-compose.yml. Spring Boot calls it via the internal service name (http://ocr-service:8000). If someone exposes it to the host, anyone on the network can submit arbitrary PDFs for processing, potentially exhausting CPU resources.

5. Job ID enumeration

GET /api/ocr/jobs/{jobId} must use UUIDs as job IDs (which the proposed schema does). Verify that the endpoint returns 404 (not 403) for job IDs that exist but belong to another user — 403 confirms the job exists, which leaks information.

6. Batch endpoint size limit

POST /api/ocr/batch accepts an unbounded list of document IDs. Add a @Size(max = 500) constraint on documentIds to prevent a single request from queuing an unbounded number of documents and starving the server.

7. OCR output is user-visible — XSS via OCR text?

OCR output is stored as text in TranscriptionBlock and rendered in the transcription panel. If the OCR engine produces malicious strings (unlikely but possible with crafted PDFs), and the frontend renders them without escaping, XSS is possible. Verify that TranscriptionBlock.text is rendered as plain text (.textContent, not .innerHTML) in the Svelte components — a quick grep of TranscriptionReadView.svelte and TranscriptionBlock.svelte should confirm this.

Suggestions

  • Annotate both OCR endpoints with @RequirePermission explicitly in the issue and in implementation
  • Add @Size(max = 500) to documentIds in the batch DTO
  • Set presigned URL TTL to 15–30 min; document this in the OcrService
  • No ports: on the ocr-service in docker-compose.yml
  • Confirm XSS safety in the transcription renderer before shipping
## 🔒 Nora "NullX" Steiner — Application Security ### Findings & Questions **1. Authorization on OCR endpoints — who can trigger OCR?** The issue specifies `WRITE_ALL` permission for transcription block creation, but doesn't explicitly state what permission guards `POST /api/documents/{id}/ocr` and `POST /api/ocr/batch`. These must be annotated with `@RequirePermission` — OCR replaces all existing blocks, making it a destructive operation. Triggering it without authorization could wipe manually verified transcriptions. Recommended: `WRITE_ALL` for per-document OCR, `ADMIN` or `WRITE_ALL` for batch (batch is higher impact). **2. IDOR risk on `POST /api/ocr/batch`** The batch endpoint accepts an arbitrary list of `documentIds`. The implementation must verify that every document ID in the list belongs to a document the requesting user is permitted to access. Without this check, a user could supply document IDs they don't own and trigger OCR (and block replacement) on other users' documents. ```java // In OcrService.startBatch(): for (UUID documentId : dto.getDocumentIds()) { documentService.getById(documentId); // throws DomainException.notFound if not accessible } ``` This also applies to the per-document endpoint — `documentService.getById()` must validate access, not just existence. **3. Presigned URL lifetime** The issue proposes passing a MinIO presigned URL to the Python service. Presigned URLs have a configurable TTL. For large documents processed slowly on CPU, a very short TTL risks expiry mid-processing. A very long TTL widens the exposure window if the URL is logged by the Python service. Recommended: set TTL to 15–30 minutes (enough for a large document on CPU) and ensure the Python service does not log the full URL. **4. The Python OCR service must not be publicly reachable** The `ocr-service` container should be on the internal Docker network only — no `ports:` mapping in `docker-compose.yml`. Spring Boot calls it via the internal service name (`http://ocr-service:8000`). If someone exposes it to the host, anyone on the network can submit arbitrary PDFs for processing, potentially exhausting CPU resources. **5. Job ID enumeration** `GET /api/ocr/jobs/{jobId}` must use UUIDs as job IDs (which the proposed schema does). Verify that the endpoint returns `404` (not `403`) for job IDs that exist but belong to another user — `403` confirms the job exists, which leaks information. **6. Batch endpoint size limit** `POST /api/ocr/batch` accepts an unbounded list of document IDs. Add a `@Size(max = 500)` constraint on `documentIds` to prevent a single request from queuing an unbounded number of documents and starving the server. **7. OCR output is user-visible — XSS via OCR text?** OCR output is stored as `text` in `TranscriptionBlock` and rendered in the transcription panel. If the OCR engine produces malicious strings (unlikely but possible with crafted PDFs), and the frontend renders them without escaping, XSS is possible. Verify that `TranscriptionBlock.text` is rendered as plain text (`.textContent`, not `.innerHTML`) in the Svelte components — a quick grep of `TranscriptionReadView.svelte` and `TranscriptionBlock.svelte` should confirm this. ### Suggestions - Annotate both OCR endpoints with `@RequirePermission` explicitly in the issue and in implementation - Add `@Size(max = 500)` to `documentIds` in the batch DTO - Set presigned URL TTL to 15–30 min; document this in the `OcrService` - No `ports:` on the `ocr-service` in `docker-compose.yml` - Confirm XSS safety in the transcription renderer before shipping
Author
Owner

🎨 Leonie Voss — UX Design & Accessibility

Questions & Observations

Where does the OCR trigger live in the existing UI?

The issue describes the feature but not the interaction entry point. For per-document OCR, there are two plausible locations:

  1. The transcription panel (when empty — replace the current empty state CTA with an "OCR starten" button)
  2. The document edit form (next to scriptType — a paired selector + trigger)

Option 1 is better. The transcription panel is where the user is already thinking about text extraction. The scriptType selector and "OCR starten" button should live there together, not scattered across two pages.

The scriptType selector needs a human-readable UI

The enum values (UNKNOWN, TYPEWRITER, HANDWRITING_MODERN, HANDWRITING_KURRENT) are developer names — they should never appear in the UI as-is. Proposed German labels:

  • UNKNOWN → "(nicht festgelegt)"
  • TYPEWRITER → "Schreibmaschine"
  • HANDWRITING_MODERN → "Handschrift (lateinisch)"
  • HANDWRITING_KURRENT → "Handschrift (Kurrent/Sütterlin)"

These should be Paraglide translation keys, not hardcoded strings.

Progress UX: incremental block appearance vs. spinner

The issue proposes "Seite 3 von 12 wird analysiert…" with blocks appearing incrementally. This is the right pattern — incremental display gives the user immediate value and makes the wait feel shorter. However, the blocks that appear during processing should be visually marked as "draft/processing" so the user doesn't start editing a block that might be replaced when the next page finishes. A subtle muted border or opacity reduction would work.

The confirmation dialog for block replacement needs care

"You have existing blocks. OCR will replace them." — this dialog is the right pattern, but the phrasing must make the consequence explicit, especially for older users who may have spent hours on manual transcription:

Vorhandene Transkription ersetzen?
Alle 12 vorhandenen Blöcke werden gelöscht und durch die OCR-Ergebnisse ersetzt. Diese Aktion kann nicht rückgängig gemacht werden.

[Abbrechen] [Ersetzen]

The destructive action button ("Ersetzen") should not be the primary (blue) style — use a red/danger style consistent with other destructive confirmations in the app.

Batch OCR progress: where does it surface?

The issue mentions "a persistent status bar" for batch progress but doesn't specify where. I'd suggest a dismissible notification banner at the top of the document list — the user can navigate away and come back. The banner should show:

  • Active: "OCR läuft · 5 von 47 Dokumente abgeschlossen"
  • Done: "OCR abgeschlossen · 45 erfolgreich · 2 fehlgeschlagen [Details anzeigen]"
  • The "Details anzeigen" link navigates to a simple log of failed documents

Error state in the transcription panel

The open question "should failed OCR documents be surfaced with a badge?" — yes, they should, but keep it subtle. A small grey label "OCR fehlgeschlagen" below the document title in the list is enough. In the transcription panel itself, show the reason in the empty state: "OCR konnte nicht abgeschlossen werden. [Erneut versuchen]"

Accessibility

  • The scriptType dropdown must have a visible <label> — not placeholder-only labeling
  • The SSE progress bar must have role="progressbar" with aria-valuenow, aria-valuemin, aria-valuemax
  • The "OCR starten" button must have a disabled state with aria-disabled="true" while a job is already running, not just visually greyed out
  • The confirmation dialog must trap focus and return focus to the trigger button on close

Suggestions

  • Place the scriptType selector and OCR trigger together in the transcription panel, not on the edit page
  • Add Paraglide translation keys for all ScriptType enum values before implementation
  • Mark in-progress blocks as visually distinct from confirmed blocks during OCR processing
  • Use a danger-style button in the replacement confirmation dialog
## 🎨 Leonie Voss — UX Design & Accessibility ### Questions & Observations **Where does the OCR trigger live in the existing UI?** The issue describes the feature but not the interaction entry point. For per-document OCR, there are two plausible locations: 1. The transcription panel (when empty — replace the current empty state CTA with an "OCR starten" button) 2. The document edit form (next to `scriptType` — a paired selector + trigger) Option 1 is better. The transcription panel is where the user is already thinking about text extraction. The `scriptType` selector and "OCR starten" button should live there together, not scattered across two pages. **The `scriptType` selector needs a human-readable UI** The enum values (`UNKNOWN`, `TYPEWRITER`, `HANDWRITING_MODERN`, `HANDWRITING_KURRENT`) are developer names — they should never appear in the UI as-is. Proposed German labels: - `UNKNOWN` → "(nicht festgelegt)" - `TYPEWRITER` → "Schreibmaschine" - `HANDWRITING_MODERN` → "Handschrift (lateinisch)" - `HANDWRITING_KURRENT` → "Handschrift (Kurrent/Sütterlin)" These should be Paraglide translation keys, not hardcoded strings. **Progress UX: incremental block appearance vs. spinner** The issue proposes "Seite 3 von 12 wird analysiert…" with blocks appearing incrementally. This is the right pattern — incremental display gives the user immediate value and makes the wait feel shorter. However, the blocks that appear during processing should be visually marked as "draft/processing" so the user doesn't start editing a block that might be replaced when the next page finishes. A subtle muted border or opacity reduction would work. **The confirmation dialog for block replacement needs care** "You have existing blocks. OCR will replace them." — this dialog is the right pattern, but the phrasing must make the consequence explicit, especially for older users who may have spent hours on manual transcription: > **Vorhandene Transkription ersetzen?** > Alle 12 vorhandenen Blöcke werden gelöscht und durch die OCR-Ergebnisse ersetzt. Diese Aktion kann nicht rückgängig gemacht werden. > > [Abbrechen] [Ersetzen] The destructive action button ("Ersetzen") should not be the primary (blue) style — use a red/danger style consistent with other destructive confirmations in the app. **Batch OCR progress: where does it surface?** The issue mentions "a persistent status bar" for batch progress but doesn't specify where. I'd suggest a dismissible notification banner at the top of the document list — the user can navigate away and come back. The banner should show: - Active: "OCR läuft · 5 von 47 Dokumente abgeschlossen" - Done: "OCR abgeschlossen · 45 erfolgreich · 2 fehlgeschlagen [Details anzeigen]" - The "Details anzeigen" link navigates to a simple log of failed documents **Error state in the transcription panel** The open question "should failed OCR documents be surfaced with a badge?" — yes, they should, but keep it subtle. A small grey label "OCR fehlgeschlagen" below the document title in the list is enough. In the transcription panel itself, show the reason in the empty state: "OCR konnte nicht abgeschlossen werden. [Erneut versuchen]" **Accessibility** - The `scriptType` dropdown must have a visible `<label>` — not placeholder-only labeling - The SSE progress bar must have `role="progressbar"` with `aria-valuenow`, `aria-valuemin`, `aria-valuemax` - The "OCR starten" button must have a disabled state with `aria-disabled="true"` while a job is already running, not just visually greyed out - The confirmation dialog must trap focus and return focus to the trigger button on close ### Suggestions - Place the `scriptType` selector and OCR trigger together in the transcription panel, not on the edit page - Add Paraglide translation keys for all `ScriptType` enum values before implementation - Mark in-progress blocks as visually distinct from confirmed blocks during OCR processing - Use a danger-style button in the replacement confirmation dialog
Author
Owner

🛠️ Tobias Wendt — DevOps & Platform

Questions & Observations

The Docker image will be very large

A PyTorch CPU-only image with Surya installed is typically 4–8 GB. Kraken adds less (it uses lighter models) but the base image is still heavy. A few things to keep this manageable:

  • Use pytorch/pytorch:2.x.x-cpu as the base — not the default PyTorch image which includes CUDA and is 10+ GB. The CPU-only build is a fraction of that.
  • Do not install both Surya and Kraken in the same image layer if Kraken is optional. Use a multi-stage build where models are downloaded at runtime (volume mount) rather than baked in.
  • .dockerignore must exclude any model files from the build context — they will accidentally bloat the image if left loose in the directory.

Model files must not be in the Docker image

Large binary models (Kraken HTR-United models can be 200–500 MB each) should be stored in a named Docker volume or a bind mount, not baked into the image. This keeps the image portable and lets models be updated without rebuilding. Add a docker-compose.yml volume:

ocr-service:
  volumes:
    - ocr_models:/app/models
  environment:
    - KRAKEN_MODEL_PATH=/app/models/german_kurrent.mlmodel

volumes:
  ocr_models:

Document the one-time model download step in the runbook.

Startup time and health checks

Surya loads transformer models into RAM at startup — this takes 30–60 seconds on first launch. The healthcheck must wait for models to be loaded, not just for the HTTP server to bind. The Python service should expose a /health endpoint that returns 200 only after models are ready. Spring Boot must not start sending OCR jobs until this health check passes.

ocr-service:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
    interval: 10s
    timeout: 5s
    retries: 12        # 2 minutes total wait for model loading
    start_period: 60s  # don't count failures during this window

And Spring Boot's depends_on:

backend:
  depends_on:
    ocr-service:
      condition: service_healthy

Restart behavior during a running batch

If the ocr-service container crashes mid-batch (OOM, segfault — both realistic on CPU-heavy workloads), in-flight documents are lost. The job tracking must be in the database (as Markus noted) so that on restart the job can be resumed or at least marked as failed. A restart: unless-stopped policy on the container is the minimum.

CI must not run the OCR service

The real OCR service must not be part of the Gitea Actions CI pipeline — the Docker image is too large to pull and the inference is too slow. The backend integration tests should mock the OCR service (WireMock stub, as Sara describes). Add a comment in the Gitea Actions workflow file making this explicit so no one accidentally adds it later.

mem_limit in Docker Compose

Set an explicit memory limit on the ocr-service to prevent it from taking down the entire host on a runaway job:

ocr-service:
  mem_limit: 6g
  memswap_limit: 6g  # disable swap for the container

memswap_limit equal to mem_limit disables swap — better to OOM-kill the container than to thrash the disk at 20 MB/s while the rest of the stack grinds to a halt.

No ports: on the OCR service

As NullX flagged: no host port mapping. Internal Docker network only. The service name ocr-service is the hostname Spring Boot uses.

Suggestions

  • Use PyTorch CPU-only base image (pytorch/pytorch:*-cpu) — significantly smaller than the default
  • Models in a named volume, not in the image — document the one-time setup step
  • Health check with generous start_period (60s) and retries: 12 to account for model loading
  • mem_limit: 6g + memswap_limit: 6g on the ocr-service
  • restart: unless-stopped on the container
  • Add a note to the CI workflow that the OCR service is intentionally excluded from CI and must remain mocked
## 🛠️ Tobias Wendt — DevOps & Platform ### Questions & Observations **The Docker image will be very large** A PyTorch CPU-only image with Surya installed is typically 4–8 GB. Kraken adds less (it uses lighter models) but the base image is still heavy. A few things to keep this manageable: - Use `pytorch/pytorch:2.x.x-cpu` as the base — **not** the default PyTorch image which includes CUDA and is 10+ GB. The CPU-only build is a fraction of that. - Do not install both Surya and Kraken in the same image layer if Kraken is optional. Use a multi-stage build where models are downloaded at runtime (volume mount) rather than baked in. - `.dockerignore` must exclude any model files from the build context — they will accidentally bloat the image if left loose in the directory. **Model files must not be in the Docker image** Large binary models (Kraken HTR-United models can be 200–500 MB each) should be stored in a named Docker volume or a bind mount, not baked into the image. This keeps the image portable and lets models be updated without rebuilding. Add a `docker-compose.yml` volume: ```yaml ocr-service: volumes: - ocr_models:/app/models environment: - KRAKEN_MODEL_PATH=/app/models/german_kurrent.mlmodel volumes: ocr_models: ``` Document the one-time model download step in the runbook. **Startup time and health checks** Surya loads transformer models into RAM at startup — this takes 30–60 seconds on first launch. The `healthcheck` must wait for models to be loaded, not just for the HTTP server to bind. The Python service should expose a `/health` endpoint that returns `200` only after models are ready. Spring Boot must not start sending OCR jobs until this health check passes. ```yaml ocr-service: healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 10s timeout: 5s retries: 12 # 2 minutes total wait for model loading start_period: 60s # don't count failures during this window ``` And Spring Boot's `depends_on`: ```yaml backend: depends_on: ocr-service: condition: service_healthy ``` **Restart behavior during a running batch** If the `ocr-service` container crashes mid-batch (OOM, segfault — both realistic on CPU-heavy workloads), in-flight documents are lost. The job tracking must be in the database (as Markus noted) so that on restart the job can be resumed or at least marked as failed. A `restart: unless-stopped` policy on the container is the minimum. **CI must not run the OCR service** The real OCR service must not be part of the Gitea Actions CI pipeline — the Docker image is too large to pull and the inference is too slow. The backend integration tests should mock the OCR service (WireMock stub, as Sara describes). Add a comment in the Gitea Actions workflow file making this explicit so no one accidentally adds it later. **`mem_limit` in Docker Compose** Set an explicit memory limit on the `ocr-service` to prevent it from taking down the entire host on a runaway job: ```yaml ocr-service: mem_limit: 6g memswap_limit: 6g # disable swap for the container ``` `memswap_limit` equal to `mem_limit` disables swap — better to OOM-kill the container than to thrash the disk at 20 MB/s while the rest of the stack grinds to a halt. **No `ports:` on the OCR service** As NullX flagged: no host port mapping. Internal Docker network only. The service name `ocr-service` is the hostname Spring Boot uses. ### Suggestions - Use PyTorch CPU-only base image (`pytorch/pytorch:*-cpu`) — significantly smaller than the default - Models in a named volume, not in the image — document the one-time setup step - Health check with generous `start_period` (60s) and `retries: 12` to account for model loading - `mem_limit: 6g` + `memswap_limit: 6g` on the `ocr-service` - `restart: unless-stopped` on the container - Add a note to the CI workflow that the OCR service is intentionally excluded from CI and must remain mocked
Author
Owner

🏛️ Markus Keller — Architecture Discussion Summary

Discussion of the seven open architectural items. All resolved. ADR written to docs/adr/001-ocr-python-microservice.md.


Resolved decisions

1. ADR first
ADRs go in docs/adr/, numbered sequentially. 001-ocr-python-microservice.md covers the Python microservice decision and must exist before implementation starts.

2. Job persistence — ocr_jobs + ocr_job_documents tables
No in-memory map. Two Flyway-managed tables:

  • ocr_jobs — batch-level tracking (status, total, timestamps)
  • ocr_job_documents — per-document status (PENDING / RUNNING / DONE / FAILED)

Python service stays stateless. Spring Boot owns all state. Batches are resumable after OOM or restart — unprocessed documents remain PENDING and can be retried.

3. Graceful degradation in production, hard dependency in dev
OcrService checks the OCR service health before each job. If it's down, a clear DomainException is returned; the rest of the app is unaffected. The hard depends_on: service_healthy is kept in the dev compose only — removed in the production overlay.

4. SSE emitter timeout — ~5 minutes with auto-reconnect
SseEmitter set to ~5 minutes. Browser EventSource reconnects automatically. On reconnect, the endpoint reads current job state from the DB and resumes streaming. The DB is the source of truth, not the emitter.

5. LISTEN/NOTIFY explicitly deferred
Multiple concurrent jobs (even from a single user) are handled via a ConcurrentHashMap<UUID, List<SseEmitter>> — ephemeral routing table, fine to lose on restart. LISTEN/NOTIFY is the upgrade path for multi-instance deployments; not needed for a single home server instance. Noted in the ADR as future direction.

6. REST routes confirmed

  • POST /api/documents/{id}/ocr — per-document
  • POST /api/ocr/batch — batch
  • GET /api/ocr/jobs/{jobId}/progress — SSE stream

OcrController is a top-level controller, not nested under documents.

7. Package placement — follow existing convention
New classes go into the existing package-by-layer structure (controller/, service/, model/, repository/, dto/). ADR notes package-by-feature as the preferred future direction.


The architecture is solid. The two things that matter most before writing any code: get the ADR committed, and nail down the Flyway migrations for ocr_jobs and ocr_job_documents — those tables are the backbone everything else hangs on.

## 🏛️ Markus Keller — Architecture Discussion Summary Discussion of the seven open architectural items. All resolved. ADR written to `docs/adr/001-ocr-python-microservice.md`. --- ### Resolved decisions **1. ADR first** ADRs go in `docs/adr/`, numbered sequentially. `001-ocr-python-microservice.md` covers the Python microservice decision and must exist before implementation starts. **2. Job persistence — `ocr_jobs` + `ocr_job_documents` tables** No in-memory map. Two Flyway-managed tables: - `ocr_jobs` — batch-level tracking (status, total, timestamps) - `ocr_job_documents` — per-document status (`PENDING` / `RUNNING` / `DONE` / `FAILED`) Python service stays stateless. Spring Boot owns all state. Batches are resumable after OOM or restart — unprocessed documents remain `PENDING` and can be retried. **3. Graceful degradation in production, hard dependency in dev** `OcrService` checks the OCR service health before each job. If it's down, a clear `DomainException` is returned; the rest of the app is unaffected. The hard `depends_on: service_healthy` is kept in the dev compose only — removed in the production overlay. **4. SSE emitter timeout — ~5 minutes with auto-reconnect** `SseEmitter` set to ~5 minutes. Browser `EventSource` reconnects automatically. On reconnect, the endpoint reads current job state from the DB and resumes streaming. The DB is the source of truth, not the emitter. **5. LISTEN/NOTIFY explicitly deferred** Multiple concurrent jobs (even from a single user) are handled via a `ConcurrentHashMap<UUID, List<SseEmitter>>` — ephemeral routing table, fine to lose on restart. LISTEN/NOTIFY is the upgrade path for multi-instance deployments; not needed for a single home server instance. Noted in the ADR as future direction. **6. REST routes confirmed** - `POST /api/documents/{id}/ocr` — per-document - `POST /api/ocr/batch` — batch - `GET /api/ocr/jobs/{jobId}/progress` — SSE stream `OcrController` is a top-level controller, not nested under documents. **7. Package placement — follow existing convention** New classes go into the existing package-by-layer structure (`controller/`, `service/`, `model/`, `repository/`, `dto/`). ADR notes package-by-feature as the preferred future direction. --- The architecture is solid. The two things that matter most before writing any code: get the ADR committed, and nail down the Flyway migrations for `ocr_jobs` and `ocr_job_documents` — those tables are the backbone everything else hangs on.
Author
Owner

👨‍💻 Felix Brandt — Developer Discussion Summary

Nine implementation decisions resolved. These lock down the class structure, frontend pattern, and annotation integration before a line of code is written.


Resolved decisions

1. Interface split — OcrClient + OcrHealthClient
Two interfaces, not one. OcrClient exposes extractBlocks(), OcrHealthClient exposes isHealthy(). RestClientOcrClient implements both. OcrBlockResult is a record. Unit tests mock only the interface they need.

2. Class breakdown — four classes

  • OcrService — single-document work (presigned URL → OCR call → block mapping → TranscriptionService)
  • OcrBatchService — batch loop, owns ocr_job_documents state, calls OcrService per document
  • OcrProgressService — owns ConcurrentHashMap<UUID, List<SseEmitter>>, exposes register() and emit()
  • RestClientOcrClient — HTTP infrastructure only

Per-document OCR (POST /api/documents/{id}/ocr) bypasses OcrBatchService entirely — goes directly to OcrService. Creates one ocr_jobs row, no ocr_job_documents rows.

3. ScriptType enum values
UNKNOWN, TYPEWRITER, HANDWRITING_LATIN, HANDWRITING_KURRENT. These are persisted as strings in the DB column — must be correct in the Flyway migration before anything else touches the column.

4. Frontend EventSource pattern
SSE stream proxied through a SvelteKit +server.ts that pipes the Spring Boot response body. Browser opens same-origin EventSource — no Spring Boot URL exposed client-side, auth cookie included automatically. OCR trigger POSTs use the generated typed API client as normal.

5. Frontend component decomposition
New components in the transcription panel:

  • OcrTrigger — wraps ScriptTypeSelect dropdown + "OCR starten" button + replacement confirmation dialog
  • OcrProgress — owns the EventSource lifecycle, receives jobId + onDone props, renders page progress bar

TranscriptionPanel holds job ID state and switches between OcrTrigger, OcrProgress, and the existing edit/read views.

6. Batch progress — document detail page, not a list banner
No cross-page banner. Each document detail page calls GET /api/documents/{id}/ocr-status in the server load function. If status is PENDING or RUNNING, the transcription panel renders OcrProgress with the job ID. User finds batch progress where they already look for transcriptions.

7. ScriptType controller validation
No custom validator. Jackson's default behaviour throws HttpMessageNotReadableException400 Bad Request for unknown enum values. Service null-checks dto.scriptType() and falls back to the document's stored scriptType if not provided.

8. Annotation overlap check — createOcrAnnotation()
OCR creates many adjacent text line annotations that would fail the existing overlap check. Solution: a separate createOcrAnnotation() method on AnnotationService that skips the overlap check. No boolean flag argument. TranscriptionService calls createOcrAnnotation() when the block source is OCR.

9. Polygon annotation sequencing
Kraken returns polygon boundaries, not rectangles. Rather than permanently storing AABB approximations, #227 (polygon annotation support — polygon JSONB column on document_annotations) ships first. Once the DB and backend are in place, Kraken integration in this feature can output proper quadrilateral shapes from day one. createOcrAnnotation() accepts the optional polygon field from the start so no rework is needed later.


The implementation order is: #227 DB + backend → then #226. Everything above is locked in before implementation starts.

## 👨‍💻 Felix Brandt — Developer Discussion Summary Nine implementation decisions resolved. These lock down the class structure, frontend pattern, and annotation integration before a line of code is written. --- ### Resolved decisions **1. Interface split — `OcrClient` + `OcrHealthClient`** Two interfaces, not one. `OcrClient` exposes `extractBlocks()`, `OcrHealthClient` exposes `isHealthy()`. `RestClientOcrClient` implements both. `OcrBlockResult` is a record. Unit tests mock only the interface they need. **2. Class breakdown — four classes** - `OcrService` — single-document work (presigned URL → OCR call → block mapping → TranscriptionService) - `OcrBatchService` — batch loop, owns `ocr_job_documents` state, calls `OcrService` per document - `OcrProgressService` — owns `ConcurrentHashMap<UUID, List<SseEmitter>>`, exposes `register()` and `emit()` - `RestClientOcrClient` — HTTP infrastructure only Per-document OCR (`POST /api/documents/{id}/ocr`) bypasses `OcrBatchService` entirely — goes directly to `OcrService`. Creates one `ocr_jobs` row, no `ocr_job_documents` rows. **3. `ScriptType` enum values** `UNKNOWN`, `TYPEWRITER`, `HANDWRITING_LATIN`, `HANDWRITING_KURRENT`. These are persisted as strings in the DB column — must be correct in the Flyway migration before anything else touches the column. **4. Frontend `EventSource` pattern** SSE stream proxied through a SvelteKit `+server.ts` that pipes the Spring Boot response body. Browser opens same-origin `EventSource` — no Spring Boot URL exposed client-side, auth cookie included automatically. OCR trigger POSTs use the generated typed API client as normal. **5. Frontend component decomposition** New components in the transcription panel: - `OcrTrigger` — wraps `ScriptTypeSelect` dropdown + "OCR starten" button + replacement confirmation dialog - `OcrProgress` — owns the `EventSource` lifecycle, receives `jobId` + `onDone` props, renders page progress bar `TranscriptionPanel` holds job ID state and switches between `OcrTrigger`, `OcrProgress`, and the existing edit/read views. **6. Batch progress — document detail page, not a list banner** No cross-page banner. Each document detail page calls `GET /api/documents/{id}/ocr-status` in the server load function. If status is `PENDING` or `RUNNING`, the transcription panel renders `OcrProgress` with the job ID. User finds batch progress where they already look for transcriptions. **7. `ScriptType` controller validation** No custom validator. Jackson's default behaviour throws `HttpMessageNotReadableException` → `400 Bad Request` for unknown enum values. Service null-checks `dto.scriptType()` and falls back to the document's stored `scriptType` if not provided. **8. Annotation overlap check — `createOcrAnnotation()`** OCR creates many adjacent text line annotations that would fail the existing overlap check. Solution: a separate `createOcrAnnotation()` method on `AnnotationService` that skips the overlap check. No boolean flag argument. `TranscriptionService` calls `createOcrAnnotation()` when the block source is OCR. **9. Polygon annotation sequencing** Kraken returns polygon boundaries, not rectangles. Rather than permanently storing AABB approximations, #227 (polygon annotation support — `polygon JSONB` column on `document_annotations`) ships first. Once the DB and backend are in place, Kraken integration in this feature can output proper quadrilateral shapes from day one. `createOcrAnnotation()` accepts the optional polygon field from the start so no rework is needed later. --- The implementation order is: **#227 DB + backend → then #226**. Everything above is locked in before implementation starts.
Author
Owner

🎨 Leonie Voss — UX Design & Accessibility Discussion Summary

Eight UI decisions resolved. These cover every visual state of the OCR feature from empty panel to error recovery, with full accessibility and mobile specification.


Resolved decisions

1. Empty state & trigger placement

  • Empty panel: OcrTrigger is the primary CTA, manual drawing is secondary text below
  • With existing blocks: OCR trigger is secondary (lower visual weight) with a persistent warning indicator that re-running destroys blocks AND comments
  • ocrAvailable checked server-side in the page load function — trigger not rendered at all if the OCR service is down (no flicker, no broken button)

2. ScriptTypeSelect — three options, Paraglide
Native <select> with visible <label>. Three options only — UNKNOWN removed from the UI:

  • script_type_typewriter → "Schreibmaschine" / "Typewriter" / "Máquina de escribir"
  • script_type_handwriting_latin → "Handschrift (lateinisch)" / "Handwriting (Latin)" / "Escritura manuscrita (latina)"
  • script_type_handwriting_kurrent → "Handschrift (Kurrent/Sütterlin)" / "Handwriting (Kurrent/Sütterlin)" / "Escritura manuscrita (Kurrent/Sütterlin)"

"OCR starten" button disabled until a script type is selected. Document's stored scriptType pre-selected when available. When stored value is UNKNOWN, no option pre-selected and button stays disabled.

3. Confirmation dialog — only when blocks exist
Uses ConfirmModal from #207 (dependency — must ship first or in parallel). No dialog on first OCR run. When blocks exist: dynamic block count + explicit mention of comments in the body text. "Ersetzen" uses destructive button style.

4. In-progress block visual state
border-brand-sand (#E4E2D7) muted left border on blocks during OCR processing — signals provisional state without reducing readability. Blocks rendered as read-only preview (TranscriptionReadView) beneath OcrProgress. Edit controls not rendered while OCR runs — architecture prevents interaction, no explicit disabling needed. Normal turquoise border restored when edit mode activates on completion.

5. OcrProgress component design

  • Header: "OCR läuft" in text-xs font-bold uppercase tracking-widest text-gray-400 style
  • Progress bar: brand-mint (#A6DAD8) fill on brand-sand (#E4E2D7) track
  • Page counter: right-aligned text-sm text-gray-500
  • Script type label below for context
  • Spinning icon top-right, stops on completion
  • No cancel button — OCR runs to completion once started
  • All text via Paraglide

6. Error state

  • Mid-run failure: OcrProgress transitions to error state — red left border, failure message with page count reached, "Erneut versuchen" button
  • Batch failure on page load: dismissible inline alert above any partial blocks
  • "Erneut versuchen" pre-selects the document's stored scriptType (set when OCR was triggered — no extra state needed)

7. Accessibility

  • Progress bar: role="progressbar" with aria-valuenow, aria-valuemin, aria-valuemax, aria-label
  • "OCR starten": native disabled attribute (not just aria-disabled) when no script type selected
  • Confirmation dialog: ConfirmModal from #207 handles role="dialog", focus trapping, focus return, and touch targets ≥ 44px
  • ScriptTypeSelect: native <select> with <label for=""> — keyboard navigation and screen reader support free
  • Error state heading: <h3> element so screen readers announce it when focus moves to the error state

8. Mobile layout
Full-width stacked layout at all narrow viewports:

Schrifttyp
[Schreibmaschine            ▾]   min-h-[44px]
[        OCR starten         ]   min-h-[44px]

OcrProgress is full-width by nature — no mobile-specific changes needed.


Dependency note: The OCR confirmation dialog requires ConfirmModal from #207. Both issues should be tracked together for implementation ordering.

## 🎨 Leonie Voss — UX Design & Accessibility Discussion Summary Eight UI decisions resolved. These cover every visual state of the OCR feature from empty panel to error recovery, with full accessibility and mobile specification. --- ### Resolved decisions **1. Empty state & trigger placement** - Empty panel: `OcrTrigger` is the primary CTA, manual drawing is secondary text below - With existing blocks: OCR trigger is secondary (lower visual weight) with a persistent warning indicator that re-running destroys blocks AND comments - `ocrAvailable` checked server-side in the page load function — trigger not rendered at all if the OCR service is down (no flicker, no broken button) **2. `ScriptTypeSelect` — three options, Paraglide** Native `<select>` with visible `<label>`. Three options only — UNKNOWN removed from the UI: - `script_type_typewriter` → "Schreibmaschine" / "Typewriter" / "Máquina de escribir" - `script_type_handwriting_latin` → "Handschrift (lateinisch)" / "Handwriting (Latin)" / "Escritura manuscrita (latina)" - `script_type_handwriting_kurrent` → "Handschrift (Kurrent/Sütterlin)" / "Handwriting (Kurrent/Sütterlin)" / "Escritura manuscrita (Kurrent/Sütterlin)" "OCR starten" button disabled until a script type is selected. Document's stored `scriptType` pre-selected when available. When stored value is `UNKNOWN`, no option pre-selected and button stays disabled. **3. Confirmation dialog — only when blocks exist** Uses `ConfirmModal` from #207 (dependency — must ship first or in parallel). No dialog on first OCR run. When blocks exist: dynamic block count + explicit mention of comments in the body text. "Ersetzen" uses destructive button style. **4. In-progress block visual state** `border-brand-sand` (`#E4E2D7`) muted left border on blocks during OCR processing — signals provisional state without reducing readability. Blocks rendered as read-only preview (`TranscriptionReadView`) beneath `OcrProgress`. Edit controls not rendered while OCR runs — architecture prevents interaction, no explicit disabling needed. Normal turquoise border restored when edit mode activates on completion. **5. `OcrProgress` component design** - Header: "OCR läuft" in `text-xs font-bold uppercase tracking-widest text-gray-400` style - Progress bar: brand-mint (`#A6DAD8`) fill on brand-sand (`#E4E2D7`) track - Page counter: right-aligned `text-sm text-gray-500` - Script type label below for context - Spinning icon top-right, stops on completion - No cancel button — OCR runs to completion once started - All text via Paraglide **6. Error state** - Mid-run failure: `OcrProgress` transitions to error state — red left border, failure message with page count reached, "Erneut versuchen" button - Batch failure on page load: dismissible inline alert above any partial blocks - "Erneut versuchen" pre-selects the document's stored `scriptType` (set when OCR was triggered — no extra state needed) **7. Accessibility** - Progress bar: `role="progressbar"` with `aria-valuenow`, `aria-valuemin`, `aria-valuemax`, `aria-label` - "OCR starten": native `disabled` attribute (not just `aria-disabled`) when no script type selected - Confirmation dialog: `ConfirmModal` from #207 handles `role="dialog"`, focus trapping, focus return, and touch targets ≥ 44px - `ScriptTypeSelect`: native `<select>` with `<label for="">` — keyboard navigation and screen reader support free - Error state heading: `<h3>` element so screen readers announce it when focus moves to the error state **8. Mobile layout** Full-width stacked layout at all narrow viewports: ``` Schrifttyp [Schreibmaschine ▾] min-h-[44px] [ OCR starten ] min-h-[44px] ``` `OcrProgress` is full-width by nature — no mobile-specific changes needed. --- **Dependency note:** The OCR confirmation dialog requires `ConfirmModal` from #207. Both issues should be tracked together for implementation ordering.
Author
Owner

👨‍💻 Felix Brandt — Developer Discussion Summary (Python Microservice)

Six implementation decisions resolved, plus Docker Compose and Dockerfile spec for the ocr-service.


Resolved decisions

1. HTTP framework — FastAPI
FastAPI + Pydantic v2 for the Python microservice. Typed request/response models keep the Python interface in sync with RestClientOcrClient on the Java side. Auto-generated /docs is free. Uvicorn as the ASGI server.

2. Model loading — eager load both at startup
Both Surya and Kraken models are loaded when the container starts. The /health endpoint returns 200 only after both are ready. Tobias's start_period: 60s + retries: 12 covers model load time. No lazy-loading — first request is always fast, RAM is committed upfront.

3. GET /api/documents/{id}/ocr-status response shape
New endpoint, called from the document page server load function to decide whether to render OcrProgress:

{
  "status": "RUNNING",
  "jobId": "uuid",
  "currentPage": 3,
  "totalPages": 12
}

status values: NONE / PENDING / RUNNING / DONE / FAILED. currentPage + totalPages give the progress bar an initial value on page load — avoids always starting at 0% mid-job.

4. Batch OCR trigger — always manual, via Admin system page
MassImportService stays untouched — no OCR coupling. Batch OCR is triggered manually from the Admin system page. The admin controller calls OcrBatchService.startBatch(), protected by ADMIN permission (as NullX specified).

5. Non-UPLOADED documents in batch — SKIPPED as distinct terminal state
PLACEHOLDER documents in a batch are recorded as SKIPPED in ocr_job_documents — not counted as errors. The SSE done event and ocr_jobs table carry separate skipped_count alongside processed_count and error_count. Admin sees: "45 verarbeitet · 3 übersprungen · 2 fehlgeschlagen".

6. Python project structure

ocr-service/
├── Dockerfile
├── requirements.txt
├── main.py              ← FastAPI app, /ocr + /health endpoints, lifespan model loading
├── models.py            ← Pydantic request/response models
├── engines/
│   ├── surya.py         ← Surya engine wrapper
│   └── kraken.py        ← Kraken engine wrapper
└── models/              ← volume-mounted Kraken model files (.gitignore'd)

Docker Compose addition

Add to docker-compose.yml services:

  # --- OCR: Python microservice (Surya + Kraken) ---
  ocr-service:
    build:
      context: ./ocr-service
      dockerfile: Dockerfile
    container_name: archive-ocr
    restart: unless-stopped
    mem_limit: 6g
    memswap_limit: 6g   # equal to mem_limit = swap disabled; OOM-kill before disk thrash
    volumes:
      - ocr_models:/app/models
    environment:
      KRAKEN_MODEL_PATH: /app/models/german_kurrent.mlmodel
    # No ports: — internal network only; Spring Boot reaches it via http://ocr-service:8000
    networks:
      - archive-net
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 10s
      timeout: 5s
      retries: 12
      start_period: 60s

Add ocr-service dependency to backend:

    depends_on:
      db:
        condition: service_healthy
      minio:
        condition: service_healthy
      mailpit:
        condition: service_started
      ocr-service:
        condition: service_healthy

Add to volumes: section:

  ocr_models:

Dockerfile + dependencies

ocr-service/Dockerfile:

FROM python:3.11-slim

WORKDIR /app

# curl for healthcheck; libgomp1 for PyTorch CPU threading
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# PyTorch CPU-only — separate layer; the whl/cpu index strips all CUDA variants (~2 GB saved)
RUN pip install --no-cache-dir \
    torch==2.5.1 \
    --index-url https://download.pytorch.org/whl/cpu

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

ocr-service/requirements.txt:

fastapi[standard]==0.115.6
surya-ocr==0.6.3
kraken==5.2.9
pillow==11.1.0
pypdfium2==4.30.0
httpx==0.28.1

Notes:

  • fastapi[standard] pulls in uvicorn, pydantic v2, python-multipart
  • kraken==5.2.9 — pin to 5.x; the 4.x→5.x jump changed the model format
  • pypdfium2 converts PDF pages to PIL images before passing to Surya (no poppler system dependency, keeps image lean)
  • httpx for fetching the presigned MinIO URL

Kraken model selection — requires evaluation

The issue correctly flags this as unresolved. Two HTR-United candidates for 19th–20th century German Kurrent:

Model Notes
german_kurrent_manu_9 Trained on 19th-century German administrative Kurrent
kurrent-de Broader coverage, lower per-page accuracy on dense text

Do not bake the model choice into the issue. The one-time runbook step: download both models into the ocr_models volume, run against 2–3 sample documents from the actual archive, keep the better one at the path KRAKEN_MODEL_PATH points to. The env var abstraction in Docker Compose means the model file can be swapped without touching any code.

## 👨‍💻 Felix Brandt — Developer Discussion Summary (Python Microservice) Six implementation decisions resolved, plus Docker Compose and Dockerfile spec for the `ocr-service`. --- ### Resolved decisions **1. HTTP framework — FastAPI** FastAPI + Pydantic v2 for the Python microservice. Typed request/response models keep the Python interface in sync with `RestClientOcrClient` on the Java side. Auto-generated `/docs` is free. Uvicorn as the ASGI server. **2. Model loading — eager load both at startup** Both Surya and Kraken models are loaded when the container starts. The `/health` endpoint returns `200` only after both are ready. Tobias's `start_period: 60s` + `retries: 12` covers model load time. No lazy-loading — first request is always fast, RAM is committed upfront. **3. `GET /api/documents/{id}/ocr-status` response shape** New endpoint, called from the document page server load function to decide whether to render `OcrProgress`: ```json { "status": "RUNNING", "jobId": "uuid", "currentPage": 3, "totalPages": 12 } ``` `status` values: `NONE` / `PENDING` / `RUNNING` / `DONE` / `FAILED`. `currentPage` + `totalPages` give the progress bar an initial value on page load — avoids always starting at 0% mid-job. **4. Batch OCR trigger — always manual, via Admin system page** `MassImportService` stays untouched — no OCR coupling. Batch OCR is triggered manually from the Admin system page. The admin controller calls `OcrBatchService.startBatch()`, protected by `ADMIN` permission (as NullX specified). **5. Non-UPLOADED documents in batch — `SKIPPED` as distinct terminal state** `PLACEHOLDER` documents in a batch are recorded as `SKIPPED` in `ocr_job_documents` — not counted as errors. The SSE `done` event and `ocr_jobs` table carry separate `skipped_count` alongside `processed_count` and `error_count`. Admin sees: "45 verarbeitet · 3 übersprungen · 2 fehlgeschlagen". **6. Python project structure** ``` ocr-service/ ├── Dockerfile ├── requirements.txt ├── main.py ← FastAPI app, /ocr + /health endpoints, lifespan model loading ├── models.py ← Pydantic request/response models ├── engines/ │ ├── surya.py ← Surya engine wrapper │ └── kraken.py ← Kraken engine wrapper └── models/ ← volume-mounted Kraken model files (.gitignore'd) ``` --- ### Docker Compose addition Add to `docker-compose.yml` services: ```yaml # --- OCR: Python microservice (Surya + Kraken) --- ocr-service: build: context: ./ocr-service dockerfile: Dockerfile container_name: archive-ocr restart: unless-stopped mem_limit: 6g memswap_limit: 6g # equal to mem_limit = swap disabled; OOM-kill before disk thrash volumes: - ocr_models:/app/models environment: KRAKEN_MODEL_PATH: /app/models/german_kurrent.mlmodel # No ports: — internal network only; Spring Boot reaches it via http://ocr-service:8000 networks: - archive-net healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 10s timeout: 5s retries: 12 start_period: 60s ``` Add `ocr-service` dependency to `backend`: ```yaml depends_on: db: condition: service_healthy minio: condition: service_healthy mailpit: condition: service_started ocr-service: condition: service_healthy ``` Add to `volumes:` section: ```yaml ocr_models: ``` --- ### Dockerfile + dependencies **`ocr-service/Dockerfile`:** ```dockerfile FROM python:3.11-slim WORKDIR /app # curl for healthcheck; libgomp1 for PyTorch CPU threading RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ libgomp1 \ && rm -rf /var/lib/apt/lists/* # PyTorch CPU-only — separate layer; the whl/cpu index strips all CUDA variants (~2 GB saved) RUN pip install --no-cache-dir \ torch==2.5.1 \ --index-url https://download.pytorch.org/whl/cpu COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 8000 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] ``` **`ocr-service/requirements.txt`:** ``` fastapi[standard]==0.115.6 surya-ocr==0.6.3 kraken==5.2.9 pillow==11.1.0 pypdfium2==4.30.0 httpx==0.28.1 ``` Notes: - `fastapi[standard]` pulls in uvicorn, pydantic v2, python-multipart - `kraken==5.2.9` — pin to 5.x; the 4.x→5.x jump changed the model format - `pypdfium2` converts PDF pages to PIL images before passing to Surya (no `poppler` system dependency, keeps image lean) - `httpx` for fetching the presigned MinIO URL --- ### Kraken model selection — requires evaluation The issue correctly flags this as unresolved. Two HTR-United candidates for 19th–20th century German Kurrent: | Model | Notes | |---|---| | `german_kurrent_manu_9` | Trained on 19th-century German administrative Kurrent | | `kurrent-de` | Broader coverage, lower per-page accuracy on dense text | **Do not bake the model choice into the issue.** The one-time runbook step: download both models into the `ocr_models` volume, run against 2–3 sample documents from the actual archive, keep the better one at the path `KRAKEN_MODEL_PATH` points to. The env var abstraction in Docker Compose means the model file can be swapped without touching any code.
Author
Owner

Implementation Complete — OCR Pipeline

Branch: feat/issue-226-227-ocr-pipeline-polygon (implements #226 and #227 together as agreed in the discussion)

Commits

Commit Description
ec32d22 ADR-001 (OCR microservice) and ADR-002 (polygon JSONB)
878a90a Polygon JSONB support: V23 migration, PolygonConverter, @UniquePoints validator, CreateAnnotationDTO
c19c41f createOcrAnnotation() on AnnotationService (skips overlap check)
d194b6b ScriptType enum, V24 migration, Document entity + DTO update
ff39907 OCR infrastructure: OcrClient/OcrHealthClient interfaces, OcrBlockResult, OcrJob/OcrJobDocument entities, V25 migration, repos, DTOs, ErrorCodes
aea46c5 OcrService, OcrBatchService, OcrProgressService, OcrController (19 new tests)
6737bd6 Python OCR microservice (FastAPI + Surya + Kraken), RestClientOcrClient, Docker Compose
cf8dc35 Frontend: AnnotationShape.svelte extraction + polygon rendering (clip-path)
a4651aa Frontend: ScriptTypeSelect, OcrTrigger, OcrProgress components + Paraglide translations (de/en/es)
931fbc2 Fix: @JdbcTypeCode(JSON) for polygon JSONB column

What was implemented

Backend (Java/Spring Boot):

  • ScriptType enum: UNKNOWN, TYPEWRITER, HANDWRITING_LATIN, HANDWRITING_KURRENT
  • OcrClient + OcrHealthClient interfaces (mockable for TDD)
  • OcrService: single-document OCR orchestration (health check → clear blocks → OCR → create annotations + blocks)
  • OcrBatchService: batch processing with @Async, per-document status, SKIPPED for PLACEHOLDER docs, failure isolation
  • OcrProgressService: SSE emitter registry per job ID
  • OcrController: POST /api/documents/{id}/ocr, POST /api/ocr/batch, GET /api/ocr/jobs/{id}, GET /api/ocr/jobs/{id}/progress (SSE), GET /api/documents/{id}/ocr-status
  • RestClientOcrClient: HTTP client to Python microservice
  • 3 Flyway migrations (V23 polygon, V24 script_type, V25 ocr_jobs)
  • Polygon annotation support with @JdbcTypeCode(JSON) + DB CHECK constraint

Python microservice (ocr-service/):

  • FastAPI app with /ocr and /health endpoints
  • Surya engine wrapper (typewriter + modern handwriting)
  • Kraken engine wrapper (Kurrent/Sütterlin) with pure-Python polygon-to-quad (gift wrapping + rotating calipers)
  • Docker: CPU-only PyTorch, 6GB mem_limit, health check with 60s start_period, no host ports

Frontend (SvelteKit):

  • AnnotationShape.svelte: renders rect or polygon via CSS clip-path: polygon()
  • ScriptTypeSelect, OcrTrigger, OcrProgress components
  • Paraglide translations for all OCR UI strings (de/en/es)
  • Error codes mapped: OCR_SERVICE_UNAVAILABLE, OCR_JOB_NOT_FOUND, OCR_DOCUMENT_NOT_UPLOADED, OCR_PROCESSING_FAILED

Test results

  • Backend: 810 tests, 0 failures
  • Frontend: 687 tests (70 files), 0 failures

Next steps

  • Open PR for review
  • Download and evaluate Kraken models for Kurrent (one-time runbook step)
  • Regenerate frontend API types once backend is running with the new endpoints
## Implementation Complete — OCR Pipeline Branch: `feat/issue-226-227-ocr-pipeline-polygon` (implements #226 and #227 together as agreed in the discussion) ### Commits | Commit | Description | |---|---| | `ec32d22` | ADR-001 (OCR microservice) and ADR-002 (polygon JSONB) | | `878a90a` | Polygon JSONB support: V23 migration, PolygonConverter, @UniquePoints validator, CreateAnnotationDTO | | `c19c41f` | `createOcrAnnotation()` on AnnotationService (skips overlap check) | | `d194b6b` | ScriptType enum, V24 migration, Document entity + DTO update | | `ff39907` | OCR infrastructure: OcrClient/OcrHealthClient interfaces, OcrBlockResult, OcrJob/OcrJobDocument entities, V25 migration, repos, DTOs, ErrorCodes | | `aea46c5` | OcrService, OcrBatchService, OcrProgressService, OcrController (19 new tests) | | `6737bd6` | Python OCR microservice (FastAPI + Surya + Kraken), RestClientOcrClient, Docker Compose | | `cf8dc35` | Frontend: AnnotationShape.svelte extraction + polygon rendering (clip-path) | | `a4651aa` | Frontend: ScriptTypeSelect, OcrTrigger, OcrProgress components + Paraglide translations (de/en/es) | | `931fbc2` | Fix: @JdbcTypeCode(JSON) for polygon JSONB column | ### What was implemented **Backend (Java/Spring Boot):** - `ScriptType` enum: UNKNOWN, TYPEWRITER, HANDWRITING_LATIN, HANDWRITING_KURRENT - `OcrClient` + `OcrHealthClient` interfaces (mockable for TDD) - `OcrService`: single-document OCR orchestration (health check → clear blocks → OCR → create annotations + blocks) - `OcrBatchService`: batch processing with @Async, per-document status, SKIPPED for PLACEHOLDER docs, failure isolation - `OcrProgressService`: SSE emitter registry per job ID - `OcrController`: `POST /api/documents/{id}/ocr`, `POST /api/ocr/batch`, `GET /api/ocr/jobs/{id}`, `GET /api/ocr/jobs/{id}/progress` (SSE), `GET /api/documents/{id}/ocr-status` - `RestClientOcrClient`: HTTP client to Python microservice - 3 Flyway migrations (V23 polygon, V24 script_type, V25 ocr_jobs) - Polygon annotation support with @JdbcTypeCode(JSON) + DB CHECK constraint **Python microservice (ocr-service/):** - FastAPI app with `/ocr` and `/health` endpoints - Surya engine wrapper (typewriter + modern handwriting) - Kraken engine wrapper (Kurrent/Sütterlin) with pure-Python polygon-to-quad (gift wrapping + rotating calipers) - Docker: CPU-only PyTorch, 6GB mem_limit, health check with 60s start_period, no host ports **Frontend (SvelteKit):** - `AnnotationShape.svelte`: renders rect or polygon via CSS `clip-path: polygon()` - `ScriptTypeSelect`, `OcrTrigger`, `OcrProgress` components - Paraglide translations for all OCR UI strings (de/en/es) - Error codes mapped: OCR_SERVICE_UNAVAILABLE, OCR_JOB_NOT_FOUND, OCR_DOCUMENT_NOT_UPLOADED, OCR_PROCESSING_FAILED ### Test results - **Backend**: 810 tests, 0 failures ✅ - **Frontend**: 687 tests (70 files), 0 failures ✅ ### Next steps - Open PR for review - Download and evaluate Kraken models for Kurrent (one-time runbook step) - Regenerate frontend API types once backend is running with the new endpoints
Sign in to join this conversation.
No Label feature
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#226