feat: OCR pipeline with NDJSON streaming and real-time progress (#226, #227, #231) #229

marcel · 2026-04-12T17:55:54+02:00

marcel commented

2026-04-12 17:55:54 +02:00

Summary

Local OCR pipeline with Surya (typewriter/Latin) and Kraken (Kurrent) engine support
NDJSON streaming (POST /ocr/stream) — Python sends one JSON line per completed page, Java consumes incrementally
Persist transcription blocks as each page arrives instead of waiting for all pages, eliminating the 10-minute timeout on long documents
Per-page progress in the frontend ("Seite 3 von 7 wird analysiert…") instead of a generic spinner
Polygon annotation support for quadrilateral bounding boxes
Allow re-running OCR when transcription blocks already exist

Closes #226, closes #227, closes #231

Changes by layer

Python OCR Service

Surya engine (typewriter, Latin handwriting) and Kraken engine (Kurrent/Sütterlin)
extract_page_blocks() per-page processing for both engines
POST /ocr/stream NDJSON endpoint with start/page/error/done event protocol
Per-page error handling: log traceback, yield generic error, continue with next page
Confidence-based [unleserlich] marker insertion with per-script thresholds
X-Accel-Buffering: no + Cache-Control: no-cache for reverse-proxy compatibility
38 tests, 88% coverage on production code

Java Backend

OcrStreamEvent sealed interface with Start/Page/Error/Done record subtypes
OcrClient.streamBlocks() with default fallback that synthesizes events from extractBlocks()
RestClientOcrClient.streamBlocks() parses NDJSON with dedicated ObjectMapper, falls back to /ocr on 404
OcrAsyncRunner.runSingleDocument() uses streaming with per-page block persistence
Batch path (runBatch) unchanged — stays on old extractBlocks()
Polygon JSONB column on annotations for quadrilateral shapes

Frontend

translateOcrProgress extracted to testable $lib/ocr/ module
ANALYZING_PAGE:current:total:blocks and DONE:count:skipped progress codes
Skipped-pages warning in amber when pages fail during OCR
Collapsible "OCR erneut ausführen…" trigger in edit mode when blocks exist
Script type selection (Typewriter / Latin / Kurrent)
i18n keys for de/en/es

Test plan

Trigger OCR on a multi-page PDF — verify per-page progress messages appear
Verify transcription blocks are created incrementally (visible after each page)
Kill OCR mid-stream — verify partial blocks are preserved, re-trigger works
Verify annotations appear on correct pages (1-based)
Verify re-run OCR option appears in edit mode when blocks exist
cd backend && ./mvnw test — 831 tests pass
cd frontend && npm run test — 698 tests pass
cd ocr-service && python -m pytest — 38 tests pass

## Summary - Local OCR pipeline with Surya (typewriter/Latin) and Kraken (Kurrent) engine support - NDJSON streaming (`POST /ocr/stream`) — Python sends one JSON line per completed page, Java consumes incrementally - Persist transcription blocks as each page arrives instead of waiting for all pages, eliminating the 10-minute timeout on long documents - Per-page progress in the frontend (\"Seite 3 von 7 wird analysiert…\") instead of a generic spinner - Polygon annotation support for quadrilateral bounding boxes - Allow re-running OCR when transcription blocks already exist Closes #226, closes #227, closes #231 ## Changes by layer ### Python OCR Service - Surya engine (typewriter, Latin handwriting) and Kraken engine (Kurrent/Sütterlin) - `extract_page_blocks()` per-page processing for both engines - `POST /ocr/stream` NDJSON endpoint with `start`/`page`/`error`/`done` event protocol - Per-page error handling: log traceback, yield generic error, continue with next page - Confidence-based `[unleserlich]` marker insertion with per-script thresholds - `X-Accel-Buffering: no` + `Cache-Control: no-cache` for reverse-proxy compatibility - 38 tests, 88% coverage on production code ### Java Backend - `OcrStreamEvent` sealed interface with `Start`/`Page`/`Error`/`Done` record subtypes - `OcrClient.streamBlocks()` with default fallback that synthesizes events from `extractBlocks()` - `RestClientOcrClient.streamBlocks()` parses NDJSON with dedicated `ObjectMapper`, falls back to `/ocr` on 404 - `OcrAsyncRunner.runSingleDocument()` uses streaming with per-page block persistence - Batch path (`runBatch`) unchanged — stays on old `extractBlocks()` - Polygon JSONB column on annotations for quadrilateral shapes ### Frontend - `translateOcrProgress` extracted to testable `$lib/ocr/` module - `ANALYZING_PAGE:current:total:blocks` and `DONE:count:skipped` progress codes - Skipped-pages warning in amber when pages fail during OCR - Collapsible \"OCR erneut ausführen…\" trigger in edit mode when blocks exist - Script type selection (Typewriter / Latin / Kurrent) - i18n keys for de/en/es ## Test plan - [ ] Trigger OCR on a multi-page PDF — verify per-page progress messages appear - [ ] Verify transcription blocks are created incrementally (visible after each page) - [ ] Kill OCR mid-stream — verify partial blocks are preserved, re-trigger works - [ ] Verify annotations appear on correct pages (1-based) - [ ] Verify re-run OCR option appears in edit mode when blocks exist - [ ] `cd backend && ./mvnw test` — 831 tests pass - [ ] `cd frontend && npm run test` — 698 tests pass - [ ] `cd ocr-service && python -m pytest` — 38 tests pass

marcel added 10 commits 2026-04-12 17:55:55 +02:00

docs(adr): add ADR-001 (OCR microservice) and ADR-002 (polygon JSONB) ec32d225b5

ADR-001 documents the decision to use a separate Python container for
OCR (Surya + Kraken), the interface contract, and why alternatives
like Tess4J were rejected.

ADR-002 documents the decision to store polygon annotations as JSONB
with a 4-point CHECK constraint, backed by an AttributeConverter.

Refs #226, #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(annotations): add polygon JSONB support for quadrilateral shapes 878a90a86d

- V23 migration adds polygon JSONB column with 4-point CHECK constraint
- PolygonConverter: AttributeConverter for List<List<Double>> <-> JSONB
- @UniquePoints custom validator rejects duplicate coordinates
- CreateAnnotationDTO: validated optional polygon field
- DocumentAnnotation entity: polygon field with converter

Refs #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(annotations): add createOcrAnnotation that skips overlap check c19c41f812

OCR creates many adjacent text line annotations that would fail the
existing overlap check. createOcrAnnotation() accepts an optional
polygon and bypasses overlap detection entirely.

Refs #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(documents): add ScriptType enum and script_type column d194b6b225

- ScriptType enum: UNKNOWN, TYPEWRITER, HANDWRITING_LATIN, HANDWRITING_KURRENT
- V24 migration adds script_type VARCHAR(30) NOT NULL DEFAULT 'UNKNOWN'
- Document entity: scriptType field with @Builder.Default UNKNOWN
- DocumentUpdateDTO: optional scriptType field
- DocumentService: wires scriptType through update method

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ocr): add OCR infrastructure (interfaces, entities, migrations, DTOs) ff3990710e

- OcrClient + OcrHealthClient interfaces for testable OCR integration
- OcrBlockResult record for OCR engine response mapping
- OcrJob + OcrJobDocument entities with status enums
- V25 migration creates ocr_jobs and ocr_job_documents tables
- Repositories for job and job-document queries
- TriggerOcrDTO, BatchOcrDTO (@Size max=500), OcrStatusDTO
- ErrorCodes: OCR_SERVICE_UNAVAILABLE, OCR_JOB_NOT_FOUND,
  OCR_DOCUMENT_NOT_UPLOADED, OCR_PROCESSING_FAILED

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ocr): add OcrService, OcrBatchService, OcrProgressService, OcrController aea46c5fd0

- OcrService: single-document OCR (health check, block clearing,
  presigned URL, annotation + block creation)
- OcrBatchService: batch processing with @Async, per-document status
  tracking, SKIPPED for PLACEHOLDER documents, failure isolation
- OcrProgressService: SSE emitter registry per job ID with 5-min timeout
- OcrController: POST /api/documents/{id}/ocr (WRITE_ALL),
  POST /api/ocr/batch (ADMIN), GET /api/ocr/jobs/{id} (READ_ALL),
  GET /api/ocr/jobs/{id}/progress (SSE), GET /api/documents/{id}/ocr-status

19 tests: 6 OcrService, 4 OcrBatchService, 3 OcrProgressService, 6 OcrController

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ocr): add Python OCR microservice, RestClientOcrClient, Docker Compose 6737bd6db5

Python microservice (ocr-service/):
- FastAPI app with /ocr and /health endpoints
- Surya engine: transformer-based OCR for typewritten/modern handwriting
- Kraken engine: historical HTR for Kurrent/Suetterlin with
  pure-Python polygon-to-quad approximation (gift wrapping + rotating calipers)
- Eager model loading at startup via lifespan context manager
- PDF download via httpx, page rendering via pypdfium2 at 300 DPI

Java RestClientOcrClient:
- Implements OcrClient + OcrHealthClient interfaces
- Calls Python service via Spring RestClient
- Health check with graceful fallback

Docker Compose:
- New ocr-service container (mem_limit 6g, no host ports)
- Health check with start_period 60s for model loading
- ocr_models volume for Kraken model files
- Backend depends on ocr-service health

Refs #226, #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(frontend): extract AnnotationShape component with polygon support cf8dc3559f

- AnnotationShape.svelte: renders a single annotation as either a
  rectangle or a polygon-clipped div (via CSS clip-path: polygon())
- AnnotationLayer.svelte: refactored to delegate rendering to
  AnnotationShape, keeping draw logic and hover state management
- Annotation type: added optional polygon field ([number, number][] | null)
- Polygon coordinates are converted from page-normalized to
  bounding-box-relative percentages for clip-path

All 687 existing frontend tests pass.

Refs #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(frontend): add OCR UI components and translations a4651aa317

- ScriptTypeSelect: native select for TYPEWRITER/HANDWRITING_LATIN/KURRENT
- OcrTrigger: wraps script type select + start button + confirmation dialog
- OcrProgress: SSE-based progress display with page counter and progress bar
- Paraglide translations for OCR (de/en/es): script types, trigger labels,
  confirmation dialog, progress messages, error messages
- ErrorCode type + getErrorMessage: OCR_SERVICE_UNAVAILABLE, OCR_JOB_NOT_FOUND,
  OCR_DOCUMENT_NOT_UPLOADED, OCR_PROCESSING_FAILED

All 687 frontend tests pass.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(annotations): use @JdbcTypeCode(JSON) for polygon JSONB column

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 2s

Details

931fbc28e5

Replace @Convert(PolygonConverter) with Hibernate native @JdbcTypeCode(SqlTypes.JSON)
to fix JDBC type mismatch — PostgreSQL requires jsonb type, not varchar.

The PolygonConverter is retained as a standalone utility but no longer
used on the entity. Hibernate 6 natively handles List<List<Double>>
serialization to JSONB.

Refs #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-04-12 17:57:25 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

Clean commit history, good TDD evidence (36 new backend tests), and the component extraction of AnnotationShape.svelte is exactly right. Several items I'd want addressed before merge.

Blockers

OcrService.startOcr() is @Transactional but calls an external HTTP service inside the transaction (OcrService.java:39-76). The ocrClient.extractBlocks() call at line 82 (via processDocument) makes an HTTP request to the Python service. If that call takes 30–60 seconds per page on CPU, the database transaction is held open for the entire duration. This will exhaust the connection pool under concurrent use. Fix: split into two methods — one @Transactional for the DB writes (clear blocks, create blocks), and a non-transactional orchestrator that calls the OCR service outside the transaction boundary.
OcrService.processDocument() is package-private but called from OcrBatchService (OcrService.java:78). Both classes are in the same package so this compiles, but it's an implicit coupling. If anyone moves OcrBatchService to a different package, it breaks silently. Make processDocument explicitly public or extract it to an interface.
OcrService bypasses TranscriptionService.createBlock() and writes directly to TranscriptionBlockRepository (OcrService.java:113). The existing TranscriptionService.createBlock() handles sanitizeText(), saveVersion(), and audit fields. The OCR path skips all of this — no version history entry, no text sanitization. Use TranscriptionService or extract the shared logic.

Suggestions

RestClientOcrClient has no timeout configured (RestClientOcrClient.java:21). The RestClient uses default timeouts. For an OCR call that can take minutes on CPU, add explicit connect/read timeouts: .requestFactory(new JdkClientHttpRequestFactory(HttpClient.newBuilder().connectTimeout(Duration.ofSeconds(10)).build())) and set a read timeout proportional to expected processing time.
PolygonConverter is now unused on the entity but still in the codebase (PolygonConverter.java). The fix commit switched to @JdbcTypeCode(SqlTypes.JSON). Either delete the converter or add a comment explaining it's retained as a standalone utility — currently it looks like dead code.
OcrBatchService.processBatchAsync() does one ocrJobRepository.save(job) per document in the loop (OcrBatchService.java:95-96). For a 500-document batch, that's 1000+ individual saves. Consider batching the job state update less frequently (e.g. every 10 documents) or at least not saving the parent job on every iteration.
No @Valid on the TriggerOcrDTO in the controller (OcrController.java:49). The BatchOcrDTO correctly has @Valid, but TriggerOcrDTO does not. If validation annotations are added to TriggerOcrDTO later, they won't fire without @Valid.
AnnotationShape.svelte:53 opacity logic — faded prop controls the dimming of non-active annotations. The naming is good, but consider documenting the prop with a JSDoc comment since the semantics of faded vs dimmed are subtle (dimmed = all annotations muted in non-transcription mode; faded = another annotation is active).

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** Clean commit history, good TDD evidence (36 new backend tests), and the component extraction of `AnnotationShape.svelte` is exactly right. Several items I'd want addressed before merge. ### Blockers 1. **`OcrService.startOcr()` is `@Transactional` but calls an external HTTP service inside the transaction** (`OcrService.java:39-76`). The `ocrClient.extractBlocks()` call at line 82 (via `processDocument`) makes an HTTP request to the Python service. If that call takes 30–60 seconds per page on CPU, the database transaction is held open for the entire duration. This will exhaust the connection pool under concurrent use. Fix: split into two methods — one `@Transactional` for the DB writes (clear blocks, create blocks), and a non-transactional orchestrator that calls the OCR service outside the transaction boundary. 2. **`OcrService.processDocument()` is package-private but called from `OcrBatchService`** (`OcrService.java:78`). Both classes are in the same package so this compiles, but it's an implicit coupling. If anyone moves `OcrBatchService` to a different package, it breaks silently. Make `processDocument` explicitly `public` or extract it to an interface. 3. **`OcrService` bypasses `TranscriptionService.createBlock()` and writes directly to `TranscriptionBlockRepository`** (`OcrService.java:113`). The existing `TranscriptionService.createBlock()` handles `sanitizeText()`, `saveVersion()`, and audit fields. The OCR path skips all of this — no version history entry, no text sanitization. Use `TranscriptionService` or extract the shared logic. ### Suggestions - **`RestClientOcrClient` has no timeout configured** (`RestClientOcrClient.java:21`). The `RestClient` uses default timeouts. For an OCR call that can take minutes on CPU, add explicit connect/read timeouts: `.requestFactory(new JdkClientHttpRequestFactory(HttpClient.newBuilder().connectTimeout(Duration.ofSeconds(10)).build()))` and set a read timeout proportional to expected processing time. - **`PolygonConverter` is now unused on the entity** but still in the codebase (`PolygonConverter.java`). The fix commit switched to `@JdbcTypeCode(SqlTypes.JSON)`. Either delete the converter or add a comment explaining it's retained as a standalone utility — currently it looks like dead code. - **`OcrBatchService.processBatchAsync()` does one `ocrJobRepository.save(job)` per document in the loop** (`OcrBatchService.java:95-96`). For a 500-document batch, that's 1000+ individual saves. Consider batching the job state update less frequently (e.g. every 10 documents) or at least not saving the parent job on every iteration. - **No `@Valid` on the `TriggerOcrDTO` in the controller** (`OcrController.java:49`). The `BatchOcrDTO` correctly has `@Valid`, but `TriggerOcrDTO` does not. If validation annotations are added to `TriggerOcrDTO` later, they won't fire without `@Valid`. - **`AnnotationShape.svelte:53` opacity logic** — `faded` prop controls the dimming of non-active annotations. The naming is good, but consider documenting the prop with a JSDoc comment since the semantics of `faded` vs `dimmed` are subtle (dimmed = all annotations muted in non-transcription mode; faded = another annotation is active).

marcel commented

2026-04-12 17:57:47 +02:00

🏛️ Markus Keller — Application Architect

Verdict: ⚠️ Approved with concerns

The architecture is sound. ADRs are present, the Python microservice is justified, the job state is in PostgreSQL (not in-memory), and the interface boundaries (OcrClient, OcrHealthClient) are clean. Two structural concerns.

Blockers

OcrController injects two repositories directly (OcrController.java:39-40). OcrJobRepository and OcrJobDocumentRepository are accessed directly from the controller for getJobStatus() and getDocumentOcrStatus(). This violates the layering rule: controllers delegate to services, services own repositories. Create read methods on OcrService or a dedicated OcrJobService to wrap these queries. The controller should never touch a repository.
@Transactional on startOcr() wraps an external HTTP call — same as Felix flagged. From the architecture perspective: the transaction boundary must match the data boundary. The OCR HTTP call is not part of the data boundary. The method should be split: @Transactional for clearing blocks + creating blocks, non-transactional for the HTTP call.

Suggestions

OcrBatchService has a direct dependency on OcrService rather than on an interface. For now this is fine (both are in the same package, single deployment), but if the batch path ever needs to be swapped (e.g. for a queue-based implementation), extracting an interface for document-level OCR processing would make that possible without touching the batch service.
The processDocument method on OcrService takes a Document entity object as a parameter (OcrService.java:78). This leaks the persistence model across the service boundary. Consider taking just the fields needed (documentId, filePath, fileHash, scriptType) or a value object. Not a blocker — the two classes are in the same package — but worth noting for when package-by-feature refactoring happens.
ocr_job_documents table has no unique constraint on (job_id, document_id) (V25__add_ocr_job_tables.sql). The same document could be added to the same batch job twice. Consider adding UNIQUE(job_id, document_id).
SSE endpoint returns SseEmitter directly from the controller (OcrController.java:76). This works but ties the controller to the Servlet API. If SvelteKit proxies this via a +server.ts endpoint (as the discussion resolved), the proxy must correctly pass through text/event-stream content type and disable response buffering. Worth a comment in the controller or ADR noting this constraint.

## 🏛️ Markus Keller — Application Architect **Verdict: ⚠️ Approved with concerns** The architecture is sound. ADRs are present, the Python microservice is justified, the job state is in PostgreSQL (not in-memory), and the interface boundaries (`OcrClient`, `OcrHealthClient`) are clean. Two structural concerns. ### Blockers 1. **`OcrController` injects two repositories directly** (`OcrController.java:39-40`). `OcrJobRepository` and `OcrJobDocumentRepository` are accessed directly from the controller for `getJobStatus()` and `getDocumentOcrStatus()`. This violates the layering rule: controllers delegate to services, services own repositories. Create read methods on `OcrService` or a dedicated `OcrJobService` to wrap these queries. The controller should never touch a repository. 2. **`@Transactional` on `startOcr()` wraps an external HTTP call** — same as Felix flagged. From the architecture perspective: the transaction boundary must match the data boundary. The OCR HTTP call is not part of the data boundary. The method should be split: `@Transactional` for clearing blocks + creating blocks, non-transactional for the HTTP call. ### Suggestions - **`OcrBatchService` has a direct dependency on `OcrService`** rather than on an interface. For now this is fine (both are in the same package, single deployment), but if the batch path ever needs to be swapped (e.g. for a queue-based implementation), extracting an interface for document-level OCR processing would make that possible without touching the batch service. - **The `processDocument` method on `OcrService` takes a `Document` entity object as a parameter** (`OcrService.java:78`). This leaks the persistence model across the service boundary. Consider taking just the fields needed (documentId, filePath, fileHash, scriptType) or a value object. Not a blocker — the two classes are in the same package — but worth noting for when package-by-feature refactoring happens. - **`ocr_job_documents` table has no unique constraint on `(job_id, document_id)`** (`V25__add_ocr_job_tables.sql`). The same document could be added to the same batch job twice. Consider adding `UNIQUE(job_id, document_id)`. - **SSE endpoint returns `SseEmitter` directly from the controller** (`OcrController.java:76`). This works but ties the controller to the Servlet API. If SvelteKit proxies this via a `+server.ts` endpoint (as the discussion resolved), the proxy must correctly pass through `text/event-stream` content type and disable response buffering. Worth a comment in the controller or ADR noting this constraint.

marcel commented

2026-04-12 17:58:12 +02:00

🧪 Sara Holt — QA Engineer

Verdict: ⚠️ Approved with concerns

36 new backend tests with good coverage of the core OCR flow. 687 frontend tests still passing. The TDD approach is visible in the commit history. Some gaps in test coverage.

Blockers

No integration test for the polygon round-trip — the PolygonConverterTest covers serialization in isolation, and AnnotationServiceTest mocks the repository. But there is no @DataJpaTest that writes an annotation with a polygon to a real PostgreSQL (Testcontainers) and reads it back. The JSONB type mismatch bug that was caught and fixed (@JdbcTypeCode commit) would have been caught earlier by such a test. This is a regression risk.
OcrBatchService.processBatchAsync() is @Async but has no integration test verifying async behavior. The unit test calls processBatchAsync directly (synchronously). There should be a @SpringBootTest integration test that verifies startBatch returns immediately and the job eventually transitions to DONE using Awaitility. Without this, there's no test confirming the @Async annotation is wired correctly.

Suggestions

OcrServiceTest doesn't verify text sanitization on OCR-produced blocks — Felix flagged that OcrService bypasses TranscriptionService.createBlock(). Until that's fixed, there should be a test asserting that OCR text exceeding 10K characters is handled gracefully (currently it's not sanitized at all).
OcrControllerTest doesn't test the SSE progress endpoint (/api/ocr/jobs/{jobId}/progress). Testing SSE with MockMvc requires MockMvcBuilders.standaloneSetup() with an async dispatcher or WebTestClient. I'd add at least a test confirming the endpoint returns text/event-stream content type.
UniquePointsValidatorTest doesn't test the edge case of exactly 4 identical points (all four points the same coordinate). The current validator checks new HashSet<>(polygon).size() == polygon.size(), which would correctly reject this, but an explicit test documents the behavior.
No test for OcrController.getDocumentOcrStatus() when a RUNNING job exists — only the NONE case is tested. Add a test where ocrJobDocumentRepository.findFirstByDocumentIdAndStatusIn() returns a result and verify the DTO is populated correctly.
The OcrBatchServiceTest.processBatchAsync_continuesAfterSingleDocumentFailure test is excellent — it verifies failure isolation, which is the most important batch behavior. Good work.
Frontend tests: no component test for OcrTrigger or OcrProgress — these are new UI components with non-trivial logic (confirmation dialog flow, EventSource lifecycle). Consider adding Vitest component tests that verify: (1) OcrTrigger shows confirmation when existingBlockCount > 0, (2) OcrTrigger calls onTrigger with the selected script type, (3) OcrProgress renders the progress bar with correct aria attributes.

## 🧪 Sara Holt — QA Engineer **Verdict: ⚠️ Approved with concerns** 36 new backend tests with good coverage of the core OCR flow. 687 frontend tests still passing. The TDD approach is visible in the commit history. Some gaps in test coverage. ### Blockers 1. **No integration test for the polygon round-trip** — the `PolygonConverterTest` covers serialization in isolation, and `AnnotationServiceTest` mocks the repository. But there is no `@DataJpaTest` that writes an annotation with a polygon to a real PostgreSQL (Testcontainers) and reads it back. The JSONB type mismatch bug that was caught and fixed (`@JdbcTypeCode` commit) would have been caught earlier by such a test. This is a regression risk. 2. **`OcrBatchService.processBatchAsync()` is `@Async` but has no integration test verifying async behavior.** The unit test calls `processBatchAsync` directly (synchronously). There should be a `@SpringBootTest` integration test that verifies `startBatch` returns immediately and the job eventually transitions to `DONE` using Awaitility. Without this, there's no test confirming the `@Async` annotation is wired correctly. ### Suggestions - **`OcrServiceTest` doesn't verify text sanitization on OCR-produced blocks** — Felix flagged that `OcrService` bypasses `TranscriptionService.createBlock()`. Until that's fixed, there should be a test asserting that OCR text exceeding 10K characters is handled gracefully (currently it's not sanitized at all). - **`OcrControllerTest` doesn't test the SSE progress endpoint** (`/api/ocr/jobs/{jobId}/progress`). Testing SSE with MockMvc requires `MockMvcBuilders.standaloneSetup()` with an async dispatcher or `WebTestClient`. I'd add at least a test confirming the endpoint returns `text/event-stream` content type. - **`UniquePointsValidatorTest` doesn't test the edge case of exactly 4 identical points** (all four points the same coordinate). The current validator checks `new HashSet<>(polygon).size() == polygon.size()`, which would correctly reject this, but an explicit test documents the behavior. - **No test for `OcrController.getDocumentOcrStatus()` when a RUNNING job exists** — only the `NONE` case is tested. Add a test where `ocrJobDocumentRepository.findFirstByDocumentIdAndStatusIn()` returns a result and verify the DTO is populated correctly. - **The `OcrBatchServiceTest.processBatchAsync_continuesAfterSingleDocumentFailure` test** is excellent — it verifies failure isolation, which is the most important batch behavior. Good work. - **Frontend tests: no component test for `OcrTrigger` or `OcrProgress`** — these are new UI components with non-trivial logic (confirmation dialog flow, EventSource lifecycle). Consider adding Vitest component tests that verify: (1) OcrTrigger shows confirmation when existingBlockCount > 0, (2) OcrTrigger calls onTrigger with the selected script type, (3) OcrProgress renders the progress bar with correct aria attributes.

marcel commented

2026-04-12 17:58:40 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

Good security posture overall: @RequirePermission on all OCR endpoints, @Size(max=500) on the batch DTO, OCR service on internal network only. One blocker and several items to address.

Blockers

SSRF via pdfUrl parameter in the Python OCR service (main.py:59,77-81). The OCR service downloads a PDF from whatever URL is provided in the pdfUrl field. Spring Boot constructs this URL from the S3 internal path (OcrService.java:117-118), but the Python service has no validation that the URL points to the expected MinIO host. A compromised or misconfigured Spring Boot instance could send http://169.254.169.254/latest/meta-data/ (cloud metadata) or any internal service URL. Fix: Validate the URL scheme and host in the Python service against an allowlist (e.g. ALLOWED_PDF_HOSTS=minio:9000). This is CWE-918 (Server-Side Request Forgery).

Suggestions

OcrController.resolveUserId() silently returns null on failure (OcrController.java:105-112). If the user lookup fails, OCR proceeds with userId=null. This means the created_by field on the job and all created annotations/blocks will be null — losing the audit trail. Consider throwing DomainException.unauthorized() instead of silently degrading. The same pattern exists on AnnotationController — a pre-existing issue, but worth fixing here since this is security-sensitive (who triggered the destructive block replacement?).
No rate limiting on POST /api/documents/{id}/ocr — each call triggers a CPU-intensive operation on the Python service. A user with WRITE_ALL permission could trigger hundreds of concurrent OCR jobs, exhausting the server. Consider adding a per-user or global concurrency check (e.g. only one per-document OCR at a time, reject if a job is already RUNNING for this document).
The BatchOcrDTO.documentIds list is validated for size but not for duplicates (BatchOcrDTO.java). Submitting the same document ID 500 times would pass the @Size(max=500) check but process the same document 500 times, wasting CPU. Add a @UniqueElements validation or deduplicate in the service.
RestClientOcrClient logs the OCR service error message at WARN level (RestClientOcrClient.java:54). If the error message contains the presigned URL (which includes S3 credentials), this leaks credentials to the log. Use log.warn("OCR service health check failed") without e.getMessage(), or sanitize the message.
Python service _download_and_convert_pdf has no response size limit (main.py:77-81). A crafted URL could serve a multi-GB file, exhausting the container's 6GB memory limit. Add a Content-Length check before reading the full response, or stream with a size cap.
Good: @RequirePermission(Permission.ADMIN) on the batch endpoint, @RequirePermission(Permission.WRITE_ALL) on per-document OCR — matches the discussion. The SSE progress endpoint correctly uses READ_ALL. BatchOcrDTO has @Size(max=500). The OCR service has no host port mapping in Docker Compose. The polygon DTO validation (@DecimalMin, @DecimalMax, @Size, @UniquePoints) is thorough.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** Good security posture overall: `@RequirePermission` on all OCR endpoints, `@Size(max=500)` on the batch DTO, OCR service on internal network only. One blocker and several items to address. ### Blockers 1. **SSRF via `pdfUrl` parameter in the Python OCR service** (`main.py:59,77-81`). The OCR service downloads a PDF from whatever URL is provided in the `pdfUrl` field. Spring Boot constructs this URL from the S3 internal path (`OcrService.java:117-118`), but the Python service has no validation that the URL points to the expected MinIO host. A compromised or misconfigured Spring Boot instance could send `http://169.254.169.254/latest/meta-data/` (cloud metadata) or any internal service URL. **Fix**: Validate the URL scheme and host in the Python service against an allowlist (e.g. `ALLOWED_PDF_HOSTS=minio:9000`). This is CWE-918 (Server-Side Request Forgery). ### Suggestions - **`OcrController.resolveUserId()` silently returns `null` on failure** (`OcrController.java:105-112`). If the user lookup fails, OCR proceeds with `userId=null`. This means the `created_by` field on the job and all created annotations/blocks will be `null` — losing the audit trail. Consider throwing `DomainException.unauthorized()` instead of silently degrading. The same pattern exists on `AnnotationController` — a pre-existing issue, but worth fixing here since this is security-sensitive (who triggered the destructive block replacement?). - **No rate limiting on `POST /api/documents/{id}/ocr`** — each call triggers a CPU-intensive operation on the Python service. A user with `WRITE_ALL` permission could trigger hundreds of concurrent OCR jobs, exhausting the server. Consider adding a per-user or global concurrency check (e.g. only one per-document OCR at a time, reject if a job is already RUNNING for this document). - **The `BatchOcrDTO.documentIds` list is validated for size but not for duplicates** (`BatchOcrDTO.java`). Submitting the same document ID 500 times would pass the `@Size(max=500)` check but process the same document 500 times, wasting CPU. Add a `@UniqueElements` validation or deduplicate in the service. - **`RestClientOcrClient` logs the OCR service error message at WARN level** (`RestClientOcrClient.java:54`). If the error message contains the presigned URL (which includes S3 credentials), this leaks credentials to the log. Use `log.warn("OCR service health check failed")` without `e.getMessage()`, or sanitize the message. - **Python service `_download_and_convert_pdf` has no response size limit** (`main.py:77-81`). A crafted URL could serve a multi-GB file, exhausting the container's 6GB memory limit. Add a `Content-Length` check before reading the full response, or stream with a size cap. - **Good**: `@RequirePermission(Permission.ADMIN)` on the batch endpoint, `@RequirePermission(Permission.WRITE_ALL)` on per-document OCR — matches the discussion. The SSE progress endpoint correctly uses `READ_ALL`. `BatchOcrDTO` has `@Size(max=500)`. The OCR service has no host port mapping in Docker Compose. The polygon DTO validation (`@DecimalMin`, `@DecimalMax`, `@Size`, `@UniquePoints`) is thorough.

marcel commented

2026-04-12 17:59:04 +02:00

🎨 Leonie Voss — UX Design & Accessibility

Verdict: ⚠️ Approved with concerns

The component decomposition follows the discussion agreements: ScriptTypeSelect is a native <select> with a visible <label>, OcrTrigger wraps it with the confirmation flow, OcrProgress shows the SSE-driven progress bar. Accessibility is partially addressed. One blocker.

Blockers

OcrProgress.svelte progress bar is missing role="progressbar" and ARIA attributes. The discussion explicitly required role="progressbar" with aria-valuenow, aria-valuemin, aria-valuemax, and aria-label. The current implementation renders a styled <div> for the progress bar without any ARIA attributes. Screen readers will not announce progress changes. Fix: add role="progressbar" aria-valuenow={progressPercent} aria-valuemin={0} aria-valuemax={100} aria-label={m.ocr_progress_heading()} to the progress bar track element.

Suggestions

ScriptTypeSelect — the disabled placeholder option should have value="" so that when no script type is selected, bind:value cleanly produces an empty string. Verify this is the case; if the placeholder has no explicit value, some browsers return the text content as the value.
OcrTrigger — the "OCR starten" button uses disabled correctly (native attribute, not just aria-disabled), which matches the discussion agreement. Good.
AnnotationShape.svelte clip-path rendering — the polygon clip-path approach is clever and avoids the complexity of switching to SVG. However, clip-path: polygon() clips the background fill and the block number badge. If a polygon annotation's top-left corner is not at the bounding box top-left, the badge will be clipped and invisible. Consider positioning the badge outside the clip-path by placing it as a sibling element rather than a child of the clipped div.
No visual distinction between rectangular and polygon annotations — the discussion (Leonie's comment) asked: "Will users be able to distinguish polygon annotations from rectangle annotations?" The current implementation makes them visually identical. Consider a subtle visual cue — for example, a slightly different stroke style on polygon annotations (dashed border vs solid) — so users can see that Kraken produced a precise shape fit rather than a generic rectangle.
Mobile layout for ScriptTypeSelect + OcrTrigger — the discussion specified full-width stacked layout at narrow viewports with min-h-[44px] touch targets. The button has min-h-[44px] and w-full. The select also has min-h-[44px]. The stacked layout uses flex-col gap-3. This matches the spec. Good.
Error state in OcrProgress — a red left border and heading is good. The retry button text "Erneut versuchen" matches the discussion. The error heading uses <h3> as specified for screen reader announcement. Verify that focus moves to the error heading when the error state appears (currently it likely doesn't — would require programmatic focus management).

## 🎨 Leonie Voss — UX Design & Accessibility **Verdict: ⚠️ Approved with concerns** The component decomposition follows the discussion agreements: `ScriptTypeSelect` is a native `<select>` with a visible `<label>`, `OcrTrigger` wraps it with the confirmation flow, `OcrProgress` shows the SSE-driven progress bar. Accessibility is partially addressed. One blocker. ### Blockers 1. **`OcrProgress.svelte` progress bar is missing `role="progressbar"` and ARIA attributes.** The discussion explicitly required `role="progressbar"` with `aria-valuenow`, `aria-valuemin`, `aria-valuemax`, and `aria-label`. The current implementation renders a styled `<div>` for the progress bar without any ARIA attributes. Screen readers will not announce progress changes. Fix: add `role="progressbar" aria-valuenow={progressPercent} aria-valuemin={0} aria-valuemax={100} aria-label={m.ocr_progress_heading()}` to the progress bar track element. ### Suggestions - **`ScriptTypeSelect` — the disabled placeholder option should have `value=""`** so that when no script type is selected, `bind:value` cleanly produces an empty string. Verify this is the case; if the placeholder has no explicit `value`, some browsers return the text content as the value. - **`OcrTrigger` — the "OCR starten" button uses `disabled` correctly** (native attribute, not just `aria-disabled`), which matches the discussion agreement. Good. - **`AnnotationShape.svelte` clip-path rendering** — the polygon clip-path approach is clever and avoids the complexity of switching to SVG. However, `clip-path: polygon()` clips the background fill and the block number badge. If a polygon annotation's top-left corner is not at the bounding box top-left, the badge will be clipped and invisible. Consider positioning the badge outside the clip-path by placing it as a sibling element rather than a child of the clipped div. - **No visual distinction between rectangular and polygon annotations** — the discussion (Leonie's comment) asked: "Will users be able to distinguish polygon annotations from rectangle annotations?" The current implementation makes them visually identical. Consider a subtle visual cue — for example, a slightly different stroke style on polygon annotations (dashed border vs solid) — so users can see that Kraken produced a precise shape fit rather than a generic rectangle. - **Mobile layout for ScriptTypeSelect + OcrTrigger** — the discussion specified full-width stacked layout at narrow viewports with `min-h-[44px]` touch targets. The button has `min-h-[44px]` and `w-full`. The select also has `min-h-[44px]`. The stacked layout uses `flex-col gap-3`. This matches the spec. Good. - **Error state in OcrProgress** — a red left border and heading is good. The retry button text "Erneut versuchen" matches the discussion. The error heading uses `<h3>` as specified for screen reader announcement. Verify that focus moves to the error heading when the error state appears (currently it likely doesn't — would require programmatic focus management).

marcel commented

2026-04-12 17:59:31 +02:00

🛠️ Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

The Docker Compose integration is well done: mem_limit: 6g, memswap_limit: 6g, health check with start_period: 60s, no host port exposure, named volume for models. One blocker and several operational concerns.

Blockers

The backend depends_on: ocr-service: condition: service_healthy makes the entire application fail to start if the OCR service is unhealthy (docker-compose.yml). The discussion explicitly resolved that OCR should degrade gracefully — the rest of the app must work without OCR. The hard dependency means if the OCR Docker image fails to build, or if model loading crashes (OOM), or if the user hasn't downloaded Kraken models yet, the entire stack won't start. Fix: Remove the depends_on for ocr-service from the backend, or move it to a dev-only override. The OcrHealthClient.isHealthy() check already handles runtime unavailability gracefully.

Suggestions

ocr-service Dockerfile installs torch==2.5.1 which may not be the latest CPU build (ocr-service/Dockerfile:13). Pin intentionally is fine, but add a comment noting the version choice and that this should be updated with the other dependencies. Also consider --no-cache-dir is already used — good.
No .dockerignore in ocr-service/ — the build context will include any model files, __pycache__, .git, etc. that happen to be in the directory. Create an ocr-service/.dockerignore:
```
__pycache__
*.pyc
models/
.git
```
surya-ocr==0.6.3 and kraken==5.2.9 pinned versions (requirements.txt) — good for reproducibility. Consider adding a comment noting when these were last verified, or set up Dependabot/Renovate for the Python service.
The ocr_models volume is empty by default — the container will start, load Surya models from pip-installed packages (bundled), but Kraken will log a warning and be unavailable. The runbook for downloading the Kraken model is mentioned in the issue discussion but not documented anywhere in the repo. Add a ocr-service/README.md or a script scripts/download-kraken-model.sh so the setup step isn't lost.
No restart behavior specified for the backend when ocr-service crashes — since ocr-service has restart: unless-stopped, it will auto-restart. But the backend's depends_on only applies at initial startup. If the OCR service crashes and restarts mid-operation, in-flight OCR jobs will fail. The OcrBatchService failure isolation handles this per-document, which is correct. No action needed, just noting the behavior.
APP_OCR_BASE_URL and APP_S3_INTERNAL_URL are hardcoded in docker-compose.yml rather than using .env variables. For consistency with the rest of the compose file (which uses ${VAR} patterns), consider APP_OCR_BASE_URL: ${OCR_BASE_URL:-http://ocr-service:8000}.
Good: memswap_limit: 6g equal to mem_limit disables swap — correct, OOM-kill is better than thrashing. Health check retries: 12 with interval: 10s gives 2 minutes for model loading. restart: unless-stopped on the OCR service. Named ocr_models volume for model persistence across rebuilds.

## 🛠️ Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** The Docker Compose integration is well done: `mem_limit: 6g`, `memswap_limit: 6g`, health check with `start_period: 60s`, no host port exposure, named volume for models. One blocker and several operational concerns. ### Blockers 1. **The backend `depends_on: ocr-service: condition: service_healthy` makes the entire application fail to start if the OCR service is unhealthy** (`docker-compose.yml`). The discussion explicitly resolved that OCR should degrade gracefully — the rest of the app must work without OCR. The hard dependency means if the OCR Docker image fails to build, or if model loading crashes (OOM), or if the user hasn't downloaded Kraken models yet, the entire stack won't start. **Fix**: Remove the `depends_on` for `ocr-service` from the backend, or move it to a dev-only override. The `OcrHealthClient.isHealthy()` check already handles runtime unavailability gracefully. ### Suggestions - **`ocr-service` Dockerfile installs `torch==2.5.1` which may not be the latest CPU build** (`ocr-service/Dockerfile:13`). Pin intentionally is fine, but add a comment noting the version choice and that this should be updated with the other dependencies. Also consider `--no-cache-dir` is already used — good. - **No `.dockerignore` in `ocr-service/`** — the build context will include any model files, `__pycache__`, `.git`, etc. that happen to be in the directory. Create an `ocr-service/.dockerignore`: ``` __pycache__ *.pyc models/ .git ``` - **`surya-ocr==0.6.3` and `kraken==5.2.9` pinned versions** (`requirements.txt`) — good for reproducibility. Consider adding a comment noting when these were last verified, or set up Dependabot/Renovate for the Python service. - **The `ocr_models` volume is empty by default** — the container will start, load Surya models from pip-installed packages (bundled), but Kraken will log a warning and be unavailable. The runbook for downloading the Kraken model is mentioned in the issue discussion but not documented anywhere in the repo. Add a `ocr-service/README.md` or a script `scripts/download-kraken-model.sh` so the setup step isn't lost. - **No restart behavior specified for the backend when `ocr-service` crashes** — since `ocr-service` has `restart: unless-stopped`, it will auto-restart. But the backend's `depends_on` only applies at initial startup. If the OCR service crashes and restarts mid-operation, in-flight OCR jobs will fail. The `OcrBatchService` failure isolation handles this per-document, which is correct. No action needed, just noting the behavior. - **`APP_OCR_BASE_URL` and `APP_S3_INTERNAL_URL` are hardcoded in `docker-compose.yml`** rather than using `.env` variables. For consistency with the rest of the compose file (which uses `${VAR}` patterns), consider `APP_OCR_BASE_URL: ${OCR_BASE_URL:-http://ocr-service:8000}`. - **Good**: `memswap_limit: 6g` equal to `mem_limit` disables swap — correct, OOM-kill is better than thrashing. Health check `retries: 12` with `interval: 10s` gives 2 minutes for model loading. `restart: unless-stopped` on the OCR service. Named `ocr_models` volume for model persistence across rebuilds.

marcel added 1 commit 2026-04-12 18:41:16 +02:00

fix(ocr): relax pillow version to match surya-ocr constraint

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 2s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

d49010cd7b

surya-ocr 0.6.3 requires pillow<11.0.0,>=10.2.0. The previous
pin at 11.1.0 caused a dependency resolution failure during
Docker build.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 18:49:33 +02:00

fix(ocr): upgrade kraken to 6.0.3 for torch>=2.4 compatibility

CI / Unit & Component Tests (push) Failing after 3s

Details

CI / Backend Unit Tests (push) Failing after 2s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 3s

Details

e29c865016

kraken 5.2.9 required torch~=2.1.0, incompatible with surya-ocr's
torch>=2.3.0. kraken 6.0.3 requires torch>=2.4.0,<=2.9 which
overlaps with surya and our pinned torch==2.5.1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 18:54:33 +02:00

feat(ocr): bump to latest surya 0.17.1, kraken 7.0, torch 2.7.1

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

49975154d9

- surya-ocr 0.6.3 → 0.17.1: new predictor API (FoundationPredictor,
  RecognitionPredictor, DetectionPredictor), native polygon output
  on text lines (4-point clockwise)
- kraken 5.2.9 → 7.0: wider torch range (>=2.4,<=2.10), unpinned numpy
- torch 2.5.1 → 2.7.1: satisfies surya's >=2.7.0 requirement
- Rewrite engines/surya.py for the 0.17 predictor class API
- Surya now outputs polygons natively — no longer rectangle-only

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 19:17:38 +02:00

feat(ocr): auto-insert [unleserlich] markers for low-confidence words

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 2s

Details

CI / Unit & Component Tests (pull_request) Failing after 2s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

c74539b04b

New confidence.py module with two functions:
- apply_confidence_markers(): replaces words below threshold with
  [unleserlich], collapses adjacent markers into one
- words_from_characters(): reconstructs word-level confidence from
  Kraken's character-level data

Surya 0.17 provides native word-level confidence via line.words.
Kraken 7.0 provides per-character confidences via record.confidences.
Both engines now pass word+confidence data through main.py, which
applies the marker post-processing before returning the API response.

Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3).
Frontend already renders [unleserlich] markers via transcriptionMarkers.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 19:20:59 +02:00

feat(ocr): add Kraken model download and evaluation script

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 2s

Details

CI / Unit & Component Tests (pull_request) Failing after 2s

Details

CI / Backend Unit Tests (pull_request) Failing after 2s

Details

41f9262238

Runbook script to download both HTR-United Kurrent model candidates
(german_kurrent_manu_9, kurrent-de) into the ocr_models Docker volume,
test them against sample documents, and activate the winner.

Usage:
  ./scripts/download-kraken-models.sh              # download both
  ./scripts/download-kraken-models.sh --activate 1  # pick model 1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 19:35:00 +02:00

fix(ocr): pin transformers<5.0 and torch==2.7.1 in requirements.txt

CI / Unit & Component Tests (push) Failing after 3s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

6669fffead

transformers 5.x breaks surya 0.17.1 — SuryaDecoderConfig is missing
pad_token_id. Pin to transformers>=4.56.1,<5.0.0.

Also add torch==2.7.1 to requirements.txt to prevent pip from upgrading
it past the CPU-only build installed in the Dockerfile layer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 19:42:56 +02:00

feat(ocr): extend model script with automatic OCR evaluation

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 2s

Details

CI / Backend Unit Tests (pull_request) Failing after 2s

Details

0af4749677

Downloads both Kraken models, then runs each against 4 sample PDFs
from the import folder (Eu-0693, Eu-0692, W-0150, W-0575). Output
goes to ocr-model-evaluation/<model-name>/<doc>.txt for side-by-side
comparison.

Usage:
  ./scripts/download-kraken-models.sh           # download + evaluate
  ./scripts/download-kraken-models.sh --eval-only  # re-run evaluation
  ./scripts/download-kraken-models.sh --activate 1  # pick winner

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 19:47:33 +02:00

fix(ocr): install torchvision from CPU index alongside torch

CI / Unit & Component Tests (push) Failing after 3s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 2s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

37abc376ec

torchvision installed from PyPI expects CUDA torch operator
registrations. Installing from the CPU whl index ensures torchvision
matches the CPU-only torch build. Fixes 'torchvision::nms does not
exist' RuntimeError on startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 20:05:59 +02:00

fix(ocr): update model script for kraken 7 DOI-based downloads

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

f12b41161e

Kraken 7 uses DOIs (not short names) to identify models from Zenodo.
Updated to use actual DOIs:
- 10.5281/zenodo.7933463 — German handwriting HTR
- 10.5281/zenodo.13788177 — McCATMuS generic handwritten/printed/typed

Added -f pdf flag for PDF input, volume mounts for import dir,
and post-download copy from htrmopo cache to the models volume.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 20:09:52 +02:00

fix(ocr): parse kraken 'Model dir' output to locate downloaded model

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 0s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 0s

Details

c0004f5e6f

The previous approach used find across the htrmopo cache which failed
because -newer /tmp ran in a separate container. Now parses the
'Model dir: <path>' line from kraken get output directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 20:11:43 +02:00

fix(ocr): add pyvips for kraken PDF input support

CI / Unit & Component Tests (push) Failing after 0s

Details

CI / Backend Unit Tests (push) Failing after 0s

Details

CI / Unit & Component Tests (pull_request) Failing after 0s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

31519af1a4

Kraken 7 requires pyvips (optional dep) for -f pdf mode.
Added libvips42 system package and pyvips Python package.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 20:38:01 +02:00

fix(ocr): extract PDF pages as PNGs before running kraken OCR

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 0s

Details

CI / Unit & Component Tests (pull_request) Failing after 2s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

dd078d50da

Kraken's -f pdf mode tries to write output next to the input file,
which fails on read-only mounts. Instead, extract pages as PNGs via
pypdfium2 (already installed), then run kraken on each image.
Both models run in a single container per PDF to avoid overhead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 20:52:12 +02:00

feat(ocr): per-script-type confidence thresholds

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

f064b27439

Kurrent OCR produces much lower confidence than typewriter/Latin.
Separate thresholds allow aggressive filtering for Kurrent (0.5)
while keeping typewriter lenient (0.3).

- OCR_CONFIDENCE_THRESHOLD: default for Surya paths (0.3)
- OCR_CONFIDENCE_THRESHOLD_KURRENT: Kraken Kurrent path (0.5)
- apply_confidence_markers() now accepts threshold parameter
- get_threshold(script_type) selects the right threshold

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 21:45:26 +02:00

feat(transcription): add source/reviewed fields for training pipeline

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 0s

Details

CI / Unit & Component Tests (pull_request) Failing after 0s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

3aaec01421

- BlockSource enum: MANUAL, OCR
- V26 migration adds source + reviewed columns to transcription_blocks
- OcrService sets source=OCR when creating blocks
- TranscriptionService.reviewBlock() toggles the reviewed flag
- PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint
- 5 new tests: reviewBlock toggle/untoggle/notfound, controller,
  OcrService source=OCR verification

The reviewed flag enables the Kraken fine-tuning pipeline: only blocks
marked as reviewed by a human are exported as training data.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 22:03:30 +02:00

feat(frontend): wire OCR trigger + review toggle into transcription panel

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

8dc9243add

- OcrTrigger component rendered in the transcription empty state when
  the document has a file and user has write permission
- Review checkmark toggle on each TranscriptionBlock (turquoise when
  reviewed, muted outline when not). Calls PUT .../review to toggle.
- TranscriptionBlockData type: added source + reviewed fields
- +page.svelte: triggerOcr() and reviewToggle() functions wired up
- Paraglide translations (de/en/es) for review toggle + reviewed count

All 687 frontend tests pass.

Refs #226, #230

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 22:09:59 +02:00

feat(frontend): show OcrProgress during OCR job + check status on load

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 2s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

f6667e0e15

- triggerOcr captures jobId from POST response and shows OcrProgress
- OcrProgress rendered in the transcription panel when ocrJobId is set
- handleOcrDone reloads blocks and annotations when OCR completes
- checkOcrStatus called when entering transcription mode — resumes
  progress display if a job is already running for this document

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 22:11:08 +02:00

fix(pdf): guard against null textLayerEl in renderPage

CI / Unit & Component Tests (push) Failing after 0s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

7a4da7cb98

Prevents 'can't access property innerHTML, textDiv is null' when
the component unmounts while a render is in flight (e.g. switching
to OCR progress view tears down the panel content).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 22:17:26 +02:00

fix(ocr): use presigned URLs for MinIO access from OCR service

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 0s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

4500c99e40

The OCR service was getting 403 Forbidden because it tried to
download PDFs from MinIO using plain internal URLs without
authentication. MinIO buckets are private.

- Add S3Presigner bean to MinioConfig
- FileService.generatePresignedUrl(): generates 15-min presigned URLs
- OcrService uses presigned URLs instead of plain internal URLs
- Remove unused s3InternalUrl / bucketName @Value fields from OcrService

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 22:21:11 +02:00

fix(ocr): increase memory limit to 10GB, reduce render DPI to 200

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 0s

Details

CI / Unit & Component Tests (pull_request) Failing after 0s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

7f78bc9cf4

Surya 0.17 models use ~5GB idle. At 300 DPI on a multi-page PDF,
page images + inference tensors push past the 6GB limit, causing
OOM kills during 'Detecting bboxes'. Increased to 10GB and reduced
render DPI to 200 (still sufficient for OCR, uses ~44% less memory).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 22:27:22 +02:00

fix(ocr): reduce memory usage for 16GB dev machines

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

902d423f3c

- Surya models lazy-load on first OCR request instead of at startup
  (saves ~3-4GB idle RAM — Kraken stays eager at ~16MB)
- Process one page at a time in Surya engine (limits peak memory)
- RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM)
- Revert mem_limit back to 6GB (sufficient with these optimizations)
- Render DPI stays at 200

Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 22:33:45 +02:00

fix(ocr): reduce mem_limit to 4g, allow 4g swap for 16GB dev machines

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

e9cf2998fe

mem_limit 4g keeps more RAM free for the host. memswap_limit 8g
(= 4g swap) lets peaks spill to disk instead of OOM-killing.
Slower during peak inference but won't starve the dev machine.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 22:36:15 +02:00

fix(ocr): increase to 8g mem_limit and larger batch sizes

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 2s

Details

CI / Unit & Component Tests (pull_request) Failing after 2s

Details

CI / Backend Unit Tests (pull_request) Failing after 0s

Details

741979304c

5GB free on host while OCR runs — give the container more room.
Bump batch sizes (detector=2, recognition=4) so it processes
faster with the available memory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 22:56:26 +02:00

refactor(ocr): make single-document OCR async, fix circular dependency

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

dd175d09e2

OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR
processing logic (processDocument, clearExistingBlocks, createBlocks)
into OcrAsyncRunner. OcrService is now a thin entry point that
validates, creates the job, and dispatches to OcrAsyncRunner.

Architecture:
- OcrService: validates document, checks health, creates OcrJob, delegates
- OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch
- OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner
- No circular dependencies

Single-document OCR is now async (returns jobId immediately).
Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED.

816 backend tests pass, 687 frontend tests pass.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 22:59:01 +02:00

fix(ocr): set 10-minute read timeout on RestClientOcrClient

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 0s

Details

aa50951320

Default RestClient timeout was 10 seconds — OCR on CPU takes minutes.
Set connect timeout to 10s, read timeout to 10 minutes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:00:10 +02:00

fix(async): increase thread pool to 2 threads + queue of 10

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 2s

Details

b6d928e1c5

The old pool (1 thread, queue=1) meant OCR blocked all other async
tasks (imports). Now 2 concurrent async tasks with a queue of 10
— enough for OCR + import to run in parallel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:02:08 +02:00

fix(async): queue 100 tasks + CallerRunsPolicy instead of abort

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

0bfaa7540b

Better to wait than to error. Queue capacity 100 holds plenty of
OCR jobs. CallerRunsPolicy means if the queue is somehow full,
the request blocks instead of getting rejected with an exception.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:03:29 +02:00

fix(async): revert to AbortPolicy — CallerRunsPolicy blocks requests

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 0s

Details

9e01009e3d

CallerRunsPolicy would cause the HTTP request to hang for minutes
if the queue is full. AbortPolicy with queue=100 is safe — the queue
will never realistically fill for a family archive. If it somehow
does, a clear error is better than a silent multi-minute hang.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:05:14 +02:00

fix(ocr): use camelCase field names in Pydantic models

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

838330b405

Pydantic v2 Field(alias=...) doesn't work with FastAPI as expected.
The Java client sends camelCase (pdfUrl, scriptType, pageNumber).
Use camelCase field names directly instead of aliases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:08:43 +02:00

fix(ocr): force HTTP/1.1 on RestClient to OCR service

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 0s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 0s

Details

2db1b73d5d

JDK HttpClient defaults to HTTP/2 with upgrade negotiation. Uvicorn
rejects the upgrade ('Unsupported upgrade request'), causing the
request body to be lost and a 422 'Field required' from FastAPI.
Force HTTP/1.1 since the OCR service is internal and doesn't need h2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:17:57 +02:00

fix(ocr): resume polling on page reload + track single-doc job status

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 0s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

c1befd3fa3

Single-document OCR now creates an OcrJobDocument row so
GET /api/documents/{id}/ocr-status can find running jobs.
OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED).

Frontend checks OCR status when entering transcription mode —
if a job is running, resumes polling and shows the spinner.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:20:01 +02:00

perf(ocr): increase batch sizes (detector=4, recognition=8)

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 2s

Details

2cc7dcd5e3

5GB free on host during OCR, container at 3.8/8GB. Larger batches
use more memory but process faster on CPU.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:22:49 +02:00

fix(ocr): persist model cache across container restarts

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 0s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

1b7540143e

Surya downloads models from HuggingFace to /root/.cache on first use.
Without a volume, every container restart re-downloads ~73MB+.
Added ocr_cache volume to persist the cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:23:54 +02:00

perf(ocr): double batch sizes (detector=8, recognition=16)

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

0b0d4a7d5e

4GB headroom in the container. Doubling batches should use ~2GB
more RAM but significantly speed up inference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:32:03 +02:00

feat(ocr): show translated progress messages during OCR processing

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 2s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

971527a50e

Backend sends progress codes (PREPARING, LOADING, ANALYZING,
CREATING_BLOCKS:N, DONE:N, ERROR) via OcrJob.progressMessage.
Frontend translates them via Paraglide (de/en/es) and displays
below the spinner.

- V27 migration: adds progress_message column to ocr_jobs
- OcrAsyncRunner updates progress at each phase
- Poll interval reduced to 2s for snappier updates

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-04-12 23:33:19 +02:00

fix(ocr): disable manual annotation drawing while OCR is running

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

ef11e4af09

Prevents users from drawing annotations that would be cleared when
the OCR job finishes. transcribeMode is set to false for the PDF
viewer while ocrRunning is true.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 18 commits 2026-04-13 10:54:50 +02:00

fix(ocr): unblock event loop during OCR and show errors in UI d8dcba1a71

OCR engines are CPU-bound and were blocking Uvicorn's single async
event loop, making /health unresponsive during processing. This caused
new OCR requests to fail silently (health check failure → no DB record
→ UI shows NONE). Wrap engine calls in asyncio.to_thread() to keep the
event loop free. Also surface OCR trigger errors in the frontend
instead of silently resetting the spinner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(ocr): extract extract_page_blocks() from both OCR engines b7d5f71ef7

Enable per-page processing by extracting the inner loop body of
extract_blocks() into extract_page_blocks(image, page_idx, language).
The original extract_blocks() now delegates to the new function,
preserving backward compatibility for the batch path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): add NDJSON streaming endpoint POST /ocr/stream 97c6cf6a65

Streams one JSON line per completed page instead of buffering the
entire result. Emits start/page/error/done events. On per-page
failure, logs the traceback but yields a generic error message and
continues with the next page. Adds X-Accel-Buffering: no and
Cache-Control: no-cache headers for reverse-proxy compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): add OcrStreamEvent sealed interface with Start/Page/Error/Done records e21d01e10b

Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed
interface with record subtypes for exhaustive pattern matching in the
consumer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): add default streamBlocks method to OcrClient interface 641e91d5a3

The default method synthesizes Start/Page/Done events from
extractBlocks() results, providing backward compatibility for
implementations that don't support streaming natively.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): implement NDJSON streaming in RestClientOcrClient 93c3154b3c

Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON
response line by line with a dedicated ObjectMapper. Falls back to
the old /ocr endpoint via the default method when /ocr/stream returns
404. Uses a separate HttpClient with 5-minute request timeout for
streaming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(ocr): extract createSingleBlock from createTranscriptionBlocks 6823973429

Enable per-page block creation during streaming by extracting the
loop body into a package-private createSingleBlock() method with an
explicit sortOrder parameter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): rewrite runSingleDocument to use streamBlocks with per-page progress 292dc66f3c

Replace the single extractBlocks() call with streamBlocks() that
processes pages incrementally. Each page's blocks are persisted
immediately via createSingleBlock(). Progress updates use the
ANALYZING_PAGE:current:total:blocks format. Per-page errors are
logged at WARN level without failing the entire job. The batch path
(processDocument) remains on the old extractBlocks() path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): extract translateOcrProgress with ANALYZING_PAGE and DONE:skipped support ddec64fc79

Move translateOcrProgress from page.svelte to a testable module.
Return structured result with currentPage/totalPages/skippedPages
for the progress bar. Add ANALYZING_PAGE and DONE with skipped pages
parsing. Add i18n keys for de/en/es.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): add OcrProgressBar component with page-based ARIA semantics 035f9768bd

Progress bar shows brand-mint fill on brand-sand background with
smooth transition. Displays page counter with tabular-nums and
skipped-pages warning in amber when applicable. Only renders when
totalPages > 0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): integrate progress bar and streaming progress into document page bac67706b9

Replace inline translateOcrProgress with the extracted module. Add
OcrProgressBar below the spinner during OCR. Parse page numbers from
ANALYZING_PAGE progress codes and feed them to the bar. On Done, fill
bar to 100% briefly before clearing the overlay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ocr): use 1-based page numbers to match frontend PDF viewer 97e5138934

The PDF viewer uses 1-based currentPage (starting at 1) but the OCR
engines produced 0-based pageNumber from enumerate(). Annotations
created by OCR were assigned to page 0, which doesn't exist in the
viewer. Change enumerate() to start=1 in both engines and the
streaming endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test(ocr): add business-logic tests for polygon extraction, Kraken routing, and confidence markers 69768a104d

Cover Surya polygon/word-level extraction, health endpoint states,
Kraken script-type routing, 503 when models not ready, 400 when
Kraken unavailable for Kurrent, and confidence marker application
during streaming. Production code coverage: 88%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): allow re-running OCR when transcription blocks already exist 3fe6eedffb

Add a collapsible OCR trigger below the block list in edit mode.
Uses a <details> element so it's unobtrusive — the primary workflow
is editing existing blocks, but users can expand to re-run OCR with
a confirmation dialog that warns about replacing existing blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ocr): remove misleading ANALYZING progress before streaming starts 84aca240ea

The ANALYZING message appeared while the Python service was still
downloading the PDF and loading models. Remove it so the LOADING
message ("Lade Modell und Dokument…") stays visible until the first
ANALYZING_PAGE event arrives from the stream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ocr): remove progress bar, keep text-only page counter b868da07cd

The thin bar without a border looked broken at low progress values.
The text counter (e.g. "1 / 6") already communicates progress clearly
so the bar is unnecessary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ocr): remove redundant page counter from progress display 6b94882409

The progress message already says "Seite 3 von 7 wird analysiert…"
so the separate "3 / 7" counter was redundant. Remove the
OcrProgressBar from the page and inline only the skipped-pages
warning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(ocr): delete unused OcrProgressBar component

CI / Unit & Component Tests (push) Failing after 1s

Details

CI / Backend Unit Tests (push) Failing after 1s

Details

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

410ef88e1a

The skipped-pages warning is inlined directly in +page.svelte.
The component and its tests are no longer needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

marcel changed title from ~~feat: local OCR pipeline + polygon annotations (#226, #227)~~ to feat: OCR pipeline with NDJSON streaming and real-time progress (#226, #227, #231)

2026-04-13 10:55:59 +02:00

marcel commented

2026-04-13 11:00:14 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

Solid feature implementation with good test coverage across all three layers. The sealed interface for OcrStreamEvent with pattern matching is clean Java 21 usage, and the component decomposition on the frontend (AnnotationShape, OcrTrigger, OcrProgress, ScriptTypeSelect) follows the visual boundary rule well. I have a few concerns:

Blockers

None — the code is functional and well-tested.

Suggestions

OcrController injects repositories directly (OcrController.java:95-96). The controller injects OcrJobRepository and OcrJobDocumentRepository directly and queries them in getJobStatus() and getDocumentOcrStatus(). This violates the layering rule: Controllers never inject or call repositories directly. These queries belong in OcrService or a dedicated read service.
OcrAsyncRunner is doing too much (OcrAsyncRunner.java, 230 lines). This class handles single-document streaming, batch processing, block creation, existing block cleanup, and progress updates. That's at least 3 responsibilities. Consider splitting: OcrBlockCreator for the annotation+block creation logic, keeping OcrAsyncRunner as the async orchestrator only.
resolveUserId swallows exceptions silently (OcrController.java:162-169). The catch-all Exception handler returns null — if user resolution fails for an unexpected reason, the OCR job runs without a user ID and there's no log entry. At minimum, log at warn level.
clearExistingBlocks iterates and deletes one-by-one (OcrAsyncRunner.java:131-135). For documents with many blocks, this generates N delete queries. A bulk deleteAllByDocumentId on the repository would be cleaner and faster.
OcrProgress.svelte retry button resets state but doesn't re-create EventSource (line 81). Setting status = 'running' doesn't reconnect the SSE stream — the $effect won't re-run because jobId hasn't changed. The retry is cosmetic only.
Inconsistent import style in OcrTrigger.svelte — uses import * as m while OcrProgress.svelte uses import { m }. Should be consistent across the codebase.
AnnotationShape.svelte uses inline styles extensively — the shapeStyle computed string is 10+ CSS properties concatenated. This is functional but harder to read than a class-based approach. Not a blocker given the dynamic nature of the values.
NDJSON_MAPPER has FAIL_ON_UNKNOWN_PROPERTIES = true (RestClientOcrClient.java:1393). This makes the Java consumer brittle — if the Python service adds a new field to the NDJSON events, the Java side will break. For a streaming protocol between two services you control, false is safer and more forward-compatible.

Overall: clean implementation, good TDD evidence in the test suite, and the streaming protocol is well-designed. The controller-repository layering violation is the most important item to address.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** Solid feature implementation with good test coverage across all three layers. The sealed interface for `OcrStreamEvent` with pattern matching is clean Java 21 usage, and the component decomposition on the frontend (AnnotationShape, OcrTrigger, OcrProgress, ScriptTypeSelect) follows the visual boundary rule well. I have a few concerns: ### Blockers None — the code is functional and well-tested. ### Suggestions 1. **OcrController injects repositories directly** (`OcrController.java:95-96`). The controller injects `OcrJobRepository` and `OcrJobDocumentRepository` directly and queries them in `getJobStatus()` and `getDocumentOcrStatus()`. This violates the layering rule: *Controllers never inject or call repositories directly*. These queries belong in `OcrService` or a dedicated read service. 2. **`OcrAsyncRunner` is doing too much** (`OcrAsyncRunner.java`, 230 lines). This class handles single-document streaming, batch processing, block creation, existing block cleanup, and progress updates. That's at least 3 responsibilities. Consider splitting: `OcrBlockCreator` for the annotation+block creation logic, keeping `OcrAsyncRunner` as the async orchestrator only. 3. **`resolveUserId` swallows exceptions silently** (`OcrController.java:162-169`). The catch-all `Exception` handler returns `null` — if user resolution fails for an unexpected reason, the OCR job runs without a user ID and there's no log entry. At minimum, log at `warn` level. 4. **`clearExistingBlocks` iterates and deletes one-by-one** (`OcrAsyncRunner.java:131-135`). For documents with many blocks, this generates N delete queries. A bulk `deleteAllByDocumentId` on the repository would be cleaner and faster. 5. **`OcrProgress.svelte` retry button resets state but doesn't re-create EventSource** (line 81). Setting `status = 'running'` doesn't reconnect the SSE stream — the `$effect` won't re-run because `jobId` hasn't changed. The retry is cosmetic only. 6. **Inconsistent import style in `OcrTrigger.svelte`** — uses `import * as m` while `OcrProgress.svelte` uses `import { m }`. Should be consistent across the codebase. 7. **`AnnotationShape.svelte` uses inline styles extensively** — the `shapeStyle` computed string is 10+ CSS properties concatenated. This is functional but harder to read than a class-based approach. Not a blocker given the dynamic nature of the values. 8. **`NDJSON_MAPPER` has `FAIL_ON_UNKNOWN_PROPERTIES = true`** (`RestClientOcrClient.java:1393`). This makes the Java consumer brittle — if the Python service adds a new field to the NDJSON events, the Java side will break. For a streaming protocol between two services you control, `false` is safer and more forward-compatible. Overall: clean implementation, good TDD evidence in the test suite, and the streaming protocol is well-designed. The controller-repository layering violation is the most important item to address.

marcel commented

2026-04-13 11:00:37 +02:00

🏗️ Markus Keller — Application Architect

Verdict: ⚠️ Approved with concerns

The architectural decisions here are sound. The Python OCR microservice is justified — Surya and Kraken have no Java equivalents, and the ADRs (001 and 002) clearly document the reasoning. NDJSON streaming is the right transport choice: simpler than WebSockets, works through reverse proxies, and the per-page persistence eliminates the timeout problem. Good call on keeping the batch path on the old extractBlocks() for now.

Blockers

None.

Suggestions

Controller-to-repository coupling in OcrController. The controller directly queries OcrJobRepository and OcrJobDocumentRepository. This bypasses the service layer and creates a shortcut that will bite you when you need to add authorization checks or caching on job status reads. Move these queries behind OcrService.
OcrService.startOcr() mutates the Document entity without saving (OcrService.java:1311-1313). doc.setScriptType(scriptTypeOverride) modifies the entity in the persistence context, but there's no explicit documentService.save(doc) call. This relies on JPA dirty checking within a transaction — but startOcr() is not annotated @Transactional. The mutation might or might not persist depending on the Hibernate flush mode.
Thread pool sizing (AsyncConfig.java). The pool was changed from 1/1/1 to 2/2/100. A queue capacity of 100 with max 2 threads means up to 100 OCR jobs can be queued. Each OCR job can consume 5-8 GB of RAM on the Python side. Consider whether the queue should be smaller (e.g., 5-10) to prevent resource exhaustion, or add backpressure signaling.
No updated_at trigger on ocr_jobs/ocr_job_documents. The migration creates these tables with updated_at TIMESTAMPTZ NOT NULL DEFAULT now() but doesn't add a trigger to auto-update it. The JPA @UpdateTimestamp handles it on the Java side, but any direct SQL updates (admin queries, debugging) won't update the timestamp.
Presigned URL TTL (FileService.java). The 15-minute TTL is documented as "enough for OCR processing on CPU." But the issue description mentions a 7-page document taking 2+ hours. If a single page can take 10+ minutes on CPU, the presigned URL might expire before the Python service finishes downloading. Consider increasing to 30 minutes or making it configurable.
SSE timeout is 5 minutes (OcrProgressService.java:1216). For long OCR jobs, the SSE connection will time out and the frontend will lose progress updates. The OcrProgress.svelte error handler fires but the retry doesn't reconnect (as Felix noted). Consider a longer timeout or a reconnect mechanism.
ADR-001 mentions WireMock for integration tests but none are present. The ADR states "Integration tests require WireMock stub" but the test suite only has unit tests with mocks. This is fine for now, but worth noting as a gap.

The NDJSON protocol design with start/page/error/done events is clean and extensible. The polygon JSONB storage with a CHECK constraint is the right balance of flexibility and safety. Good architectural work overall.

## 🏗️ Markus Keller — Application Architect **Verdict: ⚠️ Approved with concerns** The architectural decisions here are sound. The Python OCR microservice is justified — Surya and Kraken have no Java equivalents, and the ADRs (001 and 002) clearly document the reasoning. NDJSON streaming is the right transport choice: simpler than WebSockets, works through reverse proxies, and the per-page persistence eliminates the timeout problem. Good call on keeping the batch path on the old `extractBlocks()` for now. ### Blockers None. ### Suggestions 1. **Controller-to-repository coupling in `OcrController`**. The controller directly queries `OcrJobRepository` and `OcrJobDocumentRepository`. This bypasses the service layer and creates a shortcut that will bite you when you need to add authorization checks or caching on job status reads. Move these queries behind `OcrService`. 2. **`OcrService.startOcr()` mutates the Document entity without saving** (`OcrService.java:1311-1313`). `doc.setScriptType(scriptTypeOverride)` modifies the entity in the persistence context, but there's no explicit `documentService.save(doc)` call. This relies on JPA dirty checking within a transaction — but `startOcr()` is not annotated `@Transactional`. The mutation might or might not persist depending on the Hibernate flush mode. 3. **Thread pool sizing** (`AsyncConfig.java`). The pool was changed from 1/1/1 to 2/2/100. A queue capacity of 100 with max 2 threads means up to 100 OCR jobs can be queued. Each OCR job can consume 5-8 GB of RAM on the Python side. Consider whether the queue should be smaller (e.g., 5-10) to prevent resource exhaustion, or add backpressure signaling. 4. **No `updated_at` trigger on `ocr_jobs`/`ocr_job_documents`**. The migration creates these tables with `updated_at TIMESTAMPTZ NOT NULL DEFAULT now()` but doesn't add a trigger to auto-update it. The JPA `@UpdateTimestamp` handles it on the Java side, but any direct SQL updates (admin queries, debugging) won't update the timestamp. 5. **Presigned URL TTL** (`FileService.java`). The 15-minute TTL is documented as "enough for OCR processing on CPU." But the issue description mentions a 7-page document taking 2+ hours. If a single page can take 10+ minutes on CPU, the presigned URL might expire before the Python service finishes downloading. Consider increasing to 30 minutes or making it configurable. 6. **SSE timeout is 5 minutes** (`OcrProgressService.java:1216`). For long OCR jobs, the SSE connection will time out and the frontend will lose progress updates. The `OcrProgress.svelte` error handler fires but the retry doesn't reconnect (as Felix noted). Consider a longer timeout or a reconnect mechanism. 7. **ADR-001 mentions WireMock for integration tests but none are present**. The ADR states "Integration tests require WireMock stub" but the test suite only has unit tests with mocks. This is fine for now, but worth noting as a gap. The NDJSON protocol design with `start`/`page`/`error`/`done` events is clean and extensible. The polygon JSONB storage with a CHECK constraint is the right balance of flexibility and safety. Good architectural work overall.

marcel commented

2026-04-13 11:00:58 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

Impressive test coverage for a PR of this size. The backend adds 14 new test classes with thorough coverage of the OCR pipeline, streaming protocol, and edge cases. The frontend translateOcrProgress.spec.ts is well-structured. Python tests at 38 with 88% coverage is solid. My concerns are around gaps and test quality.

Blockers

None.

Suggestions

No integration tests for the OCR pipeline. All Java tests are unit tests with mocks. The critical path — OcrAsyncRunner calling streamBlocks → creating annotations → creating transcription blocks — is only tested with mocked dependencies. A @SpringBootTest integration test with Testcontainers that exercises the full flow (minus the actual Python service) would catch JPA mapping issues, Flyway migration problems, and transaction boundary bugs. This is especially important given the JSONB polygon column and the new @JdbcTypeCode(SqlTypes.JSON) annotation.
OcrProgress.svelte has no tests. This component manages SSE connections, parses events, and drives the progress UI — all untested. At minimum, test the state transitions (running → done, running → error).
OcrTrigger.svelte has no tests. The confirmation dialog flow (existing blocks → confirm → trigger) is a critical user path that should have component tests.
AnnotationShape.svelte — no tests for polygon rendering. The clipPath derived computation with coordinate transforms is non-trivial math. A unit test for the clip-path calculation would prevent regression.
FileServiceTest passes null for S3Presigner (FileServiceTest.java:508). This means the presigned URL generation path is completely untested at the unit level. If any test calls generatePresignedUrl, it will NPE.
RestClientOcrClientStreamTest tests parsing but not HTTP fallback. The 404 → fallback-to-/ocr path in streamBlocks() is tested only in the static parseNdjsonStream method. The HTTP-level fallback logic is untested.
Python tests don't test the streaming endpoint (/ocr/stream). test_stream.py presumably covers this, but I notice only test_confidence.py and test_engines.py in the diff. The stream endpoint's error handling (per-page failure → continue) should have explicit test coverage.
No test for reviewBlock toggle behavior in the frontend. The TranscriptionBlock.svelte review toggle is connected but never tested from the component level.
Batch path (runBatch) uses old extractBlocks() but processDocument is only tested via OcrAsyncRunnerTest — the batch loop's skip/fail/success state transitions are tested in OcrBatchServiceTest but only at the service level, not the runner level.

Test quality is good where tests exist — descriptive names, single assertions per behavior, proper use of @ExtendWith(MockitoExtension.class) and AssertJ. The gaps are in integration coverage and frontend component testing.

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** Impressive test coverage for a PR of this size. The backend adds 14 new test classes with thorough coverage of the OCR pipeline, streaming protocol, and edge cases. The frontend `translateOcrProgress.spec.ts` is well-structured. Python tests at 38 with 88% coverage is solid. My concerns are around gaps and test quality. ### Blockers None. ### Suggestions 1. **No integration tests for the OCR pipeline**. All Java tests are unit tests with mocks. The critical path — `OcrAsyncRunner` calling `streamBlocks` → creating annotations → creating transcription blocks — is only tested with mocked dependencies. A `@SpringBootTest` integration test with Testcontainers that exercises the full flow (minus the actual Python service) would catch JPA mapping issues, Flyway migration problems, and transaction boundary bugs. This is especially important given the JSONB polygon column and the new `@JdbcTypeCode(SqlTypes.JSON)` annotation. 2. **`OcrProgress.svelte` has no tests**. This component manages SSE connections, parses events, and drives the progress UI — all untested. At minimum, test the state transitions (running → done, running → error). 3. **`OcrTrigger.svelte` has no tests**. The confirmation dialog flow (existing blocks → confirm → trigger) is a critical user path that should have component tests. 4. **`AnnotationShape.svelte` — no tests for polygon rendering**. The `clipPath` derived computation with coordinate transforms is non-trivial math. A unit test for the clip-path calculation would prevent regression. 5. **`FileServiceTest` passes `null` for `S3Presigner`** (`FileServiceTest.java:508`). This means the presigned URL generation path is completely untested at the unit level. If any test calls `generatePresignedUrl`, it will NPE. 6. **`RestClientOcrClientStreamTest` tests parsing but not HTTP fallback**. The 404 → fallback-to-`/ocr` path in `streamBlocks()` is tested only in the static `parseNdjsonStream` method. The HTTP-level fallback logic is untested. 7. **Python tests don't test the streaming endpoint (`/ocr/stream`)**. `test_stream.py` presumably covers this, but I notice only `test_confidence.py` and `test_engines.py` in the diff. The stream endpoint's error handling (per-page failure → continue) should have explicit test coverage. 8. **No test for `reviewBlock` toggle behavior in the frontend**. The `TranscriptionBlock.svelte` review toggle is connected but never tested from the component level. 9. **Batch path (`runBatch`) uses old `extractBlocks()` but `processDocument` is only tested via `OcrAsyncRunnerTest`** — the batch loop's skip/fail/success state transitions are tested in `OcrBatchServiceTest` but only at the service level, not the runner level. Test quality is good where tests exist — descriptive names, single assertions per behavior, proper use of `@ExtendWith(MockitoExtension.class)` and AssertJ. The gaps are in integration coverage and frontend component testing.

marcel commented

2026-04-13 11:01:21 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

The OCR pipeline introduces a new external service boundary and presigned URL generation — both expand the attack surface. Overall the security posture is reasonable, but there are items to address.

Blockers

None — no critical vulnerabilities found.

Suggestions

Presigned URL exposure via OCR service (FileService.java:generatePresignedUrl). The presigned URL is passed to the Python service over the internal Docker network, which is acceptable. However, the 15-minute TTL means the URL is valid for anyone who intercepts it. Verify that:
- The OCR service does not log the full presigned URL (check Python logger calls)
- The presigned URL is not returned in any API response to the frontend
- The internal Docker network is not exposed to the host
Risk: Low (internal network only), but worth a security smell note.
No input validation on TriggerOcrDTO.scriptType (TriggerOcrDTO.java). The DTO accepts any ScriptType enum value including UNKNOWN. If UNKNOWN is passed, Surya processes it (which is fine), but there's no explicit @NotNull validation. A null scriptType is silently accepted and may cause unexpected behavior in OcrService.startOcr() where scriptTypeOverride != null is the guard.
resolveUserId silently swallows authentication failures (OcrController.java:162-169). If the user resolution throws, the OCR job runs with userId = null. This means:
- createdBy on OcrJob is null — audit trail is broken
- createdBy on annotations is null — attribution is lost
- No log entry for the failure
CWE-778 (Insufficient Logging): At minimum, log the exception at warn level.
Batch endpoint requires ADMIN but single-document requires WRITE_ALL (OcrController.java). This is intentional (batch is more destructive), but verify that WRITE_ALL users cannot abuse the single-document endpoint to re-run OCR repeatedly and cause resource exhaustion. Consider rate limiting or a check for concurrent OCR jobs on the same document.
Python _download_and_convert_pdf follows redirects by default (main.py). The httpx.AsyncClient default follows redirects. If an attacker can inject a crafted presigned URL (unlikely given it's generated server-side), they could redirect the Python service to an arbitrary URL. This is a defense-in-depth note — add follow_redirects=False to harden the client.
No SSRF protection on the pdfUrl parameter (OcrRequest.pdfUrl). The Python service downloads whatever URL it receives. In the current architecture, the URL is always a presigned MinIO URL generated by Spring Boot, so the input is trusted. But if the /ocr or /ocr/stream endpoints are ever exposed beyond the internal network, this becomes an SSRF vector. Add a comment documenting the trust assumption.
NDJSON_MAPPER has FAIL_ON_UNKNOWN_PROPERTIES = true. From a security perspective, strict parsing is actually good — it prevents deserialization of unexpected fields. However, this also means a malicious or buggy Python service sending extra fields will crash the Java consumer. Trade-off is acceptable.
OcrBlockResult record fields are not validated (OcrBlockResult.java). The x, y, width, height values from the Python service are trusted and stored directly. A malicious OCR response could inject coordinates outside [0,1], which the frontend would render as overflowing annotations. The DTO validation on CreateAnnotationDTO.polygon has range checks, but the AABB fields don't. Low risk (internal service), but worth noting.

No critical vulnerabilities. The main action items are logging improvements and documenting trust boundaries.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** The OCR pipeline introduces a new external service boundary and presigned URL generation — both expand the attack surface. Overall the security posture is reasonable, but there are items to address. ### Blockers None — no critical vulnerabilities found. ### Suggestions 1. **Presigned URL exposure via OCR service** (`FileService.java:generatePresignedUrl`). The presigned URL is passed to the Python service over the internal Docker network, which is acceptable. However, the 15-minute TTL means the URL is valid for anyone who intercepts it. Verify that: - The OCR service does not log the full presigned URL (check Python `logger` calls) - The presigned URL is not returned in any API response to the frontend - The internal Docker network is not exposed to the host **Risk**: Low (internal network only), but worth a security smell note. 2. **No input validation on `TriggerOcrDTO.scriptType`** (`TriggerOcrDTO.java`). The DTO accepts any `ScriptType` enum value including `UNKNOWN`. If `UNKNOWN` is passed, Surya processes it (which is fine), but there's no explicit `@NotNull` validation. A `null` scriptType is silently accepted and may cause unexpected behavior in `OcrService.startOcr()` where `scriptTypeOverride != null` is the guard. 3. **`resolveUserId` silently swallows authentication failures** (`OcrController.java:162-169`). If the user resolution throws, the OCR job runs with `userId = null`. This means: - `createdBy` on `OcrJob` is null — audit trail is broken - `createdBy` on annotations is null — attribution is lost - No log entry for the failure **CWE-778 (Insufficient Logging)**: At minimum, log the exception at `warn` level. 4. **Batch endpoint requires ADMIN but single-document requires WRITE_ALL** (`OcrController.java`). This is intentional (batch is more destructive), but verify that WRITE_ALL users cannot abuse the single-document endpoint to re-run OCR repeatedly and cause resource exhaustion. Consider rate limiting or a check for concurrent OCR jobs on the same document. 5. **Python `_download_and_convert_pdf` follows redirects by default** (`main.py`). The `httpx.AsyncClient` default follows redirects. If an attacker can inject a crafted presigned URL (unlikely given it's generated server-side), they could redirect the Python service to an arbitrary URL. This is a defense-in-depth note — add `follow_redirects=False` to harden the client. 6. **No SSRF protection on the `pdfUrl` parameter** (`OcrRequest.pdfUrl`). The Python service downloads whatever URL it receives. In the current architecture, the URL is always a presigned MinIO URL generated by Spring Boot, so the input is trusted. But if the `/ocr` or `/ocr/stream` endpoints are ever exposed beyond the internal network, this becomes an SSRF vector. Add a comment documenting the trust assumption. 7. **`NDJSON_MAPPER` has `FAIL_ON_UNKNOWN_PROPERTIES = true`**. From a security perspective, strict parsing is actually good — it prevents deserialization of unexpected fields. However, this also means a malicious or buggy Python service sending extra fields will crash the Java consumer. Trade-off is acceptable. 8. **`OcrBlockResult` record fields are not validated** (`OcrBlockResult.java`). The `x`, `y`, `width`, `height` values from the Python service are trusted and stored directly. A malicious OCR response could inject coordinates outside [0,1], which the frontend would render as overflowing annotations. The DTO validation on `CreateAnnotationDTO.polygon` has range checks, but the AABB fields don't. Low risk (internal service), but worth noting. No critical vulnerabilities. The main action items are logging improvements and documenting trust boundaries.

marcel commented

2026-04-13 11:01:43 +02:00

🎨 Leonie Voss — UI/UX Design Lead

Verdict: ⚠️ Approved with concerns

The frontend additions follow the project's brand guidelines well — correct use of brand-navy, brand-mint, brand-sand, proper font-serif/font-sans separation, and the card patterns match the existing design language. The component decomposition is clean: OcrTrigger, OcrProgress, ScriptTypeSelect, and AnnotationShape each represent one visual region.

Blockers

None.

Suggestions

Progress bar has no text alternative for screen readers (OcrProgress.svelte:459-467). The role="progressbar" with aria-valuenow is correct, but there's no aria-label or visible text describing what the progress bar represents. Add aria-label={m.ocr_progress_heading()} to the progressbar div.
ScriptTypeSelect.svelte — <select> touch target is correct (min-h-[44px]), good. But the label text at text-xs (12px) is at the absolute minimum. For the senior user audience, consider text-sm (14px) for form labels.
Error state border treatment (OcrProgress.svelte:474). The border-l-4 border-l-red-500 is a good visual signal, but red alone as the error indicator may not be sufficient for color-blind users. The heading text "OCR fehlgeschlagen" provides the textual cue, which is good — but consider adding an error icon (⚠ or ✕) for redundant signaling.
"OCR erneut ausführen…" trigger (TranscriptionEditView.svelte:716-730). The <details> element is semantically correct for progressive disclosure. The text-xs summary text is very small — at 12px it's the minimum. Acceptable but tight for the target audience.
Review toggle icon (TranscriptionBlock.svelte:243-268). The checkmark circle icon toggles between filled and outlined states — this is a clean interaction pattern. The text-turquoise color for the reviewed state aligns with the brand accent. The aria-label toggles between review/unreview states, which is correct for accessibility.
Block number badge in AnnotationShape.svelte (line 344-367). The badge uses font-size: 11px which is below the 12px minimum. This is a regression from the original code, but worth flagging. At 20px circle diameter with 11px text, the numbers will be hard to read on mobile.
OcrTrigger.svelte button uses min-h-[44px] — correct touch target. The disabled state uses opacity-50 which reduces contrast below WCAG AA for the white-on-navy text. Consider using a distinct disabled color instead of opacity.
Polygon clip-path rendering (AnnotationShape.svelte:290-300). The CSS clip-path: polygon(...) approach is correct for displaying quadrilateral shapes. Note that clip-path clips the entire element including the box-shadow, so active/hover states will not show the inset shadow on polygon-clipped annotations. This may be intentional but is worth verifying visually.

Overall the UI additions are clean and follow the established patterns. The accessibility items are minor — the fundamentals (ARIA roles, touch targets, semantic HTML) are in place.

## 🎨 Leonie Voss — UI/UX Design Lead **Verdict: ⚠️ Approved with concerns** The frontend additions follow the project's brand guidelines well — correct use of `brand-navy`, `brand-mint`, `brand-sand`, proper `font-serif`/`font-sans` separation, and the card patterns match the existing design language. The component decomposition is clean: `OcrTrigger`, `OcrProgress`, `ScriptTypeSelect`, and `AnnotationShape` each represent one visual region. ### Blockers None. ### Suggestions 1. **Progress bar has no text alternative for screen readers** (`OcrProgress.svelte:459-467`). The `role="progressbar"` with `aria-valuenow` is correct, but there's no `aria-label` or visible text describing what the progress bar represents. Add `aria-label={m.ocr_progress_heading()}` to the progressbar div. 2. **`ScriptTypeSelect.svelte` — `<select>` touch target is correct** (min-h-[44px]), good. But the label text at `text-xs` (12px) is at the absolute minimum. For the senior user audience, consider `text-sm` (14px) for form labels. 3. **Error state border treatment** (`OcrProgress.svelte:474`). The `border-l-4 border-l-red-500` is a good visual signal, but red alone as the error indicator may not be sufficient for color-blind users. The heading text "OCR fehlgeschlagen" provides the textual cue, which is good — but consider adding an error icon (⚠ or ✕) for redundant signaling. 4. **"OCR erneut ausführen…" trigger** (`TranscriptionEditView.svelte:716-730`). The `<details>` element is semantically correct for progressive disclosure. The `text-xs` summary text is very small — at 12px it's the minimum. Acceptable but tight for the target audience. 5. **Review toggle icon** (`TranscriptionBlock.svelte:243-268`). The checkmark circle icon toggles between filled and outlined states — this is a clean interaction pattern. The `text-turquoise` color for the reviewed state aligns with the brand accent. The `aria-label` toggles between review/unreview states, which is correct for accessibility. 6. **Block number badge in `AnnotationShape.svelte`** (line 344-367). The badge uses `font-size: 11px` which is below the 12px minimum. This is a regression from the original code, but worth flagging. At 20px circle diameter with 11px text, the numbers will be hard to read on mobile. 7. **`OcrTrigger.svelte` button** uses `min-h-[44px]` — correct touch target. The disabled state uses `opacity-50` which reduces contrast below WCAG AA for the white-on-navy text. Consider using a distinct disabled color instead of opacity. 8. **Polygon clip-path rendering** (`AnnotationShape.svelte:290-300`). The CSS `clip-path: polygon(...)` approach is correct for displaying quadrilateral shapes. Note that `clip-path` clips the entire element including the box-shadow, so active/hover states will not show the inset shadow on polygon-clipped annotations. This may be intentional but is worth verifying visually. Overall the UI additions are clean and follow the established patterns. The accessibility items are minor — the fundamentals (ARIA roles, touch targets, semantic HTML) are in place.

marcel commented

2026-04-13 11:02:07 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

The Docker Compose additions for the OCR service are well-structured. The healthcheck with start_period: 60s accounts for model loading, the mem_limit: 8g is a sensible guard, and the internal-only networking is correct. Two ADRs documenting the architectural decisions — that's rare and appreciated.

Blockers

None.

Suggestions

ocr-service has no host port mapping — good, it's internal only. But there's also no expose directive. While Docker Compose services on the same network can reach each other by service name without expose, adding expose: ["8000"] makes the intent explicit and helps with documentation.
mem_limit: 8g and memswap_limit: 8g means no swap for the OCR container. On a 16-32 GB server running the full stack (PostgreSQL + MinIO + Spring Boot + SvelteKit + OCR), this is tight. If Surya loads at ~5 GB idle and processes a large document, the OOM killer may hit. Consider either:
- Increasing memswap_limit to 10g to allow some swap headroom
- Adding a comment documenting the minimum server RAM requirement
Backend now depends on ocr-service: condition: service_healthy (docker-compose.yml:1359-1360). This means the entire backend won't start if the OCR service is unhealthy. OCR is a non-essential feature — the app should work without it. Consider changing to condition: service_started or removing the dependency entirely, since the OcrHealthClient.isHealthy() check already handles the unavailable case gracefully.
Dockerfile uses unpinned python:3.11-slim. For reproducible builds, pin to a specific digest or at least a patch version: python:3.11.12-slim. Also, torch==2.7.1 is pinned but installed from the CPU index in a separate RUN layer and then listed again in requirements.txt. If pip install -r requirements.txt re-resolves torch from PyPI (not the CPU index), it could pull in the CUDA variant (2+ GB larger). Verify that the pre-installed CPU wheel satisfies the requirements.txt constraint.
ocr_models and ocr_cache named volumes are correct for persistence. But the Kraken model (german_kurrent.mlmodel) needs to be downloaded separately — there's a scripts/download-kraken-models.sh script in the diff but it's not integrated into the Docker build or documented in the compose file. First-time users will get a healthy OCR service that fails on Kurrent requests with a confusing "model not loaded" warning.
No resource limits on the existing containers. The OCR service has mem_limit but PostgreSQL, MinIO, and the backend don't. On a shared VPS, the OCR service's 8 GB allocation could starve the other services. Consider adding mem_limit to the backend and PostgreSQL containers as well, or at minimum documenting the total RAM budget.
APP_S3_INTERNAL_URL environment variable is added (docker-compose.yml:1369) but I don't see it referenced in the Spring Boot code. Is this used? If it's for the presigned URL base, the presigned URL is generated by S3Presigner which uses the endpoint from MinioConfig. Verify this isn't a dead env var.
No CI workflow changes. The OCR service adds a Dockerfile and tests but the Gitea Actions workflow isn't updated to build/test the Python service. The cd ocr-service && python -m pytest step in the test plan is manual-only.

Good infrastructure work. The main operational concern is the hard dependency on OCR service health blocking the backend startup.

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** The Docker Compose additions for the OCR service are well-structured. The healthcheck with `start_period: 60s` accounts for model loading, the `mem_limit: 8g` is a sensible guard, and the internal-only networking is correct. Two ADRs documenting the architectural decisions — that's rare and appreciated. ### Blockers None. ### Suggestions 1. **`ocr-service` has no host port mapping** — good, it's internal only. But there's also **no `expose` directive**. While Docker Compose services on the same network can reach each other by service name without `expose`, adding `expose: ["8000"]` makes the intent explicit and helps with documentation. 2. **`mem_limit: 8g` and `memswap_limit: 8g`** means no swap for the OCR container. On a 16-32 GB server running the full stack (PostgreSQL + MinIO + Spring Boot + SvelteKit + OCR), this is tight. If Surya loads at ~5 GB idle and processes a large document, the OOM killer may hit. Consider either: - Increasing `memswap_limit` to 10g to allow some swap headroom - Adding a comment documenting the minimum server RAM requirement 3. **Backend now depends on `ocr-service: condition: service_healthy`** (`docker-compose.yml:1359-1360`). This means the entire backend won't start if the OCR service is unhealthy. OCR is a non-essential feature — the app should work without it. Consider changing to `condition: service_started` or removing the dependency entirely, since the `OcrHealthClient.isHealthy()` check already handles the unavailable case gracefully. 4. **Dockerfile uses unpinned `python:3.11-slim`**. For reproducible builds, pin to a specific digest or at least a patch version: `python:3.11.12-slim`. Also, `torch==2.7.1` is pinned but installed from the CPU index in a separate `RUN` layer and then listed again in `requirements.txt`. If `pip install -r requirements.txt` re-resolves torch from PyPI (not the CPU index), it could pull in the CUDA variant (2+ GB larger). Verify that the pre-installed CPU wheel satisfies the `requirements.txt` constraint. 5. **`ocr_models` and `ocr_cache` named volumes** are correct for persistence. But the Kraken model (`german_kurrent.mlmodel`) needs to be downloaded separately — there's a `scripts/download-kraken-models.sh` script in the diff but it's not integrated into the Docker build or documented in the compose file. First-time users will get a healthy OCR service that fails on Kurrent requests with a confusing "model not loaded" warning. 6. **No resource limits on the existing containers**. The OCR service has `mem_limit` but PostgreSQL, MinIO, and the backend don't. On a shared VPS, the OCR service's 8 GB allocation could starve the other services. Consider adding `mem_limit` to the backend and PostgreSQL containers as well, or at minimum documenting the total RAM budget. 7. **`APP_S3_INTERNAL_URL` environment variable** is added (`docker-compose.yml:1369`) but I don't see it referenced in the Spring Boot code. Is this used? If it's for the presigned URL base, the presigned URL is generated by `S3Presigner` which uses the endpoint from `MinioConfig`. Verify this isn't a dead env var. 8. **No CI workflow changes**. The OCR service adds a Dockerfile and tests but the Gitea Actions workflow isn't updated to build/test the Python service. The `cd ocr-service && python -m pytest` step in the test plan is manual-only. Good infrastructure work. The main operational concern is the hard dependency on OCR service health blocking the backend startup.

marcel added 11 commits 2026-04-13 12:30:10 +02:00

refactor(ocr): move repository access from OcrController into OcrService 2d43f09172

OcrController was injecting OcrJobRepository and OcrJobDocumentRepository
directly, violating the Controller → Service → Repository layering rule.
Moved getJob() and getDocumentOcrStatus() logic into OcrService.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ocr): persist scriptType override via DocumentService transaction 6a0fd25662

OcrService.startOcr() was setting scriptType on a detached entity,
silently losing the mutation. Added DocumentService.updateScriptType()
with @Transactional to persist the change properly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(ocr): route block lifecycle through TranscriptionService caae2ead81

OcrAsyncRunner was bypassing TranscriptionService — building blocks
directly and calling blockRepository.save(), skipping sanitizeText()
and saveVersion(). Also replaced N individual deleteBlock() calls with
a single bulk deleteAllBlocksByDocument() for OCR re-runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ocr): handle unknown NDJSON fields with @JsonIgnoreProperties 9282e46a02

Added @JsonIgnoreProperties(ignoreUnknown = true) to OcrBlockResult so
new fields from the Python OCR service don't crash the Java parser,
while keeping FAIL_ON_UNKNOWN_PROPERTIES strict globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ocr): log warning when user ID resolution fails 5a97316940

The resolveUserId() catch block was silently swallowing exceptions,
making auth failures invisible in logs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ocr): reduce async queue capacity from 100 to 10 08b1cd5dac

Queue capacity of 100 is disproportionate for 2 worker threads — a
backed-up queue would represent hours of unprocessed OCR jobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): add unique constraint on (job_id, document_id) dd47a48d90

Prevents the same document from being added to an OCR job twice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ocr): fix SSE retry to actually reconnect EventSource 8c07779a91

The retry button set status='running' but didn't re-trigger the $effect
because jobId hadn't changed. Added retryCount state so the effect
re-runs and creates a fresh EventSource on retry. Also added aria-label
to the progress bar for accessibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(frontend): normalize paraglide imports and improve accessibility b7fd4018c2

Changed OcrTrigger and ScriptTypeSelect from 'import * as m' to
'import { m }' to match the rest of the codebase. Increased
ScriptTypeSelect label to text-sm and annotation badge font to 12px
for better readability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(docker): soften ocr-service dependency and clean up compose 0beaf351f0

Changed ocr-service dependency from service_healthy to service_started
since the backend already handles OCR unavailability gracefully. Removed
unused APP_S3_INTERNAL_URL env var. Added expose directive and
.dockerignore for ocr-service.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ocr): add SSRF protection for PDF URL downloads

CI / Unit & Component Tests (pull_request) Failing after 1s

Details

CI / Backend Unit Tests (pull_request) Failing after 1s

Details

CI / Unit & Component Tests (push) Failing after 2s

Details

CI / Backend Unit Tests (push) Failing after 0s

Details

70689b8f7b

Validates PDF download URLs against an ALLOWED_PDF_HOSTS allowlist
(default: minio,localhost,127.0.0.1) and disables redirect following
to prevent redirect-based SSRF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

marcel merged commit 70689b8f7b into main

2026-04-13 12:39:04 +02:00

marcel deleted branch feat/issue-226-227-ocr-pipeline-polygon

2026-04-13 12:39:07 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#229