perf: page-by-page streaming OCR with real-time progress #231

New Issue

marcel · 2026-04-12T23:57:51+02:00

marcel commented

2026-04-12 23:57:51 +02:00

Problem

The Python OCR service processes all pages of a PDF before returning the result as a single HTTP response. For multi-page documents this takes 10+ minutes on CPU, exceeding the Java backend's 10-minute read timeout. The backend marks the job as FAILED, but the Python service keeps grinding — wasting CPU and producing results that are never received.

Additionally, during processing the frontend only shows a generic "OCR-Analyse läuft" message with no page-level progress.

Root cause discovered on feat/issue-226-227-ocr-pipeline-polygon: A 7+ page typewriter document caused the OCR service to run for over 2 hours. The Java backend timed out after 10 minutes. New OCR requests failed silently because the single-threaded event loop was blocked (fixed separately with asyncio.to_thread in d8dcba1).

Depends on #226 / #227 (OCR pipeline).

Solution: NDJSON streaming

Replace the all-at-once HTTP response with NDJSON (newline-delimited JSON) streaming. The Python service sends one JSON line per completed page, the Java backend consumes the stream incrementally — creating transcription blocks as each page arrives and updating progress in real time.

Protocol

→ POST /ocr/stream  {pdfUrl, scriptType, language}
← {"type":"start","totalPages":7}
← {"type":"page","pageNumber":0,"blocks":[...]}
← {"type":"page","pageNumber":1,"blocks":[...]}
← ...
← {"type":"done"}

On per-page error:

← {"type":"error","pageNumber":3,"message":"..."}

Processing continues with the next page (partial results are kept).

Part 1: Python — new streaming endpoint

File: ocr-service/main.py

Add POST /ocr/stream returning a StreamingResponse with media_type="application/x-ndjson":

Download and convert PDF (same _download_and_convert_pdf)
Yield {"type":"start","totalPages":N}
For each page: run engine via asyncio.to_thread() on that single page, apply confidence markers, yield {"type":"page","pageNumber":i,"blocks":[...]}
Yield {"type":"done"}
If a page fails, yield {"type":"error",...} and continue

Files: ocr-service/engines/surya.py, ocr-service/engines/kraken.py

Extract extract_page_blocks(image, page_idx, language) -> list[dict] from the inner loop of extract_blocks(). The existing extract_blocks() becomes a thin wrapper that calls extract_page_blocks() in a loop (preserving backward compatibility for the old /ocr endpoint).

Keep the old POST /ocr endpoint as fallback.

Part 2: Java — streaming consumer

New interface

New file: service/OcrStreamHandler.java

public interface OcrStreamHandler {
    void onStart(int totalPages);
    void onPage(int pageNumber, List<OcrBlockResult> blocks);
    void onError(int pageNumber, String message);
    void onDone();
}

OcrClient extension

File: service/OcrClient.java

Add:

void streamBlocks(String pdfUrl, ScriptType scriptType, OcrStreamHandler handler);

Keep existing extractBlocks() with a default implementation that collects via streamBlocks (backward compat for batch path).

RestClientOcrClient

File: service/RestClientOcrClient.java

Implement streamBlocks() using java.net.http.HttpClient directly (already available via constructor). Read response InputStream with BufferedReader.readLine(), parse each JSON line with Jackson, dispatch to handler callbacks.

No per-request total timeout needed — data flows continuously. Individual page processing can take up to 5 minutes before the next line arrives.

OcrAsyncRunner — per-page block creation

File: service/OcrAsyncRunner.java

Replace the single extractBlocks() call in runSingleDocument() with streamBlocks(). In the onPage callback:

Create annotation + transcription block for each block in that page (same logic, just incremental)
Update OcrJobDocument.currentPage / totalPages (fields already exist in DB, currently unused)
Update OcrJob.progressMessage with the new page-level format

Part 3: Frontend — page-level progress

File: frontend/src/routes/documents/[id]/+page.svelte

Update translateOcrProgress() to handle a new progress code:

ANALYZING_PAGE:<current>:<total>:<blockCount>

Example: ANALYZING_PAGE:3:7:45

Map to the existing per-page message pattern, reusing current i18n keys where possible. Add one new key:

ocr_status_analyzing_page: "Seite {current} von {total} — {blocks} Blöcke erkannt"

Add a progress bar (brand-mint fill on brand-sand background) showing current / total pages. The existing spinner + text layout stays, the progress bar is added below the text.

The polling mechanism (pollOcrJob every 2s reading job.progressMessage) stays unchanged — it already picks up progress message updates.

Part 4: Error handling

Scenario	Behavior
Page N fails, others succeed	Log warning, skip page, continue. User gets partial results.
Connection drops mid-stream	Blocks from completed pages are already persisted. Job marked `FAILED` with message "Verbindung unterbrochen nach Seite X von Y". User can re-trigger OCR.
Python crash before first page	`onStart` never fires. Standard error path, job → `FAILED`.
`/ocr/stream` returns 404 (old Python image)	Fall back to `extractBlocks()` with the old `/ocr` endpoint.

Open questions

Should the batch path (runBatch) also use streaming, or keep the old extractBlocks() for now?
Should we add X-Accel-Buffering: no header on the Python response for future reverse-proxy compatibility?

## Problem The Python OCR service processes all pages of a PDF before returning the result as a single HTTP response. For multi-page documents this takes 10+ minutes on CPU, exceeding the Java backend's 10-minute read timeout. The backend marks the job as `FAILED`, but the Python service keeps grinding — wasting CPU and producing results that are never received. Additionally, during processing the frontend only shows a generic "OCR-Analyse läuft" message with no page-level progress. **Root cause discovered on `feat/issue-226-227-ocr-pipeline-polygon`:** A 7+ page typewriter document caused the OCR service to run for over 2 hours. The Java backend timed out after 10 minutes. New OCR requests failed silently because the single-threaded event loop was blocked (fixed separately with `asyncio.to_thread` in d8dcba1). Depends on #226 / #227 (OCR pipeline). --- ## Solution: NDJSON streaming Replace the all-at-once HTTP response with NDJSON (newline-delimited JSON) streaming. The Python service sends one JSON line per completed page, the Java backend consumes the stream incrementally — creating transcription blocks as each page arrives and updating progress in real time. ### Protocol ``` → POST /ocr/stream {pdfUrl, scriptType, language} ← {"type":"start","totalPages":7} ← {"type":"page","pageNumber":0,"blocks":[...]} ← {"type":"page","pageNumber":1,"blocks":[...]} ← ... ← {"type":"done"} ``` On per-page error: ``` ← {"type":"error","pageNumber":3,"message":"..."} ``` Processing continues with the next page (partial results are kept). --- ## Part 1: Python — new streaming endpoint **File:** `ocr-service/main.py` Add `POST /ocr/stream` returning a `StreamingResponse` with `media_type="application/x-ndjson"`: 1. Download and convert PDF (same `_download_and_convert_pdf`) 2. Yield `{"type":"start","totalPages":N}` 3. For each page: run engine via `asyncio.to_thread()` on that single page, apply confidence markers, yield `{"type":"page","pageNumber":i,"blocks":[...]}` 4. Yield `{"type":"done"}` 5. If a page fails, yield `{"type":"error",...}` and continue **Files:** `ocr-service/engines/surya.py`, `ocr-service/engines/kraken.py` Extract `extract_page_blocks(image, page_idx, language) -> list[dict]` from the inner loop of `extract_blocks()`. The existing `extract_blocks()` becomes a thin wrapper that calls `extract_page_blocks()` in a loop (preserving backward compatibility for the old `/ocr` endpoint). Keep the old `POST /ocr` endpoint as fallback. --- ## Part 2: Java — streaming consumer ### New interface **New file:** `service/OcrStreamHandler.java` ```java public interface OcrStreamHandler { void onStart(int totalPages); void onPage(int pageNumber, List<OcrBlockResult> blocks); void onError(int pageNumber, String message); void onDone(); } ``` ### OcrClient extension **File:** `service/OcrClient.java` Add: ```java void streamBlocks(String pdfUrl, ScriptType scriptType, OcrStreamHandler handler); ``` Keep existing `extractBlocks()` with a default implementation that collects via `streamBlocks` (backward compat for batch path). ### RestClientOcrClient **File:** `service/RestClientOcrClient.java` Implement `streamBlocks()` using `java.net.http.HttpClient` directly (already available via constructor). Read response `InputStream` with `BufferedReader.readLine()`, parse each JSON line with Jackson, dispatch to handler callbacks. No per-request total timeout needed — data flows continuously. Individual page processing can take up to 5 minutes before the next line arrives. ### OcrAsyncRunner — per-page block creation **File:** `service/OcrAsyncRunner.java` Replace the single `extractBlocks()` call in `runSingleDocument()` with `streamBlocks()`. In the `onPage` callback: - Create annotation + transcription block for each block in that page (same logic, just incremental) - Update `OcrJobDocument.currentPage` / `totalPages` (fields already exist in DB, currently unused) - Update `OcrJob.progressMessage` with the new page-level format --- ## Part 3: Frontend — page-level progress **File:** `frontend/src/routes/documents/[id]/+page.svelte` Update `translateOcrProgress()` to handle a new progress code: ``` ANALYZING_PAGE:<current>:<total>:<blockCount> ``` Example: `ANALYZING_PAGE:3:7:45` Map to the existing per-page message pattern, reusing current i18n keys where possible. Add one new key: ``` ocr_status_analyzing_page: "Seite {current} von {total} — {blocks} Blöcke erkannt" ``` Add a progress bar (brand-mint fill on brand-sand background) showing `current / total` pages. The existing spinner + text layout stays, the progress bar is added below the text. The polling mechanism (`pollOcrJob` every 2s reading `job.progressMessage`) stays unchanged — it already picks up progress message updates. --- ## Part 4: Error handling | Scenario | Behavior | |---|---| | Page N fails, others succeed | Log warning, skip page, continue. User gets partial results. | | Connection drops mid-stream | Blocks from completed pages are already persisted. Job marked `FAILED` with message "Verbindung unterbrochen nach Seite X von Y". User can re-trigger OCR. | | Python crash before first page | `onStart` never fires. Standard error path, job → `FAILED`. | | `/ocr/stream` returns 404 (old Python image) | Fall back to `extractBlocks()` with the old `/ocr` endpoint. | --- ## Open questions - [ ] Should the batch path (`runBatch`) also use streaming, or keep the old `extractBlocks()` for now? - [ ] Should we add `X-Accel-Buffering: no` header on the Python response for future reverse-proxy compatibility?

marcel added the feature label 2026-04-12 23:58:03 +02:00

marcel commented

2026-04-12 23:58:47 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

OcrStreamHandler callback interface: The onPage callback does DB writes (create annotation + transcription block) — that's side-effecting inside a callback invoked from RestClientOcrClient. This couples the HTTP parsing layer to the persistence layer. I'd prefer the handler to be a thin adapter that the OcrAsyncRunner implements or provides as a lambda — not a standalone interface that gets injected across layers. The runner owns the orchestration, so the callbacks should be defined there.
extract_page_blocks extraction in both engines: Both surya.py and kraken.py already process one page at a time in their for page_idx, image in enumerate(images) loop. Extracting the loop body into extract_page_blocks(image, page_idx, language) is clean. But Surya's load_models() is called inside extract_blocks — make sure the new extract_page_blocks either calls load_models() or has a guard. If the streaming endpoint calls extract_page_blocks directly for each page, the lazy load must happen on the first call.
Progress message format ANALYZING_PAGE:3:7:45: Three colons in one string is a parsing smell. The translateOcrProgress function splits on : and takes code.split(':'). With ANALYZING_PAGE:3:7:45, that's ['ANALYZING_PAGE', '3', '7', '45'] — it works, but any future message with more params becomes fragile. Consider whether this should stay as the colon protocol (KISS) or switch to JSON in progressMessage. Given the existing pattern uses colons (CREATING_BLOCKS:42, DONE:100), staying consistent is the right call — just document that ANALYZING_PAGE has 3 params.
createSingleBlock extraction: The issue mentions splitting createTranscriptionBlocks into a per-block method. Good — each block creation involves annotationService.createOcrAnnotation() + blockRepository.save(). The per-block method should take a sortOrder parameter so the caller (the onPage callback) can maintain a running counter across pages.

Suggestions

Test strategy for RestClientOcrClient.streamBlocks(): This is the most testable part of the Java side. Write a test that feeds a multi-line NDJSON string to the parsing logic and asserts the handler receives onStart(7), onPage(0, [...]), onPage(1, [...]), onDone() in order. Mock the HttpClient to return a canned InputStream. This test should exist before the implementation.
Test for partial failure: Feed NDJSON with a {"type":"error","pageNumber":2,...} between two page messages. Assert the handler receives onPage(0,...), onPage(1,...), onError(2,...), onPage(3,...), onDone(). This validates the "continue on page error" contract.

Frontend translateOcrProgress: The new ANALYZING_PAGE case should destructure cleanly:

case 'ANALYZING_PAGE': {
    const parts = code.split(':');
    return m.ocr_status_analyzing_page({ current: parts[1], total: parts[2], blocks: parts[3] });
}

Keep it in the same switch block — don't extract a separate function for one case.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Questions & Observations - **`OcrStreamHandler` callback interface**: The `onPage` callback does DB writes (create annotation + transcription block) — that's side-effecting inside a callback invoked from `RestClientOcrClient`. This couples the HTTP parsing layer to the persistence layer. I'd prefer the handler to be a thin adapter that the `OcrAsyncRunner` implements or provides as a lambda — not a standalone interface that gets injected across layers. The runner owns the orchestration, so the callbacks should be defined there. - **`extract_page_blocks` extraction in both engines**: Both `surya.py` and `kraken.py` already process one page at a time in their `for page_idx, image in enumerate(images)` loop. Extracting the loop body into `extract_page_blocks(image, page_idx, language)` is clean. But Surya's `load_models()` is called inside `extract_blocks` — make sure the new `extract_page_blocks` either calls `load_models()` or has a guard. If the streaming endpoint calls `extract_page_blocks` directly for each page, the lazy load must happen on the first call. - **Progress message format `ANALYZING_PAGE:3:7:45`**: Three colons in one string is a parsing smell. The `translateOcrProgress` function splits on `:` and takes `code.split(':')`. With `ANALYZING_PAGE:3:7:45`, that's `['ANALYZING_PAGE', '3', '7', '45']` — it works, but any future message with more params becomes fragile. Consider whether this should stay as the colon protocol (KISS) or switch to JSON in `progressMessage`. Given the existing pattern uses colons (`CREATING_BLOCKS:42`, `DONE:100`), staying consistent is the right call — just document that `ANALYZING_PAGE` has 3 params. - **`createSingleBlock` extraction**: The issue mentions splitting `createTranscriptionBlocks` into a per-block method. Good — each block creation involves `annotationService.createOcrAnnotation()` + `blockRepository.save()`. The per-block method should take a `sortOrder` parameter so the caller (the `onPage` callback) can maintain a running counter across pages. ### Suggestions - **Test strategy for `RestClientOcrClient.streamBlocks()`**: This is the most testable part of the Java side. Write a test that feeds a multi-line NDJSON string to the parsing logic and asserts the handler receives `onStart(7)`, `onPage(0, [...])`, `onPage(1, [...])`, `onDone()` in order. Mock the `HttpClient` to return a canned `InputStream`. This test should exist before the implementation. - **Test for partial failure**: Feed NDJSON with a `{"type":"error","pageNumber":2,...}` between two page messages. Assert the handler receives `onPage(0,...)`, `onPage(1,...)`, `onError(2,...)`, `onPage(3,...)`, `onDone()`. This validates the "continue on page error" contract. - **Frontend `translateOcrProgress`**: The new `ANALYZING_PAGE` case should destructure cleanly: ```typescript case 'ANALYZING_PAGE': { const parts = code.split(':'); return m.ocr_status_analyzing_page({ current: parts[1], total: parts[2], blocks: parts[3] }); } ``` Keep it in the same `switch` block — don't extract a separate function for one case.

marcel commented

2026-04-12 23:59:11 +02:00

🏗️ Markus Keller — Application Architect

Questions & Observations

NDJSON is the right transport choice. SSE would add complexity (event IDs, reconnection logic) for a request-scoped stream. Chunked HTTP with NDJSON is the simplest thing that works — one JSON object per line, BufferedReader.readLine() on the Java side. No new dependencies, no new failure modes. Good call.
Using java.net.http.HttpClient directly instead of Spring's RestClient: Spring's RestClient doesn't support incremental streaming well. The project already constructs a HttpClient in RestClientOcrClient's constructor. Using it directly for streaming is natural. But this means RestClientOcrClient now has two HTTP clients — the RestClient (for the old endpoint and health checks) and the raw HttpClient (for streaming). Document this clearly in the class. Consider whether isHealthy() should also use the raw client for consistency.
Domain boundary: OcrStreamHandler lives in the service package. Its callbacks receive List<OcrBlockResult> — a service-layer type. The handler is invoked from inside RestClientOcrClient (infrastructure layer). This is acceptable because the handler is defined as an interface in the service layer and the implementation lives in OcrAsyncRunner — the dependency points inward. Clean.
DB writes in callbacks on the async thread: Each onPage call does annotationService.createOcrAnnotation() + blockRepository.save(). These are individual JPA saves, each auto-committed. If the stream breaks after page 3 of 7, pages 0-3 are persisted. The issue explicitly says "partial results are kept" and the user re-triggers OCR (which clears existing blocks first). This is the right trade-off — rollback would require collecting all created IDs and deleting them, which adds complexity for an edge case.
extractBlocks() default implementation via streamBlocks(): The issue says keep extractBlocks() with a default that collects via streamBlocks(). This is correct for interface evolution — existing callers (batch path) don't need to change. But make the default method final or clearly document that implementations should override streamBlocks(), not extractBlocks().

Suggestions

For the open question on batch: Keep runBatch using the collecting extractBlocks() for now. Batch already has its own progress mechanism (per-document SSE events). Per-page streaming within batch would be a nested progress model — not worth the complexity until batch timeout is also a problem.
For the X-Accel-Buffering header: Yes, add it. Cost is one line. Risk of not adding it: the stream silently buffers behind a reverse proxy, and you spend hours debugging why progress doesn't update. response.headers["X-Accel-Buffering"] = "no" in the FastAPI StreamingResponse.
Socket-level idle timeout: Java's HttpClient doesn't have a per-read timeout out of the box. If the Python service hangs mid-page (e.g. Kraken enters an infinite loop on a corrupted image), Java blocks on readLine() forever. Consider wrapping the read loop in a Future with a 5-minute timeout, or setting SO_TIMEOUT on the underlying socket. This is the one timeout that still matters in the streaming model.

## 🏗️ Markus Keller — Application Architect ### Questions & Observations - **NDJSON is the right transport choice.** SSE would add complexity (event IDs, reconnection logic) for a request-scoped stream. Chunked HTTP with NDJSON is the simplest thing that works — one JSON object per line, `BufferedReader.readLine()` on the Java side. No new dependencies, no new failure modes. Good call. - **Using `java.net.http.HttpClient` directly instead of Spring's `RestClient`**: Spring's `RestClient` doesn't support incremental streaming well. The project already constructs a `HttpClient` in `RestClientOcrClient`'s constructor. Using it directly for streaming is natural. But this means `RestClientOcrClient` now has two HTTP clients — the `RestClient` (for the old endpoint and health checks) and the raw `HttpClient` (for streaming). Document this clearly in the class. Consider whether `isHealthy()` should also use the raw client for consistency. - **Domain boundary**: `OcrStreamHandler` lives in the service package. Its callbacks receive `List<OcrBlockResult>` — a service-layer type. The handler is invoked from inside `RestClientOcrClient` (infrastructure layer). This is acceptable because the handler is defined as an interface in the service layer and the implementation lives in `OcrAsyncRunner` — the dependency points inward. Clean. - **DB writes in callbacks on the async thread**: Each `onPage` call does `annotationService.createOcrAnnotation()` + `blockRepository.save()`. These are individual JPA saves, each auto-committed. If the stream breaks after page 3 of 7, pages 0-3 are persisted. The issue explicitly says "partial results are kept" and the user re-triggers OCR (which clears existing blocks first). This is the right trade-off — rollback would require collecting all created IDs and deleting them, which adds complexity for an edge case. - **`extractBlocks()` default implementation via `streamBlocks()`**: The issue says keep `extractBlocks()` with a default that collects via `streamBlocks()`. This is correct for interface evolution — existing callers (batch path) don't need to change. But make the default method `final` or clearly document that implementations should override `streamBlocks()`, not `extractBlocks()`. ### Suggestions - **For the open question on batch**: Keep `runBatch` using the collecting `extractBlocks()` for now. Batch already has its own progress mechanism (per-document SSE events). Per-page streaming within batch would be a nested progress model — not worth the complexity until batch timeout is also a problem. - **For the `X-Accel-Buffering` header**: Yes, add it. Cost is one line. Risk of not adding it: the stream silently buffers behind a reverse proxy, and you spend hours debugging why progress doesn't update. `response.headers["X-Accel-Buffering"] = "no"` in the FastAPI `StreamingResponse`. - **Socket-level idle timeout**: Java's `HttpClient` doesn't have a per-read timeout out of the box. If the Python service hangs mid-page (e.g. Kraken enters an infinite loop on a corrupted image), Java blocks on `readLine()` forever. Consider wrapping the read loop in a `Future` with a 5-minute timeout, or setting `SO_TIMEOUT` on the underlying socket. This is the one timeout that still matters in the streaming model.

marcel commented

2026-04-12 23:59:36 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Questions & Observations

This change touches all three layers (Python, Java, Frontend) with a new streaming protocol between them. The protocol is the contract — if Python sends something Java doesn't expect, the whole thing breaks silently. Contract testing is essential here.
NDJSON parsing is the riskiest new code on the Java side. BufferedReader.readLine() + Jackson readTree() per line. Edge cases to consider:
- Empty lines between JSON objects (the issue says if (line.isBlank()) continue — good)
- Malformed JSON on one line — should it abort or skip?
- Very long lines (page with 50+ blocks, each with polygon data) — BufferedReader handles this, but worth a test with a realistic payload size
- Stream ends without a {"type":"done"} message — connection drop. How does the handler know?
Partial results persistence: If pages 0-3 are persisted and the stream breaks, the user sees partial transcription. When they re-trigger OCR, clearExistingBlocks(documentId) deletes everything and starts fresh. This is correct — but test it: trigger OCR, kill the stream mid-way, verify partial blocks exist, re-trigger, verify old blocks are cleared and new ones replace them.
The existing OcrAsyncRunner tests mock ocrClient.extractBlocks(). After this change, they need to mock ocrClient.streamBlocks() using doAnswer to invoke handler callbacks synchronously. This is a different mocking pattern — make sure the test is readable.

Suggestions

Test plan by layer:
- Python unit test: Call POST /ocr/stream with a mock engine that returns known blocks for 2 pages. Assert NDJSON lines: start (totalPages=2), page (x2), done. Use httpx.AsyncClient with stream=True to read line-by-line.
- Python error test: Mock engine to raise on page 1 of 3. Assert lines: start, page(0), error(1), page(2), done.
- Java unit test: Feed canned NDJSON to RestClientOcrClient.streamBlocks() (mock HttpClient). Assert handler callbacks fire in order with correct payloads.
- Java unit test: Feed NDJSON that ends abruptly (no done line). Assert the handler still receives pages that were sent, and the method throws or logs appropriately.
- Java integration test: OcrAsyncRunner.runSingleDocument() with mocked streamBlocks. Verify OcrJobDocument.currentPage/totalPages are updated after each page callback. Verify progressMessage contains ANALYZING_PAGE:N:M:K.
- Frontend (Vitest): translateOcrProgress('ANALYZING_PAGE:3:7:45') returns the expected German string. Test all existing codes still work (regression).
Edge case checklist:
- 0-page PDF → {"type":"start","totalPages":0} then {"type":"done"} → what does Java do? Should return 204 or create zero blocks gracefully.
- 1-page PDF → no streaming benefit, but the protocol should still work identically.
- Python sends {"type":"page","pageNumber":0,"blocks":[]} (page with no text) → should Java handle empty blocks list without error.
- totalPages in start message doesn't match actual page count sent → should Java trust the count for progress display but handle the actual pages received?

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Questions & Observations - **This change touches all three layers** (Python, Java, Frontend) with a new streaming protocol between them. The protocol is the contract — if Python sends something Java doesn't expect, the whole thing breaks silently. Contract testing is essential here. - **NDJSON parsing is the riskiest new code on the Java side.** `BufferedReader.readLine()` + Jackson `readTree()` per line. Edge cases to consider: - Empty lines between JSON objects (the issue says `if (line.isBlank()) continue` — good) - Malformed JSON on one line — should it abort or skip? - Very long lines (page with 50+ blocks, each with polygon data) — `BufferedReader` handles this, but worth a test with a realistic payload size - Stream ends without a `{"type":"done"}` message — connection drop. How does the handler know? - **Partial results persistence**: If pages 0-3 are persisted and the stream breaks, the user sees partial transcription. When they re-trigger OCR, `clearExistingBlocks(documentId)` deletes everything and starts fresh. This is correct — but test it: trigger OCR, kill the stream mid-way, verify partial blocks exist, re-trigger, verify old blocks are cleared and new ones replace them. - **The existing `OcrAsyncRunner` tests** mock `ocrClient.extractBlocks()`. After this change, they need to mock `ocrClient.streamBlocks()` using `doAnswer` to invoke handler callbacks synchronously. This is a different mocking pattern — make sure the test is readable. ### Suggestions - **Test plan by layer:** - **Python unit test**: Call `POST /ocr/stream` with a mock engine that returns known blocks for 2 pages. Assert NDJSON lines: `start` (totalPages=2), `page` (x2), `done`. Use `httpx.AsyncClient` with `stream=True` to read line-by-line. - **Python error test**: Mock engine to raise on page 1 of 3. Assert lines: `start`, `page(0)`, `error(1)`, `page(2)`, `done`. - **Java unit test**: Feed canned NDJSON to `RestClientOcrClient.streamBlocks()` (mock `HttpClient`). Assert handler callbacks fire in order with correct payloads. - **Java unit test**: Feed NDJSON that ends abruptly (no `done` line). Assert the handler still receives pages that were sent, and the method throws or logs appropriately. - **Java integration test**: `OcrAsyncRunner.runSingleDocument()` with mocked `streamBlocks`. Verify `OcrJobDocument.currentPage`/`totalPages` are updated after each page callback. Verify `progressMessage` contains `ANALYZING_PAGE:N:M:K`. - **Frontend (Vitest)**: `translateOcrProgress('ANALYZING_PAGE:3:7:45')` returns the expected German string. Test all existing codes still work (regression). - **Edge case checklist:** - 0-page PDF → `{"type":"start","totalPages":0}` then `{"type":"done"}` → what does Java do? Should return 204 or create zero blocks gracefully. - 1-page PDF → no streaming benefit, but the protocol should still work identically. - Python sends `{"type":"page","pageNumber":0,"blocks":[]}` (page with no text) → should Java handle empty blocks list without error. - `totalPages` in `start` message doesn't match actual page count sent → should Java trust the count for progress display but handle the actual pages received?

marcel commented

2026-04-12 23:59:56 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Questions & Observations

New endpoint POST /ocr/stream — same auth posture as POST /ocr? The existing /ocr endpoint has no authentication (internal Docker network only). The new /ocr/stream should have the same posture, but double-check: if the OCR service port is ever exposed (e.g., for debugging), both endpoints are open. The prior review on #230 already flagged this — consider a shared secret header for all OCR service endpoints, not just /train.
NDJSON parsing with Jackson readTree() on untrusted input: The Java side reads each line from the Python response and parses with Jackson. While the Python service is trusted (internal), defense in depth applies:
- Set DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES to avoid silently accepting extra fields
- Validate the type field is one of the expected values before processing — the switch with case "start", "page", "error", "done" handles this implicitly (unknown types are ignored), which is fine
- Ensure blocks array parsing doesn't accept arbitrary nested objects — the TypeReference<List<OcrBlockResult>> constrains this
Streaming response keeps the HTTP connection open for minutes: During this time, if the Python process crashes, the Java side holds an open socket. This is standard for long-lived connections, but be aware that connection pool exhaustion is possible if multiple concurrent OCR streams are open (thread pool size is 2, so max 2 concurrent streams — acceptable).
Error messages in the stream: {"type":"error","pageNumber":3,"message":"..."} — the message field contains the Python exception message. Make sure this doesn't leak internal paths, model file locations, or stack traces to the Java side's logs or (worse) to the frontend via progressMessage. The Java onError handler should log the message at WARN level but not propagate the raw Python error to the user-facing progressMessage.

Suggestions

Validate totalPages in onStart: A malicious or buggy response could send totalPages: -1 or totalPages: 999999. Sanity-check: if totalPages < 0 or totalPages > 500 (reasonable max for a family document), log a warning. Don't use it for memory allocation — it's just a progress display hint.
Log the stream protocol events at DEBUG level: Log onStart(totalPages=7), onPage(pageNumber=3, blocks=12), onDone() at DEBUG. This is invaluable for debugging production issues without exposing sensitive block content.
Don't propagate raw Python exception messages to the frontend: In the onError callback, use a generic message like "Seite X konnte nicht verarbeitet werden" for the user-facing progress, and log the actual Python error server-side only.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Questions & Observations - **New endpoint `POST /ocr/stream` — same auth posture as `POST /ocr`?** The existing `/ocr` endpoint has no authentication (internal Docker network only). The new `/ocr/stream` should have the same posture, but double-check: if the OCR service port is ever exposed (e.g., for debugging), both endpoints are open. The prior review on #230 already flagged this — consider a shared secret header for all OCR service endpoints, not just `/train`. - **NDJSON parsing with Jackson `readTree()` on untrusted input**: The Java side reads each line from the Python response and parses with Jackson. While the Python service is trusted (internal), defense in depth applies: - Set `DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES` to avoid silently accepting extra fields - Validate the `type` field is one of the expected values before processing — the `switch` with `case "start", "page", "error", "done"` handles this implicitly (unknown types are ignored), which is fine - Ensure `blocks` array parsing doesn't accept arbitrary nested objects — the `TypeReference<List<OcrBlockResult>>` constrains this - **Streaming response keeps the HTTP connection open for minutes**: During this time, if the Python process crashes, the Java side holds an open socket. This is standard for long-lived connections, but be aware that connection pool exhaustion is possible if multiple concurrent OCR streams are open (thread pool size is 2, so max 2 concurrent streams — acceptable). - **Error messages in the stream**: `{"type":"error","pageNumber":3,"message":"..."}` — the `message` field contains the Python exception message. Make sure this doesn't leak internal paths, model file locations, or stack traces to the Java side's logs or (worse) to the frontend via `progressMessage`. The Java `onError` handler should log the message at WARN level but not propagate the raw Python error to the user-facing `progressMessage`. ### Suggestions - **Validate `totalPages` in `onStart`**: A malicious or buggy response could send `totalPages: -1` or `totalPages: 999999`. Sanity-check: if `totalPages < 0` or `totalPages > 500` (reasonable max for a family document), log a warning. Don't use it for memory allocation — it's just a progress display hint. - **Log the stream protocol events at DEBUG level**: Log `onStart(totalPages=7)`, `onPage(pageNumber=3, blocks=12)`, `onDone()` at DEBUG. This is invaluable for debugging production issues without exposing sensitive block content. - **Don't propagate raw Python exception messages to the frontend**: In the `onError` callback, use a generic message like `"Seite X konnte nicht verarbeitet werden"` for the user-facing progress, and log the actual Python error server-side only.

marcel commented

2026-04-13 00:00:19 +02:00

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

Progress bar specification: The issue says "brand-mint fill on brand-sand background" — good, consistent with the project palette. But I need more detail:
- Height: h-1.5 (6px) for a subtle bar, or h-2 (8px) for more visibility? Given the senior audience (60+), I'd recommend h-2 minimum.
- Border radius: rounded-full for a pill shape, consistent with other UI elements.
- Animation: Should the fill animate smoothly (transition-all duration-500) or jump per page? Smooth is better UX — gives a sense of continuous progress even though updates are discrete (every 2s poll).
- Position: Below the spinner and text, with mt-4 spacing. Full width of the progress area, max-w-xs to keep it compact.
The new message "Seite {current} von {total} — {blocks} Blöcke erkannt": This replaces the generic "OCR-Analyse läuft — dies kann einige Minuten dauern…" during page processing. Good — but the initial state before the first page completes still shows the generic message (the ANALYZING code fires before streaming starts). Make sure there's a smooth transition:
1. PREPARING → "Dokument wird vorbereitet…"
2. LOADING → "Lade Modell und Dokument…"
3. ANALYZING → "OCR-Analyse läuft…" (brief, while PDF downloads in Python)
4. ANALYZING_PAGE:1:7:12 → "Seite 1 von 7 — 12 Blöcke erkannt" + progress bar appears
The progress bar should not appear during steps 1-3 (no total page count yet). It appears on the first ANALYZING_PAGE message.
Partial failure display: If page 3 of 7 fails, the progress bar jumps from page 2 to page 4. The user might think something glitched. Consider: should the bar still show 3/7 (error pages count toward progress) or skip? I'd count error pages toward progress — the service tried, it moved on. But a small warning icon or text noting "1 Seite übersprungen" would help set expectations that the transcription is incomplete.

Suggestions

Progress bar markup:

{#if ocrTotalPages > 0}
    <div class="mt-4 h-2 w-full max-w-xs rounded-full bg-brand-sand">
        <div
            class="h-full rounded-full bg-brand-mint transition-all duration-500"
            style="width: {(ocrCurrentPage / ocrTotalPages) * 100}%"
        ></div>
    </div>
    <p class="mt-1 text-xs text-gray-400">{ocrCurrentPage} / {ocrTotalPages}</p>
{/if}

Accessibility: The progress bar needs role="progressbar", aria-valuenow={ocrCurrentPage}, aria-valuemin="0", aria-valuemax={ocrTotalPages}, and an aria-label like "OCR-Fortschritt". Screen readers should announce progress changes.
The numeric fraction below the bar (3 / 7) should use tabular-nums (Tailwind: font-[tabular-nums]) so the digits don't shift as numbers change width.

## 🎨 Leonie Voss — UI/UX Design Lead ### Questions & Observations - **Progress bar specification**: The issue says "brand-mint fill on brand-sand background" — good, consistent with the project palette. But I need more detail: - **Height**: `h-1.5` (6px) for a subtle bar, or `h-2` (8px) for more visibility? Given the senior audience (60+), I'd recommend `h-2` minimum. - **Border radius**: `rounded-full` for a pill shape, consistent with other UI elements. - **Animation**: Should the fill animate smoothly (`transition-all duration-500`) or jump per page? Smooth is better UX — gives a sense of continuous progress even though updates are discrete (every 2s poll). - **Position**: Below the spinner and text, with `mt-4` spacing. Full width of the progress area, `max-w-xs` to keep it compact. - **The new message "Seite {current} von {total} — {blocks} Blöcke erkannt"**: This replaces the generic "OCR-Analyse läuft — dies kann einige Minuten dauern…" during page processing. Good — but the initial state before the first page completes still shows the generic message (the `ANALYZING` code fires before streaming starts). Make sure there's a smooth transition: 1. `PREPARING` → "Dokument wird vorbereitet…" 2. `LOADING` → "Lade Modell und Dokument…" 3. `ANALYZING` → "OCR-Analyse läuft…" (brief, while PDF downloads in Python) 4. `ANALYZING_PAGE:1:7:12` → "Seite 1 von 7 — 12 Blöcke erkannt" + progress bar appears The progress bar should **not appear** during steps 1-3 (no total page count yet). It appears on the first `ANALYZING_PAGE` message. - **Partial failure display**: If page 3 of 7 fails, the progress bar jumps from page 2 to page 4. The user might think something glitched. Consider: should the bar still show 3/7 (error pages count toward progress) or skip? I'd count error pages toward progress — the service tried, it moved on. But a small warning icon or text noting "1 Seite übersprungen" would help set expectations that the transcription is incomplete. ### Suggestions - **Progress bar markup**: ```svelte {#if ocrTotalPages > 0} <div class="mt-4 h-2 w-full max-w-xs rounded-full bg-brand-sand"> <div class="h-full rounded-full bg-brand-mint transition-all duration-500" style="width: {(ocrCurrentPage / ocrTotalPages) * 100}%" ></div> </div> <p class="mt-1 text-xs text-gray-400">{ocrCurrentPage} / {ocrTotalPages}</p> {/if} ``` - **Accessibility**: The progress bar needs `role="progressbar"`, `aria-valuenow={ocrCurrentPage}`, `aria-valuemin="0"`, `aria-valuemax={ocrTotalPages}`, and an `aria-label` like `"OCR-Fortschritt"`. Screen readers should announce progress changes. - **The numeric fraction below the bar** (`3 / 7`) should use `tabular-nums` (Tailwind: `font-[tabular-nums]`) so the digits don't shift as numbers change width.

marcel commented

2026-04-13 00:00:43 +02:00

⚙️ Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

Chunked transfer encoding through reverse proxies: This is my primary concern. If Caddy (production reverse proxy) or any intermediate proxy buffers the chunked response, the Java side won't see lines until the buffer fills or the response ends — defeating the entire purpose of streaming. The issue's open question about X-Accel-Buffering: no is relevant, but that's nginx-specific. For Caddy:
- Caddy 2 passes chunked responses through by default — no config change needed. But verify this in the actual deployment.
- If using Cloudflare as CDN in front, Cloudflare also buffers by default for non-SSE responses. NDJSON with application/x-ndjson content type may or may not trigger their streaming behavior. Test this.
- In the Docker network (dev), there's no proxy — the Java backend talks directly to the OCR service. No issue there.
Health check interaction with streaming: The asyncio.to_thread() fix (d8dcba1) keeps the event loop free during OCR. With streaming, the event loop is also free between yield calls (each yield in the async generator gives back control). So /health responds during streaming — good. But confirm: does asyncio.to_thread() still work inside an async def generate() context? It should, since the generator is running on the event loop and to_thread offloads to a thread pool.
OCR service single-worker constraint: Uvicorn runs one worker. With streaming, a single OCR request holds one connection open for potentially 30+ minutes. If a second OCR request comes in while the first is streaming, it queues behind the first because asyncio.to_thread() runs in the default thread pool (max_workers defaults to min(32, os.cpu_count() + 4)). The thread pool is fine, but the OCR engines likely aren't thread-safe (global model state). Two concurrent extract_page_blocks calls could corrupt the model state. This was already true before streaming — just more visible now.
Container restart during streaming: If the OCR container restarts mid-stream (OOM, Docker restart policy), Java's BufferedReader.readLine() returns null or throws IOException. The error handling table covers this ("Connection drops mid-stream"). Make sure the Java side closes the InputStream in a finally block to avoid resource leaks.

Suggestions

Add X-Accel-Buffering: no and Cache-Control: no-cache to the streaming response headers. Cost: two lines. Prevents buffering by any proxy layer:

return StreamingResponse(
    generate(),
    media_type="application/x-ndjson",
    headers={"X-Accel-Buffering": "no", "Cache-Control": "no-cache"},
)

Log stream lifecycle at INFO level: On the Python side, log when streaming starts ("Starting OCR stream for {totalPages} pages"), each page completion ("Page {i+1}/{totalPages} complete, {len(blocks)} blocks"), and stream end. This makes it trivial to diagnose timeout/hang issues from docker compose logs ocr-service.
Monitor memory during streaming: Streaming doesn't reduce peak memory much because _download_and_convert_pdf still loads all page images at once. For truly large documents, consider converting pages lazily (one at a time using pypdfium2). But this is a separate optimization — not needed for this issue.
For the open question on batch: Agree with keeping batch on the old extractBlocks() for now. Batch already has per-document progress. Per-page-within-batch would be confusing in the UI and adds no real value until individual documents in batch also time out.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer ### Questions & Observations - **Chunked transfer encoding through reverse proxies**: This is my primary concern. If Caddy (production reverse proxy) or any intermediate proxy buffers the chunked response, the Java side won't see lines until the buffer fills or the response ends — defeating the entire purpose of streaming. The issue's open question about `X-Accel-Buffering: no` is relevant, but that's nginx-specific. For Caddy: - Caddy 2 passes chunked responses through by default — **no config change needed**. But verify this in the actual deployment. - If using Cloudflare as CDN in front, Cloudflare also buffers by default for non-SSE responses. NDJSON with `application/x-ndjson` content type may or may not trigger their streaming behavior. Test this. - In the Docker network (dev), there's no proxy — the Java backend talks directly to the OCR service. No issue there. - **Health check interaction with streaming**: The `asyncio.to_thread()` fix (d8dcba1) keeps the event loop free during OCR. With streaming, the event loop is also free between `yield` calls (each `yield` in the async generator gives back control). So `/health` responds during streaming — good. But confirm: does `asyncio.to_thread()` still work inside an `async def generate()` context? It should, since the generator is running on the event loop and `to_thread` offloads to a thread pool. - **OCR service single-worker constraint**: Uvicorn runs one worker. With streaming, a single OCR request holds one connection open for potentially 30+ minutes. If a second OCR request comes in while the first is streaming, it queues behind the first because `asyncio.to_thread()` runs in the default thread pool (`max_workers` defaults to `min(32, os.cpu_count() + 4)`). The thread pool is fine, but the OCR engines likely aren't thread-safe (global model state). Two concurrent `extract_page_blocks` calls could corrupt the model state. This was already true before streaming — just more visible now. - **Container restart during streaming**: If the OCR container restarts mid-stream (OOM, Docker restart policy), Java's `BufferedReader.readLine()` returns null or throws IOException. The error handling table covers this ("Connection drops mid-stream"). Make sure the Java side closes the `InputStream` in a `finally` block to avoid resource leaks. ### Suggestions - **Add `X-Accel-Buffering: no` and `Cache-Control: no-cache`** to the streaming response headers. Cost: two lines. Prevents buffering by any proxy layer: ```python return StreamingResponse( generate(), media_type="application/x-ndjson", headers={"X-Accel-Buffering": "no", "Cache-Control": "no-cache"}, ) ``` - **Log stream lifecycle at INFO level**: On the Python side, log when streaming starts (`"Starting OCR stream for {totalPages} pages"`), each page completion (`"Page {i+1}/{totalPages} complete, {len(blocks)} blocks"`), and stream end. This makes it trivial to diagnose timeout/hang issues from `docker compose logs ocr-service`. - **Monitor memory during streaming**: Streaming doesn't reduce peak memory much because `_download_and_convert_pdf` still loads all page images at once. For truly large documents, consider converting pages lazily (one at a time using `pypdfium2`). But this is a separate optimization — not needed for this issue. - **For the open question on batch**: Agree with keeping batch on the old `extractBlocks()` for now. Batch already has per-document progress. Per-page-within-batch would be confusing in the UI and adds no real value until individual documents in batch also time out.

marcel commented

2026-04-13 09:03:59 +02:00

👨‍💻 Felix Brandt — Implementation Plan

After reading every comment, here's how I'd address each teammate's concerns in a TDD implementation. Every task maps to at least one concern; nothing is left unaddressed.

Design Decisions

No standalone OcrStreamHandler interface. Instead, OcrClient gets streamBlocks(String pdfUrl, ScriptType scriptType, Consumer<OcrStreamEvent> eventConsumer) where OcrStreamEvent is a sealed interface with Start, Page, Error, Done record subtypes. The runner provides the consumer lambda. This keeps HTTP parsing in RestClientOcrClient and orchestration in OcrAsyncRunner — clean separation without a callback interface crossing layers. (My own concern #1, Markus: domain boundary clean)
Colon protocol stays (KISS). ANALYZING_PAGE:current:total:blocks — consistent with CREATING_BLOCKS:42 and DONE:100. Three params documented. (My concern #3)
Batch stays on old extractBlocks() path. Both Markus and Tobias agree — batch has its own per-document SSE progress. Nested per-page streaming adds complexity for no real benefit right now. (Markus suggestion, Tobias #7)
Fallback for old Python image. If /ocr/stream returns 404, fall back to extractBlocks() via the old /ocr endpoint. The OcrClient default method synthesizes Start/Page/Done events from the collected result. (Issue spec, Markus #4: backward compat)
Skipped pages in DONE message. Encode as DONE:12:2 (12 blocks, 2 pages skipped). Frontend parses and shows warning. Avoids adding a new field to the poll response. (Leonie #3)
Answer to open questions:
- Batch: keep old path (see #3).
- X-Accel-Buffering: no + Cache-Control: no-cache: yes, add both. One line each. (Markus #6, Tobias #5)

Task List

Layer 1: Python OCR Service

P1 — [python/test] Extract extract_page_blocks() from Surya engine

Extract inner loop (surya.py lines 47-87) into extract_page_blocks(image, page_idx, language).
Ensure load_models() guard runs on first call (lazy load preserved).
extract_blocks() becomes a thin wrapper.
Addresses: Felix #2 (lazy loading), Felix #4 concept (per-page extraction)

P2 — [python/test] Extract extract_page_blocks() from Kraken engine

Same extraction from kraken.py lines 47-91.
Kraken loads at startup, so no lazy-load concern.
Addresses: Felix #2

P3 — [python/test] Add POST /ocr/stream NDJSON streaming endpoint

StreamingResponse with media_type="application/x-ndjson".
Headers: X-Accel-Buffering: no, Cache-Control: no-cache. (Markus #6, Tobias #5)
Validate totalPages after PDF conversion: reject < 0 or > 500 with HTTP 422. (Nora #4)
Per-page errors: yield {"type":"error",...} with generic message (not raw Python traceback), log raw error server-side only. (Nora #3)
Log stream lifecycle at INFO: start, each page completion, done. (Tobias #6)
asyncio.to_thread() per page to keep event loop free.
Tests with FastAPI TestClient:
- Contract: start → page(s) → done in order (Sara #1)
- Edge: 0-page PDF, 1-page PDF, empty blocks list, per-page error continues processing (Sara #2)
- Security: error messages don't contain tracebacks (Nora #3)
- Headers present (Tobias #5)

Layer 2: Java Backend

J1 — [backend/test] Define OcrStreamEvent sealed interface with Jackson deserialization

Sealed interface: Start(int totalPages), Page(int pageNumber, int totalPages, List<OcrBlockResult> blocks), Error(int pageNumber, int totalPages, String message), Done(int totalPages, int totalBlocks).
ObjectMapper configured with FAIL_ON_UNKNOWN_PROPERTIES = true. (Nora #2)
Tests: deserialize each JSON variant, reject unknown fields, handle unknown type gracefully.
Addresses: Sara #1 (contract), Nora #2 (Jackson defense)

J2 — [backend] Add streamBlocks() to OcrClient interface

default void streamBlocks(String pdfUrl, ScriptType scriptType, Consumer<OcrStreamEvent> eventConsumer) — default implementation calls extractBlocks() and synthesizes events.
Addresses: Markus #4 (backward compat), Felix #1 (Consumer pattern)

J3 — [backend/test] Implement streamBlocks() in RestClientOcrClient

Use raw HttpClient.send() with BodyHandlers.ofInputStream() for streaming.
Document dual-client approach in class Javadoc. (Markus #1)
BufferedReader.readLine() loop, parse each line into OcrStreamEvent.
Per-read timeout: wrap readLine() in CompletableFuture with 5-minute timeout. (Markus #7)
Close InputStream in finally block. (Tobias #4)
Log stream events at DEBUG. (Nora #5)
On 404 from /ocr/stream, fall back to extractBlocks().
Tests (mock HttpClient with canned InputStream):
- Multi-page NDJSON → events in order (Sara #1)
- Empty lines skipped (Sara #2)
- Malformed JSON line logged and skipped (Sara #2, Nora #2)
- Stream ends without done → no hang, pages received (Sara #2)
- 404 → fallback to old endpoint (spec)

J4 — [backend/test] Refactor createTranscriptionBlocks into per-block method

Extract createSingleBlock(UUID documentId, OcrBlockResult block, UUID userId, String fileHash, int sortOrder).
Test: verify sortOrder parameter used correctly.
Addresses: Felix #4 (sortOrder for running counter across pages)

J5 — [backend/test] Wire streamBlocks() into runSingleDocument()

Replace ocrClient.extractBlocks() call with ocrClient.streamBlocks().
Consumer lambda:
- Start → set jobDoc.totalPages, progress = ANALYZING
- Page → increment jobDoc.currentPage, progress = ANALYZING_PAGE:{current}:{total}:{blocksSoFar}, call createSingleBlock per block (partial persistence — Markus #3)
- Error → increment jobDoc.currentPage, log warning, track skipped count. Error pages count toward progress. (Leonie #3)
- Done → progress = DONE:{totalBlocks}:{skippedCount}
Batch runBatch() stays on extractBlocks(). (Markus #5, Tobias #7)
Tests:
- Mock streamBlocks with multi-page sequence → verify currentPage/totalPages updated, createSingleBlock called with correct sortOrder, progress messages correct (Sara #4)
- Error page in middle → progress advances, final blocks correct (Leonie #3)
- Fallback via default method → old behavior works (spec)

J6 — [backend/test] Defense-in-depth validation

In consumer: if totalPages < 0 || totalPages > 500, log warning. (Nora #4 — Java side too)
In onError: use generic message "Seite X konnte nicht verarbeitet werden" for user-facing progressMessage, log raw Python error server-side only. (Nora #3)

Layer 3: Frontend

F1 — [frontend] Add new i18n keys

ocr_status_analyzing_page: DE "Seite {current} von {total} — {blocks} Blöcke erkannt" / EN "Page {current} of {total} — {blocks} blocks detected" / ES "Página {current} de {total} — {blocks} bloques detectados"
ocr_status_pages_skipped: DE "{count} Seite(n) übersprungen" / EN "{count} page(s) skipped" / ES "{count} página(s) omitida(s)"
Addresses: Leonie #3

F2 — [frontend/test] Update translateOcrProgress() for ANALYZING_PAGE

New case in the switch: const [key, current, total, blocks] = code.split(':') → m.ocr_status_analyzing_page({ current, total, blocks }).
Update DONE case to handle optional skipped param: DONE:12:2 → done message + skipped warning.
Tests (Vitest):
- ANALYZING_PAGE:3:7:45 → localized string (Sara #4)
- DONE:12:2 → done + skipped warning
- All existing codes still work (regression)

F3 — [frontend/test] Add progress bar to OCR status display

Derived state from ocrProgressMessage:

ocrCurrentPage = $derived(...)
ocrTotalPages = $derived(...)

Progress bar: h-2, rounded-full, bg-brand-mint on bg-brand-sand, transition-all duration-500, max-w-xs, mx-auto mt-4. (Leonie #1)
Only visible when ocrTotalPages > 0 — no bar during PREPARING/LOADING/ANALYZING. (Leonie #2)
Accessibility: role="progressbar", aria-valuenow, aria-valuemin={0}, aria-valuemax={100}, aria-label="OCR-Fortschritt". (Leonie #4)
Numeric fraction {current} / {total} with tabular-nums. (Leonie #5)

Layer 4: Integration

I1 — [backend/test] Integration test: mock Python, real Java flow

@SpringBootTest test: mock streamBlocks() → emit multi-page sequence → verify job status transitions, OcrJobDocument fields, transcription blocks in DB, progress messages.
Addresses: Sara #4 (Java integration test)

I2 — [backend/test] Partial results resilience test

Stream breaks mid-page (consumer throws) → partial blocks persisted.
Re-trigger OCR → old blocks cleared, new blocks created.
Addresses: Sara #3

Concern Traceability

Reviewer	Concern	Task(s)
Felix	#1 No standalone handler, Consumer pattern	J2, J3, J5
	#2 Engine lazy loading guard	P1, P2
	#3 ANALYZING_PAGE colon protocol	J5, F2
	#4 sortOrder param for cross-page counter	J4, J5
Markus	#1 Dual HttpClient documented	J3
	#3 DB writes in callbacks (partial persistence)	J5
	#4 extractBlocks backward compat	J2
	#5 Batch on old path	J5
	#6 X-Accel-Buffering header	P3
	#7 Per-read idle timeout (5min)	J3
Sara	#1 Contract testing	P3, J1, J3
	#2 NDJSON edge cases	P3, J3
	#3 Partial results + re-trigger	J5, I2
	#4 Test plan by layer	P3, J1, J3, J5, F2, I1
Nora	#2 Jackson defense-in-depth	J1, J6
	#3 Don't leak Python errors	P3, J6
	#4 Validate totalPages bounds	P3, J6
	#5 Log stream events at DEBUG	J3
Leonie	#1 Progress bar styling	F3
	#2 State transitions (no bar before ANALYZING_PAGE)	F3
	#3 Partial failure display + skipped warning	J5, F4 → F2
	#4 Accessibility (aria attributes)	F3
	#5 tabular-nums	F3
Tobias	#4 Close InputStream in finally	J3
	#5 X-Accel-Buffering + Cache-Control	P3
	#6 Log stream lifecycle at INFO	P3
	#7 Batch on old path	J5

## 👨‍💻 Felix Brandt — Implementation Plan After reading every comment, here's how I'd address each teammate's concerns in a TDD implementation. Every task maps to at least one concern; nothing is left unaddressed. --- ### Design Decisions 1. **No standalone `OcrStreamHandler` interface.** Instead, `OcrClient` gets `streamBlocks(String pdfUrl, ScriptType scriptType, Consumer<OcrStreamEvent> eventConsumer)` where `OcrStreamEvent` is a sealed interface with `Start`, `Page`, `Error`, `Done` record subtypes. The runner provides the consumer lambda. This keeps HTTP parsing in `RestClientOcrClient` and orchestration in `OcrAsyncRunner` — clean separation without a callback interface crossing layers. *(My own concern #1, Markus: domain boundary clean)* 2. **Colon protocol stays (KISS).** `ANALYZING_PAGE:current:total:blocks` — consistent with `CREATING_BLOCKS:42` and `DONE:100`. Three params documented. *(My concern #3)* 3. **Batch stays on old `extractBlocks()` path.** Both Markus and Tobias agree — batch has its own per-document SSE progress. Nested per-page streaming adds complexity for no real benefit right now. *(Markus suggestion, Tobias #7)* 4. **Fallback for old Python image.** If `/ocr/stream` returns 404, fall back to `extractBlocks()` via the old `/ocr` endpoint. The `OcrClient` default method synthesizes `Start`/`Page`/`Done` events from the collected result. *(Issue spec, Markus #4: backward compat)* 5. **Skipped pages in DONE message.** Encode as `DONE:12:2` (12 blocks, 2 pages skipped). Frontend parses and shows warning. Avoids adding a new field to the poll response. *(Leonie #3)* 6. **Answer to open questions:** - Batch: keep old path (see #3). - `X-Accel-Buffering: no` + `Cache-Control: no-cache`: yes, add both. One line each. *(Markus #6, Tobias #5)* --- ### Task List #### Layer 1: Python OCR Service **P1 — [python/test] Extract `extract_page_blocks()` from Surya engine** - Extract inner loop (surya.py lines 47-87) into `extract_page_blocks(image, page_idx, language)`. - Ensure `load_models()` guard runs on first call (lazy load preserved). - `extract_blocks()` becomes a thin wrapper. - *Addresses: Felix #2 (lazy loading), Felix #4 concept (per-page extraction)* **P2 — [python/test] Extract `extract_page_blocks()` from Kraken engine** - Same extraction from kraken.py lines 47-91. - Kraken loads at startup, so no lazy-load concern. - *Addresses: Felix #2* **P3 — [python/test] Add `POST /ocr/stream` NDJSON streaming endpoint** - `StreamingResponse` with `media_type="application/x-ndjson"`. - Headers: `X-Accel-Buffering: no`, `Cache-Control: no-cache`. *(Markus #6, Tobias #5)* - Validate `totalPages` after PDF conversion: reject < 0 or > 500 with HTTP 422. *(Nora #4)* - Per-page errors: yield `{"type":"error",...}` with **generic message** (not raw Python traceback), log raw error server-side only. *(Nora #3)* - Log stream lifecycle at INFO: start, each page completion, done. *(Tobias #6)* - `asyncio.to_thread()` per page to keep event loop free. - Tests with FastAPI TestClient: - Contract: start → page(s) → done in order *(Sara #1)* - Edge: 0-page PDF, 1-page PDF, empty blocks list, per-page error continues processing *(Sara #2)* - Security: error messages don't contain tracebacks *(Nora #3)* - Headers present *(Tobias #5)* #### Layer 2: Java Backend **J1 — [backend/test] Define `OcrStreamEvent` sealed interface with Jackson deserialization** - Sealed interface: `Start(int totalPages)`, `Page(int pageNumber, int totalPages, List<OcrBlockResult> blocks)`, `Error(int pageNumber, int totalPages, String message)`, `Done(int totalPages, int totalBlocks)`. - ObjectMapper configured with `FAIL_ON_UNKNOWN_PROPERTIES = true`. *(Nora #2)* - Tests: deserialize each JSON variant, reject unknown fields, handle unknown `type` gracefully. - *Addresses: Sara #1 (contract), Nora #2 (Jackson defense)* **J2 — [backend] Add `streamBlocks()` to `OcrClient` interface** - `default void streamBlocks(String pdfUrl, ScriptType scriptType, Consumer<OcrStreamEvent> eventConsumer)` — default implementation calls `extractBlocks()` and synthesizes events. - *Addresses: Markus #4 (backward compat), Felix #1 (Consumer pattern)* **J3 — [backend/test] Implement `streamBlocks()` in `RestClientOcrClient`** - Use raw `HttpClient.send()` with `BodyHandlers.ofInputStream()` for streaming. - Document dual-client approach in class Javadoc. *(Markus #1)* - `BufferedReader.readLine()` loop, parse each line into `OcrStreamEvent`. - **Per-read timeout**: wrap `readLine()` in `CompletableFuture` with 5-minute timeout. *(Markus #7)* - Close `InputStream` in `finally` block. *(Tobias #4)* - Log stream events at DEBUG. *(Nora #5)* - On 404 from `/ocr/stream`, fall back to `extractBlocks()`. - Tests (mock `HttpClient` with canned `InputStream`): - Multi-page NDJSON → events in order *(Sara #1)* - Empty lines skipped *(Sara #2)* - Malformed JSON line logged and skipped *(Sara #2, Nora #2)* - Stream ends without `done` → no hang, pages received *(Sara #2)* - 404 → fallback to old endpoint *(spec)* **J4 — [backend/test] Refactor `createTranscriptionBlocks` into per-block method** - Extract `createSingleBlock(UUID documentId, OcrBlockResult block, UUID userId, String fileHash, int sortOrder)`. - Test: verify sortOrder parameter used correctly. - *Addresses: Felix #4 (sortOrder for running counter across pages)* **J5 — [backend/test] Wire `streamBlocks()` into `runSingleDocument()`** - Replace `ocrClient.extractBlocks()` call with `ocrClient.streamBlocks()`. - Consumer lambda: - `Start` → set `jobDoc.totalPages`, progress = `ANALYZING` - `Page` → increment `jobDoc.currentPage`, progress = `ANALYZING_PAGE:{current}:{total}:{blocksSoFar}`, call `createSingleBlock` per block *(partial persistence — Markus #3)* - `Error` → increment `jobDoc.currentPage`, log warning, track skipped count. Error pages count toward progress. *(Leonie #3)* - `Done` → progress = `DONE:{totalBlocks}:{skippedCount}` - Batch `runBatch()` stays on `extractBlocks()`. *(Markus #5, Tobias #7)* - Tests: - Mock `streamBlocks` with multi-page sequence → verify `currentPage`/`totalPages` updated, `createSingleBlock` called with correct sortOrder, progress messages correct *(Sara #4)* - Error page in middle → progress advances, final blocks correct *(Leonie #3)* - Fallback via default method → old behavior works *(spec)* **J6 — [backend/test] Defense-in-depth validation** - In consumer: if `totalPages < 0 || totalPages > 500`, log warning. *(Nora #4 — Java side too)* - In `onError`: use generic message `"Seite X konnte nicht verarbeitet werden"` for user-facing `progressMessage`, log raw Python error server-side only. *(Nora #3)* #### Layer 3: Frontend **F1 — [frontend] Add new i18n keys** - `ocr_status_analyzing_page`: DE "Seite {current} von {total} — {blocks} Blöcke erkannt" / EN "Page {current} of {total} — {blocks} blocks detected" / ES "Página {current} de {total} — {blocks} bloques detectados" - `ocr_status_pages_skipped`: DE "{count} Seite(n) übersprungen" / EN "{count} page(s) skipped" / ES "{count} página(s) omitida(s)" - *Addresses: Leonie #3* **F2 — [frontend/test] Update `translateOcrProgress()` for `ANALYZING_PAGE`** - New case in the switch: `const [key, current, total, blocks] = code.split(':')` → `m.ocr_status_analyzing_page({ current, total, blocks })`. - Update `DONE` case to handle optional skipped param: `DONE:12:2` → done message + skipped warning. - Tests (Vitest): - `ANALYZING_PAGE:3:7:45` → localized string *(Sara #4)* - `DONE:12:2` → done + skipped warning - All existing codes still work (regression) **F3 — [frontend/test] Add progress bar to OCR status display** - Derived state from `ocrProgressMessage`: ``` ocrCurrentPage = $derived(...) ocrTotalPages = $derived(...) ``` - Progress bar: `h-2`, `rounded-full`, `bg-brand-mint` on `bg-brand-sand`, `transition-all duration-500`, `max-w-xs`, `mx-auto mt-4`. *(Leonie #1)* - **Only visible when `ocrTotalPages > 0`** — no bar during PREPARING/LOADING/ANALYZING. *(Leonie #2)* - Accessibility: `role="progressbar"`, `aria-valuenow`, `aria-valuemin={0}`, `aria-valuemax={100}`, `aria-label="OCR-Fortschritt"`. *(Leonie #4)* - Numeric fraction `{current} / {total}` with `tabular-nums`. *(Leonie #5)* #### Layer 4: Integration **I1 — [backend/test] Integration test: mock Python, real Java flow** - `@SpringBootTest` test: mock `streamBlocks()` → emit multi-page sequence → verify job status transitions, `OcrJobDocument` fields, transcription blocks in DB, progress messages. - *Addresses: Sara #4 (Java integration test)* **I2 — [backend/test] Partial results resilience test** - Stream breaks mid-page (consumer throws) → partial blocks persisted. - Re-trigger OCR → old blocks cleared, new blocks created. - *Addresses: Sara #3* --- ### Concern Traceability | Reviewer | Concern | Task(s) | |---|---|---| | **Felix** | #1 No standalone handler, Consumer pattern | J2, J3, J5 | | | #2 Engine lazy loading guard | P1, P2 | | | #3 ANALYZING_PAGE colon protocol | J5, F2 | | | #4 sortOrder param for cross-page counter | J4, J5 | | **Markus** | #1 Dual HttpClient documented | J3 | | | #3 DB writes in callbacks (partial persistence) | J5 | | | #4 extractBlocks backward compat | J2 | | | #5 Batch on old path | J5 | | | #6 X-Accel-Buffering header | P3 | | | #7 Per-read idle timeout (5min) | J3 | | **Sara** | #1 Contract testing | P3, J1, J3 | | | #2 NDJSON edge cases | P3, J3 | | | #3 Partial results + re-trigger | J5, I2 | | | #4 Test plan by layer | P3, J1, J3, J5, F2, I1 | | **Nora** | #2 Jackson defense-in-depth | J1, J6 | | | #3 Don't leak Python errors | P3, J6 | | | #4 Validate totalPages bounds | P3, J6 | | | #5 Log stream events at DEBUG | J3 | | **Leonie** | #1 Progress bar styling | F3 | | | #2 State transitions (no bar before ANALYZING_PAGE) | F3 | | | #3 Partial failure display + skipped warning | J5, F4 → F2 | | | #4 Accessibility (aria attributes) | F3 | | | #5 tabular-nums | F3 | | **Tobias** | #4 Close InputStream in finally | J3 | | | #5 X-Accel-Buffering + Cache-Control | P3 | | | #6 Log stream lifecycle at INFO | P3 | | | #7 Batch on old path | J5 |

marcel commented

2026-04-13 09:04:54 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer (Round 2)

Reviewing my own plan against the original issue and the team's feedback.

Questions & Observations

OcrStreamEvent sealed interface vs simple records: My plan uses a sealed interface with Start, Page, Error, Done record subtypes. This is clean for pattern matching in Java 21 (switch (event) { case Page p -> ... }). However, the deserialization needs a custom Jackson deserializer or @JsonTypeInfo — this is non-trivial code. The test in J1 must cover the deserializer itself, not just the records. Make sure the deserializer is a standalone testable class, not an anonymous inner class buried in RestClientOcrClient.
translateOcrProgress — extracting for testability: Task F2 says to test translateOcrProgress via Vitest. Currently it's a function inside +page.svelte's <script> block (line 136). To unit-test it, I need to either extract it into a separate .ts file or test it through the component. Extracting is cleaner — but it means the function needs access to Paraglide's m object. The simplest approach: extract to $lib/ocr/translateOcrProgress.ts, import m from $lib/paraglide/messages, and test by mocking m. This is a minor refactor but worth calling out.
DONE:12:2 backward compatibility: My plan changes the DONE progress message from DONE:<count> to DONE:<count>:<skipped>. The frontend's current translateOcrProgress splits on : and takes parts[1] as count. If a new backend sends DONE:12:2 to an old frontend, the old frontend would show "12 Blöcke erstellt" and ignore the :2 — safe. If an old backend sends DONE:12 to a new frontend, the new frontend would parse skipped as undefined — the if (skipped > 0) check would be false. Both directions are backward-compatible. Good.
sortOrder continuity across pages: Task J4 extracts createSingleBlock with a sortOrder param. In J5, the consumer maintains a running counter across pages. I should make this explicit: int[] blockCounter = {0}; ... onPage: for (block : blocks) createSingleBlock(..., blockCounter[0]++);. The array wrapper is needed because lambdas capture effectively final variables. Alternative: use AtomicInteger. Cleaner, document it.

Suggestions

Test naming for J3: The NDJSON parser tests are the most critical new code. Name them precisely:
- streamBlocks_parsesStartPageDoneSequence_dispatchesEventsInOrder
- streamBlocks_skipsEmptyLines_betweenJsonObjects
- streamBlocks_logsMalformedLine_andContinues
- streamBlocks_handlesAbruptStreamEnd_withoutDoneEvent
- streamBlocks_fallsBackToExtractBlocks_on404
Missing task: update existing OcrAsyncRunnerTest. The plan has J5 adding new tests for streamBlocks, but the existing tests that mock extractBlocks() need updating too — they should verify the old behavior still works via the default streamBlocks() method. Don't leave stale test mocks.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer (Round 2) Reviewing my own plan against the original issue and the team's feedback. ### Questions & Observations - **`OcrStreamEvent` sealed interface vs simple records**: My plan uses a sealed interface with `Start`, `Page`, `Error`, `Done` record subtypes. This is clean for pattern matching in Java 21 (`switch (event) { case Page p -> ... }`). However, the deserialization needs a custom Jackson deserializer or `@JsonTypeInfo` — this is non-trivial code. The test in J1 must cover the deserializer itself, not just the records. Make sure the deserializer is a standalone testable class, not an anonymous inner class buried in `RestClientOcrClient`. - **`translateOcrProgress` — extracting for testability**: Task F2 says to test `translateOcrProgress` via Vitest. Currently it's a function inside `+page.svelte`'s `<script>` block (line 136). To unit-test it, I need to either extract it into a separate `.ts` file or test it through the component. Extracting is cleaner — but it means the function needs access to Paraglide's `m` object. The simplest approach: extract to `$lib/ocr/translateOcrProgress.ts`, import `m` from `$lib/paraglide/messages`, and test by mocking `m`. This is a minor refactor but worth calling out. - **`DONE:12:2` backward compatibility**: My plan changes the `DONE` progress message from `DONE:<count>` to `DONE:<count>:<skipped>`. The frontend's current `translateOcrProgress` splits on `:` and takes `parts[1]` as count. If a new backend sends `DONE:12:2` to an old frontend, the old frontend would show "12 Blöcke erstellt" and ignore the `:2` — safe. If an old backend sends `DONE:12` to a new frontend, the new frontend would parse `skipped` as `undefined` — the `if (skipped > 0)` check would be false. **Both directions are backward-compatible.** Good. - **`sortOrder` continuity across pages**: Task J4 extracts `createSingleBlock` with a `sortOrder` param. In J5, the consumer maintains a running counter across pages. I should make this explicit: `int[] blockCounter = {0}; ... onPage: for (block : blocks) createSingleBlock(..., blockCounter[0]++);`. The array wrapper is needed because lambdas capture effectively final variables. Alternative: use `AtomicInteger`. Cleaner, document it. ### Suggestions - **Test naming for J3**: The NDJSON parser tests are the most critical new code. Name them precisely: - `streamBlocks_parsesStartPageDoneSequence_dispatchesEventsInOrder` - `streamBlocks_skipsEmptyLines_betweenJsonObjects` - `streamBlocks_logsMalformedLine_andContinues` - `streamBlocks_handlesAbruptStreamEnd_withoutDoneEvent` - `streamBlocks_fallsBackToExtractBlocks_on404` - **Missing task: update existing `OcrAsyncRunnerTest`**. The plan has J5 adding new tests for `streamBlocks`, but the existing tests that mock `extractBlocks()` need updating too — they should verify the old behavior still works via the default `streamBlocks()` method. Don't leave stale test mocks.

marcel commented

2026-04-13 09:05:20 +02:00

🏗️ Markus Keller — Application Architect (Round 2)

Reviewing Felix's implementation plan for architectural soundness.

Questions & Observations

Sealed interface + Consumer pattern is the right call. The original issue proposed OcrStreamHandler as a standalone interface injected across layers. Felix's revision — Consumer<OcrStreamEvent> with a sealed interface for events — is better. The consumer is provided by the caller (OcrAsyncRunner), so the dependency points inward: RestClientOcrClient depends on the event types (service layer), not on the runner. The runner depends on nothing new. Clean inversion.
default void streamBlocks(...) on OcrClient — one concern. The default method calls extractBlocks() and synthesizes events. This means OcrClient now has two methods with a non-obvious contract: implementations should override streamBlocks(), and extractBlocks() could be derived from it. But the default goes the other direction — streamBlocks() calls extractBlocks(). This creates an asymmetry:
- RestClientOcrClient overrides streamBlocks() (talks to /ocr/stream), and its extractBlocks() is the old direct call to /ocr.
- A hypothetical second implementation would need to understand which method to override.
- Suggestion: Add a Javadoc comment on streamBlocks() saying "Implementations should override this method. The default exists only for backward compatibility during migration." This is documentation, not code — but it prevents the next developer from being confused.
CompletableFuture for per-read timeout (J3): This addresses my socket-level idle timeout concern. But the implementation detail matters. CompletableFuture.supplyAsync(() -> reader.readLine(), executor).get(5, TimeUnit.MINUTES) creates a new task per line on an executor. If the read blocks, the thread stays blocked — CompletableFuture doesn't interrupt it. To truly enforce the timeout, you'd need to close the InputStream from a timer thread, which causes the blocked readLine() to throw IOException. Consider: is CompletableFuture.get(timeout) sufficient here? If readLine() blocks forever, the future times out but the underlying thread is still stuck. For a thread pool of size 2 (OCR async pool), this could exhaust the pool. Alternative: set HttpClient.Builder.readTimeout(Duration.ofMinutes(5)) on the raw HttpClient used for streaming. This sets SO_TIMEOUT at the socket level, which actually interrupts the read. Simpler, more correct.
No migration needed. OcrJobDocument.currentPage and totalPages columns already exist in the DB (they're just always 0). The plan populates them during streaming. No Flyway migration — good, one fewer moving part.
Batch path unchanged — confirmed correct. Batch uses extractBlocks(), which still calls the old /ocr endpoint. The plan explicitly keeps this. Both Tobias and I flagged this; Felix addressed it. Satisfied.

Suggestions

Consider making extractBlocks() a default method too, derived from streamBlocks() (the inverse of what the plan proposes). Then there's only one "real" method to implement: streamBlocks(). The default extractBlocks() collects pages into a list. This is cleaner long-term — one method does the work, the other is convenience. But it changes the current contract, so only do this if the batch path is the only caller of extractBlocks().
The HttpClient.readTimeout approach for idle timeout (above) eliminates the CompletableFuture wrapper entirely. The streaming HttpClient instance would have readTimeout(Duration.ofMinutes(5)) while the existing RestClient-wrapped one keeps the current 10-minute timeout. Two clients, two timeout strategies — document it clearly.

## 🏗️ Markus Keller — Application Architect (Round 2) Reviewing Felix's implementation plan for architectural soundness. ### Questions & Observations - **Sealed interface + Consumer pattern is the right call.** The original issue proposed `OcrStreamHandler` as a standalone interface injected across layers. Felix's revision — `Consumer<OcrStreamEvent>` with a sealed interface for events — is better. The consumer is provided by the caller (OcrAsyncRunner), so the dependency points inward: `RestClientOcrClient` depends on the event types (service layer), not on the runner. The runner depends on nothing new. Clean inversion. - **`default void streamBlocks(...)` on `OcrClient` — one concern.** The default method calls `extractBlocks()` and synthesizes events. This means `OcrClient` now has two methods with a non-obvious contract: implementations *should* override `streamBlocks()`, and `extractBlocks()` *could* be derived from it. But the default goes the other direction — `streamBlocks()` calls `extractBlocks()`. This creates an asymmetry: - `RestClientOcrClient` overrides `streamBlocks()` (talks to `/ocr/stream`), and its `extractBlocks()` is the old direct call to `/ocr`. - A hypothetical second implementation would need to understand which method to override. - **Suggestion**: Add a Javadoc comment on `streamBlocks()` saying "Implementations should override this method. The default exists only for backward compatibility during migration." This is documentation, not code — but it prevents the next developer from being confused. - **`CompletableFuture` for per-read timeout (J3)**: This addresses my socket-level idle timeout concern. But the implementation detail matters. `CompletableFuture.supplyAsync(() -> reader.readLine(), executor).get(5, TimeUnit.MINUTES)` creates a new task per line on an executor. If the read blocks, the thread stays blocked — `CompletableFuture` doesn't interrupt it. To truly enforce the timeout, you'd need to close the `InputStream` from a timer thread, which causes the blocked `readLine()` to throw `IOException`. Consider: is `CompletableFuture.get(timeout)` sufficient here? If `readLine()` blocks forever, the future times out but the underlying thread is still stuck. For a thread pool of size 2 (OCR async pool), this could exhaust the pool. **Alternative**: set `HttpClient.Builder.readTimeout(Duration.ofMinutes(5))` on the raw HttpClient used for streaming. This sets `SO_TIMEOUT` at the socket level, which actually interrupts the read. Simpler, more correct. - **No migration needed.** `OcrJobDocument.currentPage` and `totalPages` columns already exist in the DB (they're just always 0). The plan populates them during streaming. No Flyway migration — good, one fewer moving part. - **Batch path unchanged — confirmed correct.** Batch uses `extractBlocks()`, which still calls the old `/ocr` endpoint. The plan explicitly keeps this. Both Tobias and I flagged this; Felix addressed it. Satisfied. ### Suggestions - **Consider making `extractBlocks()` a `default` method too**, derived from `streamBlocks()` (the inverse of what the plan proposes). Then there's only one "real" method to implement: `streamBlocks()`. The default `extractBlocks()` collects pages into a list. This is cleaner long-term — one method does the work, the other is convenience. But it changes the current contract, so only do this if the batch path is the only caller of `extractBlocks()`. - **The `HttpClient.readTimeout` approach for idle timeout** (above) eliminates the `CompletableFuture` wrapper entirely. The streaming `HttpClient` instance would have `readTimeout(Duration.ofMinutes(5))` while the existing `RestClient`-wrapped one keeps the current 10-minute timeout. Two clients, two timeout strategies — document it clearly.

marcel commented

2026-04-13 09:05:43 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist (Round 2)

Reviewing Felix's implementation plan for test coverage completeness.

Questions & Observations

Test plan is thorough — every layer is covered. P3 (Python contract tests), J1 (event deserialization), J3 (NDJSON parsing), J5 (runner integration), F2 (frontend progress), I1 (end-to-end mock), I2 (partial resilience). My original test plan is fully represented. Good.
One gap: Python streaming test with httpx streaming. Task P3 says "Tests with FastAPI TestClient." FastAPI's TestClient (which uses httpx internally) buffers the response by default. To test the streaming behavior (line-by-line NDJSON), you need to use httpx.AsyncClient with stream=True and iterate the response line-by-line. Otherwise, you're testing the assembled response, not the stream. Verify that the test actually reads lines incrementally, not just checks the final concatenated response.

Mocking pattern for J5 (streamBlocks in OcrAsyncRunner). The plan says "mock streamBlocks with multi-page sequence." With the Consumer<OcrStreamEvent> pattern, the mock setup is:

doAnswer(invocation -> {
    Consumer<OcrStreamEvent> consumer = invocation.getArgument(2);
    consumer.accept(new OcrStreamEvent.Start(3));
    consumer.accept(new OcrStreamEvent.Page(0, 3, List.of(block1)));
    consumer.accept(new OcrStreamEvent.Page(1, 3, List.of(block2)));
    consumer.accept(new OcrStreamEvent.Page(2, 3, List.of(block3)));
    consumer.accept(new OcrStreamEvent.Done(3, 3));
    return null;
}).when(ocrClient).streamBlocks(any(), any(), any());

This is synchronous — all callbacks fire immediately. In production, they fire incrementally over minutes. The test doesn't validate that DB writes survive between callbacks (that's I1's job), but it should at least verify the runner saves jobDoc after each Page event, not just at the end.

Edge case still missing from the plan: totalPages mismatch. My round 1 comment asked: "What if totalPages in start doesn't match the actual page count sent?" The plan doesn't explicitly address this. The progress bar shows current / total using the start value. If Python sends start(totalPages=7) but only sends 5 page events before done, the bar stops at 5/7. Not catastrophic — but the user sees an incomplete bar before it disappears. Suggestion: When Done arrives, set currentPage = totalPages so the bar fills to 100% briefly before clearing. Or accept the mismatch as a display-only cosmetic issue.
Test for 0-page PDF graceful handling. Task P3 lists this as an edge case, but the Java side (J5) doesn't explicitly test it. Start(0) → Done(0, 0) → runner should mark job as DONE with 0 blocks, not fail. Add a test case in J5 for this.

Suggestions

Add a test case to J5: runSingleDocument_handlesZeroPagePdf_completesWithZeroBlocks() — consumer emits Start(0), Done(0, 0). Verify job status = DONE, no blocks created, no errors.
Add a test case to J3: streamBlocks_handlesTotalPagesMismatch_completesNormally() — send Start(5) but only 3 page events, then Done. Verify all 3 pages are dispatched, no error thrown.
The Python streaming test (P3) should use async for line in response.aiter_lines() to validate true streaming behavior, not just response.json() or response.text.

## 🧪 Sara Holt — QA Engineer & Test Strategist (Round 2) Reviewing Felix's implementation plan for test coverage completeness. ### Questions & Observations - **Test plan is thorough — every layer is covered.** P3 (Python contract tests), J1 (event deserialization), J3 (NDJSON parsing), J5 (runner integration), F2 (frontend progress), I1 (end-to-end mock), I2 (partial resilience). My original test plan is fully represented. Good. - **One gap: Python streaming test with `httpx` streaming.** Task P3 says "Tests with FastAPI TestClient." FastAPI's `TestClient` (which uses `httpx` internally) buffers the response by default. To test the streaming behavior (line-by-line NDJSON), you need to use `httpx.AsyncClient` with `stream=True` and iterate the response line-by-line. Otherwise, you're testing the assembled response, not the stream. Verify that the test actually reads lines incrementally, not just checks the final concatenated response. - **Mocking pattern for J5 (`streamBlocks` in `OcrAsyncRunner`).** The plan says "mock `streamBlocks` with multi-page sequence." With the `Consumer<OcrStreamEvent>` pattern, the mock setup is: ```java doAnswer(invocation -> { Consumer<OcrStreamEvent> consumer = invocation.getArgument(2); consumer.accept(new OcrStreamEvent.Start(3)); consumer.accept(new OcrStreamEvent.Page(0, 3, List.of(block1))); consumer.accept(new OcrStreamEvent.Page(1, 3, List.of(block2))); consumer.accept(new OcrStreamEvent.Page(2, 3, List.of(block3))); consumer.accept(new OcrStreamEvent.Done(3, 3)); return null; }).when(ocrClient).streamBlocks(any(), any(), any()); ``` This is synchronous — all callbacks fire immediately. In production, they fire incrementally over minutes. The test doesn't validate that DB writes survive between callbacks (that's I1's job), but it should at least verify the runner saves `jobDoc` after each `Page` event, not just at the end. - **Edge case still missing from the plan: `totalPages` mismatch.** My round 1 comment asked: "What if `totalPages` in `start` doesn't match the actual page count sent?" The plan doesn't explicitly address this. The progress bar shows `current / total` using the `start` value. If Python sends `start(totalPages=7)` but only sends 5 page events before `done`, the bar stops at 5/7. Not catastrophic — but the user sees an incomplete bar before it disappears. **Suggestion**: When `Done` arrives, set `currentPage = totalPages` so the bar fills to 100% briefly before clearing. Or accept the mismatch as a display-only cosmetic issue. - **Test for 0-page PDF graceful handling.** Task P3 lists this as an edge case, but the Java side (J5) doesn't explicitly test it. `Start(0)` → `Done(0, 0)` → runner should mark job as DONE with 0 blocks, not fail. Add a test case in J5 for this. ### Suggestions - **Add a test case to J5**: `runSingleDocument_handlesZeroPagePdf_completesWithZeroBlocks()` — consumer emits `Start(0)`, `Done(0, 0)`. Verify job status = DONE, no blocks created, no errors. - **Add a test case to J3**: `streamBlocks_handlesTotalPagesMismatch_completesNormally()` — send `Start(5)` but only 3 page events, then `Done`. Verify all 3 pages are dispatched, no error thrown. - **The Python streaming test (P3) should use `async for line in response.aiter_lines()`** to validate true streaming behavior, not just `response.json()` or `response.text`.

marcel commented

2026-04-13 09:06:06 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer (Round 2)

Reviewing Felix's implementation plan for security coverage.

Questions & Observations

All my round 1 concerns are mapped to tasks. Jackson defense-in-depth (J1), error message leakage (P3, J6), totalPages validation (P3, J6), DEBUG logging (J3). The traceability table is clean. No gaps from my perspective.
FAIL_ON_UNKNOWN_PROPERTIES = true on the NDJSON ObjectMapper (J1): Good. But make sure this is a dedicated ObjectMapper instance, not the application's global one. Spring Boot auto-configures a global ObjectMapper bean. If you set FAIL_ON_UNKNOWN_PROPERTIES = true on that, it'll break deserialization elsewhere in the app where unknown properties are silently ignored (e.g., OpenAPI-generated DTOs). Create a local ObjectMapper in RestClientOcrClient or a small NdjsonParser utility class.
Generic error messages in onError (J6): The plan says use "Seite X konnte nicht verarbeitet werden" for user-facing progress. Good. But verify this in the Python side too (P3): the message field in the error event should already be sanitized. The plan says "generic message (not raw Python traceback)." Make sure the Python code catches the exception, logs the full traceback with logger.exception(), and yields only a short string like f"OCR engine failed on page {page_idx}" — not str(e), which can include file paths, model paths, or internal state.
Auth posture for /ocr/stream: My round 1 comment asked about this. The plan doesn't explicitly address it — it inherits the same "internal Docker network only" posture as /ocr. That's acceptable for now, but I want to note: this is the second endpoint on the OCR service without any authentication. If a shared-secret header is added later (as suggested on #230), it should cover both /ocr and /ocr/stream simultaneously.
CompletableFuture per-read timeout (J3): Markus's round 2 suggestion to use HttpClient.readTimeout instead is better from a security perspective too. A CompletableFuture that times out but doesn't interrupt the underlying thread means the thread is still blocked reading from a potentially hung socket. In a pathological case (malicious OCR service sending data very slowly — slowloris-style), this could exhaust the thread pool. Socket-level timeout is the correct defense.

Suggestions

Dedicated ObjectMapper for NDJSON parsing — instantiate in RestClientOcrClient constructor or a static factory. Configure: FAIL_ON_UNKNOWN_PROPERTIES = true, FAIL_ON_NULL_FOR_PRIMITIVES = true (reject totalPages: null), and register the OcrStreamEvent deserializer.
Python error message template: In the streaming endpoint, use a fixed template for error events:
```
yield json.dumps({"type": "error", "pageNumber": i, "totalPages": total, "message": f"Page {i + 1} could not be processed"}) + "\n"
```
Log the actual exception separately via logger.exception(f"OCR failed on page {i}"). Never include str(e) in the yielded JSON.

## 🔒 Nora "NullX" Steiner — Application Security Engineer (Round 2) Reviewing Felix's implementation plan for security coverage. ### Questions & Observations - **All my round 1 concerns are mapped to tasks.** Jackson defense-in-depth (J1), error message leakage (P3, J6), totalPages validation (P3, J6), DEBUG logging (J3). The traceability table is clean. No gaps from my perspective. - **`FAIL_ON_UNKNOWN_PROPERTIES = true` on the NDJSON ObjectMapper (J1)**: Good. But make sure this is a **dedicated ObjectMapper instance**, not the application's global one. Spring Boot auto-configures a global `ObjectMapper` bean. If you set `FAIL_ON_UNKNOWN_PROPERTIES = true` on that, it'll break deserialization elsewhere in the app where unknown properties are silently ignored (e.g., OpenAPI-generated DTOs). Create a local `ObjectMapper` in `RestClientOcrClient` or a small `NdjsonParser` utility class. - **Generic error messages in `onError` (J6)**: The plan says use `"Seite X konnte nicht verarbeitet werden"` for user-facing progress. Good. But verify this in the Python side too (P3): the `message` field in the error event should already be sanitized. The plan says "generic message (not raw Python traceback)." Make sure the Python code catches the exception, logs the full traceback with `logger.exception()`, and yields only a short string like `f"OCR engine failed on page {page_idx}"` — not `str(e)`, which can include file paths, model paths, or internal state. - **Auth posture for `/ocr/stream`**: My round 1 comment asked about this. The plan doesn't explicitly address it — it inherits the same "internal Docker network only" posture as `/ocr`. That's acceptable for now, but I want to note: this is the second endpoint on the OCR service without any authentication. If a shared-secret header is added later (as suggested on #230), it should cover both `/ocr` and `/ocr/stream` simultaneously. - **`CompletableFuture` per-read timeout (J3)**: Markus's round 2 suggestion to use `HttpClient.readTimeout` instead is better from a security perspective too. A `CompletableFuture` that times out but doesn't interrupt the underlying thread means the thread is still blocked reading from a potentially hung socket. In a pathological case (malicious OCR service sending data very slowly — slowloris-style), this could exhaust the thread pool. Socket-level timeout is the correct defense. ### Suggestions - **Dedicated `ObjectMapper` for NDJSON parsing** — instantiate in `RestClientOcrClient` constructor or a static factory. Configure: `FAIL_ON_UNKNOWN_PROPERTIES = true`, `FAIL_ON_NULL_FOR_PRIMITIVES = true` (reject `totalPages: null`), and register the `OcrStreamEvent` deserializer. - **Python error message template**: In the streaming endpoint, use a fixed template for error events: ```python yield json.dumps({"type": "error", "pageNumber": i, "totalPages": total, "message": f"Page {i + 1} could not be processed"}) + "\n" ``` Log the actual exception separately via `logger.exception(f"OCR failed on page {i}")`. Never include `str(e)` in the yielded JSON.

marcel commented

2026-04-13 09:06:30 +02:00

🎨 Leonie Voss — UI/UX Design Lead (Round 2)

Reviewing Felix's implementation plan for UI/UX completeness.

Questions & Observations

All my round 1 specs are captured in F3. h-2, rounded-full, brand-mint on brand-sand, transition-all duration-500, max-w-xs, mx-auto mt-4, accessibility attributes, tabular-nums. The plan is precise and matches my requirements.
State transition coverage (F3, Leonie #2) — confirmed handled. The plan says "only visible when ocrTotalPages > 0" which means no bar during PREPARING/LOADING/ANALYZING. The bar appears on the first ANALYZING_PAGE message. This is exactly right. But one detail: when the job transitions from ANALYZING_PAGE:7:7:88 to DONE:88:0, the bar should briefly show 100% (7/7) before the entire OCR overlay disappears. Currently, when pollOcrJob detects status DONE, it sets ocrRunning = false and clears ocrProgressMessage — the bar vanishes instantly. Consider: should there be a 1-2 second delay showing the completed bar before it clears? This is a polish detail, not a blocker — but it avoids the jarring jump from "Seite 7 von 7" to the transcription view.
Skipped pages warning (F2/F4 → Leonie #3) — good approach. Encoding in DONE:12:2 and showing "2 Seite(n) übersprungen" is clean. But where does this warning appear visually? Below the progress bar? After the bar disappears? I'd recommend: show it briefly (2-3s) below the progress bar when the DONE message arrives, before the overlay clears. Use text-amber-600 (warning color) with a small warning icon to distinguish it from normal progress text. If the user misses it, the partial transcription itself makes the gap visible.
aria-valuenow semantic: The plan uses percentage (0-100) for aria-valuenow and aria-valuemax=100. This is correct per ARIA spec — screen readers announce "43% complete." An alternative would be aria-valuenow={ocrCurrentPage} with aria-valuemax={ocrTotalPages} (announces "3 of 7"), which is more meaningful for this context. Either works; I'd prefer the page-based semantics since the user understands pages, not percentages. The percentage approach is fine too — just be consistent.
font-[tabular-nums] vs tabular-nums: Tailwind CSS 4 supports tabular-nums as a direct utility class (it was added in v3.4). No need for the bracket syntax. Use tabular-nums directly.

Suggestions

Consider page-based ARIA semantics: aria-valuenow={ocrCurrentPage}, aria-valuemin={0}, aria-valuemax={ocrTotalPages} — more meaningful than percentage for this specific progress bar. Screen reader announces "3 of 7" instead of "43%."

Skipped pages warning visual: Below the numeric fraction, text-amber-600 text-xs, with a small ⚠ prefix:

{#if ocrSkippedPages > 0}
  <p class="mt-1 text-xs text-amber-600">⚠ {m.ocr_status_pages_skipped({ count: String(ocrSkippedPages) })}</p>
{/if}

Brief completion flash: When DONE arrives, show the bar at 100% for ~1.5s before clearing. This gives the user a sense of completion rather than an abrupt cut.

## 🎨 Leonie Voss — UI/UX Design Lead (Round 2) Reviewing Felix's implementation plan for UI/UX completeness. ### Questions & Observations - **All my round 1 specs are captured in F3.** `h-2`, `rounded-full`, `brand-mint` on `brand-sand`, `transition-all duration-500`, `max-w-xs`, `mx-auto mt-4`, accessibility attributes, `tabular-nums`. The plan is precise and matches my requirements. - **State transition coverage (F3, Leonie #2) — confirmed handled.** The plan says "only visible when `ocrTotalPages > 0`" which means no bar during PREPARING/LOADING/ANALYZING. The bar appears on the first `ANALYZING_PAGE` message. This is exactly right. But one detail: when the job transitions from `ANALYZING_PAGE:7:7:88` to `DONE:88:0`, the bar should briefly show 100% (7/7) before the entire OCR overlay disappears. Currently, when `pollOcrJob` detects status `DONE`, it sets `ocrRunning = false` and clears `ocrProgressMessage` — the bar vanishes instantly. Consider: should there be a 1-2 second delay showing the completed bar before it clears? This is a polish detail, not a blocker — but it avoids the jarring jump from "Seite 7 von 7" to the transcription view. - **Skipped pages warning (F2/F4 → Leonie #3) — good approach.** Encoding in `DONE:12:2` and showing `"2 Seite(n) übersprungen"` is clean. But where does this warning appear visually? Below the progress bar? After the bar disappears? I'd recommend: show it briefly (2-3s) below the progress bar when the `DONE` message arrives, before the overlay clears. Use `text-amber-600` (warning color) with a small warning icon to distinguish it from normal progress text. If the user misses it, the partial transcription itself makes the gap visible. - **`aria-valuenow` semantic**: The plan uses percentage (0-100) for `aria-valuenow` and `aria-valuemax=100`. This is correct per ARIA spec — screen readers announce "43% complete." An alternative would be `aria-valuenow={ocrCurrentPage}` with `aria-valuemax={ocrTotalPages}` (announces "3 of 7"), which is more meaningful for this context. Either works; I'd prefer the page-based semantics since the user understands pages, not percentages. The percentage approach is fine too — just be consistent. - **`font-[tabular-nums]` vs `tabular-nums`**: Tailwind CSS 4 supports `tabular-nums` as a direct utility class (it was added in v3.4). No need for the bracket syntax. Use `tabular-nums` directly. ### Suggestions - **Consider page-based ARIA semantics**: `aria-valuenow={ocrCurrentPage}`, `aria-valuemin={0}`, `aria-valuemax={ocrTotalPages}` — more meaningful than percentage for this specific progress bar. Screen reader announces "3 of 7" instead of "43%." - **Skipped pages warning visual**: Below the numeric fraction, `text-amber-600 text-xs`, with a small `⚠` prefix: ```svelte {#if ocrSkippedPages > 0} <p class="mt-1 text-xs text-amber-600">⚠ {m.ocr_status_pages_skipped({ count: String(ocrSkippedPages) })}</p> {/if} ``` - **Brief completion flash**: When `DONE` arrives, show the bar at 100% for ~1.5s before clearing. This gives the user a sense of completion rather than an abrupt cut.

marcel commented

2026-04-13 09:06:54 +02:00

⚙️ Tobias Wendt — DevOps & Platform Engineer (Round 2)

Reviewing Felix's implementation plan for operational concerns.

Questions & Observations

All my round 1 concerns are addressed. X-Accel-Buffering: no + Cache-Control: no-cache (P3), InputStream closed in finally (J3), batch on old path (J5), stream lifecycle logging at INFO (P3). The traceability table is complete.
HttpClient.readTimeout vs CompletableFuture for idle timeout (J3): Markus raised this in round 2 and I agree — HttpClient.Builder.readTimeout(Duration.ofMinutes(5)) is simpler and works at the socket level. The CompletableFuture approach adds complexity (thread management, uncancellable blocked reads) for no benefit. If the Python service hangs mid-page, the socket read times out after 5 minutes, readLine() throws SocketTimeoutException, the catch block in the streaming loop handles it as a connection drop. Clean.
Two HttpClient instances — operational implication: The streaming HttpClient would have readTimeout(5min) and no total request timeout. The existing one (wrapped in RestClient) has readTimeout(10min). In production, monitoring should track both: the streaming client's active connections (how many concurrent streams are open) and the old client's usage (batch and health checks). Currently there's no metrics exposure for the OCR client — this is pre-existing, not new to this issue.
Memory during streaming: My round 1 comment noted that _download_and_convert_pdf still loads all pages at once. Felix's plan doesn't address lazy page loading — and it shouldn't for this issue. But once streaming is in place, lazy page conversion (load page N, process page N, free page N, load page N+1) becomes a natural follow-up. Worth a separate issue.
Docker healthcheck during long streams: The Python OCR service healthcheck is GET /health. During streaming, asyncio.to_thread offloads OCR work to a thread, keeping the event loop free. The yield in the async generator also gives back control between pages. So /health should respond during streaming. But verify: does _download_and_convert_pdf (which uses httpx.AsyncClient) block the event loop during the initial download? It's await-based, so it shouldn't — but if the PDF is large (100MB+), the conversion to PIL Images via pypdfium2.render() might be synchronous and block. The existing code runs this before asyncio.to_thread kicks in. For the streaming endpoint, consider wrapping the PDF download+conversion in asyncio.to_thread too, so /health responds even during the initial load.

Suggestions

Use HttpClient.readTimeout(Duration.ofMinutes(5)) for the streaming client — drop the CompletableFuture wrapper. Simpler, actually enforces the timeout at the OS level.
Wrap _download_and_convert_pdf in asyncio.to_thread inside the streaming endpoint: The current /ocr endpoint already uses asyncio.to_thread for the engine call but runs PDF download on the event loop (via httpx.AsyncClient). The actual blocking part is pypdfium2 rendering. For the streaming endpoint, the entire _download_and_convert_pdf could run in the thread pool to guarantee /health always responds. This is a one-line change and worth doing.
Future issue: lazy page conversion. Once streaming lands, file an issue for loading and converting pages one at a time instead of all at once. This would cut peak memory from O(pages) to O(1) for images. Not in scope for #231, but streaming makes it architecturally possible.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer (Round 2) Reviewing Felix's implementation plan for operational concerns. ### Questions & Observations - **All my round 1 concerns are addressed.** `X-Accel-Buffering: no` + `Cache-Control: no-cache` (P3), `InputStream` closed in finally (J3), batch on old path (J5), stream lifecycle logging at INFO (P3). The traceability table is complete. - **`HttpClient.readTimeout` vs `CompletableFuture` for idle timeout (J3)**: Markus raised this in round 2 and I agree — `HttpClient.Builder.readTimeout(Duration.ofMinutes(5))` is simpler and works at the socket level. The `CompletableFuture` approach adds complexity (thread management, uncancellable blocked reads) for no benefit. If the Python service hangs mid-page, the socket read times out after 5 minutes, `readLine()` throws `SocketTimeoutException`, the catch block in the streaming loop handles it as a connection drop. Clean. - **Two `HttpClient` instances — operational implication**: The streaming `HttpClient` would have `readTimeout(5min)` and no total request timeout. The existing one (wrapped in `RestClient`) has `readTimeout(10min)`. In production, monitoring should track both: the streaming client's active connections (how many concurrent streams are open) and the old client's usage (batch and health checks). Currently there's no metrics exposure for the OCR client — this is pre-existing, not new to this issue. - **Memory during streaming**: My round 1 comment noted that `_download_and_convert_pdf` still loads all pages at once. Felix's plan doesn't address lazy page loading — and it shouldn't for this issue. But once streaming is in place, lazy page conversion (load page N, process page N, free page N, load page N+1) becomes a natural follow-up. Worth a separate issue. - **Docker healthcheck during long streams**: The Python OCR service healthcheck is `GET /health`. During streaming, `asyncio.to_thread` offloads OCR work to a thread, keeping the event loop free. The `yield` in the async generator also gives back control between pages. So `/health` should respond during streaming. But verify: does `_download_and_convert_pdf` (which uses `httpx.AsyncClient`) block the event loop during the initial download? It's `await`-based, so it shouldn't — but if the PDF is large (100MB+), the conversion to PIL Images via `pypdfium2.render()` might be synchronous and block. The existing code runs this before `asyncio.to_thread` kicks in. For the streaming endpoint, consider wrapping the PDF download+conversion in `asyncio.to_thread` too, so `/health` responds even during the initial load. ### Suggestions - **Use `HttpClient.readTimeout(Duration.ofMinutes(5))` for the streaming client** — drop the `CompletableFuture` wrapper. Simpler, actually enforces the timeout at the OS level. - **Wrap `_download_and_convert_pdf` in `asyncio.to_thread` inside the streaming endpoint**: The current `/ocr` endpoint already uses `asyncio.to_thread` for the engine call but runs PDF download on the event loop (via `httpx.AsyncClient`). The actual blocking part is `pypdfium2` rendering. For the streaming endpoint, the entire `_download_and_convert_pdf` could run in the thread pool to guarantee `/health` always responds. This is a one-line change and worth doing. - **Future issue: lazy page conversion.** Once streaming lands, file an issue for loading and converting pages one at a time instead of all at once. This would cut peak memory from O(pages) to O(1) for images. Not in scope for #231, but streaming makes it architecturally possible.

marcel referenced a pull request that will close this issue

2026-04-13 10:55:30 +02:00

feat: page-by-page streaming OCR with real-time progress #88

~~marcel referenced this issue 2026-04-13 10:55:59 +02:00~~

feat: OCR pipeline with NDJSON streaming and real-time progress (#226, #227, #231) #229

marcel referenced a pull request that will close this issue

2026-04-13 10:56:00 +02:00

feat: OCR pipeline with NDJSON streaming and real-time progress (#226, #227, #231) #229

marcel closed this issue

2026-04-13 12:39:06 +02:00

Sign in to join this conversation.