perf: page-by-page streaming OCR with real-time progress #231
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The Python OCR service processes all pages of a PDF before returning the result as a single HTTP response. For multi-page documents this takes 10+ minutes on CPU, exceeding the Java backend's 10-minute read timeout. The backend marks the job as
FAILED, but the Python service keeps grinding — wasting CPU and producing results that are never received.Additionally, during processing the frontend only shows a generic "OCR-Analyse läuft" message with no page-level progress.
Root cause discovered on
feat/issue-226-227-ocr-pipeline-polygon: A 7+ page typewriter document caused the OCR service to run for over 2 hours. The Java backend timed out after 10 minutes. New OCR requests failed silently because the single-threaded event loop was blocked (fixed separately withasyncio.to_threadind8dcba1).Depends on #226 / #227 (OCR pipeline).
Solution: NDJSON streaming
Replace the all-at-once HTTP response with NDJSON (newline-delimited JSON) streaming. The Python service sends one JSON line per completed page, the Java backend consumes the stream incrementally — creating transcription blocks as each page arrives and updating progress in real time.
Protocol
On per-page error:
Processing continues with the next page (partial results are kept).
Part 1: Python — new streaming endpoint
File:
ocr-service/main.pyAdd
POST /ocr/streamreturning aStreamingResponsewithmedia_type="application/x-ndjson":_download_and_convert_pdf){"type":"start","totalPages":N}asyncio.to_thread()on that single page, apply confidence markers, yield{"type":"page","pageNumber":i,"blocks":[...]}{"type":"done"}{"type":"error",...}and continueFiles:
ocr-service/engines/surya.py,ocr-service/engines/kraken.pyExtract
extract_page_blocks(image, page_idx, language) -> list[dict]from the inner loop ofextract_blocks(). The existingextract_blocks()becomes a thin wrapper that callsextract_page_blocks()in a loop (preserving backward compatibility for the old/ocrendpoint).Keep the old
POST /ocrendpoint as fallback.Part 2: Java — streaming consumer
New interface
New file:
service/OcrStreamHandler.javaOcrClient extension
File:
service/OcrClient.javaAdd:
Keep existing
extractBlocks()with a default implementation that collects viastreamBlocks(backward compat for batch path).RestClientOcrClient
File:
service/RestClientOcrClient.javaImplement
streamBlocks()usingjava.net.http.HttpClientdirectly (already available via constructor). Read responseInputStreamwithBufferedReader.readLine(), parse each JSON line with Jackson, dispatch to handler callbacks.No per-request total timeout needed — data flows continuously. Individual page processing can take up to 5 minutes before the next line arrives.
OcrAsyncRunner — per-page block creation
File:
service/OcrAsyncRunner.javaReplace the single
extractBlocks()call inrunSingleDocument()withstreamBlocks(). In theonPagecallback:OcrJobDocument.currentPage/totalPages(fields already exist in DB, currently unused)OcrJob.progressMessagewith the new page-level formatPart 3: Frontend — page-level progress
File:
frontend/src/routes/documents/[id]/+page.svelteUpdate
translateOcrProgress()to handle a new progress code:Example:
ANALYZING_PAGE:3:7:45Map to the existing per-page message pattern, reusing current i18n keys where possible. Add one new key:
Add a progress bar (brand-mint fill on brand-sand background) showing
current / totalpages. The existing spinner + text layout stays, the progress bar is added below the text.The polling mechanism (
pollOcrJobevery 2s readingjob.progressMessage) stays unchanged — it already picks up progress message updates.Part 4: Error handling
FAILEDwith message "Verbindung unterbrochen nach Seite X von Y". User can re-trigger OCR.onStartnever fires. Standard error path, job →FAILED./ocr/streamreturns 404 (old Python image)extractBlocks()with the old/ocrendpoint.Open questions
runBatch) also use streaming, or keep the oldextractBlocks()for now?X-Accel-Buffering: noheader on the Python response for future reverse-proxy compatibility?👨💻 Felix Brandt — Senior Fullstack Developer
Questions & Observations
OcrStreamHandlercallback interface: TheonPagecallback does DB writes (create annotation + transcription block) — that's side-effecting inside a callback invoked fromRestClientOcrClient. This couples the HTTP parsing layer to the persistence layer. I'd prefer the handler to be a thin adapter that theOcrAsyncRunnerimplements or provides as a lambda — not a standalone interface that gets injected across layers. The runner owns the orchestration, so the callbacks should be defined there.extract_page_blocksextraction in both engines: Bothsurya.pyandkraken.pyalready process one page at a time in theirfor page_idx, image in enumerate(images)loop. Extracting the loop body intoextract_page_blocks(image, page_idx, language)is clean. But Surya'sload_models()is called insideextract_blocks— make sure the newextract_page_blockseither callsload_models()or has a guard. If the streaming endpoint callsextract_page_blocksdirectly for each page, the lazy load must happen on the first call.Progress message format
ANALYZING_PAGE:3:7:45: Three colons in one string is a parsing smell. ThetranslateOcrProgressfunction splits on:and takescode.split(':'). WithANALYZING_PAGE:3:7:45, that's['ANALYZING_PAGE', '3', '7', '45']— it works, but any future message with more params becomes fragile. Consider whether this should stay as the colon protocol (KISS) or switch to JSON inprogressMessage. Given the existing pattern uses colons (CREATING_BLOCKS:42,DONE:100), staying consistent is the right call — just document thatANALYZING_PAGEhas 3 params.createSingleBlockextraction: The issue mentions splittingcreateTranscriptionBlocksinto a per-block method. Good — each block creation involvesannotationService.createOcrAnnotation()+blockRepository.save(). The per-block method should take asortOrderparameter so the caller (theonPagecallback) can maintain a running counter across pages.Suggestions
Test strategy for
RestClientOcrClient.streamBlocks(): This is the most testable part of the Java side. Write a test that feeds a multi-line NDJSON string to the parsing logic and asserts the handler receivesonStart(7),onPage(0, [...]),onPage(1, [...]),onDone()in order. Mock theHttpClientto return a cannedInputStream. This test should exist before the implementation.Test for partial failure: Feed NDJSON with a
{"type":"error","pageNumber":2,...}between two page messages. Assert the handler receivesonPage(0,...),onPage(1,...),onError(2,...),onPage(3,...),onDone(). This validates the "continue on page error" contract.Frontend
translateOcrProgress: The newANALYZING_PAGEcase should destructure cleanly:Keep it in the same
switchblock — don't extract a separate function for one case.🏗️ Markus Keller — Application Architect
Questions & Observations
NDJSON is the right transport choice. SSE would add complexity (event IDs, reconnection logic) for a request-scoped stream. Chunked HTTP with NDJSON is the simplest thing that works — one JSON object per line,
BufferedReader.readLine()on the Java side. No new dependencies, no new failure modes. Good call.Using
java.net.http.HttpClientdirectly instead of Spring'sRestClient: Spring'sRestClientdoesn't support incremental streaming well. The project already constructs aHttpClientinRestClientOcrClient's constructor. Using it directly for streaming is natural. But this meansRestClientOcrClientnow has two HTTP clients — theRestClient(for the old endpoint and health checks) and the rawHttpClient(for streaming). Document this clearly in the class. Consider whetherisHealthy()should also use the raw client for consistency.Domain boundary:
OcrStreamHandlerlives in the service package. Its callbacks receiveList<OcrBlockResult>— a service-layer type. The handler is invoked from insideRestClientOcrClient(infrastructure layer). This is acceptable because the handler is defined as an interface in the service layer and the implementation lives inOcrAsyncRunner— the dependency points inward. Clean.DB writes in callbacks on the async thread: Each
onPagecall doesannotationService.createOcrAnnotation()+blockRepository.save(). These are individual JPA saves, each auto-committed. If the stream breaks after page 3 of 7, pages 0-3 are persisted. The issue explicitly says "partial results are kept" and the user re-triggers OCR (which clears existing blocks first). This is the right trade-off — rollback would require collecting all created IDs and deleting them, which adds complexity for an edge case.extractBlocks()default implementation viastreamBlocks(): The issue says keepextractBlocks()with a default that collects viastreamBlocks(). This is correct for interface evolution — existing callers (batch path) don't need to change. But make the default methodfinalor clearly document that implementations should overridestreamBlocks(), notextractBlocks().Suggestions
For the open question on batch: Keep
runBatchusing the collectingextractBlocks()for now. Batch already has its own progress mechanism (per-document SSE events). Per-page streaming within batch would be a nested progress model — not worth the complexity until batch timeout is also a problem.For the
X-Accel-Bufferingheader: Yes, add it. Cost is one line. Risk of not adding it: the stream silently buffers behind a reverse proxy, and you spend hours debugging why progress doesn't update.response.headers["X-Accel-Buffering"] = "no"in the FastAPIStreamingResponse.Socket-level idle timeout: Java's
HttpClientdoesn't have a per-read timeout out of the box. If the Python service hangs mid-page (e.g. Kraken enters an infinite loop on a corrupted image), Java blocks onreadLine()forever. Consider wrapping the read loop in aFuturewith a 5-minute timeout, or settingSO_TIMEOUTon the underlying socket. This is the one timeout that still matters in the streaming model.🧪 Sara Holt — QA Engineer & Test Strategist
Questions & Observations
This change touches all three layers (Python, Java, Frontend) with a new streaming protocol between them. The protocol is the contract — if Python sends something Java doesn't expect, the whole thing breaks silently. Contract testing is essential here.
NDJSON parsing is the riskiest new code on the Java side.
BufferedReader.readLine()+ JacksonreadTree()per line. Edge cases to consider:if (line.isBlank()) continue— good)BufferedReaderhandles this, but worth a test with a realistic payload size{"type":"done"}message — connection drop. How does the handler know?Partial results persistence: If pages 0-3 are persisted and the stream breaks, the user sees partial transcription. When they re-trigger OCR,
clearExistingBlocks(documentId)deletes everything and starts fresh. This is correct — but test it: trigger OCR, kill the stream mid-way, verify partial blocks exist, re-trigger, verify old blocks are cleared and new ones replace them.The existing
OcrAsyncRunnertests mockocrClient.extractBlocks(). After this change, they need to mockocrClient.streamBlocks()usingdoAnswerto invoke handler callbacks synchronously. This is a different mocking pattern — make sure the test is readable.Suggestions
Test plan by layer:
POST /ocr/streamwith a mock engine that returns known blocks for 2 pages. Assert NDJSON lines:start(totalPages=2),page(x2),done. Usehttpx.AsyncClientwithstream=Trueto read line-by-line.start,page(0),error(1),page(2),done.RestClientOcrClient.streamBlocks()(mockHttpClient). Assert handler callbacks fire in order with correct payloads.doneline). Assert the handler still receives pages that were sent, and the method throws or logs appropriately.OcrAsyncRunner.runSingleDocument()with mockedstreamBlocks. VerifyOcrJobDocument.currentPage/totalPagesare updated after each page callback. VerifyprogressMessagecontainsANALYZING_PAGE:N:M:K.translateOcrProgress('ANALYZING_PAGE:3:7:45')returns the expected German string. Test all existing codes still work (regression).Edge case checklist:
{"type":"start","totalPages":0}then{"type":"done"}→ what does Java do? Should return 204 or create zero blocks gracefully.{"type":"page","pageNumber":0,"blocks":[]}(page with no text) → should Java handle empty blocks list without error.totalPagesinstartmessage doesn't match actual page count sent → should Java trust the count for progress display but handle the actual pages received?🔒 Nora "NullX" Steiner — Application Security Engineer
Questions & Observations
New endpoint
POST /ocr/stream— same auth posture asPOST /ocr? The existing/ocrendpoint has no authentication (internal Docker network only). The new/ocr/streamshould have the same posture, but double-check: if the OCR service port is ever exposed (e.g., for debugging), both endpoints are open. The prior review on #230 already flagged this — consider a shared secret header for all OCR service endpoints, not just/train.NDJSON parsing with Jackson
readTree()on untrusted input: The Java side reads each line from the Python response and parses with Jackson. While the Python service is trusted (internal), defense in depth applies:DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIESto avoid silently accepting extra fieldstypefield is one of the expected values before processing — theswitchwithcase "start", "page", "error", "done"handles this implicitly (unknown types are ignored), which is fineblocksarray parsing doesn't accept arbitrary nested objects — theTypeReference<List<OcrBlockResult>>constrains thisStreaming response keeps the HTTP connection open for minutes: During this time, if the Python process crashes, the Java side holds an open socket. This is standard for long-lived connections, but be aware that connection pool exhaustion is possible if multiple concurrent OCR streams are open (thread pool size is 2, so max 2 concurrent streams — acceptable).
Error messages in the stream:
{"type":"error","pageNumber":3,"message":"..."}— themessagefield contains the Python exception message. Make sure this doesn't leak internal paths, model file locations, or stack traces to the Java side's logs or (worse) to the frontend viaprogressMessage. The JavaonErrorhandler should log the message at WARN level but not propagate the raw Python error to the user-facingprogressMessage.Suggestions
Validate
totalPagesinonStart: A malicious or buggy response could sendtotalPages: -1ortotalPages: 999999. Sanity-check: iftotalPages < 0ortotalPages > 500(reasonable max for a family document), log a warning. Don't use it for memory allocation — it's just a progress display hint.Log the stream protocol events at DEBUG level: Log
onStart(totalPages=7),onPage(pageNumber=3, blocks=12),onDone()at DEBUG. This is invaluable for debugging production issues without exposing sensitive block content.Don't propagate raw Python exception messages to the frontend: In the
onErrorcallback, use a generic message like"Seite X konnte nicht verarbeitet werden"for the user-facing progress, and log the actual Python error server-side only.🎨 Leonie Voss — UI/UX Design Lead
Questions & Observations
Progress bar specification: The issue says "brand-mint fill on brand-sand background" — good, consistent with the project palette. But I need more detail:
h-1.5(6px) for a subtle bar, orh-2(8px) for more visibility? Given the senior audience (60+), I'd recommendh-2minimum.rounded-fullfor a pill shape, consistent with other UI elements.transition-all duration-500) or jump per page? Smooth is better UX — gives a sense of continuous progress even though updates are discrete (every 2s poll).mt-4spacing. Full width of the progress area,max-w-xsto keep it compact.The new message "Seite {current} von {total} — {blocks} Blöcke erkannt": This replaces the generic "OCR-Analyse läuft — dies kann einige Minuten dauern…" during page processing. Good — but the initial state before the first page completes still shows the generic message (the
ANALYZINGcode fires before streaming starts). Make sure there's a smooth transition:PREPARING→ "Dokument wird vorbereitet…"LOADING→ "Lade Modell und Dokument…"ANALYZING→ "OCR-Analyse läuft…" (brief, while PDF downloads in Python)ANALYZING_PAGE:1:7:12→ "Seite 1 von 7 — 12 Blöcke erkannt" + progress bar appearsThe progress bar should not appear during steps 1-3 (no total page count yet). It appears on the first
ANALYZING_PAGEmessage.Partial failure display: If page 3 of 7 fails, the progress bar jumps from page 2 to page 4. The user might think something glitched. Consider: should the bar still show 3/7 (error pages count toward progress) or skip? I'd count error pages toward progress — the service tried, it moved on. But a small warning icon or text noting "1 Seite übersprungen" would help set expectations that the transcription is incomplete.
Suggestions
Progress bar markup:
Accessibility: The progress bar needs
role="progressbar",aria-valuenow={ocrCurrentPage},aria-valuemin="0",aria-valuemax={ocrTotalPages}, and anaria-labellike"OCR-Fortschritt". Screen readers should announce progress changes.The numeric fraction below the bar (
3 / 7) should usetabular-nums(Tailwind:font-[tabular-nums]) so the digits don't shift as numbers change width.⚙️ Tobias Wendt — DevOps & Platform Engineer
Questions & Observations
Chunked transfer encoding through reverse proxies: This is my primary concern. If Caddy (production reverse proxy) or any intermediate proxy buffers the chunked response, the Java side won't see lines until the buffer fills or the response ends — defeating the entire purpose of streaming. The issue's open question about
X-Accel-Buffering: nois relevant, but that's nginx-specific. For Caddy:application/x-ndjsoncontent type may or may not trigger their streaming behavior. Test this.Health check interaction with streaming: The
asyncio.to_thread()fix (d8dcba1) keeps the event loop free during OCR. With streaming, the event loop is also free betweenyieldcalls (eachyieldin the async generator gives back control). So/healthresponds during streaming — good. But confirm: doesasyncio.to_thread()still work inside anasync def generate()context? It should, since the generator is running on the event loop andto_threadoffloads to a thread pool.OCR service single-worker constraint: Uvicorn runs one worker. With streaming, a single OCR request holds one connection open for potentially 30+ minutes. If a second OCR request comes in while the first is streaming, it queues behind the first because
asyncio.to_thread()runs in the default thread pool (max_workersdefaults tomin(32, os.cpu_count() + 4)). The thread pool is fine, but the OCR engines likely aren't thread-safe (global model state). Two concurrentextract_page_blockscalls could corrupt the model state. This was already true before streaming — just more visible now.Container restart during streaming: If the OCR container restarts mid-stream (OOM, Docker restart policy), Java's
BufferedReader.readLine()returns null or throws IOException. The error handling table covers this ("Connection drops mid-stream"). Make sure the Java side closes theInputStreamin afinallyblock to avoid resource leaks.Suggestions
Add
X-Accel-Buffering: noandCache-Control: no-cacheto the streaming response headers. Cost: two lines. Prevents buffering by any proxy layer:Log stream lifecycle at INFO level: On the Python side, log when streaming starts (
"Starting OCR stream for {totalPages} pages"), each page completion ("Page {i+1}/{totalPages} complete, {len(blocks)} blocks"), and stream end. This makes it trivial to diagnose timeout/hang issues fromdocker compose logs ocr-service.Monitor memory during streaming: Streaming doesn't reduce peak memory much because
_download_and_convert_pdfstill loads all page images at once. For truly large documents, consider converting pages lazily (one at a time usingpypdfium2). But this is a separate optimization — not needed for this issue.For the open question on batch: Agree with keeping batch on the old
extractBlocks()for now. Batch already has per-document progress. Per-page-within-batch would be confusing in the UI and adds no real value until individual documents in batch also time out.👨💻 Felix Brandt — Implementation Plan
After reading every comment, here's how I'd address each teammate's concerns in a TDD implementation. Every task maps to at least one concern; nothing is left unaddressed.
Design Decisions
No standalone
OcrStreamHandlerinterface. Instead,OcrClientgetsstreamBlocks(String pdfUrl, ScriptType scriptType, Consumer<OcrStreamEvent> eventConsumer)whereOcrStreamEventis a sealed interface withStart,Page,Error,Donerecord subtypes. The runner provides the consumer lambda. This keeps HTTP parsing inRestClientOcrClientand orchestration inOcrAsyncRunner— clean separation without a callback interface crossing layers. (My own concern #1, Markus: domain boundary clean)Colon protocol stays (KISS).
ANALYZING_PAGE:current:total:blocks— consistent withCREATING_BLOCKS:42andDONE:100. Three params documented. (My concern #3)Batch stays on old
extractBlocks()path. Both Markus and Tobias agree — batch has its own per-document SSE progress. Nested per-page streaming adds complexity for no real benefit right now. (Markus suggestion, Tobias #7)Fallback for old Python image. If
/ocr/streamreturns 404, fall back toextractBlocks()via the old/ocrendpoint. TheOcrClientdefault method synthesizesStart/Page/Doneevents from the collected result. (Issue spec, Markus #4: backward compat)Skipped pages in DONE message. Encode as
DONE:12:2(12 blocks, 2 pages skipped). Frontend parses and shows warning. Avoids adding a new field to the poll response. (Leonie #3)Answer to open questions:
X-Accel-Buffering: no+Cache-Control: no-cache: yes, add both. One line each. (Markus #6, Tobias #5)Task List
Layer 1: Python OCR Service
P1 — [python/test] Extract
extract_page_blocks()from Surya engineextract_page_blocks(image, page_idx, language).load_models()guard runs on first call (lazy load preserved).extract_blocks()becomes a thin wrapper.P2 — [python/test] Extract
extract_page_blocks()from Kraken engineP3 — [python/test] Add
POST /ocr/streamNDJSON streaming endpointStreamingResponsewithmedia_type="application/x-ndjson".X-Accel-Buffering: no,Cache-Control: no-cache. (Markus #6, Tobias #5)totalPagesafter PDF conversion: reject < 0 or > 500 with HTTP 422. (Nora #4){"type":"error",...}with generic message (not raw Python traceback), log raw error server-side only. (Nora #3)asyncio.to_thread()per page to keep event loop free.Layer 2: Java Backend
J1 — [backend/test] Define
OcrStreamEventsealed interface with Jackson deserializationStart(int totalPages),Page(int pageNumber, int totalPages, List<OcrBlockResult> blocks),Error(int pageNumber, int totalPages, String message),Done(int totalPages, int totalBlocks).FAIL_ON_UNKNOWN_PROPERTIES = true. (Nora #2)typegracefully.J2 — [backend] Add
streamBlocks()toOcrClientinterfacedefault void streamBlocks(String pdfUrl, ScriptType scriptType, Consumer<OcrStreamEvent> eventConsumer)— default implementation callsextractBlocks()and synthesizes events.J3 — [backend/test] Implement
streamBlocks()inRestClientOcrClientHttpClient.send()withBodyHandlers.ofInputStream()for streaming.BufferedReader.readLine()loop, parse each line intoOcrStreamEvent.readLine()inCompletableFuturewith 5-minute timeout. (Markus #7)InputStreaminfinallyblock. (Tobias #4)/ocr/stream, fall back toextractBlocks().HttpClientwith cannedInputStream):done→ no hang, pages received (Sara #2)J4 — [backend/test] Refactor
createTranscriptionBlocksinto per-block methodcreateSingleBlock(UUID documentId, OcrBlockResult block, UUID userId, String fileHash, int sortOrder).J5 — [backend/test] Wire
streamBlocks()intorunSingleDocument()ocrClient.extractBlocks()call withocrClient.streamBlocks().Start→ setjobDoc.totalPages, progress =ANALYZINGPage→ incrementjobDoc.currentPage, progress =ANALYZING_PAGE:{current}:{total}:{blocksSoFar}, callcreateSingleBlockper block (partial persistence — Markus #3)Error→ incrementjobDoc.currentPage, log warning, track skipped count. Error pages count toward progress. (Leonie #3)Done→ progress =DONE:{totalBlocks}:{skippedCount}runBatch()stays onextractBlocks(). (Markus #5, Tobias #7)streamBlockswith multi-page sequence → verifycurrentPage/totalPagesupdated,createSingleBlockcalled with correct sortOrder, progress messages correct (Sara #4)J6 — [backend/test] Defense-in-depth validation
totalPages < 0 || totalPages > 500, log warning. (Nora #4 — Java side too)onError: use generic message"Seite X konnte nicht verarbeitet werden"for user-facingprogressMessage, log raw Python error server-side only. (Nora #3)Layer 3: Frontend
F1 — [frontend] Add new i18n keys
ocr_status_analyzing_page: DE "Seite {current} von {total} — {blocks} Blöcke erkannt" / EN "Page {current} of {total} — {blocks} blocks detected" / ES "Página {current} de {total} — {blocks} bloques detectados"ocr_status_pages_skipped: DE "{count} Seite(n) übersprungen" / EN "{count} page(s) skipped" / ES "{count} página(s) omitida(s)"F2 — [frontend/test] Update
translateOcrProgress()forANALYZING_PAGEconst [key, current, total, blocks] = code.split(':')→m.ocr_status_analyzing_page({ current, total, blocks }).DONEcase to handle optional skipped param:DONE:12:2→ done message + skipped warning.ANALYZING_PAGE:3:7:45→ localized string (Sara #4)DONE:12:2→ done + skipped warningF3 — [frontend/test] Add progress bar to OCR status display
ocrProgressMessage:h-2,rounded-full,bg-brand-mintonbg-brand-sand,transition-all duration-500,max-w-xs,mx-auto mt-4. (Leonie #1)ocrTotalPages > 0— no bar during PREPARING/LOADING/ANALYZING. (Leonie #2)role="progressbar",aria-valuenow,aria-valuemin={0},aria-valuemax={100},aria-label="OCR-Fortschritt". (Leonie #4){current} / {total}withtabular-nums. (Leonie #5)Layer 4: Integration
I1 — [backend/test] Integration test: mock Python, real Java flow
@SpringBootTesttest: mockstreamBlocks()→ emit multi-page sequence → verify job status transitions,OcrJobDocumentfields, transcription blocks in DB, progress messages.I2 — [backend/test] Partial results resilience test
Concern Traceability
👨💻 Felix Brandt — Senior Fullstack Developer (Round 2)
Reviewing my own plan against the original issue and the team's feedback.
Questions & Observations
OcrStreamEventsealed interface vs simple records: My plan uses a sealed interface withStart,Page,Error,Donerecord subtypes. This is clean for pattern matching in Java 21 (switch (event) { case Page p -> ... }). However, the deserialization needs a custom Jackson deserializer or@JsonTypeInfo— this is non-trivial code. The test in J1 must cover the deserializer itself, not just the records. Make sure the deserializer is a standalone testable class, not an anonymous inner class buried inRestClientOcrClient.translateOcrProgress— extracting for testability: Task F2 says to testtranslateOcrProgressvia Vitest. Currently it's a function inside+page.svelte's<script>block (line 136). To unit-test it, I need to either extract it into a separate.tsfile or test it through the component. Extracting is cleaner — but it means the function needs access to Paraglide'smobject. The simplest approach: extract to$lib/ocr/translateOcrProgress.ts, importmfrom$lib/paraglide/messages, and test by mockingm. This is a minor refactor but worth calling out.DONE:12:2backward compatibility: My plan changes theDONEprogress message fromDONE:<count>toDONE:<count>:<skipped>. The frontend's currenttranslateOcrProgresssplits on:and takesparts[1]as count. If a new backend sendsDONE:12:2to an old frontend, the old frontend would show "12 Blöcke erstellt" and ignore the:2— safe. If an old backend sendsDONE:12to a new frontend, the new frontend would parseskippedasundefined— theif (skipped > 0)check would be false. Both directions are backward-compatible. Good.sortOrdercontinuity across pages: Task J4 extractscreateSingleBlockwith asortOrderparam. In J5, the consumer maintains a running counter across pages. I should make this explicit:int[] blockCounter = {0}; ... onPage: for (block : blocks) createSingleBlock(..., blockCounter[0]++);. The array wrapper is needed because lambdas capture effectively final variables. Alternative: useAtomicInteger. Cleaner, document it.Suggestions
Test naming for J3: The NDJSON parser tests are the most critical new code. Name them precisely:
streamBlocks_parsesStartPageDoneSequence_dispatchesEventsInOrderstreamBlocks_skipsEmptyLines_betweenJsonObjectsstreamBlocks_logsMalformedLine_andContinuesstreamBlocks_handlesAbruptStreamEnd_withoutDoneEventstreamBlocks_fallsBackToExtractBlocks_on404Missing task: update existing
OcrAsyncRunnerTest. The plan has J5 adding new tests forstreamBlocks, but the existing tests that mockextractBlocks()need updating too — they should verify the old behavior still works via the defaultstreamBlocks()method. Don't leave stale test mocks.🏗️ Markus Keller — Application Architect (Round 2)
Reviewing Felix's implementation plan for architectural soundness.
Questions & Observations
Sealed interface + Consumer pattern is the right call. The original issue proposed
OcrStreamHandleras a standalone interface injected across layers. Felix's revision —Consumer<OcrStreamEvent>with a sealed interface for events — is better. The consumer is provided by the caller (OcrAsyncRunner), so the dependency points inward:RestClientOcrClientdepends on the event types (service layer), not on the runner. The runner depends on nothing new. Clean inversion.default void streamBlocks(...)onOcrClient— one concern. The default method callsextractBlocks()and synthesizes events. This meansOcrClientnow has two methods with a non-obvious contract: implementations should overridestreamBlocks(), andextractBlocks()could be derived from it. But the default goes the other direction —streamBlocks()callsextractBlocks(). This creates an asymmetry:RestClientOcrClientoverridesstreamBlocks()(talks to/ocr/stream), and itsextractBlocks()is the old direct call to/ocr.streamBlocks()saying "Implementations should override this method. The default exists only for backward compatibility during migration." This is documentation, not code — but it prevents the next developer from being confused.CompletableFuturefor per-read timeout (J3): This addresses my socket-level idle timeout concern. But the implementation detail matters.CompletableFuture.supplyAsync(() -> reader.readLine(), executor).get(5, TimeUnit.MINUTES)creates a new task per line on an executor. If the read blocks, the thread stays blocked —CompletableFuturedoesn't interrupt it. To truly enforce the timeout, you'd need to close theInputStreamfrom a timer thread, which causes the blockedreadLine()to throwIOException. Consider: isCompletableFuture.get(timeout)sufficient here? IfreadLine()blocks forever, the future times out but the underlying thread is still stuck. For a thread pool of size 2 (OCR async pool), this could exhaust the pool. Alternative: setHttpClient.Builder.readTimeout(Duration.ofMinutes(5))on the raw HttpClient used for streaming. This setsSO_TIMEOUTat the socket level, which actually interrupts the read. Simpler, more correct.No migration needed.
OcrJobDocument.currentPageandtotalPagescolumns already exist in the DB (they're just always 0). The plan populates them during streaming. No Flyway migration — good, one fewer moving part.Batch path unchanged — confirmed correct. Batch uses
extractBlocks(), which still calls the old/ocrendpoint. The plan explicitly keeps this. Both Tobias and I flagged this; Felix addressed it. Satisfied.Suggestions
Consider making
extractBlocks()adefaultmethod too, derived fromstreamBlocks()(the inverse of what the plan proposes). Then there's only one "real" method to implement:streamBlocks(). The defaultextractBlocks()collects pages into a list. This is cleaner long-term — one method does the work, the other is convenience. But it changes the current contract, so only do this if the batch path is the only caller ofextractBlocks().The
HttpClient.readTimeoutapproach for idle timeout (above) eliminates theCompletableFuturewrapper entirely. The streamingHttpClientinstance would havereadTimeout(Duration.ofMinutes(5))while the existingRestClient-wrapped one keeps the current 10-minute timeout. Two clients, two timeout strategies — document it clearly.🧪 Sara Holt — QA Engineer & Test Strategist (Round 2)
Reviewing Felix's implementation plan for test coverage completeness.
Questions & Observations
Test plan is thorough — every layer is covered. P3 (Python contract tests), J1 (event deserialization), J3 (NDJSON parsing), J5 (runner integration), F2 (frontend progress), I1 (end-to-end mock), I2 (partial resilience). My original test plan is fully represented. Good.
One gap: Python streaming test with
httpxstreaming. Task P3 says "Tests with FastAPI TestClient." FastAPI'sTestClient(which useshttpxinternally) buffers the response by default. To test the streaming behavior (line-by-line NDJSON), you need to usehttpx.AsyncClientwithstream=Trueand iterate the response line-by-line. Otherwise, you're testing the assembled response, not the stream. Verify that the test actually reads lines incrementally, not just checks the final concatenated response.Mocking pattern for J5 (
streamBlocksinOcrAsyncRunner). The plan says "mockstreamBlockswith multi-page sequence." With theConsumer<OcrStreamEvent>pattern, the mock setup is:This is synchronous — all callbacks fire immediately. In production, they fire incrementally over minutes. The test doesn't validate that DB writes survive between callbacks (that's I1's job), but it should at least verify the runner saves
jobDocafter eachPageevent, not just at the end.Edge case still missing from the plan:
totalPagesmismatch. My round 1 comment asked: "What iftotalPagesinstartdoesn't match the actual page count sent?" The plan doesn't explicitly address this. The progress bar showscurrent / totalusing thestartvalue. If Python sendsstart(totalPages=7)but only sends 5 page events beforedone, the bar stops at 5/7. Not catastrophic — but the user sees an incomplete bar before it disappears. Suggestion: WhenDonearrives, setcurrentPage = totalPagesso the bar fills to 100% briefly before clearing. Or accept the mismatch as a display-only cosmetic issue.Test for 0-page PDF graceful handling. Task P3 lists this as an edge case, but the Java side (J5) doesn't explicitly test it.
Start(0)→Done(0, 0)→ runner should mark job as DONE with 0 blocks, not fail. Add a test case in J5 for this.Suggestions
Add a test case to J5:
runSingleDocument_handlesZeroPagePdf_completesWithZeroBlocks()— consumer emitsStart(0),Done(0, 0). Verify job status = DONE, no blocks created, no errors.Add a test case to J3:
streamBlocks_handlesTotalPagesMismatch_completesNormally()— sendStart(5)but only 3 page events, thenDone. Verify all 3 pages are dispatched, no error thrown.The Python streaming test (P3) should use
async for line in response.aiter_lines()to validate true streaming behavior, not justresponse.json()orresponse.text.🔒 Nora "NullX" Steiner — Application Security Engineer (Round 2)
Reviewing Felix's implementation plan for security coverage.
Questions & Observations
All my round 1 concerns are mapped to tasks. Jackson defense-in-depth (J1), error message leakage (P3, J6), totalPages validation (P3, J6), DEBUG logging (J3). The traceability table is clean. No gaps from my perspective.
FAIL_ON_UNKNOWN_PROPERTIES = trueon the NDJSON ObjectMapper (J1): Good. But make sure this is a dedicated ObjectMapper instance, not the application's global one. Spring Boot auto-configures a globalObjectMapperbean. If you setFAIL_ON_UNKNOWN_PROPERTIES = trueon that, it'll break deserialization elsewhere in the app where unknown properties are silently ignored (e.g., OpenAPI-generated DTOs). Create a localObjectMapperinRestClientOcrClientor a smallNdjsonParserutility class.Generic error messages in
onError(J6): The plan says use"Seite X konnte nicht verarbeitet werden"for user-facing progress. Good. But verify this in the Python side too (P3): themessagefield in the error event should already be sanitized. The plan says "generic message (not raw Python traceback)." Make sure the Python code catches the exception, logs the full traceback withlogger.exception(), and yields only a short string likef"OCR engine failed on page {page_idx}"— notstr(e), which can include file paths, model paths, or internal state.Auth posture for
/ocr/stream: My round 1 comment asked about this. The plan doesn't explicitly address it — it inherits the same "internal Docker network only" posture as/ocr. That's acceptable for now, but I want to note: this is the second endpoint on the OCR service without any authentication. If a shared-secret header is added later (as suggested on #230), it should cover both/ocrand/ocr/streamsimultaneously.CompletableFutureper-read timeout (J3): Markus's round 2 suggestion to useHttpClient.readTimeoutinstead is better from a security perspective too. ACompletableFuturethat times out but doesn't interrupt the underlying thread means the thread is still blocked reading from a potentially hung socket. In a pathological case (malicious OCR service sending data very slowly — slowloris-style), this could exhaust the thread pool. Socket-level timeout is the correct defense.Suggestions
Dedicated
ObjectMapperfor NDJSON parsing — instantiate inRestClientOcrClientconstructor or a static factory. Configure:FAIL_ON_UNKNOWN_PROPERTIES = true,FAIL_ON_NULL_FOR_PRIMITIVES = true(rejecttotalPages: null), and register theOcrStreamEventdeserializer.Python error message template: In the streaming endpoint, use a fixed template for error events:
Log the actual exception separately via
logger.exception(f"OCR failed on page {i}"). Never includestr(e)in the yielded JSON.🎨 Leonie Voss — UI/UX Design Lead (Round 2)
Reviewing Felix's implementation plan for UI/UX completeness.
Questions & Observations
All my round 1 specs are captured in F3.
h-2,rounded-full,brand-mintonbrand-sand,transition-all duration-500,max-w-xs,mx-auto mt-4, accessibility attributes,tabular-nums. The plan is precise and matches my requirements.State transition coverage (F3, Leonie #2) — confirmed handled. The plan says "only visible when
ocrTotalPages > 0" which means no bar during PREPARING/LOADING/ANALYZING. The bar appears on the firstANALYZING_PAGEmessage. This is exactly right. But one detail: when the job transitions fromANALYZING_PAGE:7:7:88toDONE:88:0, the bar should briefly show 100% (7/7) before the entire OCR overlay disappears. Currently, whenpollOcrJobdetects statusDONE, it setsocrRunning = falseand clearsocrProgressMessage— the bar vanishes instantly. Consider: should there be a 1-2 second delay showing the completed bar before it clears? This is a polish detail, not a blocker — but it avoids the jarring jump from "Seite 7 von 7" to the transcription view.Skipped pages warning (F2/F4 → Leonie #3) — good approach. Encoding in
DONE:12:2and showing"2 Seite(n) übersprungen"is clean. But where does this warning appear visually? Below the progress bar? After the bar disappears? I'd recommend: show it briefly (2-3s) below the progress bar when theDONEmessage arrives, before the overlay clears. Usetext-amber-600(warning color) with a small warning icon to distinguish it from normal progress text. If the user misses it, the partial transcription itself makes the gap visible.aria-valuenowsemantic: The plan uses percentage (0-100) foraria-valuenowandaria-valuemax=100. This is correct per ARIA spec — screen readers announce "43% complete." An alternative would bearia-valuenow={ocrCurrentPage}witharia-valuemax={ocrTotalPages}(announces "3 of 7"), which is more meaningful for this context. Either works; I'd prefer the page-based semantics since the user understands pages, not percentages. The percentage approach is fine too — just be consistent.font-[tabular-nums]vstabular-nums: Tailwind CSS 4 supportstabular-numsas a direct utility class (it was added in v3.4). No need for the bracket syntax. Usetabular-numsdirectly.Suggestions
Consider page-based ARIA semantics:
aria-valuenow={ocrCurrentPage},aria-valuemin={0},aria-valuemax={ocrTotalPages}— more meaningful than percentage for this specific progress bar. Screen reader announces "3 of 7" instead of "43%."Skipped pages warning visual: Below the numeric fraction,
text-amber-600 text-xs, with a small⚠prefix:Brief completion flash: When
DONEarrives, show the bar at 100% for ~1.5s before clearing. This gives the user a sense of completion rather than an abrupt cut.⚙️ Tobias Wendt — DevOps & Platform Engineer (Round 2)
Reviewing Felix's implementation plan for operational concerns.
Questions & Observations
All my round 1 concerns are addressed.
X-Accel-Buffering: no+Cache-Control: no-cache(P3),InputStreamclosed in finally (J3), batch on old path (J5), stream lifecycle logging at INFO (P3). The traceability table is complete.HttpClient.readTimeoutvsCompletableFuturefor idle timeout (J3): Markus raised this in round 2 and I agree —HttpClient.Builder.readTimeout(Duration.ofMinutes(5))is simpler and works at the socket level. TheCompletableFutureapproach adds complexity (thread management, uncancellable blocked reads) for no benefit. If the Python service hangs mid-page, the socket read times out after 5 minutes,readLine()throwsSocketTimeoutException, the catch block in the streaming loop handles it as a connection drop. Clean.Two
HttpClientinstances — operational implication: The streamingHttpClientwould havereadTimeout(5min)and no total request timeout. The existing one (wrapped inRestClient) hasreadTimeout(10min). In production, monitoring should track both: the streaming client's active connections (how many concurrent streams are open) and the old client's usage (batch and health checks). Currently there's no metrics exposure for the OCR client — this is pre-existing, not new to this issue.Memory during streaming: My round 1 comment noted that
_download_and_convert_pdfstill loads all pages at once. Felix's plan doesn't address lazy page loading — and it shouldn't for this issue. But once streaming is in place, lazy page conversion (load page N, process page N, free page N, load page N+1) becomes a natural follow-up. Worth a separate issue.Docker healthcheck during long streams: The Python OCR service healthcheck is
GET /health. During streaming,asyncio.to_threadoffloads OCR work to a thread, keeping the event loop free. Theyieldin the async generator also gives back control between pages. So/healthshould respond during streaming. But verify: does_download_and_convert_pdf(which useshttpx.AsyncClient) block the event loop during the initial download? It'sawait-based, so it shouldn't — but if the PDF is large (100MB+), the conversion to PIL Images viapypdfium2.render()might be synchronous and block. The existing code runs this beforeasyncio.to_threadkicks in. For the streaming endpoint, consider wrapping the PDF download+conversion inasyncio.to_threadtoo, so/healthresponds even during the initial load.Suggestions
Use
HttpClient.readTimeout(Duration.ofMinutes(5))for the streaming client — drop theCompletableFuturewrapper. Simpler, actually enforces the timeout at the OS level.Wrap
_download_and_convert_pdfinasyncio.to_threadinside the streaming endpoint: The current/ocrendpoint already usesasyncio.to_threadfor the engine call but runs PDF download on the event loop (viahttpx.AsyncClient). The actual blocking part ispypdfium2rendering. For the streaming endpoint, the entire_download_and_convert_pdfcould run in the thread pool to guarantee/healthalways responds. This is a one-line change and worth doing.Future issue: lazy page conversion. Once streaming lands, file an issue for loading and converting pages one at a time instead of all at once. This would cut peak memory from O(pages) to O(1) for images. Not in scope for #231, but streaming makes it architecturally possible.
marcel referenced this issue2026-04-13 10:55:59 +02:00