feat(ocr): add image preprocessing pipeline to improve transcription quality on aged documents #252
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Scanned family letters from the 19th/20th century typically have yellowed paper with faded ink. The OCR engines (Surya, Kraken) receive raw page images today with no preprocessing — low contrast between aged ink and yellowed background leads to missed or garbled text.
Solution
Add a transparent preprocessing step before every OCR pass using OpenCV:
No binarization (Otsu/Sauvola) — too aggressive for always-on use and would hurt Surya which was trained on grayscale, not binary images.
Preprocessing runs inside
asyncio.to_thread()so it never blocks the FastAPI event loop.Progress Spinner
A new
preprocessingNDJSON event is emitted per page before the OCR event:This flows through the full stack: Python → Spring Boot
OcrStreamEvent.Preprocessing→progressMessage = "PREPROCESSING_PAGE:1:5"→translateOcrProgress→ "Seite 1 von 5 wird vorverarbeitet…" in the spinner.Changes
ocr-service/preprocessing.pypreprocess_page(image)functionocr-service/main.pypreprocessingevent + callpreprocess_pageper page in both stream generators; preprocess silently in non-streaming/ocrendpointocr-service/requirements.txtopencv-python-headlessservice/OcrStreamEvent.javarecord Preprocessing(int pageNumber)service/RestClientOcrClient.java"preprocessing"NDJSON type inparseNdjsonStream()service/OcrAsyncRunner.javaPreprocessingevent →updateProgress("PREPROCESSING_PAGE:X:Y")lib/ocr/translateOcrProgress.tsPREPROCESSING_PAGEcasemessages/de.json,en.json,es.jsonocr_status_preprocessing_pagekeyTests
test_preprocessing.py: output shape, L-channel mean increases on synthetic yellowed image, fallback on OpenCV errorRestClientOcrClientTest:parseNdjsonStreamdispatchesPreprocessingeventOcrAsyncRunnerTest:Preprocessingevent sets correctprogressMessagetranslateOcrProgress.test.ts:PREPROCESSING_PAGE:3:10→{ currentPage: 3, totalPages: 10 }No API Changes
The
OcrRequestmodel is unchanged. No flag, no user action required — preprocessing is always-on and transparent.👨💻 Felix Brandt — Senior Fullstack Developer
Observations
test_stream.py(pytest +AsyncMock,patch()for engine deps). Thetest_preprocessing.pyadditions slot in cleanly.generate()loop, oncepreprocess_pagereturns the processed image, the original PIL Image should be immediately deleted before the OCR call. Currentlydel imagehappens in thefinallyblock — that's fine for the original, but the intermediate numpy array insidepreprocess_pageneeds to be explicitly deleted before returning to avoid holding both the original and processed image in memory simultaneously (a ~20MB spike per page). This should be handled insidepreprocess_pageitself.generate_guided()path processes all regions on one page in a loop. Preprocessing should happen once per page (before the region loop), not once per region. The issue says this correctly but it's easy to misimplement./ocrendpoint, the preprocessed image should replace the original in-place (image = await asyncio.to_thread(preprocess_page, image)) with adelon the intermediate to avoid doubling peak memory for large documents.translateOcrProgress.spec.tsalready mocks messages with simple strings —ocr_status_preprocessing_pageshould follow the same mock pattern (({ current, total }) => \Vorverarbeitung Seite ${current} von ${total}`` or similar).Recommendations
delof intermediate numpy/cv2 arrays insidepreprocess_pagebefore returning the final PIL Image. This keeps peak memory per page at ~10MB instead of ~20MB.test_preprocess_falls_back_on_errortest should assert not just that no exception is raised, but that the returned image is pixel-identical to the input. That's the contract.test_preprocessing.pytests before touchingmain.py. The preprocessing module is pure Python with no I/O — red/green/refactor takes minutes here.🏗️ Markus Keller — Senior Application Architect
Observations
preprocessing.pyis a pure image transformation module with one responsibility and no cross-cutting dependencies. The addition ofOcrStreamEvent.Preprocessingto the sealed interface is textbook — the interface already models the stream protocol, and preprocessing is a legitimate protocol event.RestClientOcrClient.parseNdjsonStream()already has adefault -> log.debug(...)branch for unknown event types. This means if the Python service is deployed first (emittingpreprocessingevents) before the Java code is updated, those events are silently ignored — no crash, no partial data loss. Rolling deployments in either order are safe.clipLimit=2.0,tileGridSize=(8,8)). These are sensible defaults, but the right values are document-corpus-specific. Since the OCR service already uses environment variables for all tunable parameters (OCR_CONFIDENCE_THRESHOLD, batch sizes), the preprocessing parameters should follow the same pattern.progressMessagecolon-delimited string encoding is already an established protocol inOcrAsyncRunner. AddingPREPROCESSING_PAGE:X:Yfollows the existing convention — no architecture concern there.Recommendations
preprocessing.py:docker-compose.ymlwith their defaults. This avoids a code change when tuning the pipeline after the first real-world test.🔒 Nora "NullX" Steiner — Application Security Engineer
Observations
This is low-risk from a security standpoint. No new endpoints, no new HTTP parameters, no user input reaches the preprocessing code path.
preprocess_page(image: Image.Image)takes a PIL Image that was already downloaded via_download_and_convert_pdf(), which has existing SSRF protection (ALLOWED_PDF_HOSTS). The preprocessing step operates on an in-memory object, not on paths or URLs.WARNINGlevel — ensure the warning includes the page number and exception type so it's actionable in Loki.requirements.txt.Recommendations
opencv-python-headlessis introduced is a good checkpoint to scan the full dependency tree.🧪 Sara Holt — Senior QA Engineer
Observations
The test plan covers the four main layers correctly. A few gaps and one edge case worth addressing:
Edge case not covered:
totalPagesis 0 when firstPreprocessingevent arrives.OcrStreamEvent.StartsetstotalPages.get()inOcrAsyncRunner. However, the stream order is:This is correct — Start always precedes Preprocessing per the protocol. But the test for
runSingleDocument_updatesProgressOnPreprocessingEventshould simulate both events in sequence (Start then Preprocessing) to test the realistic path. Testing Preprocessing alone would give"PREPROCESSING_PAGE:1:0", which tests a degenerate case.translateOcrProgress.spec.ts— existing mock pattern to follow:The spec file mocks
$lib/paraglide/messages.jsinline. The newocr_status_preprocessing_pagemock should match this pattern exactly (the mock returns a function that accepts{ current, total }and returns a formatted string).Missing test: non-streaming
/ocrendpoint preprocesses silently.The issue correctly states the non-streaming endpoint should preprocess without emitting events. There should be a test in
test_stream.py(or a newtest_preprocessing_integration.py) that calls/ocrdirectly and verifies the engine receives a preprocessed image (mockpreprocess_pageand verify it was called).OcrAsyncRunnerTest— useArgumentCaptorcorrectly:The existing test file uses
verify(ocrJobRepository, atLeastOnce()).save(any()). The Preprocessing progress test should useArgumentCaptor<OcrJob>to capture allsave()calls and assert that at least one call hadprogressMessagematching"PREPROCESSING_PAGE:1:5"— not just"PREPROCESSING_PAGE:1:0".Recommendations
test_preprocessing_called_for_ocr_endpointintest_stream.py(mockpreprocess_page, call/ocr, assert it was invoked once per page).runSingleDocument_updatesProgressOnPreprocessingEvent, simulate the full event sequence:Start(5)→Preprocessing(1)→Page(1, [])to test realistictotalPagespropagation.test_preprocessing.pyadds ~4 fast unit tests (<1s total). No new Testcontainers, no new Docker layers in CI.🎨 Leonie Voss — UX Design Lead
Observations
This is a backend/infrastructure change with a narrow frontend surface — only the progress spinner text. My concerns are limited but worth raising.
Dual-message pattern per page may feel choppy. For a 10-page document, users will see the spinner cycle through 20 messages: "Seite 1 von 10 wird vorverarbeitet…" → "Seite 1 von 10 wird analysiert…" → "Seite 2 von 10 wird vorverarbeitet…" → ... If preprocessing is fast (<200ms/page), these messages will flash so quickly that users may not be able to read them, which adds noise without meaningful information.
aria-livefrequency. The page counter in+page.sveltehasaria-live="polite". With preprocessing events, it will fire twice per page instead of once. Screen readers will announce each change — for a 10-page document, that's 20 announcements, which may be overwhelming for users relying on assistive technology."vorverarbeitet" is a technical term. It's accurate German but more jargon-heavy than needed. Users scanning the spinner don't need to know how the system is improving the image — they need to know progress is happening. "Seite X von Y wird aufbereitet…" is slightly more user-friendly, or even a combined phrase could work.
Recommendations
ocr_status_preprocessing_page→"Seite {current} von {total} wird aufbereitet…"(de),"Preparing page {current} of {total}…"(en).aria-livepolling frequency or debounce updates to avoid announcing every intermediate state.Open Decisions
⚙️ Tobias Wendt — DevOps & Platform Engineer
Observations
Dockerfile:
opencv-python-headlessneeds a system library. The current base image (python:3.11.9-slim) does not shiplibglib2.0-0, whichopencv-python-headlessrequires on Debian slim. Theapt-getblock in the Dockerfile needs to be extended:Without this,
import cv2will raiseImportError: libglib-2.0.so.0: cannot open shared object fileat container startup and models will never load.Image rebuild time. Adding
opencv-python-headless(~50MB wheel) torequirements.txtwill extend thepip installlayer in CI and on first deploy. Since the requirements layer is cached, this is a one-time cost. No concern.Memory impact is negligible. CLAHE on a 200 DPI page image (roughly 1600×2000px) allocates ~10MB of numpy arrays temporarily. The container already has a 12GB hard limit. No change needed to
mem_limit.Health check timing is unaffected.
cv2imports at module load time (fast, <1s), not at startup model loading. The existingstart_period: 120sis driven by PyTorch/Surya model loading, which is unchanged.Rolling deployment is safe. Python emitting
preprocessingevents before Java is updated: hits the existingdefault -> log.debug(...)branch inparseNdjsonStream(). Java updated before Python: Python doesn't emit the new events yet. Both orderings are safe.No new environment variables in
docker-compose.yml. Per Markus's note on externalizing CLAHE parameters — if those env vars are added, they should follow the existing pattern in theenvironment:block with inline default comments explaining their purpose.Recommendations
libglib2.0-0to the Dockerfileapt-getblock before the PR lands — this will otherwise cause a silent container startup failure that's annoying to debug in production.opencv-python-headlessversion inrequirements.txt(e.g.,opencv-python-headless==4.11.0.86) for reproducible builds. Renovate will handle version bump PRs.🗳️ Decision Queue — Action Required
1 decision needs your input before implementation starts.
UX
Preprocessingevent silently inOcrAsyncRunner(noprogressMessageupdate) — the spinner stays on "Analyzing page X of Y" for the full page time, which is cleaner but less accurate. (Raised by: Leonie)🎨 Leonie Voss — UX Design Lead (follow-up discussion)
Worked through the four open UX items with Marcel. All resolved or delegated.
Resolved
Surface preprocessing in spinner? → Option A: show it. Ship the preprocessing message transparently; if real-world CLAHE timing turns out to be too fast to read (<200ms), remove the spinner update in a follow-up. No premature optimisation.
Word choice → Use "aufbereitet" (not "vorverarbeitet") for all three locales:
de:"Seite {current} von {total} wird aufbereitet…"en:"Preparing page {current} of {total}…"es:"Preparando página {current} de {total}…"aria-livefrequency → The preprocessing spinner update should live outside thearia-live="polite"region. Sighted users see both messages; screen readers only announce the OCR phase ("Analysiere Seite X von Y") — keeping announcements at one per page instead of two.Delegated to implementation
Overall this is a well-scoped change with a minimal frontend surface. The word choice and
aria-liveboundary are the only things that needed explicit decisions — both are now settled.✅ Implementation complete — branch
feat/issue-252-ocr-preprocessingWhat was implemented
5 commits across the full stack:
Python (
ocr-service/)preprocessing.py—preprocess_page(image): CLAHE on L channel (LAB), grayscale, Gaussian blur (3×3). Explicitdelof numpy intermediates to keep peak memory at ~10MB/page. Falls back to original image on any OpenCV error and logstype(e).__name__+ message at WARNING. CLAHE parameters externalised asOCR_CLAHE_CLIP_LIMIT/OCR_CLAHE_TILE_SIZEenv vars.test_preprocessing.py— 3 unit tests: output dimensions, L-channel mean increase on a dark synthetic yellowed image, pixel-identical fallback oncv2error.main.py—generate(): emitspreprocessingevent then preprocesses each page before OCR.generate_guided(): preprocesses once per page before the region loop (not per region)./ocrendpoint: preprocesses each image in-place, no event emitted.test_stream.py— 3 integration tests covering all three paths above.requirements.txt—opencv-python-headless==4.11.0.86(pinned for reproducible builds).Dockerfile—libglib2.0-0added toapt-getblock (required by opencv on Debian slim).Infrastructure
docker-compose.yml—OCR_CLAHE_CLIP_LIMIT: "2.0"andOCR_CLAHE_TILE_SIZE: "8"with inline comments explaining their purpose.Java (
backend/)OcrStreamEvent.java—record Preprocessing(int pageNumber)added to sealed interface.RestClientOcrClient.java—"preprocessing"NDJSON type dispatchesOcrStreamEvent.Preprocessing.OcrAsyncRunner.java—Preprocessingevent →updateProgress("PREPROCESSING_PAGE:X:Y")usingtotalPagesalready set by the precedingStartevent (so totalPages is never 0).RestClientOcrClientStreamTest,OcrStreamEventTest,OcrAsyncRunnerTest— 4 new test cases.Frontend
translateOcrProgress.ts—PREPROCESSING_PAGE:X:Ycase returns{ message, currentPage, total }.translateOcrProgress.spec.ts— new test:PREPROCESSING_PAGE:3:10→{ currentPage: 3, totalPages: 10 }.messages/de.json—"ocr_status_preprocessing_page": "Seite {current} von {total} wird aufbereitet…"("aufbereitet" per Leonie's word choice decision).messages/en.json—"Preparing page {current} of {total}…"messages/es.json—"Preparando página {current} de {total}…"The
ocrProgressMessagein+page.svelteis rendered in a plain<p>tag outside anyaria-liveregion, so the preprocessing message does not double screen reader announcements — matching Leonie's accessibility requirement.320px wrap check: "Seite 10 von 10 wird aufbereitet…" (37 chars) in
text-smwithtext-centerwraps cleanly at 320px — no fixed-width constraint clips it.Test results
translateOcrProgress.spec.tstests greenOpen items
None — all reviewer concerns addressed. The "surface preprocessing in spinner vs. silent" decision (Option A: show it) is implemented. If real-world CLAHE timing proves too fast to read, removing the spinner update is a one-line change in
OcrAsyncRunner.