feat(ocr): add image preprocessing pipeline to improve transcription quality on aged documents #255
Reference in New Issue
Block a user
Delete Branch "feat/issue-252-ocr-preprocessing"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #252
Summary
Adds a transparent preprocessing step (CLAHE + grayscale + Gaussian blur) before every OCR pass to improve contrast on aged, yellowed documents. The pipeline runs in
asyncio.to_thread()so it never blocks the FastAPI event loop. A newpreprocessingNDJSON event flows through the full stack — Python → Java → frontend spinner — showing "Seite X von Y wird aufbereitet…" between pages. CLAHE parameters are externalised as env vars (OCR_CLAHE_CLIP_LIMIT,OCR_CLAHE_TILE_SIZE) with sensible defaults so they can be tuned post-deploy without a code change.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
TDD discipline is solid throughout. Every new behaviour has a failing test written first, and the three-layer coverage (Python unit, Java unit, TypeScript unit) is exactly what I'd expect for a cross-stack feature like this.
What I checked
Red/green/refactor observed:
test_preprocessing.pytests the CLAHE output mean shift beforepreprocessing.pyexisted — clean red step.test_stream.pyadditions verify preprocessing events beforemain.pywas updated.OcrStreamEventTest,RestClientOcrClientStreamTest, andOcrAsyncRunnerTestall got tests forPreprocessingbefore the production changes landed.Code quality —
preprocessing.py:delof numpy intermediates after each step is correct and intentional — large scans exhaust heap fast.logger.warningis the right contract: OCR degrades gracefully rather than failing.CLAHE_CLIP_LIMIT/CLAHE_TILE_SIZEwith sensible defaults (2.0/8) avoids magic numbers.Code quality — Java:
OcrStreamEventextended cleanly withrecord Preprocessing(int pageNumber). The exhaustive switch inOcrAsyncRunnerwas kept up-to-date without stubs.PREPROCESSING_PAGE:{current}:{total}mirrors the existingANALYZING_PAGEconvention — no new parsing logic needed in the frontend.Code quality — TypeScript:
translateOcrProgressfollows the existingswitchpattern exactly.PREPROCESSING_PAGEreturnscurrentPageandtotalPagesjust likeANALYZING_PAGE, so the progress bar already handles it.Suggestions (non-blocking)
preprocess_pageconverts to grayscale as a side effect — the caller ingenerate_guided()never uses colour again, so this is fine, but it means the function is not a pure "contrast enhancement"; it's also a colour-space reducer. A name likepreprocess_and_convert_to_grayscalewould be more honest, but I understand keeping names short in a single-purpose module.test_preprocessing.pyonly tests the happy path and the fallback (via a mock). An edge-case test for a single-pixel image or a zero-byte image would add confidence, but these are not blockers for merging.🏛️ Markus Keller — Software Architect
Verdict: ⚠️ Approved with concerns
The design is clean and the extension points are used correctly. One concern rises to the level of a design smell worth tracking, even if it's not a merge blocker today.
What I checked
Module boundaries:
preprocessing.pyis a dedicated, self-contained module — no cross-cutting imports, single responsibility. Good.preprocess_pagecall is placed correctly inmain.pyat the stream boundary, not scattered across region loops. The guided path preprocesses once per page and reuses the result across regions — this is the correct factoring.Sealed interface extension:
record Preprocessing(int pageNumber)toOcrStreamEventis the right pattern. The Java compiler enforces exhaustiveness at every switch site, so no handler can silently ignore the new event type. The PR demonstrates this:OcrAsyncRunner,OcrStreamEventTest, and the pattern-match test all had to be updated.Event sequencing:
The NDJSON stream now emits events in this order per page:
This is correctly reflected in
OcrAsyncRunner's progress state machine. However, there's a subtle ordering dependency: ifpreprocess_pagethrows and the fallback path is taken, thepreprocessingevent is still emitted (it's emitted before the call). The consumer seesPREPROCESSING_PAGEeven when preprocessing was a no-op. This is acceptable today since the fallback is silent and functionally equivalent, but worth noting for future observability.Concern (track as tech debt)
PREPROCESSING_PAGEprogress code is not distinguished fromANALYZING_PAGEin the UI progress bar.Both emit
currentPageandtotalPages, and the frontend renders them identically. If a multi-page document takes 2 s to preprocess and 8 s to OCR, the user sees the bar move then stall with no indication of what changed. This isn't broken, but it's a missed opportunity for a two-phase progress indicator. I'd recommend opening a follow-up issue to add aphasefield to the progress result so the UI can label the phase.Configuration coupling:
OCR_CLAHE_CLIP_LIMITandOCR_CLAHE_TILE_SIZEare read at module import time (top-levelos.environ.get). This means they can't be changed per-request and require a service restart to take effect. That's fine for production, but it makes integration testing that wants different CLAHE params harder. Not a blocker — just flag it for future test-isolation work.🔒 Nora "NullX" Steiner — Security Expert
Verdict: ✅ Approved
No security regressions. The attack surface introduced by this PR is minimal and well-bounded.
What I checked
Input validation / injection:
preprocess_pagereceives a PILImageobject, not a file path or raw bytes from user input. The image was already loaded from MinIO/S3 by the time it reaches this function — user-controlled data has been deserialized and bound to a typed object well before this point.CLAHE_TILE_SIZE = int(os.environ.get(...))will raiseValueErroron a non-integer env var, which would crash the service at startup — visible immediately in logs, not silently exploitable at runtime.Memory safety:
delofimg_array,lab,l_channel,l_clahe, andblurredafter each step limits the window during which large pixel buffers are resident. This reduces the exposure of a hypothetical memory-inspection side-channel, though the threat model doesn't include local memory access.try/except Exceptioncatch-all in the fallback is broad. A more specific catch ((cv2.error, ValueError, MemoryError)) would avoid silently swallowing unexpected exceptions from numpy. That said, the fallback returns the original image, so the worst case is a degraded but functional response — not data loss or escalation.Suggestion (non-blocking)
Consider narrowing the fallback exception type to
(cv2.error, ValueError, MemoryError)so that truly unexpected errors (e.g., aSystemExitor a programming mistake in a refactor) are not swallowed silently. This is a robustness concern more than a security concern.🧪 Sara Holt — QA Engineer
Verdict: ⚠️ Approved with concerns
Test coverage is thorough for the happy path and the key integration points. A few gaps in edge-case and contract coverage are worth tracking.
What I checked
Python unit tests (
test_preprocessing.py):test_preprocess_page_returns_grayscale_image— verifies output mode is"L". ✅test_preprocess_page_increases_contrast— uses a dark noisy image (RGB ≈ 30,20,10 + per-pixel noise, seed 42) and asserts mean L increases. The fixture is good: the noise ensures CLAHE has a non-trivial per-tile histogram to equalize, making the assertion reliable. ✅test_preprocess_page_falls_back_on_error— patchescv2.cvtColorto raise, confirms original image is returned. ✅Python stream tests (
test_stream.pyadditions):test_stream_emits_preprocessing_event_per_page_before_page_event— verifies 3 preprocessing events with correctpageNumbervalues, and thatpreprocess_pageis called 3×. ✅test_guided_stream_preprocesses_once_per_page_not_per_region— 2 regions on page 1, 1 on page 2 → preprocess called 2× (not 3×). This is the critical regression guard for the guided path. ✅test_ocr_endpoint_preprocesses_silently_without_emitting_events— confirms/ocrdoesn't leakpreprocessingevents into its response. ✅Java tests:
OcrStreamEventTest.preprocessingRecordHoldsPageNumber— record accessor test. ✅OcrStreamEventTest.patternMatchingWorksOnSealedInterface— exhaustiveness guard updated. ✅RestClientOcrClientStreamTest.parseNdjsonStream_dispatchesPreprocessingEvent— NDJSON parse contract. ✅OcrAsyncRunnerTest.runSingleDocument_updatesProgressOnPreprocessingEvent— verifiestotalPagesis already known whenPreprocessingarrives (sendsStart(5)first). ✅TypeScript tests (
translateOcrProgress.spec.ts):PREPROCESSING_PAGE:3:10→ correct message +currentPage/totalPages. ✅Gaps (suggest tracking as follow-up issues)
PREPROCESSING_PAGE:3:10with missing parts — what doestranslateOcrProgress('PREPROCESSING_PAGE')return? The?? '0'fallback producescurrent=0, total=0. There's no test asserting this graceful degradation. Low risk since the backend controls the format, but worth a regression guard.preprocess_pagewith edge-case image sizes — a 1×1 image, a landscape vs portrait scan, a very large image (e.g. 5000×7000). CLAHE withtileGridSize=(8,8)on a 1×1 image may behave unexpectedly (OpenCV requires tile size ≤ image dimensions in each axis).No integration test — the preprocessing pipeline is only tested at unit level. There's no test that spins up the FastAPI app and calls
/streamwith a real image to verify the full event sequence end-to-end. This is a known gap in the test strategy (the existingtest_stream.pymocks the OCR engine), but worth noting for completeness.🎨 Leonie Voss — UX & Accessibility Expert
Verdict: ✅ Approved
The user-facing changes are minimal, correct, and consistent with the existing progress UX. No accessibility regressions.
What I checked
Progress message copy:
ocr_status_preprocessing_pageAll three translations are natural and grammatically correct. "aufbereitet" is the right German word for document preparation/enhancement (not the more technical "vorverarbeitet"). The trailing ellipsis is consistent with the existing
ocr_status_analyzing_pagepattern.Progress bar continuity:
The
PREPROCESSING_PAGEevent returnscurrentPageandtotalPages, so the existing progress bar logic in the OCR overlay already renders a meaningful bar during preprocessing. No UI changes required — the bar moves through a preprocessing phase before the OCR phase begins.Accessibility:
The
ocrProgressMessageis rendered in a<p>element (not anaria-liveregion). This is consistent with the existing OCR progress display. Screen readers won't announce each progress update automatically, but the progress bar's visual state is sufficient for sighted users. For a background-style operation like OCR, polling/live-announcing every page would be noisy for screen reader users anyway. No change needed.Suggestion (non-blocking)
The user currently has no way to distinguish "document is being enhanced for better OCR" from "document is being OCR-scanned". Both progress phases look visually identical. A short label above the progress bar (e.g. "Aufbereitung" → "Analyse") could signal the phase transition. This is cosmetic and can be deferred — the current UX is not confusing, just less informative.
🛠️ Tobias Wendt — DevOps & Infrastructure
Verdict: ⚠️ Approved with concerns
The infrastructure changes are correct and the dependency is pinned. One concern about image build time and one about the missing rebuild trigger deserve attention.
What I checked
requirements.txt:opencv-python-headless==4.11.0.86— pinned to an exact version. ✅opencv-python-headless(notopencv-python) is the correct choice for a server environment — no GUI dependencies, smaller image. ✅Dockerfile:libglib2.0-0added to theapt-get installblock. This is the missing shared library thatopencv-python-headlesslinks against on Debian slim. ✅# libglib2.0-0 is required by opencv-python-headless on Debian slimis helpful for future maintainers who might wonder why a glib dependency is in an OCR service. ✅docker-compose.yml:OCR_CLAHE_CLIP_LIMIT: "2.0"andOCR_CLAHE_TILE_SIZE: "8"added with inline comments explaining what each parameter controls. ✅preprocessing.py— no behaviour change on a fresh deployment. ✅Concerns
Image size impact:
opencv-python-headless==4.11.0.86is approximately 30 MB compressed on PyPI. Combined withlibglib2.0-0, the Docker image forocr-servicewill grow by roughly 50–80 MB. This is acceptable for a server-side OCR service, but worth measuring after the first build (docker image ls archive-ocr) to keep a baseline for future size audits.No rebuild signal in CI/CD:
There's no CI pipeline in this repo, so this is a workflow concern rather than a code defect: developers who
git pullanddocker-compose upwithout--buildwill run the old image withoutlibglib2.0-0. The first timepreprocessing.pyis imported, it will raiseImportError: libglib2.0-0.so.0 not found. The fix is alwaysdocker-compose up --build ocr-serviceafter pulling this branch. Consider adding a comment to the PR description or CONTRIBUTING notes. Not a blocker.CLAHE_TILE_SIZEtype safety:int(os.environ.get("OCR_CLAHE_TILE_SIZE", "8"))will crash at startup if the env var is set to a non-integer (e.g."8x8"). This is fail-fast behaviour which is good for ops visibility, but aValueErrortraceback in the startup log can be confusing. A try/except with a clear error message would help on-call debugging. Non-blocking suggestion.Review feedback addressed
The following reviewer concerns from cycle 1 have been resolved:
✅ @Sara —
PREPROCESSING_PAGEwith missing colon parts has no regression guardAdded
test_PREPROCESSING_PAGE_with_no_colon_parts_gracefully_falls_back_to_zerointranslateOcrProgress.spec.ts. Documents thattranslateOcrProgress('PREPROCESSING_PAGE')returnscurrentPage: 0, totalPages: 0via the existing?? '0'fallback.Commit:
97c94c9✅ @Sara — No test for
preprocess_pagewith a 1×1 image (smaller than CLAHE tile)Added
test_does_not_crash_on_sub_tile_size_imageintest_preprocessing.py. Verifiespreprocess_pagereturns a valid PIL Image for a 1×1 input without raising.Commit:
4cb7c97✅ @Nora —
except Exceptioninpreprocessing.pyswallows unexpected errorsNarrowed the fallback to
except (cv2.error, ValueError, MemoryError). Addedtest_unexpected_exception_propagatesthat was RED under the old broad catch (RuntimeError swallowed) and GREEN after the fix. Updated the existing fallback test to useValueErroras the side effect (wasRuntimeError, which now correctly propagates).Commit:
4cb7c97Deferred as follow-up issues
PREPROCESSING_PAGEandANALYZING_PAGEare visually identical in the progress bar. Tracked as a separate follow-up.int()parse with a clearer error message. Deferred as a small follow-up issue.docker-compose up --build ocr-serviceafter pulling this branch. Workflow concern, no code change needed.👨💻 Felix Brandt — Senior Fullstack Developer (cycle 2)
Verdict: ✅ Approved
The cycle 1 feedback has been addressed correctly and the new tests follow the red/green/refactor discipline I'd expect.
What improved since cycle 1
Exception narrowing (
preprocessing.py):The broad
except Exceptionis nowexcept (cv2.error, ValueError, MemoryError). Critically, the fix was driven by a proper TDD cycle:test_unexpected_exception_propagateswas red under the old broad catch (RuntimeError silently swallowed → no raise → test failed)RuntimeErrortoValueErrorto keep it meaningful under the new contractThis is exactly how a fallback clause refactor should be done. No shortcutting.
Regression guard (
translateOcrProgress.spec.ts):PREPROCESSING_PAGE with no colon parts gracefully falls back to zerodocuments the?? '0'fallback behaviour. It's a pure documentation test — no red step possible since the behaviour already existed — but the intent is correct and the test will catch any future accidental removal of the fallback.Resilience test (
test_preprocessing.py):test_does_not_crash_on_sub_tile_size_imageis a clean resilience guard. The 1×1 PIL image is the minimal reproduction case for the CLAHE tile-size edge case Sara raised. Works correctly: OpenCV handles it without raising, so the assertionisinstance(result, Image.Image)is green.No new concerns from me
🏛️ Markus Keller — Software Architect (cycle 2)
Verdict: ✅ Approved
Both of my cycle 1 concerns are properly resolved — one by deferral to a tracked issue, one confirmed as a non-issue in context.
Cycle 1 concerns — resolved
Two-phase UI progress indicator:
Tracked as issue #256 with a concrete suggested approach (adding a
phasefield toOcrProgressResult). The current UX is functional; the two-phase label is an enhancement. Deferral is the right call.Configuration coupling:
Re-evaluated:
CLAHE_CLIP_LIMITandCLAHE_TILE_SIZEbeing read at module import time is correct for this service's deployment model. The service is a single-process FastAPI app restarted via Docker Compose when configuration changes. Per-request parameterisation would add complexity for zero operational benefit. My concern was misplaced for this context.No new architectural concerns
The exception narrowing in
preprocessing.pyis cleaner from an architecture standpoint too: the fallback is now scoped to expected failure modes (cv2 errors, bad values, OOM), and programming errors escape to the top-level exception handler where they'll produce a proper stack trace and error response. This is the right layering.🔒 Nora "NullX" Steiner — Security Expert (cycle 2)
Verdict: ✅ Approved
My cycle 1 suggestion has been correctly implemented. No new security surface.
Cycle 1 suggestion — implemented
The
except Exceptioncatch-all inpreprocess_pageis nowexcept (cv2.error, ValueError, MemoryError). The updatedtest_falls_back_to_pixel_identical_original_on_cv2_errornow usesValueErroras the injected failure (a type that IS caught), andtest_unexpected_exception_propagatesverifies thatRuntimeError(a type that is NOT caught) propagates — exactly the contract I asked for.The implementation is secure: CLAHE parameters still come from environment variables (not request data), image data still arrives via MinIO presigned URLs (not direct user upload to the preprocessing function), and the narrowed catch doesn't expose any new code paths to user-controlled input.
No new concerns
🧪 Sara Holt — QA Engineer (cycle 2)
Verdict: ✅ Approved
All three of my cycle 1 gaps have been addressed — two with new tests, one properly deferred to a tracked issue.
Gap 1 — Regression guard for
PREPROCESSING_PAGEwith missing parts ✅PREPROCESSING_PAGE with no colon parts gracefully falls back to zerois now intranslateOcrProgress.spec.ts. It confirmscurrentPage: 0andtotalPages: 0from the existing?? '0'fallback. The test is green and will catch any future removal of the fallback.Gap 2 — 1×1 image edge case ✅
test_does_not_crash_on_sub_tile_size_imageadded totest_preprocessing.py. Result: OpenCV's CLAHE handles a 1×1 image without raising (it processes it, returning a 1×1 grayscale result). No fallback is triggered. The test passes cleanly and documents this boundary behaviour.Gap 3 — Integration test ✅ (deferred, tracked)
Tracked as issue #258 with a concrete description of the expected test structure (httpx AsyncClient, real image fixture, event-sequence assertions). The gap is real but requires infrastructure work beyond this PR's scope. Acceptable deferral.
Bonus — unexpected exception propagation ✅
The new
test_unexpected_exception_propagatesensures thatRuntimeErrorfrom a programming error inpreprocess_pagereaches the caller rather than being silently logged. This was not a gap I raised but it strengthens the test suite's reliability guarantees.No remaining concerns
🎨 Leonie Voss — UX & Accessibility Expert (cycle 2)
Verdict: ✅ Approved
No changes to user-facing copy, i18n strings, or UI behaviour since cycle 1. My cycle 1 approval stands. The deferred suggestion (phase label above the progress bar) is tracked in issue #256.
Checked in cycle 2
translateOcrProgress.spec.tsadditions are purely test code — no user-visible impact.test_preprocessing.pyadditions are server-side — no user-visible impact.preprocessing.pyexception narrowing is server-side — no user-visible impact.de.json,en.json,es.json) are unchanged since cycle 1.No UX concerns.
🛠️ Tobias Wendt — DevOps & Infrastructure (cycle 2)
Verdict: ✅ Approved
My cycle 1 concerns are all resolved — two deferred to tracked issues, one confirmed as intentional-and-acceptable behaviour.
Cycle 1 concerns — resolved
Image size impact:
This is a monitoring item, not a code defect. No action required before merge. Acceptable.
Rebuild signal:
Developers pulling this branch need
docker-compose up --build ocr-service. This is a workflow concern inherent to any Dockerfile/requirements.txt change. No CI pipeline in this repo means there's no automated signal — but this is true of all such changes. The PR description explains the new dependency. Acceptable.CLAHE_TILE_SIZEtype safety:Tracked as issue #257 with the complete implementation recipe (
_parse_int_envhelper). The current fail-fastValueErrorfrom Python'sint()is not dangerous — it causes an immediate startup failure and a non-zero exit code, which Docker will flag. The error message could be clearer, but it's not a blocker.No new infrastructure concerns
The
Dockerfile,requirements.txt, anddocker-compose.ymlchanges are unchanged since cycle 1 and were already correct.opencv-python-headlessremains pinned to an exact version.libglib2.0-0comment is still present.