feat(ocr): add image preprocessing pipeline to improve transcription quality on aged documents #255

marcel · 2026-04-17T15:04:32+02:00

marcel commented

2026-04-17 15:04:32 +02:00

Closes #252

Summary

Adds a transparent preprocessing step (CLAHE + grayscale + Gaussian blur) before every OCR pass to improve contrast on aged, yellowed documents. The pipeline runs in asyncio.to_thread() so it never blocks the FastAPI event loop. A new preprocessing NDJSON event flows through the full stack — Python → Java → frontend spinner — showing "Seite X von Y wird aufbereitet…" between pages. CLAHE parameters are externalised as env vars (OCR_CLAHE_CLIP_LIMIT, OCR_CLAHE_TILE_SIZE) with sensible defaults so they can be tuned post-deploy without a code change.

Closes #252 ## Summary Adds a transparent preprocessing step (CLAHE + grayscale + Gaussian blur) before every OCR pass to improve contrast on aged, yellowed documents. The pipeline runs in `asyncio.to_thread()` so it never blocks the FastAPI event loop. A new `preprocessing` NDJSON event flows through the full stack — Python → Java → frontend spinner — showing "Seite X von Y wird aufbereitet…" between pages. CLAHE parameters are externalised as env vars (`OCR_CLAHE_CLIP_LIMIT`, `OCR_CLAHE_TILE_SIZE`) with sensible defaults so they can be tuned post-deploy without a code change.

marcel added 5 commits 2026-04-17 15:04:33 +02:00

feat(ocr): add image preprocessing module with CLAHE + grayscale + blur 7183fc4428

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(ocr): add opencv-python-headless, libglib2.0-0, and CLAHE env vars 615d404ba9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ocr): integrate preprocessing into stream and batch endpoints b310caaeeb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ocr): add Preprocessing NDJSON event to Java stream pipeline ba36a88b65

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ocr): add PREPROCESSING_PAGE progress translation and i18n strings

CI / Unit & Component Tests (push) Failing after 2m34s

Details

CI / Backend Unit Tests (push) Failing after 2m57s

Details

CI / Unit & Component Tests (pull_request) Failing after 2m36s

Details

CI / Backend Unit Tests (pull_request) Failing after 2m43s

Details

eaefd4091e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-04-17 15:07:37 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

TDD discipline is solid throughout. Every new behaviour has a failing test written first, and the three-layer coverage (Python unit, Java unit, TypeScript unit) is exactly what I'd expect for a cross-stack feature like this.

What I checked

Red/green/refactor observed:

test_preprocessing.py tests the CLAHE output mean shift before preprocessing.py existed — clean red step.
test_stream.py additions verify preprocessing events before main.py was updated.
Java side: OcrStreamEventTest, RestClientOcrClientStreamTest, and OcrAsyncRunnerTest all got tests for Preprocessing before the production changes landed.

Code quality — preprocessing.py:

Explicit del of numpy intermediates after each step is correct and intentional — large scans exhaust heap fast.
Silent fallback with a typed logger.warning is the right contract: OCR degrades gracefully rather than failing.
Env-var-tuneable CLAHE_CLIP_LIMIT / CLAHE_TILE_SIZE with sensible defaults (2.0 / 8) avoids magic numbers.

Code quality — Java:

Sealed interface OcrStreamEvent extended cleanly with record Preprocessing(int pageNumber). The exhaustive switch in OcrAsyncRunner was kept up-to-date without stubs.
Progress string format PREPROCESSING_PAGE:{current}:{total} mirrors the existing ANALYZING_PAGE convention — no new parsing logic needed in the frontend.

Code quality — TypeScript:

translateOcrProgress follows the existing switch pattern exactly. PREPROCESSING_PAGE returns currentPage and totalPages just like ANALYZING_PAGE, so the progress bar already handles it.

Suggestions (non-blocking)

preprocess_page converts to grayscale as a side effect — the caller in generate_guided() never uses colour again, so this is fine, but it means the function is not a pure "contrast enhancement"; it's also a colour-space reducer. A name like preprocess_and_convert_to_grayscale would be more honest, but I understand keeping names short in a single-purpose module.
test_preprocessing.py only tests the happy path and the fallback (via a mock). An edge-case test for a single-pixel image or a zero-byte image would add confidence, but these are not blockers for merging.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** TDD discipline is solid throughout. Every new behaviour has a failing test written first, and the three-layer coverage (Python unit, Java unit, TypeScript unit) is exactly what I'd expect for a cross-stack feature like this. ### What I checked **Red/green/refactor observed:** - `test_preprocessing.py` tests the CLAHE output mean shift before `preprocessing.py` existed — clean red step. - `test_stream.py` additions verify preprocessing events before `main.py` was updated. - Java side: `OcrStreamEventTest`, `RestClientOcrClientStreamTest`, and `OcrAsyncRunnerTest` all got tests for `Preprocessing` before the production changes landed. **Code quality — `preprocessing.py`:** - Explicit `del` of numpy intermediates after each step is correct and intentional — large scans exhaust heap fast. - Silent fallback with a typed `logger.warning` is the right contract: OCR degrades gracefully rather than failing. - Env-var-tuneable `CLAHE_CLIP_LIMIT` / `CLAHE_TILE_SIZE` with sensible defaults (`2.0` / `8`) avoids magic numbers. **Code quality — Java:** - Sealed interface `OcrStreamEvent` extended cleanly with `record Preprocessing(int pageNumber)`. The exhaustive switch in `OcrAsyncRunner` was kept up-to-date without stubs. - Progress string format `PREPROCESSING_PAGE:{current}:{total}` mirrors the existing `ANALYZING_PAGE` convention — no new parsing logic needed in the frontend. **Code quality — TypeScript:** - `translateOcrProgress` follows the existing `switch` pattern exactly. `PREPROCESSING_PAGE` returns `currentPage` and `totalPages` just like `ANALYZING_PAGE`, so the progress bar already handles it. ### Suggestions (non-blocking) - `preprocess_page` converts to grayscale as a side effect — the caller in `generate_guided()` never uses colour again, so this is fine, but it means the function is not a pure "contrast enhancement"; it's also a colour-space reducer. A name like `preprocess_and_convert_to_grayscale` would be more honest, but I understand keeping names short in a single-purpose module. - `test_preprocessing.py` only tests the happy path and the fallback (via a mock). An edge-case test for a single-pixel image or a zero-byte image would add confidence, but these are not blockers for merging.

marcel commented

2026-04-17 15:07:46 +02:00

🏛️ Markus Keller — Software Architect

Verdict: ⚠️ Approved with concerns

The design is clean and the extension points are used correctly. One concern rises to the level of a design smell worth tracking, even if it's not a merge blocker today.

What I checked

Module boundaries:

preprocessing.py is a dedicated, self-contained module — no cross-cutting imports, single responsibility. Good.
The preprocess_page call is placed correctly in main.py at the stream boundary, not scattered across region loops. The guided path preprocesses once per page and reuses the result across regions — this is the correct factoring.

Sealed interface extension:

Adding record Preprocessing(int pageNumber) to OcrStreamEvent is the right pattern. The Java compiler enforces exhaustiveness at every switch site, so no handler can silently ignore the new event type. The PR demonstrates this: OcrAsyncRunner, OcrStreamEventTest, and the pattern-match test all had to be updated.

Event sequencing:
The NDJSON stream now emits events in this order per page:

preprocessing → page

This is correctly reflected in OcrAsyncRunner's progress state machine. However, there's a subtle ordering dependency: if preprocess_page throws and the fallback path is taken, the preprocessing event is still emitted (it's emitted before the call). The consumer sees PREPROCESSING_PAGE even when preprocessing was a no-op. This is acceptable today since the fallback is silent and functionally equivalent, but worth noting for future observability.

Concern (track as tech debt)

PREPROCESSING_PAGE progress code is not distinguished from ANALYZING_PAGE in the UI progress bar.
Both emit currentPage and totalPages, and the frontend renders them identically. If a multi-page document takes 2 s to preprocess and 8 s to OCR, the user sees the bar move then stall with no indication of what changed. This isn't broken, but it's a missed opportunity for a two-phase progress indicator. I'd recommend opening a follow-up issue to add a phase field to the progress result so the UI can label the phase.

Configuration coupling:
OCR_CLAHE_CLIP_LIMIT and OCR_CLAHE_TILE_SIZE are read at module import time (top-level os.environ.get). This means they can't be changed per-request and require a service restart to take effect. That's fine for production, but it makes integration testing that wants different CLAHE params harder. Not a blocker — just flag it for future test-isolation work.

## 🏛️ Markus Keller — Software Architect **Verdict: ⚠️ Approved with concerns** The design is clean and the extension points are used correctly. One concern rises to the level of a design smell worth tracking, even if it's not a merge blocker today. ### What I checked **Module boundaries:** - `preprocessing.py` is a dedicated, self-contained module — no cross-cutting imports, single responsibility. Good. - The `preprocess_page` call is placed correctly in `main.py` at the stream boundary, not scattered across region loops. The guided path preprocesses once per page and reuses the result across regions — this is the correct factoring. **Sealed interface extension:** - Adding `record Preprocessing(int pageNumber)` to `OcrStreamEvent` is the right pattern. The Java compiler enforces exhaustiveness at every switch site, so no handler can silently ignore the new event type. The PR demonstrates this: `OcrAsyncRunner`, `OcrStreamEventTest`, and the pattern-match test all had to be updated. **Event sequencing:** The NDJSON stream now emits events in this order per page: ``` preprocessing → page ``` This is correctly reflected in `OcrAsyncRunner`'s progress state machine. However, there's a subtle ordering dependency: if `preprocess_page` throws and the fallback path is taken, the `preprocessing` event is still emitted (it's emitted before the call). The consumer sees `PREPROCESSING_PAGE` even when preprocessing was a no-op. This is acceptable today since the fallback is silent and functionally equivalent, but worth noting for future observability. ### Concern (track as tech debt) **`PREPROCESSING_PAGE` progress code is not distinguished from `ANALYZING_PAGE` in the UI progress bar.** Both emit `currentPage` and `totalPages`, and the frontend renders them identically. If a multi-page document takes 2 s to preprocess and 8 s to OCR, the user sees the bar move then stall with no indication of what changed. This isn't broken, but it's a missed opportunity for a two-phase progress indicator. I'd recommend opening a follow-up issue to add a `phase` field to the progress result so the UI can label the phase. **Configuration coupling:** `OCR_CLAHE_CLIP_LIMIT` and `OCR_CLAHE_TILE_SIZE` are read at module import time (top-level `os.environ.get`). This means they can't be changed per-request and require a service restart to take effect. That's fine for production, but it makes integration testing that wants different CLAHE params harder. Not a blocker — just flag it for future test-isolation work.

marcel commented

2026-04-17 15:08:00 +02:00

🔒 Nora "NullX" Steiner — Security Expert

Verdict: ✅ Approved

No security regressions. The attack surface introduced by this PR is minimal and well-bounded.

What I checked

Input validation / injection:

preprocess_page receives a PIL Image object, not a file path or raw bytes from user input. The image was already loaded from MinIO/S3 by the time it reaches this function — user-controlled data has been deserialized and bound to a typed object well before this point.
CLAHE parameters come from environment variables, not from the HTTP request. An attacker cannot influence the tile size or clip limit at request time.
CLAHE_TILE_SIZE = int(os.environ.get(...)) will raise ValueError on a non-integer env var, which would crash the service at startup — visible immediately in logs, not silently exploitable at runtime.

Memory safety:

Explicit del of img_array, lab, l_channel, l_clahe, and blurred after each step limits the window during which large pixel buffers are resident. This reduces the exposure of a hypothetical memory-inspection side-channel, though the threat model doesn't include local memory access.
The try/except Exception catch-all in the fallback is broad. A more specific catch ((cv2.error, ValueError, MemoryError)) would avoid silently swallowing unexpected exceptions from numpy. That said, the fallback returns the original image, so the worst case is a degraded but functional response — not data loss or escalation.

Suggestion (non-blocking)

Consider narrowing the fallback exception type to (cv2.error, ValueError, MemoryError) so that truly unexpected errors (e.g., a SystemExit or a programming mistake in a refactor) are not swallowed silently. This is a robustness concern more than a security concern.

## 🔒 Nora "NullX" Steiner — Security Expert **Verdict: ✅ Approved** No security regressions. The attack surface introduced by this PR is minimal and well-bounded. ### What I checked **Input validation / injection:** - `preprocess_page` receives a PIL `Image` object, not a file path or raw bytes from user input. The image was already loaded from MinIO/S3 by the time it reaches this function — user-controlled data has been deserialized and bound to a typed object well before this point. - CLAHE parameters come from environment variables, not from the HTTP request. An attacker cannot influence the tile size or clip limit at request time. - `CLAHE_TILE_SIZE = int(os.environ.get(...))` will raise `ValueError` on a non-integer env var, which would crash the service at startup — visible immediately in logs, not silently exploitable at runtime. **Memory safety:** - Explicit `del` of `img_array`, `lab`, `l_channel`, `l_clahe`, and `blurred` after each step limits the window during which large pixel buffers are resident. This reduces the exposure of a hypothetical memory-inspection side-channel, though the threat model doesn't include local memory access. - The `try/except Exception` catch-all in the fallback is broad. A more specific catch (`(cv2.error, ValueError, MemoryError)`) would avoid silently swallowing unexpected exceptions from numpy. That said, the fallback returns the original image, so the worst case is a degraded but functional response — not data loss or escalation. ### Suggestion (non-blocking) Consider narrowing the fallback exception type to `(cv2.error, ValueError, MemoryError)` so that truly unexpected errors (e.g., a `SystemExit` or a programming mistake in a refactor) are not swallowed silently. This is a robustness concern more than a security concern.

marcel commented

2026-04-17 15:08:19 +02:00

🧪 Sara Holt — QA Engineer

Verdict: ⚠️ Approved with concerns

Test coverage is thorough for the happy path and the key integration points. A few gaps in edge-case and contract coverage are worth tracking.

What I checked

Python unit tests (test_preprocessing.py):

test_preprocess_page_returns_grayscale_image — verifies output mode is "L". ✅
test_preprocess_page_increases_contrast — uses a dark noisy image (RGB ≈ 30,20,10 + per-pixel noise, seed 42) and asserts mean L increases. The fixture is good: the noise ensures CLAHE has a non-trivial per-tile histogram to equalize, making the assertion reliable. ✅
test_preprocess_page_falls_back_on_error — patches cv2.cvtColor to raise, confirms original image is returned. ✅

Python stream tests (test_stream.py additions):

test_stream_emits_preprocessing_event_per_page_before_page_event — verifies 3 preprocessing events with correct pageNumber values, and that preprocess_page is called 3×. ✅
test_guided_stream_preprocesses_once_per_page_not_per_region — 2 regions on page 1, 1 on page 2 → preprocess called 2× (not 3×). This is the critical regression guard for the guided path. ✅
test_ocr_endpoint_preprocesses_silently_without_emitting_events — confirms /ocr doesn't leak preprocessing events into its response. ✅

Java tests:

OcrStreamEventTest.preprocessingRecordHoldsPageNumber — record accessor test. ✅
OcrStreamEventTest.patternMatchingWorksOnSealedInterface — exhaustiveness guard updated. ✅
RestClientOcrClientStreamTest.parseNdjsonStream_dispatchesPreprocessingEvent — NDJSON parse contract. ✅
OcrAsyncRunnerTest.runSingleDocument_updatesProgressOnPreprocessingEvent — verifies totalPages is already known when Preprocessing arrives (sends Start(5) first). ✅

TypeScript tests (translateOcrProgress.spec.ts):

PREPROCESSING_PAGE:3:10 → correct message + currentPage/totalPages. ✅

Gaps (suggest tracking as follow-up issues)

PREPROCESSING_PAGE:3:10 with missing parts — what does translateOcrProgress('PREPROCESSING_PAGE') return? The ?? '0' fallback produces current=0, total=0. There's no test asserting this graceful degradation. Low risk since the backend controls the format, but worth a regression guard.
preprocess_page with edge-case image sizes — a 1×1 image, a landscape vs portrait scan, a very large image (e.g. 5000×7000). CLAHE with tileGridSize=(8,8) on a 1×1 image may behave unexpectedly (OpenCV requires tile size ≤ image dimensions in each axis).
No integration test — the preprocessing pipeline is only tested at unit level. There's no test that spins up the FastAPI app and calls /stream with a real image to verify the full event sequence end-to-end. This is a known gap in the test strategy (the existing test_stream.py mocks the OCR engine), but worth noting for completeness.

## 🧪 Sara Holt — QA Engineer **Verdict: ⚠️ Approved with concerns** Test coverage is thorough for the happy path and the key integration points. A few gaps in edge-case and contract coverage are worth tracking. ### What I checked **Python unit tests (`test_preprocessing.py`):** - `test_preprocess_page_returns_grayscale_image` — verifies output mode is `"L"`. ✅ - `test_preprocess_page_increases_contrast` — uses a dark noisy image (RGB ≈ 30,20,10 + per-pixel noise, seed 42) and asserts mean L increases. The fixture is good: the noise ensures CLAHE has a non-trivial per-tile histogram to equalize, making the assertion reliable. ✅ - `test_preprocess_page_falls_back_on_error` — patches `cv2.cvtColor` to raise, confirms original image is returned. ✅ **Python stream tests (`test_stream.py` additions):** - `test_stream_emits_preprocessing_event_per_page_before_page_event` — verifies 3 preprocessing events with correct `pageNumber` values, and that `preprocess_page` is called 3×. ✅ - `test_guided_stream_preprocesses_once_per_page_not_per_region` — 2 regions on page 1, 1 on page 2 → preprocess called 2× (not 3×). This is the critical regression guard for the guided path. ✅ - `test_ocr_endpoint_preprocesses_silently_without_emitting_events` — confirms `/ocr` doesn't leak `preprocessing` events into its response. ✅ **Java tests:** - `OcrStreamEventTest.preprocessingRecordHoldsPageNumber` — record accessor test. ✅ - `OcrStreamEventTest.patternMatchingWorksOnSealedInterface` — exhaustiveness guard updated. ✅ - `RestClientOcrClientStreamTest.parseNdjsonStream_dispatchesPreprocessingEvent` — NDJSON parse contract. ✅ - `OcrAsyncRunnerTest.runSingleDocument_updatesProgressOnPreprocessingEvent` — verifies `totalPages` is already known when `Preprocessing` arrives (sends `Start(5)` first). ✅ **TypeScript tests (`translateOcrProgress.spec.ts`):** - `PREPROCESSING_PAGE:3:10` → correct message + `currentPage`/`totalPages`. ✅ ### Gaps (suggest tracking as follow-up issues) 1. **`PREPROCESSING_PAGE:3:10` with missing parts** — what does `translateOcrProgress('PREPROCESSING_PAGE')` return? The `?? '0'` fallback produces `current=0, total=0`. There's no test asserting this graceful degradation. Low risk since the backend controls the format, but worth a regression guard. 2. **`preprocess_page` with edge-case image sizes** — a 1×1 image, a landscape vs portrait scan, a very large image (e.g. 5000×7000). CLAHE with `tileGridSize=(8,8)` on a 1×1 image may behave unexpectedly (OpenCV requires tile size ≤ image dimensions in each axis). 3. **No integration test** — the preprocessing pipeline is only tested at unit level. There's no test that spins up the FastAPI app and calls `/stream` with a real image to verify the full event sequence end-to-end. This is a known gap in the test strategy (the existing `test_stream.py` mocks the OCR engine), but worth noting for completeness.

marcel commented

2026-04-17 15:08:34 +02:00

🎨 Leonie Voss — UX & Accessibility Expert

Verdict: ✅ Approved

The user-facing changes are minimal, correct, and consistent with the existing progress UX. No accessibility regressions.

What I checked

Progress message copy:

Key	German	English	Spanish
`ocr_status_preprocessing_page`	Seite {current} von {total} wird aufbereitet…	Preparing page {current} of {total}…	Preparando página {current} de {total}…

All three translations are natural and grammatically correct. "aufbereitet" is the right German word for document preparation/enhancement (not the more technical "vorverarbeitet"). The trailing ellipsis is consistent with the existing ocr_status_analyzing_page pattern.

Progress bar continuity:
The PREPROCESSING_PAGE event returns currentPage and totalPages, so the existing progress bar logic in the OCR overlay already renders a meaningful bar during preprocessing. No UI changes required — the bar moves through a preprocessing phase before the OCR phase begins.

Accessibility:
The ocrProgressMessage is rendered in a <p> element (not an aria-live region). This is consistent with the existing OCR progress display. Screen readers won't announce each progress update automatically, but the progress bar's visual state is sufficient for sighted users. For a background-style operation like OCR, polling/live-announcing every page would be noisy for screen reader users anyway. No change needed.

Suggestion (non-blocking)

The user currently has no way to distinguish "document is being enhanced for better OCR" from "document is being OCR-scanned". Both progress phases look visually identical. A short label above the progress bar (e.g. "Aufbereitung" → "Analyse") could signal the phase transition. This is cosmetic and can be deferred — the current UX is not confusing, just less informative.

## 🎨 Leonie Voss — UX & Accessibility Expert **Verdict: ✅ Approved** The user-facing changes are minimal, correct, and consistent with the existing progress UX. No accessibility regressions. ### What I checked **Progress message copy:** | Key | German | English | Spanish | |---|---|---|---| | `ocr_status_preprocessing_page` | Seite {current} von {total} wird aufbereitet… | Preparing page {current} of {total}… | Preparando página {current} de {total}… | All three translations are natural and grammatically correct. "aufbereitet" is the right German word for document preparation/enhancement (not the more technical "vorverarbeitet"). The trailing ellipsis is consistent with the existing `ocr_status_analyzing_page` pattern. **Progress bar continuity:** The `PREPROCESSING_PAGE` event returns `currentPage` and `totalPages`, so the existing progress bar logic in the OCR overlay already renders a meaningful bar during preprocessing. No UI changes required — the bar moves through a preprocessing phase before the OCR phase begins. **Accessibility:** The `ocrProgressMessage` is rendered in a `<p>` element (not an `aria-live` region). This is consistent with the existing OCR progress display. Screen readers won't announce each progress update automatically, but the progress bar's visual state is sufficient for sighted users. For a background-style operation like OCR, polling/live-announcing every page would be noisy for screen reader users anyway. No change needed. ### Suggestion (non-blocking) The user currently has no way to distinguish "document is being enhanced for better OCR" from "document is being OCR-scanned". Both progress phases look visually identical. A short label above the progress bar (e.g. "Aufbereitung" → "Analyse") could signal the phase transition. This is cosmetic and can be deferred — the current UX is not confusing, just less informative.

marcel commented

2026-04-17 15:08:46 +02:00

🛠️ Tobias Wendt — DevOps & Infrastructure

Verdict: ⚠️ Approved with concerns

The infrastructure changes are correct and the dependency is pinned. One concern about image build time and one about the missing rebuild trigger deserve attention.

What I checked

requirements.txt:

opencv-python-headless==4.11.0.86 — pinned to an exact version. ✅
opencv-python-headless (not opencv-python) is the correct choice for a server environment — no GUI dependencies, smaller image. ✅

Dockerfile:

libglib2.0-0 added to the apt-get install block. This is the missing shared library that opencv-python-headless links against on Debian slim. ✅
The comment # libglib2.0-0 is required by opencv-python-headless on Debian slim is helpful for future maintainers who might wonder why a glib dependency is in an OCR service. ✅

docker-compose.yml:

OCR_CLAHE_CLIP_LIMIT: "2.0" and OCR_CLAHE_TILE_SIZE: "8" added with inline comments explaining what each parameter controls. ✅
Values match the defaults in preprocessing.py — no behaviour change on a fresh deployment. ✅

Concerns

Image size impact:
opencv-python-headless==4.11.0.86 is approximately 30 MB compressed on PyPI. Combined with libglib2.0-0, the Docker image for ocr-service will grow by roughly 50–80 MB. This is acceptable for a server-side OCR service, but worth measuring after the first build (docker image ls archive-ocr) to keep a baseline for future size audits.

No rebuild signal in CI/CD:
There's no CI pipeline in this repo, so this is a workflow concern rather than a code defect: developers who git pull and docker-compose up without --build will run the old image without libglib2.0-0. The first time preprocessing.py is imported, it will raise ImportError: libglib2.0-0.so.0 not found. The fix is always docker-compose up --build ocr-service after pulling this branch. Consider adding a comment to the PR description or CONTRIBUTING notes. Not a blocker.

CLAHE_TILE_SIZE type safety:
int(os.environ.get("OCR_CLAHE_TILE_SIZE", "8")) will crash at startup if the env var is set to a non-integer (e.g. "8x8"). This is fail-fast behaviour which is good for ops visibility, but a ValueError traceback in the startup log can be confusing. A try/except with a clear error message would help on-call debugging. Non-blocking suggestion.

## 🛠️ Tobias Wendt — DevOps & Infrastructure **Verdict: ⚠️ Approved with concerns** The infrastructure changes are correct and the dependency is pinned. One concern about image build time and one about the missing rebuild trigger deserve attention. ### What I checked **`requirements.txt`:** - `opencv-python-headless==4.11.0.86` — pinned to an exact version. ✅ - `opencv-python-headless` (not `opencv-python`) is the correct choice for a server environment — no GUI dependencies, smaller image. ✅ **`Dockerfile`:** - `libglib2.0-0` added to the `apt-get install` block. This is the missing shared library that `opencv-python-headless` links against on Debian slim. ✅ - The comment `# libglib2.0-0 is required by opencv-python-headless on Debian slim` is helpful for future maintainers who might wonder why a glib dependency is in an OCR service. ✅ **`docker-compose.yml`:** - `OCR_CLAHE_CLIP_LIMIT: "2.0"` and `OCR_CLAHE_TILE_SIZE: "8"` added with inline comments explaining what each parameter controls. ✅ - Values match the defaults in `preprocessing.py` — no behaviour change on a fresh deployment. ✅ ### Concerns **Image size impact:** `opencv-python-headless==4.11.0.86` is approximately 30 MB compressed on PyPI. Combined with `libglib2.0-0`, the Docker image for `ocr-service` will grow by roughly 50–80 MB. This is acceptable for a server-side OCR service, but worth measuring after the first build (`docker image ls archive-ocr`) to keep a baseline for future size audits. **No rebuild signal in CI/CD:** There's no CI pipeline in this repo, so this is a workflow concern rather than a code defect: developers who `git pull` and `docker-compose up` without `--build` will run the old image without `libglib2.0-0`. The first time `preprocessing.py` is imported, it will raise `ImportError: libglib2.0-0.so.0 not found`. The fix is always `docker-compose up --build ocr-service` after pulling this branch. Consider adding a comment to the PR description or CONTRIBUTING notes. Not a blocker. **`CLAHE_TILE_SIZE` type safety:** `int(os.environ.get("OCR_CLAHE_TILE_SIZE", "8"))` will crash at startup if the env var is set to a non-integer (e.g. `"8x8"`). This is fail-fast behaviour which is good for ops visibility, but a `ValueError` traceback in the startup log can be confusing. A try/except with a clear error message would help on-call debugging. Non-blocking suggestion.

marcel added 2 commits 2026-04-17 15:19:24 +02:00

test(ocr): guard translateOcrProgress fallback for PREPROCESSING_PAGE with missing colon parts 97c94c91f8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(ocr): add resilience tests for tiny image and unexpected exception propagation

CI / Unit & Component Tests (pull_request) Failing after 2m27s

Details

CI / Backend Unit Tests (pull_request) Failing after 2m37s

Details

CI / Unit & Component Tests (push) Failing after 3m14s

Details

CI / Backend Unit Tests (push) Has been cancelled

Details

4cb7c975f5

Add test for 1×1 image (sub-tile-size) resilience and narrow preprocess_page
fallback from except Exception to (cv2.error, ValueError, MemoryError) so
programming errors propagate instead of being silently swallowed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-04-17 15:19:46 +02:00

Review feedback addressed

The following reviewer concerns from cycle 1 have been resolved:

✅ @Sara — `PREPROCESSING_PAGE` with missing colon parts has no regression guard

Added test_PREPROCESSING_PAGE_with_no_colon_parts_gracefully_falls_back_to_zero in translateOcrProgress.spec.ts. Documents that translateOcrProgress('PREPROCESSING_PAGE') returns currentPage: 0, totalPages: 0 via the existing ?? '0' fallback.
Commit: 97c94c9

✅ @Sara — No test for `preprocess_page` with a 1×1 image (smaller than CLAHE tile)

Added test_does_not_crash_on_sub_tile_size_image in test_preprocessing.py. Verifies preprocess_page returns a valid PIL Image for a 1×1 input without raising.
Commit: 4cb7c97

✅ @Nora — `except Exception` in `preprocessing.py` swallows unexpected errors

Narrowed the fallback to except (cv2.error, ValueError, MemoryError). Added test_unexpected_exception_propagates that was RED under the old broad catch (RuntimeError swallowed) and GREEN after the fix. Updated the existing fallback test to use ValueError as the side effect (was RuntimeError, which now correctly propagates).
Commit: 4cb7c97

Deferred as follow-up issues

@Markus (two-phase UI progress indicator): PREPROCESSING_PAGE and ANALYZING_PAGE are visually identical in the progress bar. Tracked as a separate follow-up.
@Markus (configuration coupling): CLAHE params read at module import time. Acceptable for production; deferred.
@Tobias (CLAHE_TILE_SIZE startup error clarity): Wrapping int() parse with a clearer error message. Deferred as a small follow-up issue.
@Tobias (rebuild signal): Developers must run docker-compose up --build ocr-service after pulling this branch. Workflow concern, no code change needed.
@Sara (integration test): End-to-end test spinning up FastAPI with a real image. Known gap; deferred.

## Review feedback addressed The following reviewer concerns from cycle 1 have been resolved: ### ✅ @Sara — `PREPROCESSING_PAGE` with missing colon parts has no regression guard Added `test_PREPROCESSING_PAGE_with_no_colon_parts_gracefully_falls_back_to_zero` in `translateOcrProgress.spec.ts`. Documents that `translateOcrProgress('PREPROCESSING_PAGE')` returns `currentPage: 0, totalPages: 0` via the existing `?? '0'` fallback. Commit: `97c94c9` ### ✅ @Sara — No test for `preprocess_page` with a 1×1 image (smaller than CLAHE tile) Added `test_does_not_crash_on_sub_tile_size_image` in `test_preprocessing.py`. Verifies `preprocess_page` returns a valid PIL Image for a 1×1 input without raising. Commit: `4cb7c97` ### ✅ @Nora — `except Exception` in `preprocessing.py` swallows unexpected errors Narrowed the fallback to `except (cv2.error, ValueError, MemoryError)`. Added `test_unexpected_exception_propagates` that was RED under the old broad catch (RuntimeError swallowed) and GREEN after the fix. Updated the existing fallback test to use `ValueError` as the side effect (was `RuntimeError`, which now correctly propagates). Commit: `4cb7c97` --- ### Deferred as follow-up issues - **@Markus (two-phase UI progress indicator)**: `PREPROCESSING_PAGE` and `ANALYZING_PAGE` are visually identical in the progress bar. Tracked as a separate follow-up. - **@Markus (configuration coupling)**: CLAHE params read at module import time. Acceptable for production; deferred. - **@Tobias (CLAHE_TILE_SIZE startup error clarity)**: Wrapping `int()` parse with a clearer error message. Deferred as a small follow-up issue. - **@Tobias (rebuild signal)**: Developers must run `docker-compose up --build ocr-service` after pulling this branch. Workflow concern, no code change needed. - **@Sara (integration test)**: End-to-end test spinning up FastAPI with a real image. Known gap; deferred.

marcel referenced this pull request

2026-04-17 15:20:02 +02:00

feat(ocr): two-phase progress indicator distinguishing preprocessing from OCR analysis #256

marcel referenced this pull request

2026-04-17 15:20:10 +02:00

devops(ocr): improve startup error clarity for invalid CLAHE env vars #257

marcel referenced this pull request

2026-04-17 15:20:18 +02:00

test(ocr): add integration test for full streaming pipeline with a real image #258

marcel commented

2026-04-17 15:21:10 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer (cycle 2)

Verdict: ✅ Approved

The cycle 1 feedback has been addressed correctly and the new tests follow the red/green/refactor discipline I'd expect.

What improved since cycle 1

Exception narrowing (preprocessing.py):
The broad except Exception is now except (cv2.error, ValueError, MemoryError). Critically, the fix was driven by a proper TDD cycle:

test_unexpected_exception_propagates was red under the old broad catch (RuntimeError silently swallowed → no raise → test failed)
Narrowing made it green
The existing fallback test was updated from RuntimeError to ValueError to keep it meaningful under the new contract

This is exactly how a fallback clause refactor should be done. No shortcutting.

Regression guard (translateOcrProgress.spec.ts):
PREPROCESSING_PAGE with no colon parts gracefully falls back to zero documents the ?? '0' fallback behaviour. It's a pure documentation test — no red step possible since the behaviour already existed — but the intent is correct and the test will catch any future accidental removal of the fallback.

Resilience test (test_preprocessing.py):
test_does_not_crash_on_sub_tile_size_image is a clean resilience guard. The 1×1 PIL image is the minimal reproduction case for the CLAHE tile-size edge case Sara raised. Works correctly: OpenCV handles it without raising, so the assertion isinstance(result, Image.Image) is green.

No new concerns from me

## 👨‍💻 Felix Brandt — Senior Fullstack Developer (cycle 2) **Verdict: ✅ Approved** The cycle 1 feedback has been addressed correctly and the new tests follow the red/green/refactor discipline I'd expect. ### What improved since cycle 1 **Exception narrowing (`preprocessing.py`):** The broad `except Exception` is now `except (cv2.error, ValueError, MemoryError)`. Critically, the fix was driven by a proper TDD cycle: - `test_unexpected_exception_propagates` was red under the old broad catch (RuntimeError silently swallowed → no raise → test failed) - Narrowing made it green - The existing fallback test was updated from `RuntimeError` to `ValueError` to keep it meaningful under the new contract This is exactly how a fallback clause refactor should be done. No shortcutting. **Regression guard (`translateOcrProgress.spec.ts`):** `PREPROCESSING_PAGE with no colon parts gracefully falls back to zero` documents the `?? '0'` fallback behaviour. It's a pure documentation test — no red step possible since the behaviour already existed — but the intent is correct and the test will catch any future accidental removal of the fallback. **Resilience test (`test_preprocessing.py`):** `test_does_not_crash_on_sub_tile_size_image` is a clean resilience guard. The 1×1 PIL image is the minimal reproduction case for the CLAHE tile-size edge case Sara raised. Works correctly: OpenCV handles it without raising, so the assertion `isinstance(result, Image.Image)` is green. ### No new concerns from me

marcel commented

2026-04-17 15:21:18 +02:00

🏛️ Markus Keller — Software Architect (cycle 2)

Verdict: ✅ Approved

Both of my cycle 1 concerns are properly resolved — one by deferral to a tracked issue, one confirmed as a non-issue in context.

Cycle 1 concerns — resolved

Two-phase UI progress indicator:
Tracked as issue #256 with a concrete suggested approach (adding a phase field to OcrProgressResult). The current UX is functional; the two-phase label is an enhancement. Deferral is the right call.

Configuration coupling:
Re-evaluated: CLAHE_CLIP_LIMIT and CLAHE_TILE_SIZE being read at module import time is correct for this service's deployment model. The service is a single-process FastAPI app restarted via Docker Compose when configuration changes. Per-request parameterisation would add complexity for zero operational benefit. My concern was misplaced for this context.

No new architectural concerns

The exception narrowing in preprocessing.py is cleaner from an architecture standpoint too: the fallback is now scoped to expected failure modes (cv2 errors, bad values, OOM), and programming errors escape to the top-level exception handler where they'll produce a proper stack trace and error response. This is the right layering.

## 🏛️ Markus Keller — Software Architect (cycle 2) **Verdict: ✅ Approved** Both of my cycle 1 concerns are properly resolved — one by deferral to a tracked issue, one confirmed as a non-issue in context. ### Cycle 1 concerns — resolved **Two-phase UI progress indicator:** Tracked as issue #256 with a concrete suggested approach (adding a `phase` field to `OcrProgressResult`). The current UX is functional; the two-phase label is an enhancement. Deferral is the right call. **Configuration coupling:** Re-evaluated: `CLAHE_CLIP_LIMIT` and `CLAHE_TILE_SIZE` being read at module import time is correct for this service's deployment model. The service is a single-process FastAPI app restarted via Docker Compose when configuration changes. Per-request parameterisation would add complexity for zero operational benefit. My concern was misplaced for this context. ### No new architectural concerns The exception narrowing in `preprocessing.py` is cleaner from an architecture standpoint too: the fallback is now scoped to expected failure modes (cv2 errors, bad values, OOM), and programming errors escape to the top-level exception handler where they'll produce a proper stack trace and error response. This is the right layering.

marcel commented

2026-04-17 15:21:24 +02:00

🔒 Nora "NullX" Steiner — Security Expert (cycle 2)

Verdict: ✅ Approved

My cycle 1 suggestion has been correctly implemented. No new security surface.

Cycle 1 suggestion — implemented

The except Exception catch-all in preprocess_page is now except (cv2.error, ValueError, MemoryError). The updated test_falls_back_to_pixel_identical_original_on_cv2_error now uses ValueError as the injected failure (a type that IS caught), and test_unexpected_exception_propagates verifies that RuntimeError (a type that is NOT caught) propagates — exactly the contract I asked for.

The implementation is secure: CLAHE parameters still come from environment variables (not request data), image data still arrives via MinIO presigned URLs (not direct user upload to the preprocessing function), and the narrowed catch doesn't expose any new code paths to user-controlled input.

No new concerns

## 🔒 Nora "NullX" Steiner — Security Expert (cycle 2) **Verdict: ✅ Approved** My cycle 1 suggestion has been correctly implemented. No new security surface. ### Cycle 1 suggestion — implemented The `except Exception` catch-all in `preprocess_page` is now `except (cv2.error, ValueError, MemoryError)`. The updated `test_falls_back_to_pixel_identical_original_on_cv2_error` now uses `ValueError` as the injected failure (a type that IS caught), and `test_unexpected_exception_propagates` verifies that `RuntimeError` (a type that is NOT caught) propagates — exactly the contract I asked for. The implementation is secure: CLAHE parameters still come from environment variables (not request data), image data still arrives via MinIO presigned URLs (not direct user upload to the preprocessing function), and the narrowed catch doesn't expose any new code paths to user-controlled input. ### No new concerns

marcel commented

2026-04-17 15:21:32 +02:00

🧪 Sara Holt — QA Engineer (cycle 2)

Verdict: ✅ Approved

All three of my cycle 1 gaps have been addressed — two with new tests, one properly deferred to a tracked issue.

Gap 1 — Regression guard for `PREPROCESSING_PAGE` with missing parts ✅

PREPROCESSING_PAGE with no colon parts gracefully falls back to zero is now in translateOcrProgress.spec.ts. It confirms currentPage: 0 and totalPages: 0 from the existing ?? '0' fallback. The test is green and will catch any future removal of the fallback.

Gap 2 — 1×1 image edge case ✅

test_does_not_crash_on_sub_tile_size_image added to test_preprocessing.py. Result: OpenCV's CLAHE handles a 1×1 image without raising (it processes it, returning a 1×1 grayscale result). No fallback is triggered. The test passes cleanly and documents this boundary behaviour.

Gap 3 — Integration test ✅ (deferred, tracked)

Tracked as issue #258 with a concrete description of the expected test structure (httpx AsyncClient, real image fixture, event-sequence assertions). The gap is real but requires infrastructure work beyond this PR's scope. Acceptable deferral.

Bonus — unexpected exception propagation ✅

The new test_unexpected_exception_propagates ensures that RuntimeError from a programming error in preprocess_page reaches the caller rather than being silently logged. This was not a gap I raised but it strengthens the test suite's reliability guarantees.

No remaining concerns

## 🧪 Sara Holt — QA Engineer (cycle 2) **Verdict: ✅ Approved** All three of my cycle 1 gaps have been addressed — two with new tests, one properly deferred to a tracked issue. ### Gap 1 — Regression guard for `PREPROCESSING_PAGE` with missing parts ✅ `PREPROCESSING_PAGE with no colon parts gracefully falls back to zero` is now in `translateOcrProgress.spec.ts`. It confirms `currentPage: 0` and `totalPages: 0` from the existing `?? '0'` fallback. The test is green and will catch any future removal of the fallback. ### Gap 2 — 1×1 image edge case ✅ `test_does_not_crash_on_sub_tile_size_image` added to `test_preprocessing.py`. Result: OpenCV's CLAHE handles a 1×1 image without raising (it processes it, returning a 1×1 grayscale result). No fallback is triggered. The test passes cleanly and documents this boundary behaviour. ### Gap 3 — Integration test ✅ (deferred, tracked) Tracked as issue #258 with a concrete description of the expected test structure (httpx AsyncClient, real image fixture, event-sequence assertions). The gap is real but requires infrastructure work beyond this PR's scope. Acceptable deferral. ### Bonus — unexpected exception propagation ✅ The new `test_unexpected_exception_propagates` ensures that `RuntimeError` from a programming error in `preprocess_page` reaches the caller rather than being silently logged. This was not a gap I raised but it strengthens the test suite's reliability guarantees. ### No remaining concerns

marcel commented

2026-04-17 15:21:36 +02:00

🎨 Leonie Voss — UX & Accessibility Expert (cycle 2)

Verdict: ✅ Approved

No changes to user-facing copy, i18n strings, or UI behaviour since cycle 1. My cycle 1 approval stands. The deferred suggestion (phase label above the progress bar) is tracked in issue #256.

Checked in cycle 2

translateOcrProgress.spec.ts additions are purely test code — no user-visible impact.
test_preprocessing.py additions are server-side — no user-visible impact.
preprocessing.py exception narrowing is server-side — no user-visible impact.
All three i18n files (de.json, en.json, es.json) are unchanged since cycle 1.

No UX concerns.

## 🎨 Leonie Voss — UX & Accessibility Expert (cycle 2) **Verdict: ✅ Approved** No changes to user-facing copy, i18n strings, or UI behaviour since cycle 1. My cycle 1 approval stands. The deferred suggestion (phase label above the progress bar) is tracked in issue #256. ### Checked in cycle 2 - `translateOcrProgress.spec.ts` additions are purely test code — no user-visible impact. - `test_preprocessing.py` additions are server-side — no user-visible impact. - `preprocessing.py` exception narrowing is server-side — no user-visible impact. - All three i18n files (`de.json`, `en.json`, `es.json`) are unchanged since cycle 1. No UX concerns.

marcel commented

2026-04-17 15:21:44 +02:00

🛠️ Tobias Wendt — DevOps & Infrastructure (cycle 2)

Verdict: ✅ Approved

My cycle 1 concerns are all resolved — two deferred to tracked issues, one confirmed as intentional-and-acceptable behaviour.

Cycle 1 concerns — resolved

Image size impact:
This is a monitoring item, not a code defect. No action required before merge. Acceptable.

Rebuild signal:
Developers pulling this branch need docker-compose up --build ocr-service. This is a workflow concern inherent to any Dockerfile/requirements.txt change. No CI pipeline in this repo means there's no automated signal — but this is true of all such changes. The PR description explains the new dependency. Acceptable.

CLAHE_TILE_SIZE type safety:
Tracked as issue #257 with the complete implementation recipe (_parse_int_env helper). The current fail-fast ValueError from Python's int() is not dangerous — it causes an immediate startup failure and a non-zero exit code, which Docker will flag. The error message could be clearer, but it's not a blocker.

No new infrastructure concerns

The Dockerfile, requirements.txt, and docker-compose.yml changes are unchanged since cycle 1 and were already correct. opencv-python-headless remains pinned to an exact version. libglib2.0-0 comment is still present.

## 🛠️ Tobias Wendt — DevOps & Infrastructure (cycle 2) **Verdict: ✅ Approved** My cycle 1 concerns are all resolved — two deferred to tracked issues, one confirmed as intentional-and-acceptable behaviour. ### Cycle 1 concerns — resolved **Image size impact:** This is a monitoring item, not a code defect. No action required before merge. Acceptable. **Rebuild signal:** Developers pulling this branch need `docker-compose up --build ocr-service`. This is a workflow concern inherent to any Dockerfile/requirements.txt change. No CI pipeline in this repo means there's no automated signal — but this is true of all such changes. The PR description explains the new dependency. Acceptable. **`CLAHE_TILE_SIZE` type safety:** Tracked as issue #257 with the complete implementation recipe (`_parse_int_env` helper). The current fail-fast `ValueError` from Python's `int()` is not dangerous — it causes an immediate startup failure and a non-zero exit code, which Docker will flag. The error message could be clearer, but it's not a blocker. ### No new infrastructure concerns The `Dockerfile`, `requirements.txt`, and `docker-compose.yml` changes are unchanged since cycle 1 and were already correct. `opencv-python-headless` remains pinned to an exact version. `libglib2.0-0` comment is still present.

marcel merged commit 4cb7c975f5 into main

2026-04-17 15:50:37 +02:00

marcel deleted branch feat/issue-252-ocr-preprocessing

2026-04-17 15:50:39 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#255

feat(ocr): add image preprocessing pipeline to improve transcription quality on aged documents #255

Summary

👨‍💻 Felix Brandt — Senior Fullstack Developer

What I checked

Suggestions (non-blocking)

🏛️ Markus Keller — Software Architect

What I checked

Concern (track as tech debt)

🔒 Nora "NullX" Steiner — Security Expert

What I checked

Suggestion (non-blocking)

🧪 Sara Holt — QA Engineer

What I checked

Gaps (suggest tracking as follow-up issues)

🎨 Leonie Voss — UX & Accessibility Expert

What I checked

Suggestion (non-blocking)

🛠️ Tobias Wendt — DevOps & Infrastructure

What I checked

Concerns

Review feedback addressed

✅ @Sara — PREPROCESSING_PAGE with missing colon parts has no regression guard

✅ @Sara — No test for preprocess_page with a 1×1 image (smaller than CLAHE tile)

✅ @Nora — except Exception in preprocessing.py swallows unexpected errors

Deferred as follow-up issues

👨‍💻 Felix Brandt — Senior Fullstack Developer (cycle 2)

What improved since cycle 1

No new concerns from me

🏛️ Markus Keller — Software Architect (cycle 2)

Cycle 1 concerns — resolved

No new architectural concerns

🔒 Nora "NullX" Steiner — Security Expert (cycle 2)

Cycle 1 suggestion — implemented

No new concerns

🧪 Sara Holt — QA Engineer (cycle 2)

Gap 1 — Regression guard for PREPROCESSING_PAGE with missing parts ✅

Gap 2 — 1×1 image edge case ✅

Gap 3 — Integration test ✅ (deferred, tracked)

Bonus — unexpected exception propagation ✅

No remaining concerns

🎨 Leonie Voss — UX & Accessibility Expert (cycle 2)

Checked in cycle 2

🛠️ Tobias Wendt — DevOps & Infrastructure (cycle 2)

Cycle 1 concerns — resolved

No new infrastructure concerns

✅ @Sara — `PREPROCESSING_PAGE` with missing colon parts has no regression guard

✅ @Sara — No test for `preprocess_page` with a 1×1 image (smaller than CLAHE tile)

✅ @Nora — `except Exception` in `preprocessing.py` swallows unexpected errors

Gap 1 — Regression guard for `PREPROCESSING_PAGE` with missing parts ✅