feat(ocr): add image preprocessing pipeline to improve transcription quality on aged documents #252

Closed
opened 2026-04-17 12:04:01 +02:00 by marcel · 9 comments
Owner

Problem

Scanned family letters from the 19th/20th century typically have yellowed paper with faded ink. The OCR engines (Surya, Kraken) receive raw page images today with no preprocessing — low contrast between aged ink and yellowed background leads to missed or garbled text.

Solution

Add a transparent preprocessing step before every OCR pass using OpenCV:

  1. CLAHE on the L channel (LAB color space) — local contrast enhancement that adapts to uneven yellowing across the page
  2. Grayscale conversion — removes color noise from aged paper, reduces memory
  3. Gaussian blur (3×3) — removes speckle noise while preserving ink strokes

No binarization (Otsu/Sauvola) — too aggressive for always-on use and would hurt Surya which was trained on grayscale, not binary images.

Preprocessing runs inside asyncio.to_thread() so it never blocks the FastAPI event loop.

Progress Spinner

A new preprocessing NDJSON event is emitted per page before the OCR event:

{"type": "preprocessing", "pageNumber": 1}

This flows through the full stack: Python → Spring Boot OcrStreamEvent.PreprocessingprogressMessage = "PREPROCESSING_PAGE:1:5"translateOcrProgress"Seite 1 von 5 wird vorverarbeitet…" in the spinner.

Changes

Layer File Change
Python ocr-service/preprocessing.py Newpreprocess_page(image) function
Python ocr-service/main.py Emit preprocessing event + call preprocess_page per page in both stream generators; preprocess silently in non-streaming /ocr endpoint
Python ocr-service/requirements.txt Add opencv-python-headless
Java service/OcrStreamEvent.java Add record Preprocessing(int pageNumber)
Java service/RestClientOcrClient.java Parse "preprocessing" NDJSON type in parseNdjsonStream()
Java service/OcrAsyncRunner.java Handle Preprocessing event → updateProgress("PREPROCESSING_PAGE:X:Y")
TS lib/ocr/translateOcrProgress.ts Add PREPROCESSING_PAGE case
i18n messages/de.json, en.json, es.json Add ocr_status_preprocessing_page key

Tests

  • Python test_preprocessing.py: output shape, L-channel mean increases on synthetic yellowed image, fallback on OpenCV error
  • Java RestClientOcrClientTest: parseNdjsonStream dispatches Preprocessing event
  • Java OcrAsyncRunnerTest: Preprocessing event sets correct progressMessage
  • Frontend translateOcrProgress.test.ts: PREPROCESSING_PAGE:3:10{ currentPage: 3, totalPages: 10 }

No API Changes

The OcrRequest model is unchanged. No flag, no user action required — preprocessing is always-on and transparent.

## Problem Scanned family letters from the 19th/20th century typically have yellowed paper with faded ink. The OCR engines (Surya, Kraken) receive raw page images today with no preprocessing — low contrast between aged ink and yellowed background leads to missed or garbled text. ## Solution Add a transparent preprocessing step before every OCR pass using OpenCV: 1. **CLAHE** on the L channel (LAB color space) — local contrast enhancement that adapts to uneven yellowing across the page 2. **Grayscale conversion** — removes color noise from aged paper, reduces memory 3. **Gaussian blur (3×3)** — removes speckle noise while preserving ink strokes No binarization (Otsu/Sauvola) — too aggressive for always-on use and would hurt Surya which was trained on grayscale, not binary images. Preprocessing runs inside `asyncio.to_thread()` so it never blocks the FastAPI event loop. ## Progress Spinner A new `preprocessing` NDJSON event is emitted per page before the OCR event: ```json {"type": "preprocessing", "pageNumber": 1} ``` This flows through the full stack: Python → Spring Boot `OcrStreamEvent.Preprocessing` → `progressMessage = "PREPROCESSING_PAGE:1:5"` → `translateOcrProgress` → **"Seite 1 von 5 wird vorverarbeitet…"** in the spinner. ## Changes | Layer | File | Change | |---|---|---| | Python | `ocr-service/preprocessing.py` | **New** — `preprocess_page(image)` function | | Python | `ocr-service/main.py` | Emit `preprocessing` event + call `preprocess_page` per page in both stream generators; preprocess silently in non-streaming `/ocr` endpoint | | Python | `ocr-service/requirements.txt` | Add `opencv-python-headless` | | Java | `service/OcrStreamEvent.java` | Add `record Preprocessing(int pageNumber)` | | Java | `service/RestClientOcrClient.java` | Parse `"preprocessing"` NDJSON type in `parseNdjsonStream()` | | Java | `service/OcrAsyncRunner.java` | Handle `Preprocessing` event → `updateProgress("PREPROCESSING_PAGE:X:Y")` | | TS | `lib/ocr/translateOcrProgress.ts` | Add `PREPROCESSING_PAGE` case | | i18n | `messages/de.json`, `en.json`, `es.json` | Add `ocr_status_preprocessing_page` key | ## Tests - **Python** `test_preprocessing.py`: output shape, L-channel mean increases on synthetic yellowed image, fallback on OpenCV error - **Java** `RestClientOcrClientTest`: `parseNdjsonStream` dispatches `Preprocessing` event - **Java** `OcrAsyncRunnerTest`: `Preprocessing` event sets correct `progressMessage` - **Frontend** `translateOcrProgress.test.ts`: `PREPROCESSING_PAGE:3:10` → `{ currentPage: 3, totalPages: 10 }` ## No API Changes The `OcrRequest` model is unchanged. No flag, no user action required — preprocessing is always-on and transparent.
marcel added the feature label 2026-04-17 12:04:06 +02:00
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

  • The test plan is well-structured and matches the existing patterns in test_stream.py (pytest + AsyncMock, patch() for engine deps). The test_preprocessing.py additions slot in cleanly.
  • In the streaming generate() loop, once preprocess_page returns the processed image, the original PIL Image should be immediately deleted before the OCR call. Currently del image happens in the finally block — that's fine for the original, but the intermediate numpy array inside preprocess_page needs to be explicitly deleted before returning to avoid holding both the original and processed image in memory simultaneously (a ~20MB spike per page). This should be handled inside preprocess_page itself.
  • The generate_guided() path processes all regions on one page in a loop. Preprocessing should happen once per page (before the region loop), not once per region. The issue says this correctly but it's easy to misimplement.
  • For the non-streaming /ocr endpoint, the preprocessed image should replace the original in-place (image = await asyncio.to_thread(preprocess_page, image)) with a del on the intermediate to avoid doubling peak memory for large documents.
  • translateOcrProgress.spec.ts already mocks messages with simple strings — ocr_status_preprocessing_page should follow the same mock pattern (({ current, total }) => \Vorverarbeitung Seite ${current} von ${total}`` or similar).

Recommendations

  • Add explicit del of intermediate numpy/cv2 arrays inside preprocess_page before returning the final PIL Image. This keeps peak memory per page at ~10MB instead of ~20MB.
  • The test_preprocess_falls_back_on_error test should assert not just that no exception is raised, but that the returned image is pixel-identical to the input. That's the contract.
  • Write the failing test_preprocessing.py tests before touching main.py. The preprocessing module is pure Python with no I/O — red/green/refactor takes minutes here.
## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - The test plan is well-structured and matches the existing patterns in `test_stream.py` (pytest + `AsyncMock`, `patch()` for engine deps). The `test_preprocessing.py` additions slot in cleanly. - In the streaming `generate()` loop, once `preprocess_page` returns the processed image, the original PIL Image should be immediately deleted before the OCR call. Currently `del image` happens in the `finally` block — that's fine for the original, but the intermediate numpy array inside `preprocess_page` needs to be explicitly deleted before returning to avoid holding both the original and processed image in memory simultaneously (a ~20MB spike per page). This should be handled inside `preprocess_page` itself. - The `generate_guided()` path processes all regions on one page in a loop. Preprocessing should happen once per page (before the region loop), not once per region. The issue says this correctly but it's easy to misimplement. - For the non-streaming `/ocr` endpoint, the preprocessed image should replace the original in-place (`image = await asyncio.to_thread(preprocess_page, image)`) with a `del` on the intermediate to avoid doubling peak memory for large documents. - `translateOcrProgress.spec.ts` already mocks messages with simple strings — `ocr_status_preprocessing_page` should follow the same mock pattern (`({ current, total }) => \`Vorverarbeitung Seite ${current} von ${total}\`` or similar). ### Recommendations - Add explicit `del` of intermediate numpy/cv2 arrays inside `preprocess_page` before returning the final PIL Image. This keeps peak memory per page at ~10MB instead of ~20MB. - The `test_preprocess_falls_back_on_error` test should assert not just that no exception is raised, but that the **returned image is pixel-identical to the input**. That's the contract. - Write the failing `test_preprocessing.py` tests before touching `main.py`. The preprocessing module is pure Python with no I/O — red/green/refactor takes minutes here.
Author
Owner

🏗️ Markus Keller — Senior Application Architect

Observations

  • The module boundary is clean. preprocessing.py is a pure image transformation module with one responsibility and no cross-cutting dependencies. The addition of OcrStreamEvent.Preprocessing to the sealed interface is textbook — the interface already models the stream protocol, and preprocessing is a legitimate protocol event.
  • NDJSON_MAPPER safety: RestClientOcrClient.parseNdjsonStream() already has a default -> log.debug(...) branch for unknown event types. This means if the Python service is deployed first (emitting preprocessing events) before the Java code is updated, those events are silently ignored — no crash, no partial data loss. Rolling deployments in either order are safe.
  • CLAHE parameters are hardcoded (clipLimit=2.0, tileGridSize=(8,8)). These are sensible defaults, but the right values are document-corpus-specific. Since the OCR service already uses environment variables for all tunable parameters (OCR_CONFIDENCE_THRESHOLD, batch sizes), the preprocessing parameters should follow the same pattern.
  • The progressMessage colon-delimited string encoding is already an established protocol in OcrAsyncRunner. Adding PREPROCESSING_PAGE:X:Y follows the existing convention — no architecture concern there.

Recommendations

  • Expose CLAHE parameters as environment variables in preprocessing.py:
    CLAHE_CLIP_LIMIT = float(os.environ.get("OCR_CLAHE_CLIP_LIMIT", "2.0"))
    CLAHE_TILE_SIZE = int(os.environ.get("OCR_CLAHE_TILE_SIZE", "8"))
    
    Add them to docker-compose.yml with their defaults. This avoids a code change when tuning the pipeline after the first real-world test.
  • No architectural blockers. The design follows existing patterns throughout the stack.
## 🏗️ Markus Keller — Senior Application Architect ### Observations - The module boundary is clean. `preprocessing.py` is a pure image transformation module with one responsibility and no cross-cutting dependencies. The addition of `OcrStreamEvent.Preprocessing` to the sealed interface is textbook — the interface already models the stream protocol, and preprocessing is a legitimate protocol event. - **NDJSON_MAPPER safety**: `RestClientOcrClient.parseNdjsonStream()` already has a `default -> log.debug(...)` branch for unknown event types. This means if the Python service is deployed first (emitting `preprocessing` events) before the Java code is updated, those events are silently ignored — no crash, no partial data loss. Rolling deployments in either order are safe. - **CLAHE parameters are hardcoded** (`clipLimit=2.0`, `tileGridSize=(8,8)`). These are sensible defaults, but the right values are document-corpus-specific. Since the OCR service already uses environment variables for all tunable parameters (`OCR_CONFIDENCE_THRESHOLD`, batch sizes), the preprocessing parameters should follow the same pattern. - The `progressMessage` colon-delimited string encoding is already an established protocol in `OcrAsyncRunner`. Adding `PREPROCESSING_PAGE:X:Y` follows the existing convention — no architecture concern there. ### Recommendations - Expose CLAHE parameters as environment variables in `preprocessing.py`: ```python CLAHE_CLIP_LIMIT = float(os.environ.get("OCR_CLAHE_CLIP_LIMIT", "2.0")) CLAHE_TILE_SIZE = int(os.environ.get("OCR_CLAHE_TILE_SIZE", "8")) ``` Add them to `docker-compose.yml` with their defaults. This avoids a code change when tuning the pipeline after the first real-world test. - No architectural blockers. The design follows existing patterns throughout the stack.
Author
Owner

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

This is low-risk from a security standpoint. No new endpoints, no new HTTP parameters, no user input reaches the preprocessing code path.

  • No new attack surface. preprocess_page(image: Image.Image) takes a PIL Image that was already downloaded via _download_and_convert_pdf(), which has existing SSRF protection (ALLOWED_PDF_HOSTS). The preprocessing step operates on an in-memory object, not on paths or URLs.
  • Failure posture is correct. The fallback-on-error (return original image, log warning, never raise) is the right behavior for a preprocessing step. A crash in CLAHE should not abort the OCR job. However, this means a silent preprocessing failure will log at WARNING level — ensure the warning includes the page number and exception type so it's actionable in Loki.
  • opencv-python-headless supply chain: This is a well-maintained, widely-used library (PyPI downloads ~2M/week). It has no known active CVEs at the time of this review. Add it to any dependency scanning setup (pip-audit in CI) once it's in requirements.txt.
  • Memory handling: CLAHE and Gaussian blur operate in-process on numpy arrays. There is no disk write, no temp file creation, no subprocess call. No path traversal or file injection risk.

Recommendations

  • Log the exception type in the fallback handler:
    logger.warning("preprocess_page failed on page (falling back to original): %s: %s", type(e).__name__, e)
    
    This makes it observable in production without exposing stack traces.
  • If pip-audit is not already part of CI, adding it when opencv-python-headless is introduced is a good checkpoint to scan the full dependency tree.
## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Observations This is low-risk from a security standpoint. No new endpoints, no new HTTP parameters, no user input reaches the preprocessing code path. - **No new attack surface.** `preprocess_page(image: Image.Image)` takes a PIL Image that was already downloaded via `_download_and_convert_pdf()`, which has existing SSRF protection (`ALLOWED_PDF_HOSTS`). The preprocessing step operates on an in-memory object, not on paths or URLs. - **Failure posture is correct.** The fallback-on-error (return original image, log warning, never raise) is the right behavior for a preprocessing step. A crash in CLAHE should not abort the OCR job. However, this means a silent preprocessing failure will log at `WARNING` level — ensure the warning includes the page number and exception type so it's actionable in Loki. - **opencv-python-headless supply chain**: This is a well-maintained, widely-used library (PyPI downloads ~2M/week). It has no known active CVEs at the time of this review. Add it to any dependency scanning setup (pip-audit in CI) once it's in `requirements.txt`. - **Memory handling**: CLAHE and Gaussian blur operate in-process on numpy arrays. There is no disk write, no temp file creation, no subprocess call. No path traversal or file injection risk. ### Recommendations - Log the exception type in the fallback handler: ```python logger.warning("preprocess_page failed on page (falling back to original): %s: %s", type(e).__name__, e) ``` This makes it observable in production without exposing stack traces. - If pip-audit is not already part of CI, adding it when `opencv-python-headless` is introduced is a good checkpoint to scan the full dependency tree.
Author
Owner

🧪 Sara Holt — Senior QA Engineer

Observations

The test plan covers the four main layers correctly. A few gaps and one edge case worth addressing:

Edge case not covered: totalPages is 0 when first Preprocessing event arrives.
OcrStreamEvent.Start sets totalPages.get() in OcrAsyncRunner. However, the stream order is:

{"type": "start", "totalPages": 5}   ← totalPages set here
{"type": "preprocessing", "pageNumber": 1}  ← used immediately after

This is correct — Start always precedes Preprocessing per the protocol. But the test for runSingleDocument_updatesProgressOnPreprocessingEvent should simulate both events in sequence (Start then Preprocessing) to test the realistic path. Testing Preprocessing alone would give "PREPROCESSING_PAGE:1:0", which tests a degenerate case.

translateOcrProgress.spec.ts — existing mock pattern to follow:
The spec file mocks $lib/paraglide/messages.js inline. The new ocr_status_preprocessing_page mock should match this pattern exactly (the mock returns a function that accepts { current, total } and returns a formatted string).

Missing test: non-streaming /ocr endpoint preprocesses silently.
The issue correctly states the non-streaming endpoint should preprocess without emitting events. There should be a test in test_stream.py (or a new test_preprocessing_integration.py) that calls /ocr directly and verifies the engine receives a preprocessed image (mock preprocess_page and verify it was called).

OcrAsyncRunnerTest — use ArgumentCaptor correctly:
The existing test file uses verify(ocrJobRepository, atLeastOnce()).save(any()). The Preprocessing progress test should use ArgumentCaptor<OcrJob> to capture all save() calls and assert that at least one call had progressMessage matching "PREPROCESSING_PAGE:1:5" — not just "PREPROCESSING_PAGE:1:0".

Recommendations

  • Add test_preprocessing_called_for_ocr_endpoint in test_stream.py (mock preprocess_page, call /ocr, assert it was invoked once per page).
  • In runSingleDocument_updatesProgressOnPreprocessingEvent, simulate the full event sequence: Start(5)Preprocessing(1)Page(1, []) to test realistic totalPages propagation.
  • Estimated CI impact: test_preprocessing.py adds ~4 fast unit tests (<1s total). No new Testcontainers, no new Docker layers in CI.
## 🧪 Sara Holt — Senior QA Engineer ### Observations The test plan covers the four main layers correctly. A few gaps and one edge case worth addressing: **Edge case not covered: `totalPages` is 0 when first `Preprocessing` event arrives.** `OcrStreamEvent.Start` sets `totalPages.get()` in `OcrAsyncRunner`. However, the stream order is: ``` {"type": "start", "totalPages": 5} ← totalPages set here {"type": "preprocessing", "pageNumber": 1} ← used immediately after ``` This is correct — Start always precedes Preprocessing per the protocol. But the test for `runSingleDocument_updatesProgressOnPreprocessingEvent` should simulate both events in sequence (Start then Preprocessing) to test the realistic path. Testing Preprocessing alone would give `"PREPROCESSING_PAGE:1:0"`, which tests a degenerate case. **`translateOcrProgress.spec.ts` — existing mock pattern to follow:** The spec file mocks `$lib/paraglide/messages.js` inline. The new `ocr_status_preprocessing_page` mock should match this pattern exactly (the mock returns a function that accepts `{ current, total }` and returns a formatted string). **Missing test: non-streaming `/ocr` endpoint preprocesses silently.** The issue correctly states the non-streaming endpoint should preprocess without emitting events. There should be a test in `test_stream.py` (or a new `test_preprocessing_integration.py`) that calls `/ocr` directly and verifies the engine receives a preprocessed image (mock `preprocess_page` and verify it was called). **`OcrAsyncRunnerTest` — use `ArgumentCaptor` correctly:** The existing test file uses `verify(ocrJobRepository, atLeastOnce()).save(any())`. The Preprocessing progress test should use `ArgumentCaptor<OcrJob>` to capture all `save()` calls and assert that at least one call had `progressMessage` matching `"PREPROCESSING_PAGE:1:5"` — not just `"PREPROCESSING_PAGE:1:0"`. ### Recommendations - Add `test_preprocessing_called_for_ocr_endpoint` in `test_stream.py` (mock `preprocess_page`, call `/ocr`, assert it was invoked once per page). - In `runSingleDocument_updatesProgressOnPreprocessingEvent`, simulate the full event sequence: `Start(5)` → `Preprocessing(1)` → `Page(1, [])` to test realistic `totalPages` propagation. - Estimated CI impact: `test_preprocessing.py` adds ~4 fast unit tests (<1s total). No new Testcontainers, no new Docker layers in CI.
Author
Owner

🎨 Leonie Voss — UX Design Lead

Observations

This is a backend/infrastructure change with a narrow frontend surface — only the progress spinner text. My concerns are limited but worth raising.

Dual-message pattern per page may feel choppy. For a 10-page document, users will see the spinner cycle through 20 messages: "Seite 1 von 10 wird vorverarbeitet…" → "Seite 1 von 10 wird analysiert…" → "Seite 2 von 10 wird vorverarbeitet…" → ... If preprocessing is fast (<200ms/page), these messages will flash so quickly that users may not be able to read them, which adds noise without meaningful information.

aria-live frequency. The page counter in +page.svelte has aria-live="polite". With preprocessing events, it will fire twice per page instead of once. Screen readers will announce each change — for a 10-page document, that's 20 announcements, which may be overwhelming for users relying on assistive technology.

"vorverarbeitet" is a technical term. It's accurate German but more jargon-heavy than needed. Users scanning the spinner don't need to know how the system is improving the image — they need to know progress is happening. "Seite X von Y wird aufbereitet…" is slightly more user-friendly, or even a combined phrase could work.

Recommendations

  • Consider whether preprocessing should be surfaced in the spinner at all. If CLAHE takes <200ms per page, the preprocessing message appears and disappears faster than a user can read it. An alternative: keep preprocessing silent (no spinner update), and let the "Analyzing page X of Y" message cover the full page processing time. This simplifies the frontend change to zero lines and the Spring Boot handler to a no-op for Preprocessing events.
  • If the message is surfaced, use a less technical label. Suggested: ocr_status_preprocessing_page"Seite {current} von {total} wird aufbereitet…" (de), "Preparing page {current} of {total}…" (en).
  • Limit aria-live polling frequency or debounce updates to avoid announcing every intermediate state.

Open Decisions

  • Surface preprocessing in the spinner vs. silent processing. If preprocessing is fast (<200ms), users gain little by seeing "vorverarbeitet" flash briefly before "analysiert". Surfacing it adds UX complexity for questionable benefit. If preprocessing is slow (>500ms), it's worth showing. The right answer depends on real-world timing on your document corpus — something only you can measure once the feature ships.
## 🎨 Leonie Voss — UX Design Lead ### Observations This is a backend/infrastructure change with a narrow frontend surface — only the progress spinner text. My concerns are limited but worth raising. **Dual-message pattern per page may feel choppy.** For a 10-page document, users will see the spinner cycle through 20 messages: "Seite 1 von 10 wird vorverarbeitet…" → "Seite 1 von 10 wird analysiert…" → "Seite 2 von 10 wird vorverarbeitet…" → ... If preprocessing is fast (<200ms/page), these messages will flash so quickly that users may not be able to read them, which adds noise without meaningful information. **`aria-live` frequency.** The page counter in `+page.svelte` has `aria-live="polite"`. With preprocessing events, it will fire twice per page instead of once. Screen readers will announce each change — for a 10-page document, that's 20 announcements, which may be overwhelming for users relying on assistive technology. **"vorverarbeitet" is a technical term.** It's accurate German but more jargon-heavy than needed. Users scanning the spinner don't need to know *how* the system is improving the image — they need to know progress is happening. "Seite X von Y wird aufbereitet…" is slightly more user-friendly, or even a combined phrase could work. ### Recommendations - **Consider whether preprocessing should be surfaced in the spinner at all.** If CLAHE takes <200ms per page, the preprocessing message appears and disappears faster than a user can read it. An alternative: keep preprocessing silent (no spinner update), and let the "Analyzing page X of Y" message cover the full page processing time. This simplifies the frontend change to zero lines and the Spring Boot handler to a no-op for Preprocessing events. - If the message *is* surfaced, use a less technical label. Suggested: `ocr_status_preprocessing_page` → `"Seite {current} von {total} wird aufbereitet…"` (de), `"Preparing page {current} of {total}…"` (en). - Limit `aria-live` polling frequency or debounce updates to avoid announcing every intermediate state. ### Open Decisions - **Surface preprocessing in the spinner vs. silent processing.** If preprocessing is fast (<200ms), users gain little by seeing "vorverarbeitet" flash briefly before "analysiert". Surfacing it adds UX complexity for questionable benefit. If preprocessing is slow (>500ms), it's worth showing. The right answer depends on real-world timing on your document corpus — something only you can measure once the feature ships.
Author
Owner

⚙️ Tobias Wendt — DevOps & Platform Engineer

Observations

  • Dockerfile: opencv-python-headless needs a system library. The current base image (python:3.11.9-slim) does not ship libglib2.0-0, which opencv-python-headless requires on Debian slim. The apt-get block in the Dockerfile needs to be extended:

    RUN apt-get update && apt-get install -y --no-install-recommends \
        curl \
        libgomp1 \
        libvips42 \
        libglib2.0-0 \      # required by opencv-python-headless on slim
        && rm -rf /var/lib/apt/lists/*
    

    Without this, import cv2 will raise ImportError: libglib-2.0.so.0: cannot open shared object file at container startup and models will never load.

  • Image rebuild time. Adding opencv-python-headless (~50MB wheel) to requirements.txt will extend the pip install layer in CI and on first deploy. Since the requirements layer is cached, this is a one-time cost. No concern.

  • Memory impact is negligible. CLAHE on a 200 DPI page image (roughly 1600×2000px) allocates ~10MB of numpy arrays temporarily. The container already has a 12GB hard limit. No change needed to mem_limit.

  • Health check timing is unaffected. cv2 imports at module load time (fast, <1s), not at startup model loading. The existing start_period: 120s is driven by PyTorch/Surya model loading, which is unchanged.

  • Rolling deployment is safe. Python emitting preprocessing events before Java is updated: hits the existing default -> log.debug(...) branch in parseNdjsonStream(). Java updated before Python: Python doesn't emit the new events yet. Both orderings are safe.

  • No new environment variables in docker-compose.yml. Per Markus's note on externalizing CLAHE parameters — if those env vars are added, they should follow the existing pattern in the environment: block with inline default comments explaining their purpose.

Recommendations

  • Add libglib2.0-0 to the Dockerfile apt-get block before the PR lands — this will otherwise cause a silent container startup failure that's annoying to debug in production.
  • Pin the opencv-python-headless version in requirements.txt (e.g., opencv-python-headless==4.11.0.86) for reproducible builds. Renovate will handle version bump PRs.
## ⚙️ Tobias Wendt — DevOps & Platform Engineer ### Observations - **Dockerfile: `opencv-python-headless` needs a system library.** The current base image (`python:3.11.9-slim`) does not ship `libglib2.0-0`, which `opencv-python-headless` requires on Debian slim. The `apt-get` block in the Dockerfile needs to be extended: ```dockerfile RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ libgomp1 \ libvips42 \ libglib2.0-0 \ # required by opencv-python-headless on slim && rm -rf /var/lib/apt/lists/* ``` Without this, `import cv2` will raise `ImportError: libglib-2.0.so.0: cannot open shared object file` at container startup and models will never load. - **Image rebuild time.** Adding `opencv-python-headless` (~50MB wheel) to `requirements.txt` will extend the `pip install` layer in CI and on first deploy. Since the requirements layer is cached, this is a one-time cost. No concern. - **Memory impact is negligible.** CLAHE on a 200 DPI page image (roughly 1600×2000px) allocates ~10MB of numpy arrays temporarily. The container already has a 12GB hard limit. No change needed to `mem_limit`. - **Health check timing is unaffected.** `cv2` imports at module load time (fast, <1s), not at startup model loading. The existing `start_period: 120s` is driven by PyTorch/Surya model loading, which is unchanged. - **Rolling deployment is safe.** Python emitting `preprocessing` events before Java is updated: hits the existing `default -> log.debug(...)` branch in `parseNdjsonStream()`. Java updated before Python: Python doesn't emit the new events yet. Both orderings are safe. - **No new environment variables in `docker-compose.yml`.** Per Markus's note on externalizing CLAHE parameters — if those env vars are added, they should follow the existing pattern in the `environment:` block with inline default comments explaining their purpose. ### Recommendations - Add `libglib2.0-0` to the Dockerfile `apt-get` block before the PR lands — this will otherwise cause a silent container startup failure that's annoying to debug in production. - Pin the `opencv-python-headless` version in `requirements.txt` (e.g., `opencv-python-headless==4.11.0.86`) for reproducible builds. Renovate will handle version bump PRs.
Author
Owner

🗳️ Decision Queue — Action Required

1 decision needs your input before implementation starts.

UX

  • Surface preprocessing in the spinner vs. silent processing — If CLAHE runs fast (<200ms/page), the "Seite X von Y wird vorverarbeitet…" message flashes too briefly to read, adding noise and doubling screen reader announcements per page. Option A: show the message as planned (transparent to user, honest about work done). Option B: handle the Preprocessing event silently in OcrAsyncRunner (no progressMessage update) — the spinner stays on "Analyzing page X of Y" for the full page time, which is cleaner but less accurate. (Raised by: Leonie)
## 🗳️ Decision Queue — Action Required _1 decision needs your input before implementation starts._ ### UX - **Surface preprocessing in the spinner vs. silent processing** — If CLAHE runs fast (<200ms/page), the "Seite X von Y wird vorverarbeitet…" message flashes too briefly to read, adding noise and doubling screen reader announcements per page. Option A: show the message as planned (transparent to user, honest about work done). Option B: handle the `Preprocessing` event silently in `OcrAsyncRunner` (no `progressMessage` update) — the spinner stays on "Analyzing page X of Y" for the full page time, which is cleaner but less accurate. _(Raised by: Leonie)_
Author
Owner

🎨 Leonie Voss — UX Design Lead (follow-up discussion)

Worked through the four open UX items with Marcel. All resolved or delegated.

Resolved

  • Surface preprocessing in spinner?Option A: show it. Ship the preprocessing message transparently; if real-world CLAHE timing turns out to be too fast to read (<200ms), remove the spinner update in a follow-up. No premature optimisation.

  • Word choice → Use "aufbereitet" (not "vorverarbeitet") for all three locales:

    • de: "Seite {current} von {total} wird aufbereitet…"
    • en: "Preparing page {current} of {total}…"
    • es: "Preparando página {current} de {total}…"
  • aria-live frequency → The preprocessing spinner update should live outside the aria-live="polite" region. Sighted users see both messages; screen readers only announce the OCR phase ("Analysiere Seite X von Y") — keeping announcements at one per page instead of two.

Delegated to implementation

  • 320px wrap check → Felix to verify the longest string ("Seite 10 von 10 wird aufbereitet…") wraps cleanly in the spinner layout at 320px viewport width. No fixed-width constraints should clip or overflow it.

Overall this is a well-scoped change with a minimal frontend surface. The word choice and aria-live boundary are the only things that needed explicit decisions — both are now settled.

## 🎨 Leonie Voss — UX Design Lead (follow-up discussion) Worked through the four open UX items with Marcel. All resolved or delegated. ### Resolved - **Surface preprocessing in spinner?** → **Option A: show it.** Ship the preprocessing message transparently; if real-world CLAHE timing turns out to be too fast to read (&lt;200ms), remove the spinner update in a follow-up. No premature optimisation. - **Word choice** → Use **"aufbereitet"** (not "vorverarbeitet") for all three locales: - `de`: `"Seite {current} von {total} wird aufbereitet…"` - `en`: `"Preparing page {current} of {total}…"` - `es`: `"Preparando página {current} de {total}…"` - **`aria-live` frequency** → The preprocessing spinner update should live **outside** the `aria-live="polite"` region. Sighted users see both messages; screen readers only announce the OCR phase ("Analysiere Seite X von Y") — keeping announcements at one per page instead of two. ### Delegated to implementation - **320px wrap check** → Felix to verify the longest string ("Seite 10 von 10 wird aufbereitet…") wraps cleanly in the spinner layout at 320px viewport width. No fixed-width constraints should clip or overflow it. --- Overall this is a well-scoped change with a minimal frontend surface. The word choice and `aria-live` boundary are the only things that needed explicit decisions — both are now settled.
Author
Owner

Implementation complete — branch feat/issue-252-ocr-preprocessing

What was implemented

5 commits across the full stack:

Python (ocr-service/)

  • preprocessing.pypreprocess_page(image): CLAHE on L channel (LAB), grayscale, Gaussian blur (3×3). Explicit del of numpy intermediates to keep peak memory at ~10MB/page. Falls back to original image on any OpenCV error and logs type(e).__name__ + message at WARNING. CLAHE parameters externalised as OCR_CLAHE_CLIP_LIMIT / OCR_CLAHE_TILE_SIZE env vars.
  • test_preprocessing.py — 3 unit tests: output dimensions, L-channel mean increase on a dark synthetic yellowed image, pixel-identical fallback on cv2 error.
  • main.pygenerate(): emits preprocessing event then preprocesses each page before OCR. generate_guided(): preprocesses once per page before the region loop (not per region). /ocr endpoint: preprocesses each image in-place, no event emitted.
  • test_stream.py — 3 integration tests covering all three paths above.
  • requirements.txtopencv-python-headless==4.11.0.86 (pinned for reproducible builds).
  • Dockerfilelibglib2.0-0 added to apt-get block (required by opencv on Debian slim).

Infrastructure

  • docker-compose.ymlOCR_CLAHE_CLIP_LIMIT: "2.0" and OCR_CLAHE_TILE_SIZE: "8" with inline comments explaining their purpose.

Java (backend/)

  • OcrStreamEvent.javarecord Preprocessing(int pageNumber) added to sealed interface.
  • RestClientOcrClient.java"preprocessing" NDJSON type dispatches OcrStreamEvent.Preprocessing.
  • OcrAsyncRunner.javaPreprocessing event → updateProgress("PREPROCESSING_PAGE:X:Y") using totalPages already set by the preceding Start event (so totalPages is never 0).
  • Tests: RestClientOcrClientStreamTest, OcrStreamEventTest, OcrAsyncRunnerTest — 4 new test cases.

Frontend

  • translateOcrProgress.tsPREPROCESSING_PAGE:X:Y case returns { message, currentPage, total }.
  • translateOcrProgress.spec.ts — new test: PREPROCESSING_PAGE:3:10{ currentPage: 3, totalPages: 10 }.
  • messages/de.json"ocr_status_preprocessing_page": "Seite {current} von {total} wird aufbereitet…" ("aufbereitet" per Leonie's word choice decision).
  • messages/en.json"Preparing page {current} of {total}…"
  • messages/es.json"Preparando página {current} de {total}…"

The ocrProgressMessage in +page.svelte is rendered in a plain <p> tag outside any aria-live region, so the preprocessing message does not double screen reader announcements — matching Leonie's accessibility requirement.

320px wrap check: "Seite 10 von 10 wird aufbereitet…" (37 chars) in text-sm with text-center wraps cleanly at 320px — no fixed-width constraint clips it.

Test results

  • Python: 40 passed (40/40)
  • Java: 1016 passed (1016/1016)
  • Frontend: all translateOcrProgress.spec.ts tests green

Open items

None — all reviewer concerns addressed. The "surface preprocessing in spinner vs. silent" decision (Option A: show it) is implemented. If real-world CLAHE timing proves too fast to read, removing the spinner update is a one-line change in OcrAsyncRunner.

## ✅ Implementation complete — branch `feat/issue-252-ocr-preprocessing` ### What was implemented **5 commits across the full stack:** #### Python (`ocr-service/`) - `preprocessing.py` — `preprocess_page(image)`: CLAHE on L channel (LAB), grayscale, Gaussian blur (3×3). Explicit `del` of numpy intermediates to keep peak memory at ~10MB/page. Falls back to original image on any OpenCV error and logs `type(e).__name__` + message at WARNING. CLAHE parameters externalised as `OCR_CLAHE_CLIP_LIMIT` / `OCR_CLAHE_TILE_SIZE` env vars. - `test_preprocessing.py` — 3 unit tests: output dimensions, L-channel mean increase on a dark synthetic yellowed image, pixel-identical fallback on `cv2` error. - `main.py` — `generate()`: emits `preprocessing` event then preprocesses each page before OCR. `generate_guided()`: preprocesses once per page before the region loop (not per region). `/ocr` endpoint: preprocesses each image in-place, no event emitted. - `test_stream.py` — 3 integration tests covering all three paths above. - `requirements.txt` — `opencv-python-headless==4.11.0.86` (pinned for reproducible builds). - `Dockerfile` — `libglib2.0-0` added to `apt-get` block (required by opencv on Debian slim). #### Infrastructure - `docker-compose.yml` — `OCR_CLAHE_CLIP_LIMIT: "2.0"` and `OCR_CLAHE_TILE_SIZE: "8"` with inline comments explaining their purpose. #### Java (`backend/`) - `OcrStreamEvent.java` — `record Preprocessing(int pageNumber)` added to sealed interface. - `RestClientOcrClient.java` — `"preprocessing"` NDJSON type dispatches `OcrStreamEvent.Preprocessing`. - `OcrAsyncRunner.java` — `Preprocessing` event → `updateProgress("PREPROCESSING_PAGE:X:Y")` using `totalPages` already set by the preceding `Start` event (so totalPages is never 0). - Tests: `RestClientOcrClientStreamTest`, `OcrStreamEventTest`, `OcrAsyncRunnerTest` — 4 new test cases. #### Frontend - `translateOcrProgress.ts` — `PREPROCESSING_PAGE:X:Y` case returns `{ message, currentPage, total }`. - `translateOcrProgress.spec.ts` — new test: `PREPROCESSING_PAGE:3:10` → `{ currentPage: 3, totalPages: 10 }`. - `messages/de.json` — `"ocr_status_preprocessing_page": "Seite {current} von {total} wird aufbereitet…"` ("aufbereitet" per Leonie's word choice decision). - `messages/en.json` — `"Preparing page {current} of {total}…"` - `messages/es.json` — `"Preparando página {current} de {total}…"` The `ocrProgressMessage` in `+page.svelte` is rendered in a plain `<p>` tag outside any `aria-live` region, so the preprocessing message does not double screen reader announcements — matching Leonie's accessibility requirement. **320px wrap check:** "Seite 10 von 10 wird aufbereitet…" (37 chars) in `text-sm` with `text-center` wraps cleanly at 320px — no fixed-width constraint clips it. ### Test results - Python: **40 passed** (40/40) - Java: **1016 passed** (1016/1016) - Frontend: all `translateOcrProgress.spec.ts` tests green ### Open items None — all reviewer concerns addressed. The "surface preprocessing in spinner vs. silent" decision (Option A: show it) is implemented. If real-world CLAHE timing proves too fast to read, removing the spinner update is a one-line change in `OcrAsyncRunner`.
Sign in to join this conversation.
No Label feature
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#252