feat(ocr): add image preprocessing pipeline to improve transcription quality on aged documents #252

New Issue

marcel · 2026-04-17T12:04:01+02:00

marcel commented

2026-04-17 12:04:01 +02:00

Problem

Scanned family letters from the 19th/20th century typically have yellowed paper with faded ink. The OCR engines (Surya, Kraken) receive raw page images today with no preprocessing — low contrast between aged ink and yellowed background leads to missed or garbled text.

Solution

Add a transparent preprocessing step before every OCR pass using OpenCV:

CLAHE on the L channel (LAB color space) — local contrast enhancement that adapts to uneven yellowing across the page
Grayscale conversion — removes color noise from aged paper, reduces memory
Gaussian blur (3×3) — removes speckle noise while preserving ink strokes

No binarization (Otsu/Sauvola) — too aggressive for always-on use and would hurt Surya which was trained on grayscale, not binary images.

Preprocessing runs inside asyncio.to_thread() so it never blocks the FastAPI event loop.

Progress Spinner

A new preprocessing NDJSON event is emitted per page before the OCR event:

{"type": "preprocessing", "pageNumber": 1}

This flows through the full stack: Python → Spring Boot OcrStreamEvent.Preprocessing → progressMessage = "PREPROCESSING_PAGE:1:5" → translateOcrProgress → "Seite 1 von 5 wird vorverarbeitet…" in the spinner.

Changes

Layer	File	Change
Python	`ocr-service/preprocessing.py`	New — `preprocess_page(image)` function
Python	`ocr-service/main.py`	Emit `preprocessing` event + call `preprocess_page` per page in both stream generators; preprocess silently in non-streaming `/ocr` endpoint
Python	`ocr-service/requirements.txt`	Add `opencv-python-headless`
Java	`service/OcrStreamEvent.java`	Add `record Preprocessing(int pageNumber)`
Java	`service/RestClientOcrClient.java`	Parse `"preprocessing"` NDJSON type in `parseNdjsonStream()`
Java	`service/OcrAsyncRunner.java`	Handle `Preprocessing` event → `updateProgress("PREPROCESSING_PAGE:X:Y")`
TS	`lib/ocr/translateOcrProgress.ts`	Add `PREPROCESSING_PAGE` case
i18n	`messages/de.json`, `en.json`, `es.json`	Add `ocr_status_preprocessing_page` key

Tests

Python test_preprocessing.py: output shape, L-channel mean increases on synthetic yellowed image, fallback on OpenCV error
Java RestClientOcrClientTest: parseNdjsonStream dispatches Preprocessing event
Java OcrAsyncRunnerTest: Preprocessing event sets correct progressMessage
Frontend translateOcrProgress.test.ts: PREPROCESSING_PAGE:3:10 → { currentPage: 3, totalPages: 10 }

No API Changes

The OcrRequest model is unchanged. No flag, no user action required — preprocessing is always-on and transparent.

## Problem Scanned family letters from the 19th/20th century typically have yellowed paper with faded ink. The OCR engines (Surya, Kraken) receive raw page images today with no preprocessing — low contrast between aged ink and yellowed background leads to missed or garbled text. ## Solution Add a transparent preprocessing step before every OCR pass using OpenCV: 1. **CLAHE** on the L channel (LAB color space) — local contrast enhancement that adapts to uneven yellowing across the page 2. **Grayscale conversion** — removes color noise from aged paper, reduces memory 3. **Gaussian blur (3×3)** — removes speckle noise while preserving ink strokes No binarization (Otsu/Sauvola) — too aggressive for always-on use and would hurt Surya which was trained on grayscale, not binary images. Preprocessing runs inside `asyncio.to_thread()` so it never blocks the FastAPI event loop. ## Progress Spinner A new `preprocessing` NDJSON event is emitted per page before the OCR event: ```json {"type": "preprocessing", "pageNumber": 1} ``` This flows through the full stack: Python → Spring Boot `OcrStreamEvent.Preprocessing` → `progressMessage = "PREPROCESSING_PAGE:1:5"` → `translateOcrProgress` → **"Seite 1 von 5 wird vorverarbeitet…"** in the spinner. ## Changes | Layer | File | Change | |---|---|---| | Python | `ocr-service/preprocessing.py` | **New** — `preprocess_page(image)` function | | Python | `ocr-service/main.py` | Emit `preprocessing` event + call `preprocess_page` per page in both stream generators; preprocess silently in non-streaming `/ocr` endpoint | | Python | `ocr-service/requirements.txt` | Add `opencv-python-headless` | | Java | `service/OcrStreamEvent.java` | Add `record Preprocessing(int pageNumber)` | | Java | `service/RestClientOcrClient.java` | Parse `"preprocessing"` NDJSON type in `parseNdjsonStream()` | | Java | `service/OcrAsyncRunner.java` | Handle `Preprocessing` event → `updateProgress("PREPROCESSING_PAGE:X:Y")` | | TS | `lib/ocr/translateOcrProgress.ts` | Add `PREPROCESSING_PAGE` case | | i18n | `messages/de.json`, `en.json`, `es.json` | Add `ocr_status_preprocessing_page` key | ## Tests - **Python** `test_preprocessing.py`: output shape, L-channel mean increases on synthetic yellowed image, fallback on OpenCV error - **Java** `RestClientOcrClientTest`: `parseNdjsonStream` dispatches `Preprocessing` event - **Java** `OcrAsyncRunnerTest`: `Preprocessing` event sets correct `progressMessage` - **Frontend** `translateOcrProgress.test.ts`: `PREPROCESSING_PAGE:3:10` → `{ currentPage: 3, totalPages: 10 }` ## No API Changes The `OcrRequest` model is unchanged. No flag, no user action required — preprocessing is always-on and transparent.

marcel added the feature label 2026-04-17 12:04:06 +02:00

marcel commented

2026-04-17 12:06:35 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

The test plan is well-structured and matches the existing patterns in test_stream.py (pytest + AsyncMock, patch() for engine deps). The test_preprocessing.py additions slot in cleanly.
In the streaming generate() loop, once preprocess_page returns the processed image, the original PIL Image should be immediately deleted before the OCR call. Currently del image happens in the finally block — that's fine for the original, but the intermediate numpy array inside preprocess_page needs to be explicitly deleted before returning to avoid holding both the original and processed image in memory simultaneously (a ~20MB spike per page). This should be handled inside preprocess_page itself.
The generate_guided() path processes all regions on one page in a loop. Preprocessing should happen once per page (before the region loop), not once per region. The issue says this correctly but it's easy to misimplement.
For the non-streaming /ocr endpoint, the preprocessed image should replace the original in-place (image = await asyncio.to_thread(preprocess_page, image)) with a del on the intermediate to avoid doubling peak memory for large documents.
translateOcrProgress.spec.ts already mocks messages with simple strings — ocr_status_preprocessing_page should follow the same mock pattern (({ current, total }) => \Vorverarbeitung Seite ${current} von ${total}`` or similar).

Recommendations

Add explicit del of intermediate numpy/cv2 arrays inside preprocess_page before returning the final PIL Image. This keeps peak memory per page at ~10MB instead of ~20MB.
The test_preprocess_falls_back_on_error test should assert not just that no exception is raised, but that the returned image is pixel-identical to the input. That's the contract.
Write the failing test_preprocessing.py tests before touching main.py. The preprocessing module is pure Python with no I/O — red/green/refactor takes minutes here.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - The test plan is well-structured and matches the existing patterns in `test_stream.py` (pytest + `AsyncMock`, `patch()` for engine deps). The `test_preprocessing.py` additions slot in cleanly. - In the streaming `generate()` loop, once `preprocess_page` returns the processed image, the original PIL Image should be immediately deleted before the OCR call. Currently `del image` happens in the `finally` block — that's fine for the original, but the intermediate numpy array inside `preprocess_page` needs to be explicitly deleted before returning to avoid holding both the original and processed image in memory simultaneously (a ~20MB spike per page). This should be handled inside `preprocess_page` itself. - The `generate_guided()` path processes all regions on one page in a loop. Preprocessing should happen once per page (before the region loop), not once per region. The issue says this correctly but it's easy to misimplement. - For the non-streaming `/ocr` endpoint, the preprocessed image should replace the original in-place (`image = await asyncio.to_thread(preprocess_page, image)`) with a `del` on the intermediate to avoid doubling peak memory for large documents. - `translateOcrProgress.spec.ts` already mocks messages with simple strings — `ocr_status_preprocessing_page` should follow the same mock pattern (`({ current, total }) => \`Vorverarbeitung Seite ${current} von ${total}\`` or similar). ### Recommendations - Add explicit `del` of intermediate numpy/cv2 arrays inside `preprocess_page` before returning the final PIL Image. This keeps peak memory per page at ~10MB instead of ~20MB. - The `test_preprocess_falls_back_on_error` test should assert not just that no exception is raised, but that the **returned image is pixel-identical to the input**. That's the contract. - Write the failing `test_preprocessing.py` tests before touching `main.py`. The preprocessing module is pure Python with no I/O — red/green/refactor takes minutes here.

marcel commented

2026-04-17 12:06:44 +02:00

🏗️ Markus Keller — Senior Application Architect

Observations

The module boundary is clean. preprocessing.py is a pure image transformation module with one responsibility and no cross-cutting dependencies. The addition of OcrStreamEvent.Preprocessing to the sealed interface is textbook — the interface already models the stream protocol, and preprocessing is a legitimate protocol event.
NDJSON_MAPPER safety: RestClientOcrClient.parseNdjsonStream() already has a default -> log.debug(...) branch for unknown event types. This means if the Python service is deployed first (emitting preprocessing events) before the Java code is updated, those events are silently ignored — no crash, no partial data loss. Rolling deployments in either order are safe.
CLAHE parameters are hardcoded (clipLimit=2.0, tileGridSize=(8,8)). These are sensible defaults, but the right values are document-corpus-specific. Since the OCR service already uses environment variables for all tunable parameters (OCR_CONFIDENCE_THRESHOLD, batch sizes), the preprocessing parameters should follow the same pattern.
The progressMessage colon-delimited string encoding is already an established protocol in OcrAsyncRunner. Adding PREPROCESSING_PAGE:X:Y follows the existing convention — no architecture concern there.

Recommendations

Expose CLAHE parameters as environment variables in preprocessing.py:
```
CLAHE_CLIP_LIMIT = float(os.environ.get("OCR_CLAHE_CLIP_LIMIT", "2.0"))
CLAHE_TILE_SIZE = int(os.environ.get("OCR_CLAHE_TILE_SIZE", "8"))
```
Add them to docker-compose.yml with their defaults. This avoids a code change when tuning the pipeline after the first real-world test.
No architectural blockers. The design follows existing patterns throughout the stack.

## 🏗️ Markus Keller — Senior Application Architect ### Observations - The module boundary is clean. `preprocessing.py` is a pure image transformation module with one responsibility and no cross-cutting dependencies. The addition of `OcrStreamEvent.Preprocessing` to the sealed interface is textbook — the interface already models the stream protocol, and preprocessing is a legitimate protocol event. - **NDJSON_MAPPER safety**: `RestClientOcrClient.parseNdjsonStream()` already has a `default -> log.debug(...)` branch for unknown event types. This means if the Python service is deployed first (emitting `preprocessing` events) before the Java code is updated, those events are silently ignored — no crash, no partial data loss. Rolling deployments in either order are safe. - **CLAHE parameters are hardcoded** (`clipLimit=2.0`, `tileGridSize=(8,8)`). These are sensible defaults, but the right values are document-corpus-specific. Since the OCR service already uses environment variables for all tunable parameters (`OCR_CONFIDENCE_THRESHOLD`, batch sizes), the preprocessing parameters should follow the same pattern. - The `progressMessage` colon-delimited string encoding is already an established protocol in `OcrAsyncRunner`. Adding `PREPROCESSING_PAGE:X:Y` follows the existing convention — no architecture concern there. ### Recommendations - Expose CLAHE parameters as environment variables in `preprocessing.py`: ```python CLAHE_CLIP_LIMIT = float(os.environ.get("OCR_CLAHE_CLIP_LIMIT", "2.0")) CLAHE_TILE_SIZE = int(os.environ.get("OCR_CLAHE_TILE_SIZE", "8")) ``` Add them to `docker-compose.yml` with their defaults. This avoids a code change when tuning the pipeline after the first real-world test. - No architectural blockers. The design follows existing patterns throughout the stack.

marcel commented

2026-04-17 12:06:55 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

This is low-risk from a security standpoint. No new endpoints, no new HTTP parameters, no user input reaches the preprocessing code path.

No new attack surface. preprocess_page(image: Image.Image) takes a PIL Image that was already downloaded via _download_and_convert_pdf(), which has existing SSRF protection (ALLOWED_PDF_HOSTS). The preprocessing step operates on an in-memory object, not on paths or URLs.
Failure posture is correct. The fallback-on-error (return original image, log warning, never raise) is the right behavior for a preprocessing step. A crash in CLAHE should not abort the OCR job. However, this means a silent preprocessing failure will log at WARNING level — ensure the warning includes the page number and exception type so it's actionable in Loki.
opencv-python-headless supply chain: This is a well-maintained, widely-used library (PyPI downloads ~2M/week). It has no known active CVEs at the time of this review. Add it to any dependency scanning setup (pip-audit in CI) once it's in requirements.txt.
Memory handling: CLAHE and Gaussian blur operate in-process on numpy arrays. There is no disk write, no temp file creation, no subprocess call. No path traversal or file injection risk.

Recommendations

Log the exception type in the fallback handler:
```
logger.warning("preprocess_page failed on page (falling back to original): %s: %s", type(e).__name__, e)
```
This makes it observable in production without exposing stack traces.
If pip-audit is not already part of CI, adding it when opencv-python-headless is introduced is a good checkpoint to scan the full dependency tree.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Observations This is low-risk from a security standpoint. No new endpoints, no new HTTP parameters, no user input reaches the preprocessing code path. - **No new attack surface.** `preprocess_page(image: Image.Image)` takes a PIL Image that was already downloaded via `_download_and_convert_pdf()`, which has existing SSRF protection (`ALLOWED_PDF_HOSTS`). The preprocessing step operates on an in-memory object, not on paths or URLs. - **Failure posture is correct.** The fallback-on-error (return original image, log warning, never raise) is the right behavior for a preprocessing step. A crash in CLAHE should not abort the OCR job. However, this means a silent preprocessing failure will log at `WARNING` level — ensure the warning includes the page number and exception type so it's actionable in Loki. - **opencv-python-headless supply chain**: This is a well-maintained, widely-used library (PyPI downloads ~2M/week). It has no known active CVEs at the time of this review. Add it to any dependency scanning setup (pip-audit in CI) once it's in `requirements.txt`. - **Memory handling**: CLAHE and Gaussian blur operate in-process on numpy arrays. There is no disk write, no temp file creation, no subprocess call. No path traversal or file injection risk. ### Recommendations - Log the exception type in the fallback handler: ```python logger.warning("preprocess_page failed on page (falling back to original): %s: %s", type(e).__name__, e) ``` This makes it observable in production without exposing stack traces. - If pip-audit is not already part of CI, adding it when `opencv-python-headless` is introduced is a good checkpoint to scan the full dependency tree.

marcel commented

2026-04-17 12:07:08 +02:00

🧪 Sara Holt — Senior QA Engineer

Observations

The test plan covers the four main layers correctly. A few gaps and one edge case worth addressing:

Edge case not covered: totalPages is 0 when first Preprocessing event arrives.
OcrStreamEvent.Start sets totalPages.get() in OcrAsyncRunner. However, the stream order is:

{"type": "start", "totalPages": 5}   ← totalPages set here
{"type": "preprocessing", "pageNumber": 1}  ← used immediately after

This is correct — Start always precedes Preprocessing per the protocol. But the test for runSingleDocument_updatesProgressOnPreprocessingEvent should simulate both events in sequence (Start then Preprocessing) to test the realistic path. Testing Preprocessing alone would give "PREPROCESSING_PAGE:1:0", which tests a degenerate case.

translateOcrProgress.spec.ts — existing mock pattern to follow:
The spec file mocks $lib/paraglide/messages.js inline. The new ocr_status_preprocessing_page mock should match this pattern exactly (the mock returns a function that accepts { current, total } and returns a formatted string).

Missing test: non-streaming /ocr endpoint preprocesses silently.
The issue correctly states the non-streaming endpoint should preprocess without emitting events. There should be a test in test_stream.py (or a new test_preprocessing_integration.py) that calls /ocr directly and verifies the engine receives a preprocessed image (mock preprocess_page and verify it was called).

OcrAsyncRunnerTest — use ArgumentCaptor correctly:
The existing test file uses verify(ocrJobRepository, atLeastOnce()).save(any()). The Preprocessing progress test should use ArgumentCaptor<OcrJob> to capture all save() calls and assert that at least one call had progressMessage matching "PREPROCESSING_PAGE:1:5" — not just "PREPROCESSING_PAGE:1:0".

Recommendations

Add test_preprocessing_called_for_ocr_endpoint in test_stream.py (mock preprocess_page, call /ocr, assert it was invoked once per page).
In runSingleDocument_updatesProgressOnPreprocessingEvent, simulate the full event sequence: Start(5) → Preprocessing(1) → Page(1, []) to test realistic totalPages propagation.
Estimated CI impact: test_preprocessing.py adds ~4 fast unit tests (<1s total). No new Testcontainers, no new Docker layers in CI.

## 🧪 Sara Holt — Senior QA Engineer ### Observations The test plan covers the four main layers correctly. A few gaps and one edge case worth addressing: **Edge case not covered: `totalPages` is 0 when first `Preprocessing` event arrives.** `OcrStreamEvent.Start` sets `totalPages.get()` in `OcrAsyncRunner`. However, the stream order is: ``` {"type": "start", "totalPages": 5} ← totalPages set here {"type": "preprocessing", "pageNumber": 1} ← used immediately after ``` This is correct — Start always precedes Preprocessing per the protocol. But the test for `runSingleDocument_updatesProgressOnPreprocessingEvent` should simulate both events in sequence (Start then Preprocessing) to test the realistic path. Testing Preprocessing alone would give `"PREPROCESSING_PAGE:1:0"`, which tests a degenerate case. **`translateOcrProgress.spec.ts` — existing mock pattern to follow:** The spec file mocks `$lib/paraglide/messages.js` inline. The new `ocr_status_preprocessing_page` mock should match this pattern exactly (the mock returns a function that accepts `{ current, total }` and returns a formatted string). **Missing test: non-streaming `/ocr` endpoint preprocesses silently.** The issue correctly states the non-streaming endpoint should preprocess without emitting events. There should be a test in `test_stream.py` (or a new `test_preprocessing_integration.py`) that calls `/ocr` directly and verifies the engine receives a preprocessed image (mock `preprocess_page` and verify it was called). **`OcrAsyncRunnerTest` — use `ArgumentCaptor` correctly:** The existing test file uses `verify(ocrJobRepository, atLeastOnce()).save(any())`. The Preprocessing progress test should use `ArgumentCaptor<OcrJob>` to capture all `save()` calls and assert that at least one call had `progressMessage` matching `"PREPROCESSING_PAGE:1:5"` — not just `"PREPROCESSING_PAGE:1:0"`. ### Recommendations - Add `test_preprocessing_called_for_ocr_endpoint` in `test_stream.py` (mock `preprocess_page`, call `/ocr`, assert it was invoked once per page). - In `runSingleDocument_updatesProgressOnPreprocessingEvent`, simulate the full event sequence: `Start(5)` → `Preprocessing(1)` → `Page(1, [])` to test realistic `totalPages` propagation. - Estimated CI impact: `test_preprocessing.py` adds ~4 fast unit tests (<1s total). No new Testcontainers, no new Docker layers in CI.

marcel commented

2026-04-17 12:07:22 +02:00

🎨 Leonie Voss — UX Design Lead

Observations

This is a backend/infrastructure change with a narrow frontend surface — only the progress spinner text. My concerns are limited but worth raising.

Dual-message pattern per page may feel choppy. For a 10-page document, users will see the spinner cycle through 20 messages: "Seite 1 von 10 wird vorverarbeitet…" → "Seite 1 von 10 wird analysiert…" → "Seite 2 von 10 wird vorverarbeitet…" → ... If preprocessing is fast (<200ms/page), these messages will flash so quickly that users may not be able to read them, which adds noise without meaningful information.

aria-live frequency. The page counter in +page.svelte has aria-live="polite". With preprocessing events, it will fire twice per page instead of once. Screen readers will announce each change — for a 10-page document, that's 20 announcements, which may be overwhelming for users relying on assistive technology.

"vorverarbeitet" is a technical term. It's accurate German but more jargon-heavy than needed. Users scanning the spinner don't need to know how the system is improving the image — they need to know progress is happening. "Seite X von Y wird aufbereitet…" is slightly more user-friendly, or even a combined phrase could work.

Recommendations

Consider whether preprocessing should be surfaced in the spinner at all. If CLAHE takes <200ms per page, the preprocessing message appears and disappears faster than a user can read it. An alternative: keep preprocessing silent (no spinner update), and let the "Analyzing page X of Y" message cover the full page processing time. This simplifies the frontend change to zero lines and the Spring Boot handler to a no-op for Preprocessing events.
If the message is surfaced, use a less technical label. Suggested: ocr_status_preprocessing_page → "Seite {current} von {total} wird aufbereitet…" (de), "Preparing page {current} of {total}…" (en).
Limit aria-live polling frequency or debounce updates to avoid announcing every intermediate state.

Open Decisions

Surface preprocessing in the spinner vs. silent processing. If preprocessing is fast (<200ms), users gain little by seeing "vorverarbeitet" flash briefly before "analysiert". Surfacing it adds UX complexity for questionable benefit. If preprocessing is slow (>500ms), it's worth showing. The right answer depends on real-world timing on your document corpus — something only you can measure once the feature ships.

## 🎨 Leonie Voss — UX Design Lead ### Observations This is a backend/infrastructure change with a narrow frontend surface — only the progress spinner text. My concerns are limited but worth raising. **Dual-message pattern per page may feel choppy.** For a 10-page document, users will see the spinner cycle through 20 messages: "Seite 1 von 10 wird vorverarbeitet…" → "Seite 1 von 10 wird analysiert…" → "Seite 2 von 10 wird vorverarbeitet…" → ... If preprocessing is fast (<200ms/page), these messages will flash so quickly that users may not be able to read them, which adds noise without meaningful information. **`aria-live` frequency.** The page counter in `+page.svelte` has `aria-live="polite"`. With preprocessing events, it will fire twice per page instead of once. Screen readers will announce each change — for a 10-page document, that's 20 announcements, which may be overwhelming for users relying on assistive technology. **"vorverarbeitet" is a technical term.** It's accurate German but more jargon-heavy than needed. Users scanning the spinner don't need to know *how* the system is improving the image — they need to know progress is happening. "Seite X von Y wird aufbereitet…" is slightly more user-friendly, or even a combined phrase could work. ### Recommendations - **Consider whether preprocessing should be surfaced in the spinner at all.** If CLAHE takes <200ms per page, the preprocessing message appears and disappears faster than a user can read it. An alternative: keep preprocessing silent (no spinner update), and let the "Analyzing page X of Y" message cover the full page processing time. This simplifies the frontend change to zero lines and the Spring Boot handler to a no-op for Preprocessing events. - If the message *is* surfaced, use a less technical label. Suggested: `ocr_status_preprocessing_page` → `"Seite {current} von {total} wird aufbereitet…"` (de), `"Preparing page {current} of {total}…"` (en). - Limit `aria-live` polling frequency or debounce updates to avoid announcing every intermediate state. ### Open Decisions - **Surface preprocessing in the spinner vs. silent processing.** If preprocessing is fast (<200ms), users gain little by seeing "vorverarbeitet" flash briefly before "analysiert". Surfacing it adds UX complexity for questionable benefit. If preprocessing is slow (>500ms), it's worth showing. The right answer depends on real-world timing on your document corpus — something only you can measure once the feature ships.

marcel commented

2026-04-17 12:07:33 +02:00

⚙️ Tobias Wendt — DevOps & Platform Engineer

Observations

Dockerfile: opencv-python-headless needs a system library. The current base image (python:3.11.9-slim) does not ship libglib2.0-0, which opencv-python-headless requires on Debian slim. The apt-get block in the Dockerfile needs to be extended:
```
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    libgomp1 \
    libvips42 \
    libglib2.0-0 \      # required by opencv-python-headless on slim
    && rm -rf /var/lib/apt/lists/*
```
Without this, import cv2 will raise ImportError: libglib-2.0.so.0: cannot open shared object file at container startup and models will never load.
Image rebuild time. Adding opencv-python-headless (~50MB wheel) to requirements.txt will extend the pip install layer in CI and on first deploy. Since the requirements layer is cached, this is a one-time cost. No concern.
Memory impact is negligible. CLAHE on a 200 DPI page image (roughly 1600×2000px) allocates ~10MB of numpy arrays temporarily. The container already has a 12GB hard limit. No change needed to mem_limit.
Health check timing is unaffected. cv2 imports at module load time (fast, <1s), not at startup model loading. The existing start_period: 120s is driven by PyTorch/Surya model loading, which is unchanged.
Rolling deployment is safe. Python emitting preprocessing events before Java is updated: hits the existing default -> log.debug(...) branch in parseNdjsonStream(). Java updated before Python: Python doesn't emit the new events yet. Both orderings are safe.
No new environment variables in docker-compose.yml. Per Markus's note on externalizing CLAHE parameters — if those env vars are added, they should follow the existing pattern in the environment: block with inline default comments explaining their purpose.

Recommendations

Add libglib2.0-0 to the Dockerfile apt-get block before the PR lands — this will otherwise cause a silent container startup failure that's annoying to debug in production.
Pin the opencv-python-headless version in requirements.txt (e.g., opencv-python-headless==4.11.0.86) for reproducible builds. Renovate will handle version bump PRs.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer ### Observations - **Dockerfile: `opencv-python-headless` needs a system library.** The current base image (`python:3.11.9-slim`) does not ship `libglib2.0-0`, which `opencv-python-headless` requires on Debian slim. The `apt-get` block in the Dockerfile needs to be extended: ```dockerfile RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ libgomp1 \ libvips42 \ libglib2.0-0 \ # required by opencv-python-headless on slim && rm -rf /var/lib/apt/lists/* ``` Without this, `import cv2` will raise `ImportError: libglib-2.0.so.0: cannot open shared object file` at container startup and models will never load. - **Image rebuild time.** Adding `opencv-python-headless` (~50MB wheel) to `requirements.txt` will extend the `pip install` layer in CI and on first deploy. Since the requirements layer is cached, this is a one-time cost. No concern. - **Memory impact is negligible.** CLAHE on a 200 DPI page image (roughly 1600×2000px) allocates ~10MB of numpy arrays temporarily. The container already has a 12GB hard limit. No change needed to `mem_limit`. - **Health check timing is unaffected.** `cv2` imports at module load time (fast, <1s), not at startup model loading. The existing `start_period: 120s` is driven by PyTorch/Surya model loading, which is unchanged. - **Rolling deployment is safe.** Python emitting `preprocessing` events before Java is updated: hits the existing `default -> log.debug(...)` branch in `parseNdjsonStream()`. Java updated before Python: Python doesn't emit the new events yet. Both orderings are safe. - **No new environment variables in `docker-compose.yml`.** Per Markus's note on externalizing CLAHE parameters — if those env vars are added, they should follow the existing pattern in the `environment:` block with inline default comments explaining their purpose. ### Recommendations - Add `libglib2.0-0` to the Dockerfile `apt-get` block before the PR lands — this will otherwise cause a silent container startup failure that's annoying to debug in production. - Pin the `opencv-python-headless` version in `requirements.txt` (e.g., `opencv-python-headless==4.11.0.86`) for reproducible builds. Renovate will handle version bump PRs.

marcel commented

2026-04-17 12:07:43 +02:00

🗳️ Decision Queue — Action Required

1 decision needs your input before implementation starts.

UX

Surface preprocessing in the spinner vs. silent processing — If CLAHE runs fast (<200ms/page), the "Seite X von Y wird vorverarbeitet…" message flashes too briefly to read, adding noise and doubling screen reader announcements per page. Option A: show the message as planned (transparent to user, honest about work done). Option B: handle the Preprocessing event silently in OcrAsyncRunner (no progressMessage update) — the spinner stays on "Analyzing page X of Y" for the full page time, which is cleaner but less accurate. (Raised by: Leonie)

## 🗳️ Decision Queue — Action Required _1 decision needs your input before implementation starts._ ### UX - **Surface preprocessing in the spinner vs. silent processing** — If CLAHE runs fast (<200ms/page), the "Seite X von Y wird vorverarbeitet…" message flashes too briefly to read, adding noise and doubling screen reader announcements per page. Option A: show the message as planned (transparent to user, honest about work done). Option B: handle the `Preprocessing` event silently in `OcrAsyncRunner` (no `progressMessage` update) — the spinner stays on "Analyzing page X of Y" for the full page time, which is cleaner but less accurate. _(Raised by: Leonie)_

marcel commented

2026-04-17 12:25:52 +02:00

🎨 Leonie Voss — UX Design Lead (follow-up discussion)

Worked through the four open UX items with Marcel. All resolved or delegated.

Resolved

Surface preprocessing in spinner? → Option A: show it. Ship the preprocessing message transparently; if real-world CLAHE timing turns out to be too fast to read (<200ms), remove the spinner update in a follow-up. No premature optimisation.
Word choice → Use "aufbereitet" (not "vorverarbeitet") for all three locales:
- de: "Seite {current} von {total} wird aufbereitet…"
- en: "Preparing page {current} of {total}…"
- es: "Preparando página {current} de {total}…"
aria-live frequency → The preprocessing spinner update should live outside the aria-live="polite" region. Sighted users see both messages; screen readers only announce the OCR phase ("Analysiere Seite X von Y") — keeping announcements at one per page instead of two.

Delegated to implementation

320px wrap check → Felix to verify the longest string ("Seite 10 von 10 wird aufbereitet…") wraps cleanly in the spinner layout at 320px viewport width. No fixed-width constraints should clip or overflow it.

Overall this is a well-scoped change with a minimal frontend surface. The word choice and aria-live boundary are the only things that needed explicit decisions — both are now settled.

## 🎨 Leonie Voss — UX Design Lead (follow-up discussion) Worked through the four open UX items with Marcel. All resolved or delegated. ### Resolved - **Surface preprocessing in spinner?** → **Option A: show it.** Ship the preprocessing message transparently; if real-world CLAHE timing turns out to be too fast to read (<200ms), remove the spinner update in a follow-up. No premature optimisation. - **Word choice** → Use **"aufbereitet"** (not "vorverarbeitet") for all three locales: - `de`: `"Seite {current} von {total} wird aufbereitet…"` - `en`: `"Preparing page {current} of {total}…"` - `es`: `"Preparando página {current} de {total}…"` - **`aria-live` frequency** → The preprocessing spinner update should live **outside** the `aria-live="polite"` region. Sighted users see both messages; screen readers only announce the OCR phase ("Analysiere Seite X von Y") — keeping announcements at one per page instead of two. ### Delegated to implementation - **320px wrap check** → Felix to verify the longest string ("Seite 10 von 10 wird aufbereitet…") wraps cleanly in the spinner layout at 320px viewport width. No fixed-width constraints should clip or overflow it. --- Overall this is a well-scoped change with a minimal frontend surface. The word choice and `aria-live` boundary are the only things that needed explicit decisions — both are now settled.

marcel referenced this issue

2026-04-17 12:29:00 +02:00

feat(ocr): per-sender specialized Kurrent models with automatic active-learning retraining #253

marcel commented

2026-04-17 14:29:04 +02:00

✅ Implementation complete — branch `feat/issue-252-ocr-preprocessing`

What was implemented

5 commits across the full stack:

Python (`ocr-service/`)

preprocessing.py — preprocess_page(image): CLAHE on L channel (LAB), grayscale, Gaussian blur (3×3). Explicit del of numpy intermediates to keep peak memory at ~10MB/page. Falls back to original image on any OpenCV error and logs type(e).__name__ + message at WARNING. CLAHE parameters externalised as OCR_CLAHE_CLIP_LIMIT / OCR_CLAHE_TILE_SIZE env vars.
test_preprocessing.py — 3 unit tests: output dimensions, L-channel mean increase on a dark synthetic yellowed image, pixel-identical fallback on cv2 error.
main.py — generate(): emits preprocessing event then preprocesses each page before OCR. generate_guided(): preprocesses once per page before the region loop (not per region). /ocr endpoint: preprocesses each image in-place, no event emitted.
test_stream.py — 3 integration tests covering all three paths above.
requirements.txt — opencv-python-headless==4.11.0.86 (pinned for reproducible builds).
Dockerfile — libglib2.0-0 added to apt-get block (required by opencv on Debian slim).

Infrastructure

docker-compose.yml — OCR_CLAHE_CLIP_LIMIT: "2.0" and OCR_CLAHE_TILE_SIZE: "8" with inline comments explaining their purpose.

Java (`backend/`)

OcrStreamEvent.java — record Preprocessing(int pageNumber) added to sealed interface.
RestClientOcrClient.java — "preprocessing" NDJSON type dispatches OcrStreamEvent.Preprocessing.
OcrAsyncRunner.java — Preprocessing event → updateProgress("PREPROCESSING_PAGE:X:Y") using totalPages already set by the preceding Start event (so totalPages is never 0).
Tests: RestClientOcrClientStreamTest, OcrStreamEventTest, OcrAsyncRunnerTest — 4 new test cases.

Frontend

translateOcrProgress.ts — PREPROCESSING_PAGE:X:Y case returns { message, currentPage, total }.
translateOcrProgress.spec.ts — new test: PREPROCESSING_PAGE:3:10 → { currentPage: 3, totalPages: 10 }.
messages/de.json — "ocr_status_preprocessing_page": "Seite {current} von {total} wird aufbereitet…" ("aufbereitet" per Leonie's word choice decision).
messages/en.json — "Preparing page {current} of {total}…"
messages/es.json — "Preparando página {current} de {total}…"

The ocrProgressMessage in +page.svelte is rendered in a plain <p> tag outside any aria-live region, so the preprocessing message does not double screen reader announcements — matching Leonie's accessibility requirement.

320px wrap check: "Seite 10 von 10 wird aufbereitet…" (37 chars) in text-sm with text-center wraps cleanly at 320px — no fixed-width constraint clips it.

Test results

Python: 40 passed (40/40)
Java: 1016 passed (1016/1016)
Frontend: all translateOcrProgress.spec.ts tests green

Open items

None — all reviewer concerns addressed. The "surface preprocessing in spinner vs. silent" decision (Option A: show it) is implemented. If real-world CLAHE timing proves too fast to read, removing the spinner update is a one-line change in OcrAsyncRunner.

## ✅ Implementation complete — branch `feat/issue-252-ocr-preprocessing` ### What was implemented **5 commits across the full stack:** #### Python (`ocr-service/`) - `preprocessing.py` — `preprocess_page(image)`: CLAHE on L channel (LAB), grayscale, Gaussian blur (3×3). Explicit `del` of numpy intermediates to keep peak memory at ~10MB/page. Falls back to original image on any OpenCV error and logs `type(e).__name__` + message at WARNING. CLAHE parameters externalised as `OCR_CLAHE_CLIP_LIMIT` / `OCR_CLAHE_TILE_SIZE` env vars. - `test_preprocessing.py` — 3 unit tests: output dimensions, L-channel mean increase on a dark synthetic yellowed image, pixel-identical fallback on `cv2` error. - `main.py` — `generate()`: emits `preprocessing` event then preprocesses each page before OCR. `generate_guided()`: preprocesses once per page before the region loop (not per region). `/ocr` endpoint: preprocesses each image in-place, no event emitted. - `test_stream.py` — 3 integration tests covering all three paths above. - `requirements.txt` — `opencv-python-headless==4.11.0.86` (pinned for reproducible builds). - `Dockerfile` — `libglib2.0-0` added to `apt-get` block (required by opencv on Debian slim). #### Infrastructure - `docker-compose.yml` — `OCR_CLAHE_CLIP_LIMIT: "2.0"` and `OCR_CLAHE_TILE_SIZE: "8"` with inline comments explaining their purpose. #### Java (`backend/`) - `OcrStreamEvent.java` — `record Preprocessing(int pageNumber)` added to sealed interface. - `RestClientOcrClient.java` — `"preprocessing"` NDJSON type dispatches `OcrStreamEvent.Preprocessing`. - `OcrAsyncRunner.java` — `Preprocessing` event → `updateProgress("PREPROCESSING_PAGE:X:Y")` using `totalPages` already set by the preceding `Start` event (so totalPages is never 0). - Tests: `RestClientOcrClientStreamTest`, `OcrStreamEventTest`, `OcrAsyncRunnerTest` — 4 new test cases. #### Frontend - `translateOcrProgress.ts` — `PREPROCESSING_PAGE:X:Y` case returns `{ message, currentPage, total }`. - `translateOcrProgress.spec.ts` — new test: `PREPROCESSING_PAGE:3:10` → `{ currentPage: 3, totalPages: 10 }`. - `messages/de.json` — `"ocr_status_preprocessing_page": "Seite {current} von {total} wird aufbereitet…"` ("aufbereitet" per Leonie's word choice decision). - `messages/en.json` — `"Preparing page {current} of {total}…"` - `messages/es.json` — `"Preparando página {current} de {total}…"` The `ocrProgressMessage` in `+page.svelte` is rendered in a plain `<p>` tag outside any `aria-live` region, so the preprocessing message does not double screen reader announcements — matching Leonie's accessibility requirement. **320px wrap check:** "Seite 10 von 10 wird aufbereitet…" (37 chars) in `text-sm` with `text-center` wraps cleanly at 320px — no fixed-width constraint clips it. ### Test results - Python: **40 passed** (40/40) - Java: **1016 passed** (1016/1016) - Frontend: all `translateOcrProgress.spec.ts` tests green ### Open items None — all reviewer concerns addressed. The "surface preprocessing in spinner vs. silent" decision (Option A: show it) is implemented. If real-world CLAHE timing proves too fast to read, removing the spinner update is a one-line change in `OcrAsyncRunner`.

marcel referenced a pull request that will close this issue

2026-04-17 15:04:32 +02:00

feat(ocr): add image preprocessing pipeline to improve transcription quality on aged documents #255

marcel closed this issue

2026-04-17 15:50:38 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#252

feat(ocr): add image preprocessing pipeline to improve transcription quality on aged documents #252

Problem

Solution

Progress Spinner

Changes

Tests

No API Changes

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

Recommendations

🏗️ Markus Keller — Senior Application Architect

Observations

Recommendations

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

Recommendations

🧪 Sara Holt — Senior QA Engineer

Observations

Recommendations

🎨 Leonie Voss — UX Design Lead

Observations

Recommendations

Open Decisions

⚙️ Tobias Wendt — DevOps & Platform Engineer

Observations

Recommendations

🗳️ Decision Queue — Action Required

UX

🎨 Leonie Voss — UX Design Lead (follow-up discussion)

Resolved

Delegated to implementation

✅ Implementation complete — branch feat/issue-252-ocr-preprocessing

What was implemented

Python (ocr-service/)

Infrastructure

Java (backend/)

Frontend

Test results

Open items

✅ Implementation complete — branch `feat/issue-252-ocr-preprocessing`

Python (`ocr-service/`)

Java (`backend/`)