test(ocr): add integration tests for spell-check routing in main.py #262

Open
opened 2026-04-17 17:24:01 +02:00 by marcel · 0 comments
Owner

Background

Deferred during PR #260 review cycle 1.

Concern

The spell-check post-processing is wired into three code paths in main.py:

  • Block mode (run_ocr)
  • Stream mode full-page (generate() inside run_ocr_stream)
  • Guided mode (generate_guided())

None of these wiring points currently have test coverage verifying that correct_text() is called when scriptType is HANDWRITING_KURRENT or HANDWRITING_LATIN, and that it is not called for TYPEWRITER.

Why deferred

The full ML stack (Kraken, Surya, model files) is not available in CI or local without GPU provisioning. A minimal smoke test could mock out the OCR engines using unittest.mock.patch and httpx.AsyncClient with ASGITransport, but this was out of scope for the initial feature PR.

Suggested approach

Write three parametrized tests (block / stream / guided mode) using ASGITransport and patching:

  • main._download_and_convert_pdf → returns a fake PIL image list
  • main.kraken_engine.extract_blocks → returns a fake block with text
  • main.correct_text → verify it is called once per block for handwriting types, zero times for typewriter

Reference

PR: http://heim-nas:3005/marcel/familienarchiv/pulls/260
Raised by: @saraholt in PR review

## Background Deferred during PR #260 review cycle 1. ## Concern The spell-check post-processing is wired into three code paths in `main.py`: - Block mode (`run_ocr`) - Stream mode full-page (`generate()` inside `run_ocr_stream`) - Guided mode (`generate_guided()`) None of these wiring points currently have test coverage verifying that `correct_text()` is called when `scriptType` is `HANDWRITING_KURRENT` or `HANDWRITING_LATIN`, and that it is **not** called for `TYPEWRITER`. ## Why deferred The full ML stack (Kraken, Surya, model files) is not available in CI or local without GPU provisioning. A minimal smoke test could mock out the OCR engines using `unittest.mock.patch` and `httpx.AsyncClient` with `ASGITransport`, but this was out of scope for the initial feature PR. ## Suggested approach Write three parametrized tests (block / stream / guided mode) using `ASGITransport` and patching: - `main._download_and_convert_pdf` → returns a fake PIL image list - `main.kraken_engine.extract_blocks` → returns a fake block with text - `main.correct_text` → verify it is called once per block for handwriting types, zero times for typewriter ## Reference PR: http://heim-nas:3005/marcel/familienarchiv/pulls/260 Raised by: @saraholt in PR review
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#262