Commit Graph

4 Commits

Author SHA1 Message Date
Marcel
1f7b712dd0 fix(ocr): accept sender_model_path in Surya engine so non-Kurrent OCR works
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m36s
CI / OCR Service Tests (push) Successful in 33s
CI / Backend Unit Tests (push) Has started running
main.py unifies the call to both engines and always passes
`sender_model_path` (None for non-Kurrent scripts). Surya's
extract_region_text / extract_page_blocks accepted one fewer positional
arg than Kraken's, so every guided-OCR run on a TYPEWRITER or
HANDWRITING_LATIN document raised "takes 5 positional arguments but 6
were given" and the stream returned 0 blocks / 1 skipped page.

Add an ignored `sender_model_path` kwarg to both Surya functions so the
signatures match Kraken's, and guard the regression with two signature
tests in test_engines.py that compare both engines' parameter lists.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-23 09:28:25 +02:00
Marcel
69768a104d test(ocr): add business-logic tests for polygon extraction, Kraken routing, and confidence markers
Cover Surya polygon/word-level extraction, health endpoint states,
Kraken script-type routing, 503 when models not ready, 400 when
Kraken unavailable for Kurrent, and confidence marker application
during streaming. Production code coverage: 88%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:34:23 +02:00
Marcel
97e5138934 fix(ocr): use 1-based page numbers to match frontend PDF viewer
The PDF viewer uses 1-based currentPage (starting at 1) but the OCR
engines produced 0-based pageNumber from enumerate(). Annotations
created by OCR were assigned to page 0, which doesn't exist in the
viewer. Change enumerate() to start=1 in both engines and the
streaming endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:32:08 +02:00
Marcel
b7d5f71ef7 refactor(ocr): extract extract_page_blocks() from both OCR engines
Enable per-page processing by extracting the inner loop body of
extract_blocks() into extract_page_blocks(image, page_idx, language).
The original extract_blocks() now delegates to the new function,
preserving backward compatibility for the batch path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 09:56:34 +02:00