fix(ocr): accept sender_model_path in Surya engine so non-Kurrent OCR works
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m36s
CI / OCR Service Tests (push) Successful in 33s
CI / Backend Unit Tests (push) Has started running

main.py unifies the call to both engines and always passes
`sender_model_path` (None for non-Kurrent scripts). Surya's
extract_region_text / extract_page_blocks accepted one fewer positional
arg than Kraken's, so every guided-OCR run on a TYPEWRITER or
HANDWRITING_LATIN document raised "takes 5 positional arguments but 6
were given" and the stream returned 0 blocks / 1 skipped page.

Add an ignored `sender_model_path` kwarg to both Surya functions so the
signatures match Kraken's, and guard the regression with two signature
tests in test_engines.py that compare both engines' parameter lists.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-04-23 09:28:25 +02:00
parent 90f111fcb1
commit 1f7b712dd0
2 changed files with 40 additions and 3 deletions

View File

@@ -33,11 +33,16 @@ def load_models():
logger.info("Surya models loaded successfully")
def extract_page_blocks(image, page_idx: int, language: str = "de") -> list[dict]:
def extract_page_blocks(
image, page_idx: int, language: str = "de", sender_model_path: str | None = None
) -> list[dict]:
"""Run Surya OCR on a single PIL image and return block dicts for that page.
`sender_model_path` is accepted for signature parity with the Kraken engine
(which uses it to select a fine-tuned HTR model) and is ignored here.
Coordinates are normalized to [0, 1].
"""
del sender_model_path
load_models()
page_w, page_h = image.size
@@ -81,12 +86,22 @@ def extract_page_blocks(image, page_idx: int, language: str = "de") -> list[dict
return blocks
def extract_region_text(image, x: float, y: float, w: float, h: float) -> str:
def extract_region_text(
image,
x: float,
y: float,
w: float,
h: float,
sender_model_path: str | None = None,
) -> str:
"""Crop image to a normalized region and run Surya recognition on the crop.
Used for guided OCR — skips full-page layout detection and only processes
the given bounding box. Coordinates are normalized to [0, 1].
the given bounding box. `sender_model_path` is accepted for signature
parity with the Kraken engine and is ignored here. Coordinates are
normalized to [0, 1].
"""
del sender_model_path
load_models()
pw, ph = image.size