fix(ocr): guard Kraken block extraction against missing boundary/baseline
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m37s
CI / OCR Service Tests (push) Successful in 32s
CI / Backend Unit Tests (push) Failing after 2m51s

extract_page_blocks() walked `record.boundary` and `record.baseline`
unconditionally, so a record that arrived without either (malformed
kraken output, or a MagicMock in tests that iterates to nothing)
crashed with "min() arg is an empty sequence".

Coerce both attributes through list(), require at least 3 points for
the polygon path, fall back to the baseline path when the polygon is
missing, and skip the record entirely when neither is usable —
emitting no block is safer than emitting one with garbage coordinates.

The test helper now sets `boundary` and `baseline` explicitly to
mirror real Kraken 7.0 records (and so the happy-path test exercises
the polygon branch). A new regression test covers the skip path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-04-23 09:33:03 +02:00
parent 1f7b712dd0
commit 23cf88856e
2 changed files with 48 additions and 5 deletions

View File

@@ -115,11 +115,16 @@ def test_surya_extract_blocks_delegates_to_extract_page_blocks():
# ─── Kraken extract_page_blocks ──────────────────────────────────────────────
def _make_kraken_record(text, cuts, confidences=None):
def _make_kraken_record(text, cuts, confidences=None, boundary=None, baseline=None):
record = MagicMock()
record.prediction = text
record.cuts = cuts
record.line = cuts
# Real kraken records expose `boundary` (polygon) and `baseline` lists.
# Mirror that here so the extract path doesn't take the "missing data"
# branch during normal tests.
record.boundary = boundary if boundary is not None else cuts
record.baseline = baseline if baseline is not None else cuts
record.confidences = confidences or [0.9] * len(text)
return record
@@ -179,6 +184,33 @@ def test_kraken_extract_blocks_delegates_to_extract_page_blocks():
assert blocks[1]["pageNumber"] == 2
def test_kraken_extract_page_blocks_skips_records_without_positional_data():
"""Records that arrive without a usable boundary polygon OR baseline must
be dropped rather than crash min() with an empty sequence."""
import sys
image = Image.new("RGB", (100, 200))
mock_blla = MagicMock()
mock_blla.segment.return_value = MagicMock()
mock_rpred = MagicMock()
malformed = _make_kraken_record("Noise", [], boundary=[], baseline=[])
mock_rpred.rpred.return_value = [malformed]
sys.modules["kraken"] = MagicMock(blla=mock_blla, rpred=mock_rpred)
sys.modules["kraken.blla"] = mock_blla
sys.modules["kraken.rpred"] = mock_rpred
try:
with patch.object(kraken, "_model", MagicMock()):
blocks = kraken.extract_page_blocks(image, page_idx=1, language="de")
finally:
sys.modules.pop("kraken", None)
sys.modules.pop("kraken.blla", None)
sys.modules.pop("kraken.rpred", None)
assert blocks == []
# ─── Engine signatures must match ─────────────────────────────────────────────
#
# main.py resolves `engine = kraken_engine if use_kraken else surya_engine` and