fix(ocr): guard Kraken block extraction against missing boundary/baseline
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m37s
CI / OCR Service Tests (push) Successful in 32s
CI / Backend Unit Tests (push) Failing after 2m51s

extract_page_blocks() walked `record.boundary` and `record.baseline`
unconditionally, so a record that arrived without either (malformed
kraken output, or a MagicMock in tests that iterates to nothing)
crashed with "min() arg is an empty sequence".

Coerce both attributes through list(), require at least 3 points for
the polygon path, fall back to the baseline path when the polygon is
missing, and skip the record entirely when neither is usable —
emitting no block is safer than emitting one with garbage coordinates.

The test helper now sets `boundary` and `baseline` explicitly to
mirror real Kraken 7.0 records (and so the happy-path test exercises
the polygon branch). A new regression test covers the skip path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-04-23 09:33:03 +02:00
parent 1f7b712dd0
commit 23cf88856e
2 changed files with 48 additions and 5 deletions

View File

@@ -122,16 +122,27 @@ def extract_page_blocks(image: Image, page_idx: int, language: str = "de",
pred_it = rpred.rpred(active_model, image, baseline_seg)
for record in pred_it:
polygon_pts = record.boundary if hasattr(record, "boundary") and record.boundary else []
# Coerce via list() so unexpected shapes (None, truthy mocks that
# iterate to nothing, empty lists) collapse to [] and can't blow up
# the min/max calls below.
boundary_attr = getattr(record, "boundary", None)
polygon_pts = list(boundary_attr) if boundary_attr else []
if polygon_pts:
if len(polygon_pts) >= 3:
xs = [p[0] for p in polygon_pts]
ys = [p[1] for p in polygon_pts]
x1, y1 = min(xs), min(ys)
x2, y2 = max(xs), max(ys)
else:
xs = [p[0] for p in record.baseline]
ys = [p[1] for p in record.baseline]
baseline_attr = getattr(record, "baseline", None)
baseline_pts = list(baseline_attr) if baseline_attr else []
if not baseline_pts:
# No polygon and no baseline — we have no way to place the
# block on the page, so drop it rather than emit garbage
# coordinates.
continue
xs = [p[0] for p in baseline_pts]
ys = [p[1] for p in baseline_pts]
x1, y1 = min(xs), min(ys) - 5
x2, y2 = max(xs), max(ys) + 5