familienarchiv

Author	SHA1	Message	Date
Marcel	23cf88856e	fix(ocr): guard Kraken block extraction against missing boundary/baseline Some checks failed CI / Unit & Component Tests (push) Failing after 2m37s Details CI / OCR Service Tests (push) Successful in 32s Details CI / Backend Unit Tests (push) Failing after 2m51s Details extract_page_blocks() walked `record.boundary` and `record.baseline` unconditionally, so a record that arrived without either (malformed kraken output, or a MagicMock in tests that iterates to nothing) crashed with "min() arg is an empty sequence". Coerce both attributes through list(), require at least 3 points for the polygon path, fall back to the baseline path when the polygon is missing, and skip the record entirely when neither is usable — emitting no block is safer than emitting one with garbage coordinates. The test helper now sets `boundary` and `baseline` explicitly to mirror real Kraken 7.0 records (and so the happy-path test exercises the polygon branch). A new regression test covers the skip path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-23 09:33:03 +02:00
Marcel	1f7b712dd0	fix(ocr): accept sender_model_path in Surya engine so non-Kurrent OCR works Some checks failed CI / Unit & Component Tests (push) Failing after 2m36s Details CI / OCR Service Tests (push) Successful in 33s Details CI / Backend Unit Tests (push) Has started running Details main.py unifies the call to both engines and always passes `sender_model_path` (None for non-Kurrent scripts). Surya's extract_region_text / extract_page_blocks accepted one fewer positional arg than Kraken's, so every guided-OCR run on a TYPEWRITER or HANDWRITING_LATIN document raised "takes 5 positional arguments but 6 were given" and the stream returned 0 blocks / 1 skipped page. Add an ignored `sender_model_path` kwarg to both Surya functions so the signatures match Kraken's, and guard the regression with two signature tests in test_engines.py that compare both engines' parameter lists. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-23 09:28:25 +02:00
Marcel	69768a104d	test(ocr): add business-logic tests for polygon extraction, Kraken routing, and confidence markers Cover Surya polygon/word-level extraction, health endpoint states, Kraken script-type routing, 503 when models not ready, 400 when Kraken unavailable for Kurrent, and confidence marker application during streaming. Production code coverage: 88%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:34:23 +02:00
Marcel	97e5138934	fix(ocr): use 1-based page numbers to match frontend PDF viewer The PDF viewer uses 1-based currentPage (starting at 1) but the OCR engines produced 0-based pageNumber from enumerate(). Annotations created by OCR were assigned to page 0, which doesn't exist in the viewer. Change enumerate() to start=1 in both engines and the streaming endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:32:08 +02:00
Marcel	b7d5f71ef7	refactor(ocr): extract extract_page_blocks() from both OCR engines Enable per-page processing by extracting the inner loop body of extract_blocks() into extract_page_blocks(image, page_idx, language). The original extract_blocks() now delegates to the new function, preserving backward compatibility for the batch path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:56:34 +02:00

5 Commits