Files
familienarchiv/ocr-service/engines/surya.py
Marcel c74539b04b
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
feat(ocr): auto-insert [unleserlich] markers for low-confidence words
New confidence.py module with two functions:
- apply_confidence_markers(): replaces words below threshold with
  [unleserlich], collapses adjacent markers into one
- words_from_characters(): reconstructs word-level confidence from
  Kraken's character-level data

Surya 0.17 provides native word-level confidence via line.words.
Kraken 7.0 provides per-character confidences via record.confidences.
Both engines now pass word+confidence data through main.py, which
applies the marker post-processing before returning the API response.

Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3).
Frontend already renders [unleserlich] markers via transcriptionMarkers.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:16:17 +02:00

77 lines
2.6 KiB
Python

"""Surya OCR engine wrapper — transformer-based, handles typewritten and modern Latin handwriting."""
import logging
logger = logging.getLogger(__name__)
_recognition_predictor = None
_detection_predictor = None
def load_models():
"""Eagerly load Surya models into memory. Called once at container startup."""
global _recognition_predictor, _detection_predictor
logger.info("Loading Surya models...")
from surya.foundation import FoundationPredictor
from surya.recognition import RecognitionPredictor
from surya.detection import DetectionPredictor
foundation_predictor = FoundationPredictor()
_recognition_predictor = RecognitionPredictor(foundation_predictor)
_detection_predictor = DetectionPredictor()
logger.info("Surya models loaded successfully")
def extract_blocks(images: list, language: str = "de") -> list[dict]:
"""Run Surya OCR on a list of PIL images (one per page).
Returns a flat list of block dicts with pageNumber, x, y, width, height,
polygon, text. Coordinates are normalized to [0, 1] relative to page dimensions.
Surya 0.17+ returns polygon (4-point) natively on each text line.
"""
all_blocks = []
predictions = _recognition_predictor(images, det_predictor=_detection_predictor)
for page_idx, page_pred in enumerate(predictions):
page_w, page_h = images[page_idx].size
for line in page_pred.text_lines:
bbox = line.bbox # [x1, y1, x2, y2] in pixel coordinates
x1, y1, x2, y2 = bbox
# Surya 0.17 provides polygon as list of (x, y) tuples (4 points, clockwise)
polygon = None
if hasattr(line, "polygon") and line.polygon and len(line.polygon) == 4:
polygon = [
[p[0] / page_w, p[1] / page_h]
for p in line.polygon
]
# Extract word-level confidence for [unleserlich] marking
words = []
if hasattr(line, "words") and line.words:
for word in line.words:
words.append({
"text": word.text,
"confidence": word.confidence,
})
else:
words = [{"text": line.text, "confidence": getattr(line, "confidence", 1.0)}]
all_blocks.append({
"pageNumber": page_idx,
"x": x1 / page_w,
"y": y1 / page_h,
"width": (x2 - x1) / page_w,
"height": (y2 - y1) / page_h,
"polygon": polygon,
"text": line.text,
"words": words,
})
return all_blocks