feat(ocr): auto-insert [unleserlich] markers for low-confidence words
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s

New confidence.py module with two functions:
- apply_confidence_markers(): replaces words below threshold with
  [unleserlich], collapses adjacent markers into one
- words_from_characters(): reconstructs word-level confidence from
  Kraken's character-level data

Surya 0.17 provides native word-level confidence via line.words.
Kraken 7.0 provides per-character confidences via record.confidences.
Both engines now pass word+confidence data through main.py, which
applies the marker post-processing before returning the API response.

Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3).
Frontend already renders [unleserlich] markers via transcriptionMarkers.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-04-12 19:16:17 +02:00
parent 49975154d9
commit c74539b04b
6 changed files with 257 additions and 0 deletions

View File

@@ -9,6 +9,7 @@ import pypdfium2 as pdfium
from fastapi import FastAPI, HTTPException
from PIL import Image
from confidence import apply_confidence_markers
from engines import kraken as kraken_engine
from engines import surya as surya_engine
from models import OcrBlock, OcrRequest
@@ -71,6 +72,11 @@ async def run_ocr(request: OcrRequest):
# TYPEWRITER, HANDWRITING_LATIN, UNKNOWN — all use Surya
blocks = surya_engine.extract_blocks(images, request.language)
for block in blocks:
if block.get("words"):
block["text"] = apply_confidence_markers(block["words"])
block.pop("words", None)
return [OcrBlock(**b) for b in blocks]