feat(ocr): auto-insert [unleserlich] markers for low-confidence words
New confidence.py module with two functions: - apply_confidence_markers(): replaces words below threshold with [unleserlich], collapses adjacent markers into one - words_from_characters(): reconstructs word-level confidence from Kraken's character-level data Surya 0.17 provides native word-level confidence via line.words. Kraken 7.0 provides per-character confidences via record.confidences. Both engines now pass word+confidence data through main.py, which applies the marker post-processing before returning the API response. Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3). Frontend already renders [unleserlich] markers via transcriptionMarkers.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -9,6 +9,7 @@ import pypdfium2 as pdfium
|
||||
from fastapi import FastAPI, HTTPException
|
||||
from PIL import Image
|
||||
|
||||
from confidence import apply_confidence_markers
|
||||
from engines import kraken as kraken_engine
|
||||
from engines import surya as surya_engine
|
||||
from models import OcrBlock, OcrRequest
|
||||
@@ -71,6 +72,11 @@ async def run_ocr(request: OcrRequest):
|
||||
# TYPEWRITER, HANDWRITING_LATIN, UNKNOWN — all use Surya
|
||||
blocks = surya_engine.extract_blocks(images, request.language)
|
||||
|
||||
for block in blocks:
|
||||
if block.get("words"):
|
||||
block["text"] = apply_confidence_markers(block["words"])
|
||||
block.pop("words", None)
|
||||
|
||||
return [OcrBlock(**b) for b in blocks]
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user