feat(ocr): auto-insert [unleserlich] markers for low-confidence words

New confidence.py module with two functions: - apply_confidence_markers(): replaces words below threshold with [unleserlich], collapses adjacent markers into one - words_from_characters(): reconstructs word-level confidence from Kraken's character-level data Surya 0.17 provides native word-level confidence via line.words. Kraken 7.0 provides per-character confidences via record.confidences. Both engines now pass word+confidence data through main.py, which applies the marker post-processing before returning the API response. Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3). Frontend already renders [unleserlich] markers via transcriptionMarkers.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:16:17 +02:00
parent 49975154d9
commit c74539b04b
6 changed files with 257 additions and 0 deletions
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -84,6 +84,7 @@ services:
      - ocr_models:/app/models
    environment:
      KRAKEN_MODEL_PATH: /app/models/german_kurrent.mlmodel
+      OCR_CONFIDENCE_THRESHOLD: "0.3"
    networks:
      - archive-net
    healthcheck: