feat(ocr): per-sender model registry and /train-sender endpoint

engines/kraken.py: - Add _SenderModelRegistry with LRU eviction (max configurable via OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(), and path whitelist (/app/models/ only) - Add _load_sender_model() helper for testability - extract_page_blocks() and extract_region_text() accept optional sender_model_path; route to sender registry when provided models.py: - OcrRequest gains senderModelPath: str | None = None field main.py: - /ocr and /ocr/stream pass request.senderModelPath to Kraken engine - New /train-sender endpoint: validates output_model_path, runs ketos train with base model as starting point, invalidates sender cache docker-compose.yml: - Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment test_sender_registry.py: - 4 tests: cache hit, LRU eviction, invalidate, path traversal guard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 18:05:39 +02:00
parent 7a342a07cf
commit 64d27d6d61
5 changed files with 234 additions and 9 deletions
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -97,6 +97,7 @@ services:
      DETECTOR_BATCH_SIZE: "8"
      OCR_CLAHE_CLIP_LIMIT: "2.0"   # CLAHE contrast limit (multiplier of average histogram frequency)
      OCR_CLAHE_TILE_SIZE: "8"      # CLAHE tile grid size (NxN tiles per page)
+      OCR_MAX_CACHED_MODELS: "5"    # LRU cache size for per-sender Kraken models
    networks:
      - archive-net
    healthcheck: