feat(ocr): per-sender model registry and /train-sender endpoint

engines/kraken.py:
- Add _SenderModelRegistry with LRU eviction (max configurable via
  OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(),
  and path whitelist (/app/models/ only)
- Add _load_sender_model() helper for testability
- extract_page_blocks() and extract_region_text() accept optional
  sender_model_path; route to sender registry when provided

models.py:
- OcrRequest gains senderModelPath: str | None = None field

main.py:
- /ocr and /ocr/stream pass request.senderModelPath to Kraken engine
- New /train-sender endpoint: validates output_model_path, runs ketos
  train with base model as starting point, invalidates sender cache

docker-compose.yml:
- Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment

test_sender_registry.py:
- 4 tests: cache hit, LRU eviction, invalidate, path traversal guard

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-04-17 18:05:39 +02:00
committed by marcel
parent 548ad0fa68
commit a146a2ec3c
5 changed files with 234 additions and 9 deletions

View File

@@ -19,6 +19,7 @@ class OcrRequest(BaseModel):
scriptType: str = "UNKNOWN"
language: str = "de"
regions: list[OcrRegion] | None = None
senderModelPath: str | None = None
class OcrBlock(BaseModel):