engines/kraken.py: - Add _SenderModelRegistry with LRU eviction (max configurable via OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(), and path whitelist (/app/models/ only) - Add _load_sender_model() helper for testability - extract_page_blocks() and extract_region_text() accept optional sender_model_path; route to sender registry when provided models.py: - OcrRequest gains senderModelPath: str | None = None field main.py: - /ocr and /ocr/stream pass request.senderModelPath to Kraken engine - New /train-sender endpoint: validates output_model_path, runs ketos train with base model as starting point, invalidates sender cache docker-compose.yml: - Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment test_sender_registry.py: - 4 tests: cache hit, LRU eviction, invalidate, path traversal guard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
36 lines
729 B
Python
36 lines
729 B
Python
from pydantic import BaseModel, ConfigDict
|
|
|
|
|
|
class OcrRegion(BaseModel):
|
|
model_config = ConfigDict(populate_by_name=True)
|
|
|
|
annotationId: str
|
|
pageNumber: int
|
|
x: float
|
|
y: float
|
|
width: float
|
|
height: float
|
|
|
|
|
|
class OcrRequest(BaseModel):
|
|
model_config = ConfigDict(populate_by_name=True)
|
|
|
|
pdfUrl: str
|
|
scriptType: str = "UNKNOWN"
|
|
language: str = "de"
|
|
regions: list[OcrRegion] | None = None
|
|
senderModelPath: str | None = None
|
|
|
|
|
|
class OcrBlock(BaseModel):
|
|
model_config = ConfigDict(populate_by_name=True)
|
|
|
|
pageNumber: int
|
|
x: float
|
|
y: float
|
|
width: float
|
|
height: float
|
|
polygon: list[list[float]] | None = None
|
|
text: str
|
|
annotationId: str | None = None
|