blla.segment() is a full-page layout detection model that kills the worker
process when called on tiny annotation crops (e.g. 597x89 px). For guided
OCR the annotation region IS already the text line, so segmentation is
unnecessary. Replace the blla call with a single synthetic BaselineLine that
spans the full crop width — rpred then runs recognition on the whole crop.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a document has manually drawn annotation boxes, the user can now
enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine
skips layout detection entirely and runs recognition only within the
pre-drawn bounding boxes, preserving manual transcription blocks.
- Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided
branch in /ocr/stream groups by page and crops each region
- Engines: add extract_region_text() to both Kraken and Surya
- Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion,
TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to
upsertGuidedBlock when annotationId is present; OcrService threads
the flag through to runSingleDocument
- TranscriptionService: adds upsertGuidedBlock (creates, updates OCR,
or preserves MANUAL blocks)
- Frontend: guided OCR toggle in OcrTrigger shown when blocks exist;
skips destructive-replace confirmation in guided mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BaselineOCRRecord has 'baseline' and 'boundary' attributes, not 'line'
and 'cuts'. The fallback used record.line which doesn't exist, causing
AttributeError on every Kurrent OCR page.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The PDF viewer uses 1-based currentPage (starting at 1) but the OCR
engines produced 0-based pageNumber from enumerate(). Annotations
created by OCR were assigned to page 0, which doesn't exist in the
viewer. Change enumerate() to start=1 in both engines and the
streaming endpoint.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enable per-page processing by extracting the inner loop body of
extract_blocks() into extract_page_blocks(image, page_idx, language).
The original extract_blocks() now delegates to the new function,
preserving backward compatibility for the batch path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Surya models lazy-load on first OCR request instead of at startup
(saves ~3-4GB idle RAM — Kraken stays eager at ~16MB)
- Process one page at a time in Surya engine (limits peak memory)
- RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM)
- Revert mem_limit back to 6GB (sufficient with these optimizations)
- Render DPI stays at 200
Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New confidence.py module with two functions:
- apply_confidence_markers(): replaces words below threshold with
[unleserlich], collapses adjacent markers into one
- words_from_characters(): reconstructs word-level confidence from
Kraken's character-level data
Surya 0.17 provides native word-level confidence via line.words.
Kraken 7.0 provides per-character confidences via record.confidences.
Both engines now pass word+confidence data through main.py, which
applies the marker post-processing before returning the API response.
Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3).
Frontend already renders [unleserlich] markers via transcriptionMarkers.ts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Python microservice (ocr-service/):
- FastAPI app with /ocr and /health endpoints
- Surya engine: transformer-based OCR for typewritten/modern handwriting
- Kraken engine: historical HTR for Kurrent/Suetterlin with
pure-Python polygon-to-quad approximation (gift wrapping + rotating calipers)
- Eager model loading at startup via lifespan context manager
- PDF download via httpx, page rendering via pypdfium2 at 300 DPI
Java RestClientOcrClient:
- Implements OcrClient + OcrHealthClient interfaces
- Calls Python service via Spring RestClient
- Health check with graceful fallback
Docker Compose:
- New ocr-service container (mem_limit 6g, no host ports)
- Health check with start_period 60s for model loading
- ocr_models volume for Kraken model files
- Backend depends on ocr-service health
Refs #226, #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>