Commit Graph

17 Commits

Author SHA1 Message Date
Marcel
e85057bed2 refactor(document): move document domain core to document/ package
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:39:20 +02:00
Marcel
23cf88856e fix(ocr): guard Kraken block extraction against missing boundary/baseline
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m37s
CI / OCR Service Tests (push) Successful in 32s
CI / Backend Unit Tests (push) Failing after 2m51s
extract_page_blocks() walked `record.boundary` and `record.baseline`
unconditionally, so a record that arrived without either (malformed
kraken output, or a MagicMock in tests that iterates to nothing)
crashed with "min() arg is an empty sequence".

Coerce both attributes through list(), require at least 3 points for
the polygon path, fall back to the baseline path when the polygon is
missing, and skip the record entirely when neither is usable —
emitting no block is safer than emitting one with garbage coordinates.

The test helper now sets `boundary` and `baseline` explicitly to
mirror real Kraken 7.0 records (and so the happy-path test exercises
the polygon branch). A new regression test covers the skip path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-23 09:33:03 +02:00
Marcel
1f7b712dd0 fix(ocr): accept sender_model_path in Surya engine so non-Kurrent OCR works
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m36s
CI / OCR Service Tests (push) Successful in 33s
CI / Backend Unit Tests (push) Has started running
main.py unifies the call to both engines and always passes
`sender_model_path` (None for non-Kurrent scripts). Surya's
extract_region_text / extract_page_blocks accepted one fewer positional
arg than Kraken's, so every guided-OCR run on a TYPEWRITER or
HANDWRITING_LATIN document raised "takes 5 positional arguments but 6
were given" and the stream returned 0 blocks / 1 skipped page.

Add an ignored `sender_model_path` kwarg to both Surya functions so the
signatures match Kraken's, and guard the regression with two signature
tests in test_engines.py that compare both engines' parameter lists.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-23 09:28:25 +02:00
Marcel
64a854aad6 refactor(ocr): mark _SenderModelRegistry.contains as private (_contains)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 21:26:46 +02:00
Marcel
000079fd50 refactor(ocr): rename _contains to contains in SenderModelRegistry
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 20:53:16 +02:00
Marcel
07035b9fa9 style(ocr): add Image type hints to extract_page_blocks and extract_region_text
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 20:22:34 +02:00
Marcel
64d27d6d61 feat(ocr): per-sender model registry and /train-sender endpoint
engines/kraken.py:
- Add _SenderModelRegistry with LRU eviction (max configurable via
  OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(),
  and path whitelist (/app/models/ only)
- Add _load_sender_model() helper for testability
- extract_page_blocks() and extract_region_text() accept optional
  sender_model_path; route to sender registry when provided

models.py:
- OcrRequest gains senderModelPath: str | None = None field

main.py:
- /ocr and /ocr/stream pass request.senderModelPath to Kraken engine
- New /train-sender endpoint: validates output_model_path, runs ketos
  train with base model as starting point, invalidates sender cache

docker-compose.yml:
- Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment

test_sender_registry.py:
- 4 tests: cache hit, LRU eviction, invalidate, path traversal guard

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 18:05:39 +02:00
Marcel
3e34366702 fix(ocr): use cw-1/ch-1 for synthetic baseline bounds to pass Kraken's >= check
Kraken's segmentation bounds check rejects coordinates where any point
satisfies x >= im.width or y >= im.height (strictly >=, not >). Using
(cw, ch) as the boundary corner was triggering this for every crop.
Changed to (cw-1, ch-1) so all coordinates are strictly inside the image.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:21:00 +02:00
Marcel
051c43f088 fix(ocr): use synthetic baseline in guided OCR to avoid blla crash on small crops
blla.segment() is a full-page layout detection model that kills the worker
process when called on tiny annotation crops (e.g. 597x89 px). For guided
OCR the annotation region IS already the text line, so segmentation is
unnecessary. Replace the blla call with a single synthetic BaselineLine that
spans the full crop width — rpred then runs recognition on the whole crop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:09:35 +02:00
Marcel
ee58b63517 feat(ocr): add guided OCR mode using existing annotation regions
When a document has manually drawn annotation boxes, the user can now
enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine
skips layout detection entirely and runs recognition only within the
pre-drawn bounding boxes, preserving manual transcription blocks.

- Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided
  branch in /ocr/stream groups by page and crops each region
- Engines: add extract_region_text() to both Kraken and Surya
- Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion,
  TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to
  upsertGuidedBlock when annotationId is present; OcrService threads
  the flag through to runSingleDocument
- TranscriptionService: adds upsertGuidedBlock (creates, updates OCR,
  or preserves MANUAL blocks)
- Frontend: guided OCR toggle in OcrTrigger shown when blocks exist;
  skips destructive-replace confirmation in guided mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 15:57:54 +02:00
Marcel
33dc4654e5 fix(ocr): use correct Kraken record attributes for line geometry
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
BaselineOCRRecord has 'baseline' and 'boundary' attributes, not 'line'
and 'cuts'. The fallback used record.line which doesn't exist, causing
AttributeError on every Kurrent OCR page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 13:16:25 +02:00
Marcel
97e5138934 fix(ocr): use 1-based page numbers to match frontend PDF viewer
The PDF viewer uses 1-based currentPage (starting at 1) but the OCR
engines produced 0-based pageNumber from enumerate(). Annotations
created by OCR were assigned to page 0, which doesn't exist in the
viewer. Change enumerate() to start=1 in both engines and the
streaming endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:32:08 +02:00
Marcel
b7d5f71ef7 refactor(ocr): extract extract_page_blocks() from both OCR engines
Enable per-page processing by extracting the inner loop body of
extract_blocks() into extract_page_blocks(image, page_idx, language).
The original extract_blocks() now delegates to the new function,
preserving backward compatibility for the batch path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 09:56:34 +02:00
Marcel
902d423f3c fix(ocr): reduce memory usage for 16GB dev machines
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
- Surya models lazy-load on first OCR request instead of at startup
  (saves ~3-4GB idle RAM — Kraken stays eager at ~16MB)
- Process one page at a time in Surya engine (limits peak memory)
- RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM)
- Revert mem_limit back to 6GB (sufficient with these optimizations)
- Render DPI stays at 200

Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:26:50 +02:00
Marcel
c74539b04b feat(ocr): auto-insert [unleserlich] markers for low-confidence words
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
New confidence.py module with two functions:
- apply_confidence_markers(): replaces words below threshold with
  [unleserlich], collapses adjacent markers into one
- words_from_characters(): reconstructs word-level confidence from
  Kraken's character-level data

Surya 0.17 provides native word-level confidence via line.words.
Kraken 7.0 provides per-character confidences via record.confidences.
Both engines now pass word+confidence data through main.py, which
applies the marker post-processing before returning the API response.

Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3).
Frontend already renders [unleserlich] markers via transcriptionMarkers.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:16:17 +02:00
Marcel
49975154d9 feat(ocr): bump to latest surya 0.17.1, kraken 7.0, torch 2.7.1
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
- surya-ocr 0.6.3 → 0.17.1: new predictor API (FoundationPredictor,
  RecognitionPredictor, DetectionPredictor), native polygon output
  on text lines (4-point clockwise)
- kraken 5.2.9 → 7.0: wider torch range (>=2.4,<=2.10), unpinned numpy
- torch 2.5.1 → 2.7.1: satisfies surya's >=2.7.0 requirement
- Rewrite engines/surya.py for the 0.17 predictor class API
- Surya now outputs polygons natively — no longer rectangle-only

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 18:53:14 +02:00
Marcel
6737bd6db5 feat(ocr): add Python OCR microservice, RestClientOcrClient, Docker Compose
Python microservice (ocr-service/):
- FastAPI app with /ocr and /health endpoints
- Surya engine: transformer-based OCR for typewritten/modern handwriting
- Kraken engine: historical HTR for Kurrent/Suetterlin with
  pure-Python polygon-to-quad approximation (gift wrapping + rotating calipers)
- Eager model loading at startup via lifespan context manager
- PDF download via httpx, page rendering via pypdfium2 at 300 DPI

Java RestClientOcrClient:
- Implements OcrClient + OcrHealthClient interfaces
- Calls Python service via Spring RestClient
- Health check with graceful fallback

Docker Compose:
- New ocr-service container (mem_limit 6g, no host ports)
- Health check with start_period 60s for model loading
- ocr_models volume for Kraken model files
- Backend depends on ocr-service health

Refs #226, #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:26:40 +02:00