Commit Graph

54 Commits

Author SHA1 Message Date
Marcel
eab37b9ac9 test(ocr): verify load failure does not cache broken entry in SenderModelRegistry
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 20:19:40 +02:00
Marcel
64d27d6d61 feat(ocr): per-sender model registry and /train-sender endpoint
engines/kraken.py:
- Add _SenderModelRegistry with LRU eviction (max configurable via
  OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(),
  and path whitelist (/app/models/ only)
- Add _load_sender_model() helper for testability
- extract_page_blocks() and extract_region_text() accept optional
  sender_model_path; route to sender registry when provided

models.py:
- OcrRequest gains senderModelPath: str | None = None field

main.py:
- /ocr and /ocr/stream pass request.senderModelPath to Kraken engine
- New /train-sender endpoint: validates output_model_path, runs ketos
  train with base model as starting point, invalidates sender cache

docker-compose.yml:
- Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment

test_sender_registry.py:
- 4 tests: cache hit, LRU eviction, invalidate, path traversal guard

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 18:05:39 +02:00
Marcel
c5e6ed922b test(ocr): decouple correction tests from exact library dictionary state
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 3m35s
CI / OCR Service Tests (pull_request) Successful in 36s
CI / Backend Unit Tests (pull_request) Failing after 2m47s
CI / Unit & Component Tests (push) Failing after 2m33s
CI / OCR Service Tests (push) Successful in 34s
CI / Backend Unit Tests (push) Failing after 2m41s
Replace exact-string assertions in test_correctable_ocr_error_gets_corrected
and test_sentence_with_multiple_corrections with structural assertions that
verify behavior (correction attempted, marker present, expected stem) without
coupling to a specific pyspellchecker version's frequency weights.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:23:09 +02:00
Marcel
ec85f228c1 refactor(ocr): document > 50 frequency threshold rationale
Strict greater-than avoids non-determinism: if multiple candidates share
the minimum frequency value, pyspellchecker's ranking is undefined.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:21:37 +02:00
Marcel
fea24aee25 refactor(ocr): make collapse_adjacent_markers a public function
Drop underscore prefix — the helper is part of confidence.py's effective
public API since spell_check.py imports and calls it directly.

Fixes reviewer concern: importing a _-prefixed name across module boundaries
contradicts Python's private-by-convention signal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:20:31 +02:00
Marcel
77100ab1e6 feat(ocr): integrate spell-check post-processing for handwriting script types
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:54:17 +02:00
Marcel
092131930c feat(ocr): add spell_check module with German spellchecker and historical wordlist
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:52:50 +02:00
Marcel
47f9a0bf73 test(ocr): add failing tests for spell_check module
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:51:38 +02:00
Marcel
30a6cbeb7f feat(ocr): add DTA-derived historical German wordlist and generation script
153K words from dtak+dtae 1800-1899 corpora (min_freq=20),
covering pre-reform spellings common in Kurrent/Süterlin documents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:48:26 +02:00
Marcel
6faaa3b7d6 feat(ocr): add pyspellchecker dependency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:41:24 +02:00
Marcel
77747aa556 refactor(ocr): extract _collapse_adjacent_markers helper and add CORRECTION_MARKER
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:40:39 +02:00
Marcel
4cb7c975f5 test(ocr): add resilience tests for tiny image and unexpected exception propagation
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 2m27s
CI / Backend Unit Tests (pull_request) Failing after 2m37s
CI / Unit & Component Tests (push) Failing after 3m14s
CI / Backend Unit Tests (push) Has been cancelled
Add test for 1×1 image (sub-tile-size) resilience and narrow preprocess_page
fallback from except Exception to (cv2.error, ValueError, MemoryError) so
programming errors propagate instead of being silently swallowed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 15:16:17 +02:00
Marcel
b310caaeeb feat(ocr): integrate preprocessing into stream and batch endpoints
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:16:47 +02:00
Marcel
615d404ba9 chore(ocr): add opencv-python-headless, libglib2.0-0, and CLAHE env vars
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:14:47 +02:00
Marcel
7183fc4428 feat(ocr): add image preprocessing module with CLAHE + grayscale + blur
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:13:42 +02:00
Marcel
5c7efef307 fix(ocr): pin Dockerfile base image to python:3.11.9-slim for reproducible builds
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
74c9046745 fix(ocr): narrow exception handling and add unit tests for ensure_blla_model
- _model_is_loadable: narrow bare except to (RuntimeError, OSError, ValueError)
  with DEBUG-level fallback for unexpected exceptions — prevents silent masking
  of missing kraken install or AttributeError on vgsl
- _run_segtrain: replace bare except:pass with log.warning so height-check
  fallback is visible in container logs
- New test_ensure_blla_model.py: covers model-OK early return, incompatible
  model rename+replace, and missing model download paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
a5979c4069 fix(ocr-service): fix ketos 7 segtrain compatibility and prevent OOM
Three issues fixed:

1. --resize both was removed in ketos 7; replaced with --resize union
   which extends the model's class mapping to include training data classes.

2. ketos ignores -s when -i is present, so the 1800px blla model caused
   7+ GB peak RAM and OOM-killed the host (no swap, 5 GB free).
   Now checks the loaded model's input height: only uses the base model
   when it was already fine-tuned at 800px; otherwise trains from scratch
   at 800px (~200 MB peak). After the first run the trained 800px model
   becomes the base for all subsequent fine-tuning runs.

3. segtrain now computes and returns cer = 1 - accuracy, matching the
   recognition training path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
e8375d6c72 fix(ocr-service): add entrypoint that validates blla model format on startup
Adds ensure_blla_model.py which loads the blla segmentation model with
ketos on every container start. If the model is missing or in the legacy
PyTorch ZIP format (incompatible with ketos 7), it re-downloads the
correct CoreML protobuf model from Zenodo (DOI 10.5281/zenodo.14602569).
The Dockerfile now uses entrypoint.sh which runs this check before
starting uvicorn.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
30a17c97e8 fix(ocr): fail closed when TRAINING_TOKEN is not configured
_check_training_token previously skipped auth when TRAINING_TOKEN was
empty, allowing unauthenticated requests to reach /train and /segtrain.
Now returns 503 ("Training not configured on this node") when the token
is absent, so missing configuration fails closed rather than open.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:02:13 +02:00
Marcel
62be895b9e fix(ocr): drop uvicorn workers from 2 to 1
Two workers × ~5 GB Surya model load = ~10 GB required, exceeding the
8 GB memory cap and causing OOM on the first /train call. Two OS
processes also cause model-state divergence after training, contradicting
the single-node constraint documented in ADR-001.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 09:55:55 +02:00
Marcel
669f2f8b98 fix(training): output CoreML format and fix best-model finder
ketos 7 defaults to safetensors output, but kraken's load_any() only
handles CoreML (.mlmodel). Adding --weights-format coreml ensures the
hot-swap after training produces a file that load_any() can parse.

Also fixed _find_best_model to look for best_<score>.mlmodel (produced
by --weights-format coreml) in addition to the previous checkpoint_*
pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:57:42 +02:00
Marcel
49c9022285 fix(training): switch to PAGE XML format for kurrent recognition training
Kraken 7 removed support for the legacy `path` format (image + .gt.txt
pairs) in VGSLRecognitionDataModule despite the CLI still advertising it.
Switching to PAGE XML (-f page) format which is the supported standard.

- Java export now writes .xml alongside .png (PAGE XML with TextLine,
  Baseline at 75% height, and Unicode transcription)
- XML special characters in transcription text are escaped (&amp; &lt; &gt;)
- Python trainer globs *.xml and passes -f page to ketos train
- Regenerated frontend API types to include cer/loss/accuracy/epochs on
  OcrTrainingRun (were missing, causing empty CER column in history)
- Updated and extended TrainingDataExportServiceTest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:45:08 +02:00
Marcel
94b9c56527 fix(segtrain): reduce input height to 800px on first run to avoid OOM
ketos segtrain has no batch-size flag (-B), so with the default 1800px
input height the intermediate CNN feature maps consume ~500 MB+ per
image, causing the kernel OOM-killer (exit -9) to terminate the process.

On first run (no existing blla.mlmodel), override the VGSL spec to use
800px height instead. Subsequent runs load the saved model with
--resize both, preserving incremental fine-tuning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:37:24 +02:00
Marcel
89a18c430e fix(training): limit CPU threads and epochs to prevent RAM exhaustion
Force CPU-only training (--device cpu), cap OpenMP/BLAS thread pool at 2
(--threads 2), and reduce epochs from 50 to 10 (-N 10). 50 epochs on a
laptop OOM-killed the container. 10 epochs is sufficient for incremental
fine-tuning runs; more data is added over time and training re-run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:09:13 +02:00
Marcel
8dec5b5976 fix(training): disable DataLoader workers in subprocess training
DataLoader worker subprocesses crash inside Docker due to multiprocessing
fork restrictions. Pass --workers 0 to both ketos train and ketos segtrain
so data loading runs in the main process.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 20:58:32 +02:00
Marcel
e33164c4aa fix(training): use ketos CLI subprocess instead of missing Python API
kraken.ketos has no .train or .segtrain attributes in Kraken 7 — both are
only exposed as CLI commands. Rewrites both training functions to invoke
`ketos train` / `ketos segtrain` via subprocess and parse the best
val_metric from checkpoint filenames.

Also fixes the OcrTrainingCard history so it only shows non-blla runs
(recognition model), matching SegmentationTrainingCard which already
filtered to blla-only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 20:50:21 +02:00
Marcel
22954f348a feat(training): track and display CER per training run
After each training run, the Character Error Rate (CER = 1 - accuracy),
loss, accuracy, and epoch count are now stored on the OcrTrainingRun
record and shown in the training history table.

Also adds the missing POST /api/ocr/segtrain endpoint and the
triggerSegTraining service method so the segmentation training card
can actually trigger training.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 19:01:10 +02:00
Marcel
3e34366702 fix(ocr): use cw-1/ch-1 for synthetic baseline bounds to pass Kraken's >= check
Kraken's segmentation bounds check rejects coordinates where any point
satisfies x >= im.width or y >= im.height (strictly >=, not >). Using
(cw, ch) as the boundary corner was triggering this for every crop.
Changed to (cw-1, ch-1) so all coordinates are strictly inside the image.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:21:00 +02:00
Marcel
051c43f088 fix(ocr): use synthetic baseline in guided OCR to avoid blla crash on small crops
blla.segment() is a full-page layout detection model that kills the worker
process when called on tiny annotation crops (e.g. 597x89 px). For guided
OCR the annotation region IS already the text line, so segmentation is
unnecessary. Replace the blla call with a single synthetic BaselineLine that
spans the full crop width — rpred then runs recognition on the whole crop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:09:35 +02:00
Marcel
ee58b63517 feat(ocr): add guided OCR mode using existing annotation regions
When a document has manually drawn annotation boxes, the user can now
enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine
skips layout detection entirely and runs recognition only within the
pre-drawn bounding boxes, preserving manual transcription blocks.

- Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided
  branch in /ocr/stream groups by page and crops each region
- Engines: add extract_region_text() to both Kraken and Surya
- Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion,
  TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to
  upsertGuidedBlock when annotationId is present; OcrService threads
  the flag through to runSingleDocument
- TranscriptionService: adds upsertGuidedBlock (creates, updates OCR,
  or preserves MANUAL blocks)
- Frontend: guided OCR toggle in OcrTrigger shown when blocks exist;
  skips destructive-replace confirmation in guided mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 15:57:54 +02:00
Marcel
9b2f91ee59 feat(training): add segmentation training pipeline and complete Part 6
- Add /segtrain endpoint to OCR service (ZIP upload, ketos.segtrain,
  backup rotation, in-process model reload)
- Add segtrainModel() to OcrClient and RestClientOcrClient (10-min timeout,
  X-Training-Token header)
- Add SegmentationTrainingExportService: PAGE XML export with polygon
  de-normalization and per-page PNG rendering via PDFBox
- Add GET /api/ocr/segmentation-training-data/export endpoint
- Make TranscriptionBlock.text nullable for segmentation-only blocks
  (V31 migration)
- Add Paraglide i18n translation keys for all training UI strings (de/en/es)
- Pass source prop from TranscriptionEditView to TranscriptionBlock

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 15:15:17 +02:00
Marcel
bc97a2dade feat(ocr): add /train endpoint to OCR service and OcrClient.trainModel()
- POST /train in ocr-service with ZIP Slip validation, TemporaryDirectory,
  ketos transfer learning, timestamped backups (keep last 3), in-process reload
- X-Training-Token auth (no-op in dev when TRAINING_TOKEN env is empty)
- trainModel() in OcrClient interface + RestClientOcrClient (10-min timeout,
  multipart upload, forwards X-Training-Token when configured)
- TRAINING_TOKEN env var wired in docker-compose; --workers 2 in Dockerfile
  so /health stays responsive during synchronous training

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 14:40:53 +02:00
Marcel
33dc4654e5 fix(ocr): use correct Kraken record attributes for line geometry
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
BaselineOCRRecord has 'baseline' and 'boundary' attributes, not 'line'
and 'cuts'. The fallback used record.line which doesn't exist, causing
AttributeError on every Kurrent OCR page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 13:16:25 +02:00
Marcel
70689b8f7b feat(ocr): add SSRF protection for PDF URL downloads
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 0s
Validates PDF download URLs against an ALLOWED_PDF_HOSTS allowlist
(default: minio,localhost,127.0.0.1) and disables redirect following
to prevent redirect-based SSRF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:29:42 +02:00
Marcel
0beaf351f0 fix(docker): soften ocr-service dependency and clean up compose
Changed ocr-service dependency from service_healthy to service_started
since the backend already handles OCR unavailability gracefully. Removed
unused APP_S3_INTERNAL_URL env var. Added expose directive and
.dockerignore for ocr-service.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:29:21 +02:00
Marcel
69768a104d test(ocr): add business-logic tests for polygon extraction, Kraken routing, and confidence markers
Cover Surya polygon/word-level extraction, health endpoint states,
Kraken script-type routing, 503 when models not ready, 400 when
Kraken unavailable for Kurrent, and confidence marker application
during streaming. Production code coverage: 88%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:34:23 +02:00
Marcel
97e5138934 fix(ocr): use 1-based page numbers to match frontend PDF viewer
The PDF viewer uses 1-based currentPage (starting at 1) but the OCR
engines produced 0-based pageNumber from enumerate(). Annotations
created by OCR were assigned to page 0, which doesn't exist in the
viewer. Change enumerate() to start=1 in both engines and the
streaming endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:32:08 +02:00
Marcel
97c6cf6a65 feat(ocr): add NDJSON streaming endpoint POST /ocr/stream
Streams one JSON line per completed page instead of buffering the
entire result. Emits start/page/error/done events. On per-page
failure, logs the traceback but yields a generic error message and
continues with the next page. Adds X-Accel-Buffering: no and
Cache-Control: no-cache headers for reverse-proxy compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 09:57:57 +02:00
Marcel
b7d5f71ef7 refactor(ocr): extract extract_page_blocks() from both OCR engines
Enable per-page processing by extracting the inner loop body of
extract_blocks() into extract_page_blocks(image, page_idx, language).
The original extract_blocks() now delegates to the new function,
preserving backward compatibility for the batch path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 09:56:34 +02:00
Marcel
d8dcba1a71 fix(ocr): unblock event loop during OCR and show errors in UI
OCR engines are CPU-bound and were blocking Uvicorn's single async
event loop, making /health unresponsive during processing. This caused
new OCR requests to fail silently (health check failure → no DB record
→ UI shows NONE). Wrap engine calls in asyncio.to_thread() to keep the
event loop free. Also surface OCR trigger errors in the frontend
instead of silently resetting the spinner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 23:50:39 +02:00
Marcel
838330b405 fix(ocr): use camelCase field names in Pydantic models
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Pydantic v2 Field(alias=...) doesn't work with FastAPI as expected.
The Java client sends camelCase (pdfUrl, scriptType, pageNumber).
Use camelCase field names directly instead of aliases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:04:42 +02:00
Marcel
902d423f3c fix(ocr): reduce memory usage for 16GB dev machines
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
- Surya models lazy-load on first OCR request instead of at startup
  (saves ~3-4GB idle RAM — Kraken stays eager at ~16MB)
- Process one page at a time in Surya engine (limits peak memory)
- RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM)
- Revert mem_limit back to 6GB (sufficient with these optimizations)
- Render DPI stays at 200

Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:26:50 +02:00
Marcel
7f78bc9cf4 fix(ocr): increase memory limit to 10GB, reduce render DPI to 200
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 0s
CI / Backend Unit Tests (pull_request) Failing after 1s
Surya 0.17 models use ~5GB idle. At 300 DPI on a multi-page PDF,
page images + inference tensors push past the 6GB limit, causing
OOM kills during 'Detecting bboxes'. Increased to 10GB and reduced
render DPI to 200 (still sufficient for OCR, uses ~44% less memory).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:20:36 +02:00
Marcel
4500c99e40 fix(ocr): use presigned URLs for MinIO access from OCR service
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
The OCR service was getting 403 Forbidden because it tried to
download PDFs from MinIO using plain internal URLs without
authentication. MinIO buckets are private.

- Add S3Presigner bean to MinioConfig
- FileService.generatePresignedUrl(): generates 15-min presigned URLs
- OcrService uses presigned URLs instead of plain internal URLs
- Remove unused s3InternalUrl / bucketName @Value fields from OcrService

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:16:52 +02:00
Marcel
f064b27439 feat(ocr): per-script-type confidence thresholds
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Kurrent OCR produces much lower confidence than typewriter/Latin.
Separate thresholds allow aggressive filtering for Kurrent (0.5)
while keeping typewriter lenient (0.3).

- OCR_CONFIDENCE_THRESHOLD: default for Surya paths (0.3)
- OCR_CONFIDENCE_THRESHOLD_KURRENT: Kraken Kurrent path (0.5)
- apply_confidence_markers() now accepts threshold parameter
- get_threshold(script_type) selects the right threshold

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:50:59 +02:00
Marcel
31519af1a4 fix(ocr): add pyvips for kraken PDF input support
Some checks failed
CI / Unit & Component Tests (push) Failing after 0s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 0s
CI / Backend Unit Tests (pull_request) Failing after 1s
Kraken 7 requires pyvips (optional dep) for -f pdf mode.
Added libvips42 system package and pyvips Python package.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:11:14 +02:00
Marcel
37abc376ec fix(ocr): install torchvision from CPU index alongside torch
Some checks failed
CI / Unit & Component Tests (push) Failing after 3s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
torchvision installed from PyPI expects CUDA torch operator
registrations. Installing from the CPU whl index ensures torchvision
matches the CPU-only torch build. Fixes 'torchvision::nms does not
exist' RuntimeError on startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:46:37 +02:00
Marcel
6669fffead fix(ocr): pin transformers<5.0 and torch==2.7.1 in requirements.txt
Some checks failed
CI / Unit & Component Tests (push) Failing after 3s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
transformers 5.x breaks surya 0.17.1 — SuryaDecoderConfig is missing
pad_token_id. Pin to transformers>=4.56.1,<5.0.0.

Also add torch==2.7.1 to requirements.txt to prevent pip from upgrading
it past the CPU-only build installed in the Dockerfile layer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:34:03 +02:00
Marcel
c74539b04b feat(ocr): auto-insert [unleserlich] markers for low-confidence words
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
New confidence.py module with two functions:
- apply_confidence_markers(): replaces words below threshold with
  [unleserlich], collapses adjacent markers into one
- words_from_characters(): reconstructs word-level confidence from
  Kraken's character-level data

Surya 0.17 provides native word-level confidence via line.words.
Kraken 7.0 provides per-character confidences via record.confidences.
Both engines now pass word+confidence data through main.py, which
applies the marker post-processing before returning the API response.

Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3).
Frontend already renders [unleserlich] markers via transcriptionMarkers.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:16:17 +02:00