familienarchiv

Author	SHA1	Message	Date
Marcel	eab37b9ac9	test(ocr): verify load failure does not cache broken entry in SenderModelRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:19:40 +02:00
Marcel	64d27d6d61	feat(ocr): per-sender model registry and /train-sender endpoint engines/kraken.py: - Add _SenderModelRegistry with LRU eviction (max configurable via OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(), and path whitelist (/app/models/ only) - Add _load_sender_model() helper for testability - extract_page_blocks() and extract_region_text() accept optional sender_model_path; route to sender registry when provided models.py: - OcrRequest gains senderModelPath: str \| None = None field main.py: - /ocr and /ocr/stream pass request.senderModelPath to Kraken engine - New /train-sender endpoint: validates output_model_path, runs ketos train with base model as starting point, invalidates sender cache docker-compose.yml: - Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment test_sender_registry.py: - 4 tests: cache hit, LRU eviction, invalidate, path traversal guard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 18:05:39 +02:00
Marcel	c5e6ed922b	test(ocr): decouple correction tests from exact library dictionary state Some checks failed CI / Unit & Component Tests (pull_request) Successful in 3m35s Details CI / OCR Service Tests (pull_request) Successful in 36s Details CI / Backend Unit Tests (pull_request) Failing after 2m47s Details CI / Unit & Component Tests (push) Failing after 2m33s Details CI / OCR Service Tests (push) Successful in 34s Details CI / Backend Unit Tests (push) Failing after 2m41s Details Replace exact-string assertions in test_correctable_ocr_error_gets_corrected and test_sentence_with_multiple_corrections with structural assertions that verify behavior (correction attempted, marker present, expected stem) without coupling to a specific pyspellchecker version's frequency weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:23:09 +02:00
Marcel	ec85f228c1	refactor(ocr): document > 50 frequency threshold rationale Strict greater-than avoids non-determinism: if multiple candidates share the minimum frequency value, pyspellchecker's ranking is undefined. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:21:37 +02:00
Marcel	fea24aee25	refactor(ocr): make collapse_adjacent_markers a public function Drop underscore prefix — the helper is part of confidence.py's effective public API since spell_check.py imports and calls it directly. Fixes reviewer concern: importing a _-prefixed name across module boundaries contradicts Python's private-by-convention signal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:20:31 +02:00
Marcel	77100ab1e6	feat(ocr): integrate spell-check post-processing for handwriting script types Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:54:17 +02:00
Marcel	092131930c	feat(ocr): add spell_check module with German spellchecker and historical wordlist Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:52:50 +02:00
Marcel	47f9a0bf73	test(ocr): add failing tests for spell_check module Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:51:38 +02:00
Marcel	30a6cbeb7f	feat(ocr): add DTA-derived historical German wordlist and generation script 153K words from dtak+dtae 1800-1899 corpora (min_freq=20), covering pre-reform spellings common in Kurrent/Süterlin documents. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:48:26 +02:00
Marcel	6faaa3b7d6	feat(ocr): add pyspellchecker dependency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:41:24 +02:00
Marcel	77747aa556	refactor(ocr): extract _collapse_adjacent_markers helper and add CORRECTION_MARKER Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:40:39 +02:00
Marcel	4cb7c975f5	test(ocr): add resilience tests for tiny image and unexpected exception propagation Some checks failed CI / Unit & Component Tests (pull_request) Failing after 2m27s Details CI / Backend Unit Tests (pull_request) Failing after 2m37s Details CI / Unit & Component Tests (push) Failing after 3m14s Details CI / Backend Unit Tests (push) Has been cancelled Details Add test for 1×1 image (sub-tile-size) resilience and narrow preprocess_page fallback from except Exception to (cv2.error, ValueError, MemoryError) so programming errors propagate instead of being silently swallowed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 15:16:17 +02:00
Marcel	b310caaeeb	feat(ocr): integrate preprocessing into stream and batch endpoints Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:16:47 +02:00
Marcel	615d404ba9	chore(ocr): add opencv-python-headless, libglib2.0-0, and CLAHE env vars Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:14:47 +02:00
Marcel	7183fc4428	feat(ocr): add image preprocessing module with CLAHE + grayscale + blur Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:13:42 +02:00
Marcel	5c7efef307	fix(ocr): pin Dockerfile base image to python:3.11.9-slim for reproducible builds Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	74c9046745	fix(ocr): narrow exception handling and add unit tests for ensure_blla_model - _model_is_loadable: narrow bare except to (RuntimeError, OSError, ValueError) with DEBUG-level fallback for unexpected exceptions — prevents silent masking of missing kraken install or AttributeError on vgsl - _run_segtrain: replace bare except:pass with log.warning so height-check fallback is visible in container logs - New test_ensure_blla_model.py: covers model-OK early return, incompatible model rename+replace, and missing model download paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	a5979c4069	fix(ocr-service): fix ketos 7 segtrain compatibility and prevent OOM Three issues fixed: 1. --resize both was removed in ketos 7; replaced with --resize union which extends the model's class mapping to include training data classes. 2. ketos ignores -s when -i is present, so the 1800px blla model caused 7+ GB peak RAM and OOM-killed the host (no swap, 5 GB free). Now checks the loaded model's input height: only uses the base model when it was already fine-tuned at 800px; otherwise trains from scratch at 800px (~200 MB peak). After the first run the trained 800px model becomes the base for all subsequent fine-tuning runs. 3. segtrain now computes and returns cer = 1 - accuracy, matching the recognition training path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	e8375d6c72	fix(ocr-service): add entrypoint that validates blla model format on startup Adds ensure_blla_model.py which loads the blla segmentation model with ketos on every container start. If the model is missing or in the legacy PyTorch ZIP format (incompatible with ketos 7), it re-downloads the correct CoreML protobuf model from Zenodo (DOI 10.5281/zenodo.14602569). The Dockerfile now uses entrypoint.sh which runs this check before starting uvicorn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	30a17c97e8	fix(ocr): fail closed when TRAINING_TOKEN is not configured _check_training_token previously skipped auth when TRAINING_TOKEN was empty, allowing unauthenticated requests to reach /train and /segtrain. Now returns 503 ("Training not configured on this node") when the token is absent, so missing configuration fails closed rather than open. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 10:02:13 +02:00
Marcel	62be895b9e	fix(ocr): drop uvicorn workers from 2 to 1 Two workers × ~5 GB Surya model load = ~10 GB required, exceeding the 8 GB memory cap and causing OOM on the first /train call. Two OS processes also cause model-state divergence after training, contradicting the single-node constraint documented in ADR-001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 09:55:55 +02:00
Marcel	669f2f8b98	fix(training): output CoreML format and fix best-model finder ketos 7 defaults to safetensors output, but kraken's load_any() only handles CoreML (.mlmodel). Adding --weights-format coreml ensures the hot-swap after training produces a file that load_any() can parse. Also fixed _find_best_model to look for best_<score>.mlmodel (produced by --weights-format coreml) in addition to the previous checkpoint_* pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:57:42 +02:00
Marcel	49c9022285	fix(training): switch to PAGE XML format for kurrent recognition training Kraken 7 removed support for the legacy `path` format (image + .gt.txt pairs) in VGSLRecognitionDataModule despite the CLI still advertising it. Switching to PAGE XML (-f page) format which is the supported standard. - Java export now writes .xml alongside .png (PAGE XML with TextLine, Baseline at 75% height, and Unicode transcription) - XML special characters in transcription text are escaped (& < >) - Python trainer globs *.xml and passes -f page to ketos train - Regenerated frontend API types to include cer/loss/accuracy/epochs on OcrTrainingRun (were missing, causing empty CER column in history) - Updated and extended TrainingDataExportServiceTest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:45:08 +02:00
Marcel	94b9c56527	fix(segtrain): reduce input height to 800px on first run to avoid OOM ketos segtrain has no batch-size flag (-B), so with the default 1800px input height the intermediate CNN feature maps consume ~500 MB+ per image, causing the kernel OOM-killer (exit -9) to terminate the process. On first run (no existing blla.mlmodel), override the VGSL spec to use 800px height instead. Subsequent runs load the saved model with --resize both, preserving incremental fine-tuning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:37:24 +02:00
Marcel	89a18c430e	fix(training): limit CPU threads and epochs to prevent RAM exhaustion Force CPU-only training (--device cpu), cap OpenMP/BLAS thread pool at 2 (--threads 2), and reduce epochs from 50 to 10 (-N 10). 50 epochs on a laptop OOM-killed the container. 10 epochs is sufficient for incremental fine-tuning runs; more data is added over time and training re-run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:09:13 +02:00
Marcel	8dec5b5976	fix(training): disable DataLoader workers in subprocess training DataLoader worker subprocesses crash inside Docker due to multiprocessing fork restrictions. Pass --workers 0 to both ketos train and ketos segtrain so data loading runs in the main process. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 20:58:32 +02:00
Marcel	e33164c4aa	fix(training): use ketos CLI subprocess instead of missing Python API kraken.ketos has no .train or .segtrain attributes in Kraken 7 — both are only exposed as CLI commands. Rewrites both training functions to invoke `ketos train` / `ketos segtrain` via subprocess and parse the best val_metric from checkpoint filenames. Also fixes the OcrTrainingCard history so it only shows non-blla runs (recognition model), matching SegmentationTrainingCard which already filtered to blla-only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 20:50:21 +02:00
Marcel	22954f348a	feat(training): track and display CER per training run After each training run, the Character Error Rate (CER = 1 - accuracy), loss, accuracy, and epoch count are now stored on the OcrTrainingRun record and shown in the training history table. Also adds the missing POST /api/ocr/segtrain endpoint and the triggerSegTraining service method so the segmentation training card can actually trigger training. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 19:01:10 +02:00
Marcel	3e34366702	fix(ocr): use cw-1/ch-1 for synthetic baseline bounds to pass Kraken's >= check Kraken's segmentation bounds check rejects coordinates where any point satisfies x >= im.width or y >= im.height (strictly >=, not >). Using (cw, ch) as the boundary corner was triggering this for every crop. Changed to (cw-1, ch-1) so all coordinates are strictly inside the image. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:21:00 +02:00
Marcel	051c43f088	fix(ocr): use synthetic baseline in guided OCR to avoid blla crash on small crops blla.segment() is a full-page layout detection model that kills the worker process when called on tiny annotation crops (e.g. 597x89 px). For guided OCR the annotation region IS already the text line, so segmentation is unnecessary. Replace the blla call with a single synthetic BaselineLine that spans the full crop width — rpred then runs recognition on the whole crop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:09:35 +02:00
Marcel	ee58b63517	feat(ocr): add guided OCR mode using existing annotation regions When a document has manually drawn annotation boxes, the user can now enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine skips layout detection entirely and runs recognition only within the pre-drawn bounding boxes, preserving manual transcription blocks. - Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided branch in /ocr/stream groups by page and crops each region - Engines: add extract_region_text() to both Kraken and Surya - Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion, TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to upsertGuidedBlock when annotationId is present; OcrService threads the flag through to runSingleDocument - TranscriptionService: adds upsertGuidedBlock (creates, updates OCR, or preserves MANUAL blocks) - Frontend: guided OCR toggle in OcrTrigger shown when blocks exist; skips destructive-replace confirmation in guided mode Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 15:57:54 +02:00
Marcel	9b2f91ee59	feat(training): add segmentation training pipeline and complete Part 6 - Add /segtrain endpoint to OCR service (ZIP upload, ketos.segtrain, backup rotation, in-process model reload) - Add segtrainModel() to OcrClient and RestClientOcrClient (10-min timeout, X-Training-Token header) - Add SegmentationTrainingExportService: PAGE XML export with polygon de-normalization and per-page PNG rendering via PDFBox - Add GET /api/ocr/segmentation-training-data/export endpoint - Make TranscriptionBlock.text nullable for segmentation-only blocks (V31 migration) - Add Paraglide i18n translation keys for all training UI strings (de/en/es) - Pass source prop from TranscriptionEditView to TranscriptionBlock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 15:15:17 +02:00
Marcel	bc97a2dade	feat(ocr): add /train endpoint to OCR service and OcrClient.trainModel() - POST /train in ocr-service with ZIP Slip validation, TemporaryDirectory, ketos transfer learning, timestamped backups (keep last 3), in-process reload - X-Training-Token auth (no-op in dev when TRAINING_TOKEN env is empty) - trainModel() in OcrClient interface + RestClientOcrClient (10-min timeout, multipart upload, forwards X-Training-Token when configured) - TRAINING_TOKEN env var wired in docker-compose; --workers 2 in Dockerfile so /health stays responsive during synchronous training Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:40:53 +02:00
Marcel	33dc4654e5	fix(ocr): use correct Kraken record attributes for line geometry Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details BaselineOCRRecord has 'baseline' and 'boundary' attributes, not 'line' and 'cuts'. The fallback used record.line which doesn't exist, causing AttributeError on every Kurrent OCR page. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 13:16:25 +02:00
Marcel	70689b8f7b	feat(ocr): add SSRF protection for PDF URL downloads Some checks failed CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 0s Details Validates PDF download URLs against an ALLOWED_PDF_HOSTS allowlist (default: minio,localhost,127.0.0.1) and disables redirect following to prevent redirect-based SSRF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:29:42 +02:00
Marcel	0beaf351f0	fix(docker): soften ocr-service dependency and clean up compose Changed ocr-service dependency from service_healthy to service_started since the backend already handles OCR unavailability gracefully. Removed unused APP_S3_INTERNAL_URL env var. Added expose directive and .dockerignore for ocr-service. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:29:21 +02:00
Marcel	69768a104d	test(ocr): add business-logic tests for polygon extraction, Kraken routing, and confidence markers Cover Surya polygon/word-level extraction, health endpoint states, Kraken script-type routing, 503 when models not ready, 400 when Kraken unavailable for Kurrent, and confidence marker application during streaming. Production code coverage: 88%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:34:23 +02:00
Marcel	97e5138934	fix(ocr): use 1-based page numbers to match frontend PDF viewer The PDF viewer uses 1-based currentPage (starting at 1) but the OCR engines produced 0-based pageNumber from enumerate(). Annotations created by OCR were assigned to page 0, which doesn't exist in the viewer. Change enumerate() to start=1 in both engines and the streaming endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:32:08 +02:00
Marcel	97c6cf6a65	feat(ocr): add NDJSON streaming endpoint POST /ocr/stream Streams one JSON line per completed page instead of buffering the entire result. Emits start/page/error/done events. On per-page failure, logs the traceback but yields a generic error message and continues with the next page. Adds X-Accel-Buffering: no and Cache-Control: no-cache headers for reverse-proxy compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:57:57 +02:00
Marcel	b7d5f71ef7	refactor(ocr): extract extract_page_blocks() from both OCR engines Enable per-page processing by extracting the inner loop body of extract_blocks() into extract_page_blocks(image, page_idx, language). The original extract_blocks() now delegates to the new function, preserving backward compatibility for the batch path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:56:34 +02:00
Marcel	d8dcba1a71	fix(ocr): unblock event loop during OCR and show errors in UI OCR engines are CPU-bound and were blocking Uvicorn's single async event loop, making /health unresponsive during processing. This caused new OCR requests to fail silently (health check failure → no DB record → UI shows NONE). Wrap engine calls in asyncio.to_thread() to keep the event loop free. Also surface OCR trigger errors in the frontend instead of silently resetting the spinner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:50:39 +02:00
Marcel	838330b405	fix(ocr): use camelCase field names in Pydantic models Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Pydantic v2 Field(alias=...) doesn't work with FastAPI as expected. The Java client sends camelCase (pdfUrl, scriptType, pageNumber). Use camelCase field names directly instead of aliases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:04:42 +02:00
Marcel	902d423f3c	fix(ocr): reduce memory usage for 16GB dev machines Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - Surya models lazy-load on first OCR request instead of at startup (saves ~3-4GB idle RAM — Kraken stays eager at ~16MB) - Process one page at a time in Surya engine (limits peak memory) - RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM) - Revert mem_limit back to 6GB (sufficient with these optimizations) - Render DPI stays at 200 Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:26:50 +02:00
Marcel	7f78bc9cf4	fix(ocr): increase memory limit to 10GB, reduce render DPI to 200 Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Surya 0.17 models use ~5GB idle. At 300 DPI on a multi-page PDF, page images + inference tensors push past the 6GB limit, causing OOM kills during 'Detecting bboxes'. Increased to 10GB and reduced render DPI to 200 (still sufficient for OCR, uses ~44% less memory). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:20:36 +02:00
Marcel	4500c99e40	fix(ocr): use presigned URLs for MinIO access from OCR service Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details The OCR service was getting 403 Forbidden because it tried to download PDFs from MinIO using plain internal URLs without authentication. MinIO buckets are private. - Add S3Presigner bean to MinioConfig - FileService.generatePresignedUrl(): generates 15-min presigned URLs - OcrService uses presigned URLs instead of plain internal URLs - Remove unused s3InternalUrl / bucketName @Value fields from OcrService Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:16:52 +02:00
Marcel	f064b27439	feat(ocr): per-script-type confidence thresholds Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Kurrent OCR produces much lower confidence than typewriter/Latin. Separate thresholds allow aggressive filtering for Kurrent (0.5) while keeping typewriter lenient (0.3). - OCR_CONFIDENCE_THRESHOLD: default for Surya paths (0.3) - OCR_CONFIDENCE_THRESHOLD_KURRENT: Kraken Kurrent path (0.5) - apply_confidence_markers() now accepts threshold parameter - get_threshold(script_type) selects the right threshold Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:50:59 +02:00
Marcel	31519af1a4	fix(ocr): add pyvips for kraken PDF input support Some checks failed CI / Unit & Component Tests (push) Failing after 0s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Kraken 7 requires pyvips (optional dep) for -f pdf mode. Added libvips42 system package and pyvips Python package. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:11:14 +02:00
Marcel	37abc376ec	fix(ocr): install torchvision from CPU index alongside torch Some checks failed CI / Unit & Component Tests (push) Failing after 3s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details torchvision installed from PyPI expects CUDA torch operator registrations. Installing from the CPU whl index ensures torchvision matches the CPU-only torch build. Fixes 'torchvision::nms does not exist' RuntimeError on startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:46:37 +02:00
Marcel	6669fffead	fix(ocr): pin transformers<5.0 and torch==2.7.1 in requirements.txt Some checks failed CI / Unit & Component Tests (push) Failing after 3s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details transformers 5.x breaks surya 0.17.1 — SuryaDecoderConfig is missing pad_token_id. Pin to transformers>=4.56.1,<5.0.0. Also add torch==2.7.1 to requirements.txt to prevent pip from upgrading it past the CPU-only build installed in the Dockerfile layer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:34:03 +02:00
Marcel	c74539b04b	feat(ocr): auto-insert [unleserlich] markers for low-confidence words Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 2s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details New confidence.py module with two functions: - apply_confidence_markers(): replaces words below threshold with [unleserlich], collapses adjacent markers into one - words_from_characters(): reconstructs word-level confidence from Kraken's character-level data Surya 0.17 provides native word-level confidence via line.words. Kraken 7.0 provides per-character confidences via record.confidences. Both engines now pass word+confidence data through main.py, which applies the marker post-processing before returning the API response. Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3). Frontend already renders [unleserlich] markers via transcriptionMarkers.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:16:17 +02:00

1 2

54 Commits