familienarchiv

Author	SHA1	Message	Date
Marcel	669f2f8b98	fix(training): output CoreML format and fix best-model finder ketos 7 defaults to safetensors output, but kraken's load_any() only handles CoreML (.mlmodel). Adding --weights-format coreml ensures the hot-swap after training produces a file that load_any() can parse. Also fixed _find_best_model to look for best_<score>.mlmodel (produced by --weights-format coreml) in addition to the previous checkpoint_* pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:57:42 +02:00
Marcel	49c9022285	fix(training): switch to PAGE XML format for kurrent recognition training Kraken 7 removed support for the legacy `path` format (image + .gt.txt pairs) in VGSLRecognitionDataModule despite the CLI still advertising it. Switching to PAGE XML (-f page) format which is the supported standard. - Java export now writes .xml alongside .png (PAGE XML with TextLine, Baseline at 75% height, and Unicode transcription) - XML special characters in transcription text are escaped (& < >) - Python trainer globs *.xml and passes -f page to ketos train - Regenerated frontend API types to include cer/loss/accuracy/epochs on OcrTrainingRun (were missing, causing empty CER column in history) - Updated and extended TrainingDataExportServiceTest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:45:08 +02:00
Marcel	94b9c56527	fix(segtrain): reduce input height to 800px on first run to avoid OOM ketos segtrain has no batch-size flag (-B), so with the default 1800px input height the intermediate CNN feature maps consume ~500 MB+ per image, causing the kernel OOM-killer (exit -9) to terminate the process. On first run (no existing blla.mlmodel), override the VGSL spec to use 800px height instead. Subsequent runs load the saved model with --resize both, preserving incremental fine-tuning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:37:24 +02:00
Marcel	89a18c430e	fix(training): limit CPU threads and epochs to prevent RAM exhaustion Force CPU-only training (--device cpu), cap OpenMP/BLAS thread pool at 2 (--threads 2), and reduce epochs from 50 to 10 (-N 10). 50 epochs on a laptop OOM-killed the container. 10 epochs is sufficient for incremental fine-tuning runs; more data is added over time and training re-run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:09:13 +02:00
Marcel	8dec5b5976	fix(training): disable DataLoader workers in subprocess training DataLoader worker subprocesses crash inside Docker due to multiprocessing fork restrictions. Pass --workers 0 to both ketos train and ketos segtrain so data loading runs in the main process. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 20:58:32 +02:00
Marcel	e33164c4aa	fix(training): use ketos CLI subprocess instead of missing Python API kraken.ketos has no .train or .segtrain attributes in Kraken 7 — both are only exposed as CLI commands. Rewrites both training functions to invoke `ketos train` / `ketos segtrain` via subprocess and parse the best val_metric from checkpoint filenames. Also fixes the OcrTrainingCard history so it only shows non-blla runs (recognition model), matching SegmentationTrainingCard which already filtered to blla-only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 20:50:21 +02:00
Marcel	22954f348a	feat(training): track and display CER per training run After each training run, the Character Error Rate (CER = 1 - accuracy), loss, accuracy, and epoch count are now stored on the OcrTrainingRun record and shown in the training history table. Also adds the missing POST /api/ocr/segtrain endpoint and the triggerSegTraining service method so the segmentation training card can actually trigger training. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 19:01:10 +02:00
Marcel	ee58b63517	feat(ocr): add guided OCR mode using existing annotation regions When a document has manually drawn annotation boxes, the user can now enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine skips layout detection entirely and runs recognition only within the pre-drawn bounding boxes, preserving manual transcription blocks. - Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided branch in /ocr/stream groups by page and crops each region - Engines: add extract_region_text() to both Kraken and Surya - Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion, TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to upsertGuidedBlock when annotationId is present; OcrService threads the flag through to runSingleDocument - TranscriptionService: adds upsertGuidedBlock (creates, updates OCR, or preserves MANUAL blocks) - Frontend: guided OCR toggle in OcrTrigger shown when blocks exist; skips destructive-replace confirmation in guided mode Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 15:57:54 +02:00
Marcel	9b2f91ee59	feat(training): add segmentation training pipeline and complete Part 6 - Add /segtrain endpoint to OCR service (ZIP upload, ketos.segtrain, backup rotation, in-process model reload) - Add segtrainModel() to OcrClient and RestClientOcrClient (10-min timeout, X-Training-Token header) - Add SegmentationTrainingExportService: PAGE XML export with polygon de-normalization and per-page PNG rendering via PDFBox - Add GET /api/ocr/segmentation-training-data/export endpoint - Make TranscriptionBlock.text nullable for segmentation-only blocks (V31 migration) - Add Paraglide i18n translation keys for all training UI strings (de/en/es) - Pass source prop from TranscriptionEditView to TranscriptionBlock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 15:15:17 +02:00
Marcel	bc97a2dade	feat(ocr): add /train endpoint to OCR service and OcrClient.trainModel() - POST /train in ocr-service with ZIP Slip validation, TemporaryDirectory, ketos transfer learning, timestamped backups (keep last 3), in-process reload - X-Training-Token auth (no-op in dev when TRAINING_TOKEN env is empty) - trainModel() in OcrClient interface + RestClientOcrClient (10-min timeout, multipart upload, forwards X-Training-Token when configured) - TRAINING_TOKEN env var wired in docker-compose; --workers 2 in Dockerfile so /health stays responsive during synchronous training Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:40:53 +02:00
Marcel	70689b8f7b	feat(ocr): add SSRF protection for PDF URL downloads Some checks failed CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 0s Details Validates PDF download URLs against an ALLOWED_PDF_HOSTS allowlist (default: minio,localhost,127.0.0.1) and disables redirect following to prevent redirect-based SSRF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:29:42 +02:00
Marcel	97e5138934	fix(ocr): use 1-based page numbers to match frontend PDF viewer The PDF viewer uses 1-based currentPage (starting at 1) but the OCR engines produced 0-based pageNumber from enumerate(). Annotations created by OCR were assigned to page 0, which doesn't exist in the viewer. Change enumerate() to start=1 in both engines and the streaming endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:32:08 +02:00
Marcel	97c6cf6a65	feat(ocr): add NDJSON streaming endpoint POST /ocr/stream Streams one JSON line per completed page instead of buffering the entire result. Emits start/page/error/done events. On per-page failure, logs the traceback but yields a generic error message and continues with the next page. Adds X-Accel-Buffering: no and Cache-Control: no-cache headers for reverse-proxy compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:57:57 +02:00
Marcel	d8dcba1a71	fix(ocr): unblock event loop during OCR and show errors in UI OCR engines are CPU-bound and were blocking Uvicorn's single async event loop, making /health unresponsive during processing. This caused new OCR requests to fail silently (health check failure → no DB record → UI shows NONE). Wrap engine calls in asyncio.to_thread() to keep the event loop free. Also surface OCR trigger errors in the frontend instead of silently resetting the spinner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:50:39 +02:00
Marcel	838330b405	fix(ocr): use camelCase field names in Pydantic models Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Pydantic v2 Field(alias=...) doesn't work with FastAPI as expected. The Java client sends camelCase (pdfUrl, scriptType, pageNumber). Use camelCase field names directly instead of aliases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:04:42 +02:00
Marcel	902d423f3c	fix(ocr): reduce memory usage for 16GB dev machines Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - Surya models lazy-load on first OCR request instead of at startup (saves ~3-4GB idle RAM — Kraken stays eager at ~16MB) - Process one page at a time in Surya engine (limits peak memory) - RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM) - Revert mem_limit back to 6GB (sufficient with these optimizations) - Render DPI stays at 200 Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:26:50 +02:00
Marcel	7f78bc9cf4	fix(ocr): increase memory limit to 10GB, reduce render DPI to 200 Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Surya 0.17 models use ~5GB idle. At 300 DPI on a multi-page PDF, page images + inference tensors push past the 6GB limit, causing OOM kills during 'Detecting bboxes'. Increased to 10GB and reduced render DPI to 200 (still sufficient for OCR, uses ~44% less memory). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:20:36 +02:00
Marcel	4500c99e40	fix(ocr): use presigned URLs for MinIO access from OCR service Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details The OCR service was getting 403 Forbidden because it tried to download PDFs from MinIO using plain internal URLs without authentication. MinIO buckets are private. - Add S3Presigner bean to MinioConfig - FileService.generatePresignedUrl(): generates 15-min presigned URLs - OcrService uses presigned URLs instead of plain internal URLs - Remove unused s3InternalUrl / bucketName @Value fields from OcrService Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:16:52 +02:00
Marcel	f064b27439	feat(ocr): per-script-type confidence thresholds Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Kurrent OCR produces much lower confidence than typewriter/Latin. Separate thresholds allow aggressive filtering for Kurrent (0.5) while keeping typewriter lenient (0.3). - OCR_CONFIDENCE_THRESHOLD: default for Surya paths (0.3) - OCR_CONFIDENCE_THRESHOLD_KURRENT: Kraken Kurrent path (0.5) - apply_confidence_markers() now accepts threshold parameter - get_threshold(script_type) selects the right threshold Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:50:59 +02:00
Marcel	c74539b04b	feat(ocr): auto-insert [unleserlich] markers for low-confidence words Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 2s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details New confidence.py module with two functions: - apply_confidence_markers(): replaces words below threshold with [unleserlich], collapses adjacent markers into one - words_from_characters(): reconstructs word-level confidence from Kraken's character-level data Surya 0.17 provides native word-level confidence via line.words. Kraken 7.0 provides per-character confidences via record.confidences. Both engines now pass word+confidence data through main.py, which applies the marker post-processing before returning the API response. Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3). Frontend already renders [unleserlich] markers via transcriptionMarkers.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:16:17 +02:00
Marcel	6737bd6db5	feat(ocr): add Python OCR microservice, RestClientOcrClient, Docker Compose Python microservice (ocr-service/): - FastAPI app with /ocr and /health endpoints - Surya engine: transformer-based OCR for typewritten/modern handwriting - Kraken engine: historical HTR for Kurrent/Suetterlin with pure-Python polygon-to-quad approximation (gift wrapping + rotating calipers) - Eager model loading at startup via lifespan context manager - PDF download via httpx, page rendering via pypdfium2 at 300 DPI Java RestClientOcrClient: - Implements OcrClient + OcrHealthClient interfaces - Calls Python service via Spring RestClient - Health check with graceful fallback Docker Compose: - New ocr-service container (mem_limit 6g, no host ports) - Health check with start_period 60s for model loading - ocr_models volume for Kraken model files - Backend depends on ocr-service health Refs #226, #227 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 15:26:40 +02:00

21 Commits