familienarchiv

Author	SHA1	Message	Date
Marcel	ddec64fc79	feat(ocr): extract translateOcrProgress with ANALYZING_PAGE and DONE:skipped support Move translateOcrProgress from page.svelte to a testable module. Return structured result with currentPage/totalPages/skippedPages for the progress bar. Add ANALYZING_PAGE and DONE with skipped pages parsing. Add i18n keys for de/en/es. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:09:29 +02:00
Marcel	292dc66f3c	feat(ocr): rewrite runSingleDocument to use streamBlocks with per-page progress Replace the single extractBlocks() call with streamBlocks() that processes pages incrementally. Each page's blocks are persisted immediately via createSingleBlock(). Progress updates use the ANALYZING_PAGE:current:total:blocks format. Per-page errors are logged at WARN level without failing the entire job. The batch path (processDocument) remains on the old extractBlocks() path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:07:06 +02:00
Marcel	6823973429	refactor(ocr): extract createSingleBlock from createTranscriptionBlocks Enable per-page block creation during streaming by extracting the loop body into a package-private createSingleBlock() method with an explicit sortOrder parameter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:04:02 +02:00
Marcel	93c3154b3c	feat(ocr): implement NDJSON streaming in RestClientOcrClient Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON response line by line with a dedicated ObjectMapper. Falls back to the old /ocr endpoint via the default method when /ocr/stream returns 404. Uses a separate HttpClient with 5-minute request timeout for streaming. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:03:12 +02:00
Marcel	641e91d5a3	feat(ocr): add default streamBlocks method to OcrClient interface The default method synthesizes Start/Page/Done events from extractBlocks() results, providing backward compatibility for implementations that don't support streaming natively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:01:26 +02:00
Marcel	e21d01e10b	feat(ocr): add OcrStreamEvent sealed interface with Start/Page/Error/Done records Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed interface with record subtypes for exhaustive pattern matching in the consumer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:00:02 +02:00
Marcel	97c6cf6a65	feat(ocr): add NDJSON streaming endpoint POST /ocr/stream Streams one JSON line per completed page instead of buffering the entire result. Emits start/page/error/done events. On per-page failure, logs the traceback but yields a generic error message and continues with the next page. Adds X-Accel-Buffering: no and Cache-Control: no-cache headers for reverse-proxy compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:57:57 +02:00
Marcel	b7d5f71ef7	refactor(ocr): extract extract_page_blocks() from both OCR engines Enable per-page processing by extracting the inner loop body of extract_blocks() into extract_page_blocks(image, page_idx, language). The original extract_blocks() now delegates to the new function, preserving backward compatibility for the batch path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:56:34 +02:00
Marcel	d8dcba1a71	fix(ocr): unblock event loop during OCR and show errors in UI OCR engines are CPU-bound and were blocking Uvicorn's single async event loop, making /health unresponsive during processing. This caused new OCR requests to fail silently (health check failure → no DB record → UI shows NONE). Wrap engine calls in asyncio.to_thread() to keep the event loop free. Also surface OCR trigger errors in the frontend instead of silently resetting the spinner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:50:39 +02:00
Marcel	ef11e4af09	fix(ocr): disable manual annotation drawing while OCR is running Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Prevents users from drawing annotations that would be cleared when the OCR job finishes. transcribeMode is set to false for the PDF viewer while ocrRunning is true. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:32:55 +02:00
Marcel	971527a50e	feat(ocr): show translated progress messages during OCR processing Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Backend sends progress codes (PREPARING, LOADING, ANALYZING, CREATING_BLOCKS:N, DONE:N, ERROR) via OcrJob.progressMessage. Frontend translates them via Paraglide (de/en/es) and displays below the spinner. - V27 migration: adds progress_message column to ocr_jobs - OcrAsyncRunner updates progress at each phase - Poll interval reduced to 2s for snappier updates Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:31:23 +02:00
Marcel	0b0d4a7d5e	perf(ocr): double batch sizes (detector=8, recognition=16) Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details 4GB headroom in the container. Doubling batches should use ~2GB more RAM but significantly speed up inference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:23:13 +02:00
Marcel	1b7540143e	fix(ocr): persist model cache across container restarts Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Surya downloads models from HuggingFace to /root/.cache on first use. Without a volume, every container restart re-downloads ~73MB+. Added ocr_cache volume to persist the cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:21:51 +02:00
Marcel	2cc7dcd5e3	perf(ocr): increase batch sizes (detector=4, recognition=8) Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 2s Details 5GB free on host during OCR, container at 3.8/8GB. Larger batches use more memory but process faster on CPU. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:19:22 +02:00
Marcel	c1befd3fa3	fix(ocr): resume polling on page reload + track single-doc job status Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Single-document OCR now creates an OcrJobDocument row so GET /api/documents/{id}/ocr-status can find running jobs. OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED). Frontend checks OCR status when entering transcription mode — if a job is running, resumes polling and shows the spinner. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:16:59 +02:00
Marcel	2db1b73d5d	fix(ocr): force HTTP/1.1 on RestClient to OCR service Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 0s Details JDK HttpClient defaults to HTTP/2 with upgrade negotiation. Uvicorn rejects the upgrade ('Unsupported upgrade request'), causing the request body to be lost and a 422 'Field required' from FastAPI. Force HTTP/1.1 since the OCR service is internal and doesn't need h2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:08:11 +02:00
Marcel	838330b405	fix(ocr): use camelCase field names in Pydantic models Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Pydantic v2 Field(alias=...) doesn't work with FastAPI as expected. The Java client sends camelCase (pdfUrl, scriptType, pageNumber). Use camelCase field names directly instead of aliases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:04:42 +02:00
Marcel	9e01009e3d	fix(async): revert to AbortPolicy — CallerRunsPolicy blocks requests Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 0s Details CallerRunsPolicy would cause the HTTP request to hang for minutes if the queue is full. AbortPolicy with queue=100 is safe — the queue will never realistically fill for a family archive. If it somehow does, a clear error is better than a silent multi-minute hang. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:02:58 +02:00
Marcel	0bfaa7540b	fix(async): queue 100 tasks + CallerRunsPolicy instead of abort Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Better to wait than to error. Queue capacity 100 holds plenty of OCR jobs. CallerRunsPolicy means if the queue is somehow full, the request blocks instead of getting rejected with an exception. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:01:37 +02:00
Marcel	b6d928e1c5	fix(async): increase thread pool to 2 threads + queue of 10 Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 2s Details The old pool (1 thread, queue=1) meant OCR blocked all other async tasks (imports). Now 2 concurrent async tasks with a queue of 10 — enough for OCR + import to run in parallel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:59:31 +02:00
Marcel	aa50951320	fix(ocr): set 10-minute read timeout on RestClientOcrClient Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 0s Details Default RestClient timeout was 10 seconds — OCR on CPU takes minutes. Set connect timeout to 10s, read timeout to 10 minutes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:58:00 +02:00
Marcel	dd175d09e2	refactor(ocr): make single-document OCR async, fix circular dependency Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR processing logic (processDocument, clearExistingBlocks, createBlocks) into OcrAsyncRunner. OcrService is now a thin entry point that validates, creates the job, and dispatches to OcrAsyncRunner. Architecture: - OcrService: validates document, checks health, creates OcrJob, delegates - OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch - OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner - No circular dependencies Single-document OCR is now async (returns jobId immediately). Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED. 816 backend tests pass, 687 frontend tests pass. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:55:52 +02:00
Marcel	741979304c	fix(ocr): increase to 8g mem_limit and larger batch sizes Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 2s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 0s Details 5GB free on host while OCR runs — give the container more room. Bump batch sizes (detector=2, recognition=4) so it processes faster with the available memory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:35:34 +02:00
Marcel	e9cf2998fe	fix(ocr): reduce mem_limit to 4g, allow 4g swap for 16GB dev machines Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details mem_limit 4g keeps more RAM free for the host. memswap_limit 8g (= 4g swap) lets peaks spill to disk instead of OOM-killing. Slower during peak inference but won't starve the dev machine. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:33:05 +02:00
Marcel	902d423f3c	fix(ocr): reduce memory usage for 16GB dev machines Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - Surya models lazy-load on first OCR request instead of at startup (saves ~3-4GB idle RAM — Kraken stays eager at ~16MB) - Process one page at a time in Surya engine (limits peak memory) - RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM) - Revert mem_limit back to 6GB (sufficient with these optimizations) - Render DPI stays at 200 Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:26:50 +02:00
Marcel	7f78bc9cf4	fix(ocr): increase memory limit to 10GB, reduce render DPI to 200 Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Surya 0.17 models use ~5GB idle. At 300 DPI on a multi-page PDF, page images + inference tensors push past the 6GB limit, causing OOM kills during 'Detecting bboxes'. Increased to 10GB and reduced render DPI to 200 (still sufficient for OCR, uses ~44% less memory). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:20:36 +02:00
Marcel	4500c99e40	fix(ocr): use presigned URLs for MinIO access from OCR service Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details The OCR service was getting 403 Forbidden because it tried to download PDFs from MinIO using plain internal URLs without authentication. MinIO buckets are private. - Add S3Presigner bean to MinioConfig - FileService.generatePresignedUrl(): generates 15-min presigned URLs - OcrService uses presigned URLs instead of plain internal URLs - Remove unused s3InternalUrl / bucketName @Value fields from OcrService Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:16:52 +02:00
Marcel	7a4da7cb98	fix(pdf): guard against null textLayerEl in renderPage Some checks failed CI / Unit & Component Tests (push) Failing after 0s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Prevents 'can't access property innerHTML, textDiv is null' when the component unmounts while a render is in flight (e.g. switching to OCR progress view tears down the panel content). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:10:33 +02:00
Marcel	f6667e0e15	feat(frontend): show OcrProgress during OCR job + check status on load Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 2s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - triggerOcr captures jobId from POST response and shows OcrProgress - OcrProgress rendered in the transcription panel when ocrJobId is set - handleOcrDone reloads blocks and annotations when OCR completes - checkOcrStatus called when entering transcription mode — resumes progress display if a job is already running for this document Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:09:24 +02:00
Marcel	8dc9243add	feat(frontend): wire OCR trigger + review toggle into transcription panel Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - OcrTrigger component rendered in the transcription empty state when the document has a file and user has write permission - Review checkmark toggle on each TranscriptionBlock (turquoise when reviewed, muted outline when not). Calls PUT .../review to toggle. - TranscriptionBlockData type: added source + reviewed fields - +page.svelte: triggerOcr() and reviewToggle() functions wired up - Paraglide translations (de/en/es) for review toggle + reviewed count All 687 frontend tests pass. Refs #226, #230 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:02:56 +02:00
Marcel	3aaec01421	feat(transcription): add source/reviewed fields for training pipeline Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - BlockSource enum: MANUAL, OCR - V26 migration adds source + reviewed columns to transcription_blocks - OcrService sets source=OCR when creating blocks - TranscriptionService.reviewBlock() toggles the reviewed flag - PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint - 5 new tests: reviewBlock toggle/untoggle/notfound, controller, OcrService source=OCR verification The reviewed flag enables the Kraken fine-tuning pipeline: only blocks marked as reviewed by a human are exported as training data. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 21:44:51 +02:00
Marcel	f064b27439	feat(ocr): per-script-type confidence thresholds Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Kurrent OCR produces much lower confidence than typewriter/Latin. Separate thresholds allow aggressive filtering for Kurrent (0.5) while keeping typewriter lenient (0.3). - OCR_CONFIDENCE_THRESHOLD: default for Surya paths (0.3) - OCR_CONFIDENCE_THRESHOLD_KURRENT: Kraken Kurrent path (0.5) - apply_confidence_markers() now accepts threshold parameter - get_threshold(script_type) selects the right threshold Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:50:59 +02:00
Marcel	dd078d50da	fix(ocr): extract PDF pages as PNGs before running kraken OCR Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Kraken's -f pdf mode tries to write output next to the input file, which fails on read-only mounts. Instead, extract pages as PNGs via pypdfium2 (already installed), then run kraken on each image. Both models run in a single container per PDF to avoid overhead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:37:29 +02:00
Marcel	31519af1a4	fix(ocr): add pyvips for kraken PDF input support Some checks failed CI / Unit & Component Tests (push) Failing after 0s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Kraken 7 requires pyvips (optional dep) for -f pdf mode. Added libvips42 system package and pyvips Python package. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:11:14 +02:00
Marcel	c0004f5e6f	fix(ocr): parse kraken 'Model dir' output to locate downloaded model Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 0s Details The previous approach used find across the htrmopo cache which failed because -newer /tmp ran in a separate container. Now parses the 'Model dir: <path>' line from kraken get output directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:09:23 +02:00
Marcel	f12b41161e	fix(ocr): update model script for kraken 7 DOI-based downloads Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Kraken 7 uses DOIs (not short names) to identify models from Zenodo. Updated to use actual DOIs: - 10.5281/zenodo.7933463 — German handwriting HTR - 10.5281/zenodo.13788177 — McCATMuS generic handwritten/printed/typed Added -f pdf flag for PDF input, volume mounts for import dir, and post-download copy from htrmopo cache to the models volume. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:05:29 +02:00
Marcel	37abc376ec	fix(ocr): install torchvision from CPU index alongside torch Some checks failed CI / Unit & Component Tests (push) Failing after 3s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details torchvision installed from PyPI expects CUDA torch operator registrations. Installing from the CPU whl index ensures torchvision matches the CPU-only torch build. Fixes 'torchvision::nms does not exist' RuntimeError on startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:46:37 +02:00
Marcel	0af4749677	feat(ocr): extend model script with automatic OCR evaluation Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 2s Details Downloads both Kraken models, then runs each against 4 sample PDFs from the import folder (Eu-0693, Eu-0692, W-0150, W-0575). Output goes to ocr-model-evaluation/<model-name>/<doc>.txt for side-by-side comparison. Usage: ./scripts/download-kraken-models.sh # download + evaluate ./scripts/download-kraken-models.sh --eval-only # re-run evaluation ./scripts/download-kraken-models.sh --activate 1 # pick winner Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:41:59 +02:00
Marcel	6669fffead	fix(ocr): pin transformers<5.0 and torch==2.7.1 in requirements.txt Some checks failed CI / Unit & Component Tests (push) Failing after 3s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details transformers 5.x breaks surya 0.17.1 — SuryaDecoderConfig is missing pad_token_id. Pin to transformers>=4.56.1,<5.0.0. Also add torch==2.7.1 to requirements.txt to prevent pip from upgrading it past the CPU-only build installed in the Dockerfile layer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:34:03 +02:00
Marcel	41f9262238	feat(ocr): add Kraken model download and evaluation script Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 2s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 2s Details Runbook script to download both HTR-United Kurrent model candidates (german_kurrent_manu_9, kurrent-de) into the ocr_models Docker volume, test them against sample documents, and activate the winner. Usage: ./scripts/download-kraken-models.sh # download both ./scripts/download-kraken-models.sh --activate 1 # pick model 1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:19:39 +02:00
Marcel	c74539b04b	feat(ocr): auto-insert [unleserlich] markers for low-confidence words Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 2s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details New confidence.py module with two functions: - apply_confidence_markers(): replaces words below threshold with [unleserlich], collapses adjacent markers into one - words_from_characters(): reconstructs word-level confidence from Kraken's character-level data Surya 0.17 provides native word-level confidence via line.words. Kraken 7.0 provides per-character confidences via record.confidences. Both engines now pass word+confidence data through main.py, which applies the marker post-processing before returning the API response. Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3). Frontend already renders [unleserlich] markers via transcriptionMarkers.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:16:17 +02:00
Marcel	49975154d9	feat(ocr): bump to latest surya 0.17.1, kraken 7.0, torch 2.7.1 Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - surya-ocr 0.6.3 → 0.17.1: new predictor API (FoundationPredictor, RecognitionPredictor, DetectionPredictor), native polygon output on text lines (4-point clockwise) - kraken 5.2.9 → 7.0: wider torch range (>=2.4,<=2.10), unpinned numpy - torch 2.5.1 → 2.7.1: satisfies surya's >=2.7.0 requirement - Rewrite engines/surya.py for the 0.17 predictor class API - Surya now outputs polygons natively — no longer rectangle-only Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 18:53:14 +02:00
Marcel	e29c865016	fix(ocr): upgrade kraken to 6.0.3 for torch>=2.4 compatibility Some checks failed CI / Unit & Component Tests (push) Failing after 3s Details CI / Backend Unit Tests (push) Failing after 2s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 3s Details kraken 5.2.9 required torch~=2.1.0, incompatible with surya-ocr's torch>=2.3.0. kraken 6.0.3 requires torch>=2.4.0,<=2.9 which overlaps with surya and our pinned torch==2.5.1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 18:48:14 +02:00
Marcel	d49010cd7b	fix(ocr): relax pillow version to match surya-ocr constraint Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details surya-ocr 0.6.3 requires pillow<11.0.0,>=10.2.0. The previous pin at 11.1.0 caused a dependency resolution failure during Docker build. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 18:40:46 +02:00
Marcel	931fbc28e5	fix(annotations): use @JdbcTypeCode(JSON) for polygon JSONB column Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 2s Details Replace @Convert(PolygonConverter) with Hibernate native @JdbcTypeCode(SqlTypes.JSON) to fix JDBC type mismatch — PostgreSQL requires jsonb type, not varchar. The PolygonConverter is retained as a standalone utility but no longer used on the entity. Hibernate 6 natively handles List<List<Double>> serialization to JSONB. Refs #227 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 15:39:54 +02:00
Marcel	a4651aa317	feat(frontend): add OCR UI components and translations - ScriptTypeSelect: native select for TYPEWRITER/HANDWRITING_LATIN/KURRENT - OcrTrigger: wraps script type select + start button + confirmation dialog - OcrProgress: SSE-based progress display with page counter and progress bar - Paraglide translations for OCR (de/en/es): script types, trigger labels, confirmation dialog, progress messages, error messages - ErrorCode type + getErrorMessage: OCR_SERVICE_UNAVAILABLE, OCR_JOB_NOT_FOUND, OCR_DOCUMENT_NOT_UPLOADED, OCR_PROCESSING_FAILED All 687 frontend tests pass. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 15:36:00 +02:00
Marcel	cf8dc3559f	feat(frontend): extract AnnotationShape component with polygon support - AnnotationShape.svelte: renders a single annotation as either a rectangle or a polygon-clipped div (via CSS clip-path: polygon()) - AnnotationLayer.svelte: refactored to delegate rendering to AnnotationShape, keeping draw logic and hover state management - Annotation type: added optional polygon field ([number, number][] \| null) - Polygon coordinates are converted from page-normalized to bounding-box-relative percentages for clip-path All 687 existing frontend tests pass. Refs #227 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 15:30:27 +02:00
Marcel	6737bd6db5	feat(ocr): add Python OCR microservice, RestClientOcrClient, Docker Compose Python microservice (ocr-service/): - FastAPI app with /ocr and /health endpoints - Surya engine: transformer-based OCR for typewritten/modern handwriting - Kraken engine: historical HTR for Kurrent/Suetterlin with pure-Python polygon-to-quad approximation (gift wrapping + rotating calipers) - Eager model loading at startup via lifespan context manager - PDF download via httpx, page rendering via pypdfium2 at 300 DPI Java RestClientOcrClient: - Implements OcrClient + OcrHealthClient interfaces - Calls Python service via Spring RestClient - Health check with graceful fallback Docker Compose: - New ocr-service container (mem_limit 6g, no host ports) - Health check with start_period 60s for model loading - ocr_models volume for Kraken model files - Backend depends on ocr-service health Refs #226, #227 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 15:26:40 +02:00
Marcel	aea46c5fd0	feat(ocr): add OcrService, OcrBatchService, OcrProgressService, OcrController - OcrService: single-document OCR (health check, block clearing, presigned URL, annotation + block creation) - OcrBatchService: batch processing with @Async, per-document status tracking, SKIPPED for PLACEHOLDER documents, failure isolation - OcrProgressService: SSE emitter registry per job ID with 5-min timeout - OcrController: POST /api/documents/{id}/ocr (WRITE_ALL), POST /api/ocr/batch (ADMIN), GET /api/ocr/jobs/{id} (READ_ALL), GET /api/ocr/jobs/{id}/progress (SSE), GET /api/documents/{id}/ocr-status 19 tests: 6 OcrService, 4 OcrBatchService, 3 OcrProgressService, 6 OcrController Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 15:24:15 +02:00
Marcel	ff3990710e	feat(ocr): add OCR infrastructure (interfaces, entities, migrations, DTOs) - OcrClient + OcrHealthClient interfaces for testable OCR integration - OcrBlockResult record for OCR engine response mapping - OcrJob + OcrJobDocument entities with status enums - V25 migration creates ocr_jobs and ocr_job_documents tables - Repositories for job and job-document queries - TriggerOcrDTO, BatchOcrDTO (@Size max=500), OcrStatusDTO - ErrorCodes: OCR_SERVICE_UNAVAILABLE, OCR_JOB_NOT_FOUND, OCR_DOCUMENT_NOT_UPLOADED, OCR_PROCESSING_FAILED Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 15:15:16 +02:00

1 2 3 4 5 ...

763 Commits