familienarchiv

Author	SHA1	Message	Date
Marcel	dd47a48d90	feat(ocr): add unique constraint on (job_id, document_id) Prevents the same document from being added to an OCR job twice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:28:18 +02:00
Marcel	08b1cd5dac	fix(ocr): reduce async queue capacity from 100 to 10 Queue capacity of 100 is disproportionate for 2 worker threads — a backed-up queue would represent hours of unprocessed OCR jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:58 +02:00
Marcel	5a97316940	fix(ocr): log warning when user ID resolution fails The resolveUserId() catch block was silently swallowing exceptions, making auth failures invisible in logs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:39 +02:00
Marcel	9282e46a02	fix(ocr): handle unknown NDJSON fields with @JsonIgnoreProperties Added @JsonIgnoreProperties(ignoreUnknown = true) to OcrBlockResult so new fields from the Python OCR service don't crash the Java parser, while keeping FAIL_ON_UNKNOWN_PROPERTIES strict globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:20 +02:00
Marcel	caae2ead81	refactor(ocr): route block lifecycle through TranscriptionService OcrAsyncRunner was bypassing TranscriptionService — building blocks directly and calling blockRepository.save(), skipping sanitizeText() and saveVersion(). Also replaced N individual deleteBlock() calls with a single bulk deleteAllBlocksByDocument() for OCR re-runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:01 +02:00
Marcel	6a0fd25662	fix(ocr): persist scriptType override via DocumentService transaction OcrService.startOcr() was setting scriptType on a detached entity, silently losing the mutation. Added DocumentService.updateScriptType() with @Transactional to persist the change properly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:26:37 +02:00
Marcel	2d43f09172	refactor(ocr): move repository access from OcrController into OcrService OcrController was injecting OcrJobRepository and OcrJobDocumentRepository directly, violating the Controller → Service → Repository layering rule. Moved getJob() and getDocumentOcrStatus() logic into OcrService. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:26:14 +02:00
Marcel	410ef88e1a	refactor(ocr): delete unused OcrProgressBar component Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details The skipped-pages warning is inlined directly in +page.svelte. The component and its tests are no longer needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:53:10 +02:00
Marcel	6b94882409	fix(ocr): remove redundant page counter from progress display The progress message already says "Seite 3 von 7 wird analysiert…" so the separate "3 / 7" counter was redundant. Remove the OcrProgressBar from the page and inline only the skipped-pages warning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:50:05 +02:00
Marcel	b868da07cd	fix(ocr): remove progress bar, keep text-only page counter The thin bar without a border looked broken at low progress values. The text counter (e.g. "1 / 6") already communicates progress clearly so the bar is unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:46:29 +02:00
Marcel	84aca240ea	fix(ocr): remove misleading ANALYZING progress before streaming starts The ANALYZING message appeared while the Python service was still downloading the PDF and loading models. Remove it so the LOADING message ("Lade Modell und Dokument…") stays visible until the first ANALYZING_PAGE event arrives from the stream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:40:54 +02:00
Marcel	3fe6eedffb	feat(ocr): allow re-running OCR when transcription blocks already exist Add a collapsible OCR trigger below the block list in edit mode. Uses a <details> element so it's unobtrusive — the primary workflow is editing existing blocks, but users can expand to re-run OCR with a confirmation dialog that warns about replacing existing blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:37:51 +02:00
Marcel	69768a104d	test(ocr): add business-logic tests for polygon extraction, Kraken routing, and confidence markers Cover Surya polygon/word-level extraction, health endpoint states, Kraken script-type routing, 503 when models not ready, 400 when Kraken unavailable for Kurrent, and confidence marker application during streaming. Production code coverage: 88%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:34:23 +02:00
Marcel	97e5138934	fix(ocr): use 1-based page numbers to match frontend PDF viewer The PDF viewer uses 1-based currentPage (starting at 1) but the OCR engines produced 0-based pageNumber from enumerate(). Annotations created by OCR were assigned to page 0, which doesn't exist in the viewer. Change enumerate() to start=1 in both engines and the streaming endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:32:08 +02:00
Marcel	bac67706b9	feat(ocr): integrate progress bar and streaming progress into document page Replace inline translateOcrProgress with the extracted module. Add OcrProgressBar below the spinner during OCR. Parse page numbers from ANALYZING_PAGE progress codes and feed them to the bar. On Done, fill bar to 100% briefly before clearing the overlay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:15:55 +02:00
Marcel	035f9768bd	feat(ocr): add OcrProgressBar component with page-based ARIA semantics Progress bar shows brand-mint fill on brand-sand background with smooth transition. Displays page counter with tabular-nums and skipped-pages warning in amber when applicable. Only renders when totalPages > 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:13:57 +02:00
Marcel	ddec64fc79	feat(ocr): extract translateOcrProgress with ANALYZING_PAGE and DONE:skipped support Move translateOcrProgress from page.svelte to a testable module. Return structured result with currentPage/totalPages/skippedPages for the progress bar. Add ANALYZING_PAGE and DONE with skipped pages parsing. Add i18n keys for de/en/es. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:09:29 +02:00
Marcel	292dc66f3c	feat(ocr): rewrite runSingleDocument to use streamBlocks with per-page progress Replace the single extractBlocks() call with streamBlocks() that processes pages incrementally. Each page's blocks are persisted immediately via createSingleBlock(). Progress updates use the ANALYZING_PAGE:current:total:blocks format. Per-page errors are logged at WARN level without failing the entire job. The batch path (processDocument) remains on the old extractBlocks() path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:07:06 +02:00
Marcel	6823973429	refactor(ocr): extract createSingleBlock from createTranscriptionBlocks Enable per-page block creation during streaming by extracting the loop body into a package-private createSingleBlock() method with an explicit sortOrder parameter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:04:02 +02:00
Marcel	93c3154b3c	feat(ocr): implement NDJSON streaming in RestClientOcrClient Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON response line by line with a dedicated ObjectMapper. Falls back to the old /ocr endpoint via the default method when /ocr/stream returns 404. Uses a separate HttpClient with 5-minute request timeout for streaming. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:03:12 +02:00
Marcel	641e91d5a3	feat(ocr): add default streamBlocks method to OcrClient interface The default method synthesizes Start/Page/Done events from extractBlocks() results, providing backward compatibility for implementations that don't support streaming natively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:01:26 +02:00
Marcel	e21d01e10b	feat(ocr): add OcrStreamEvent sealed interface with Start/Page/Error/Done records Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed interface with record subtypes for exhaustive pattern matching in the consumer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:00:02 +02:00
Marcel	97c6cf6a65	feat(ocr): add NDJSON streaming endpoint POST /ocr/stream Streams one JSON line per completed page instead of buffering the entire result. Emits start/page/error/done events. On per-page failure, logs the traceback but yields a generic error message and continues with the next page. Adds X-Accel-Buffering: no and Cache-Control: no-cache headers for reverse-proxy compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:57:57 +02:00
Marcel	b7d5f71ef7	refactor(ocr): extract extract_page_blocks() from both OCR engines Enable per-page processing by extracting the inner loop body of extract_blocks() into extract_page_blocks(image, page_idx, language). The original extract_blocks() now delegates to the new function, preserving backward compatibility for the batch path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:56:34 +02:00
Marcel	d8dcba1a71	fix(ocr): unblock event loop during OCR and show errors in UI OCR engines are CPU-bound and were blocking Uvicorn's single async event loop, making /health unresponsive during processing. This caused new OCR requests to fail silently (health check failure → no DB record → UI shows NONE). Wrap engine calls in asyncio.to_thread() to keep the event loop free. Also surface OCR trigger errors in the frontend instead of silently resetting the spinner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:50:39 +02:00
Marcel	ef11e4af09	fix(ocr): disable manual annotation drawing while OCR is running Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Prevents users from drawing annotations that would be cleared when the OCR job finishes. transcribeMode is set to false for the PDF viewer while ocrRunning is true. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:32:55 +02:00
Marcel	971527a50e	feat(ocr): show translated progress messages during OCR processing Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Backend sends progress codes (PREPARING, LOADING, ANALYZING, CREATING_BLOCKS:N, DONE:N, ERROR) via OcrJob.progressMessage. Frontend translates them via Paraglide (de/en/es) and displays below the spinner. - V27 migration: adds progress_message column to ocr_jobs - OcrAsyncRunner updates progress at each phase - Poll interval reduced to 2s for snappier updates Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:31:23 +02:00
Marcel	0b0d4a7d5e	perf(ocr): double batch sizes (detector=8, recognition=16) Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details 4GB headroom in the container. Doubling batches should use ~2GB more RAM but significantly speed up inference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:23:13 +02:00
Marcel	1b7540143e	fix(ocr): persist model cache across container restarts Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Surya downloads models from HuggingFace to /root/.cache on first use. Without a volume, every container restart re-downloads ~73MB+. Added ocr_cache volume to persist the cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:21:51 +02:00
Marcel	2cc7dcd5e3	perf(ocr): increase batch sizes (detector=4, recognition=8) Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 2s Details 5GB free on host during OCR, container at 3.8/8GB. Larger batches use more memory but process faster on CPU. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:19:22 +02:00
Marcel	c1befd3fa3	fix(ocr): resume polling on page reload + track single-doc job status Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Single-document OCR now creates an OcrJobDocument row so GET /api/documents/{id}/ocr-status can find running jobs. OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED). Frontend checks OCR status when entering transcription mode — if a job is running, resumes polling and shows the spinner. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:16:59 +02:00
Marcel	2db1b73d5d	fix(ocr): force HTTP/1.1 on RestClient to OCR service Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 0s Details JDK HttpClient defaults to HTTP/2 with upgrade negotiation. Uvicorn rejects the upgrade ('Unsupported upgrade request'), causing the request body to be lost and a 422 'Field required' from FastAPI. Force HTTP/1.1 since the OCR service is internal and doesn't need h2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:08:11 +02:00
Marcel	838330b405	fix(ocr): use camelCase field names in Pydantic models Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Pydantic v2 Field(alias=...) doesn't work with FastAPI as expected. The Java client sends camelCase (pdfUrl, scriptType, pageNumber). Use camelCase field names directly instead of aliases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:04:42 +02:00
Marcel	9e01009e3d	fix(async): revert to AbortPolicy — CallerRunsPolicy blocks requests Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 0s Details CallerRunsPolicy would cause the HTTP request to hang for minutes if the queue is full. AbortPolicy with queue=100 is safe — the queue will never realistically fill for a family archive. If it somehow does, a clear error is better than a silent multi-minute hang. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:02:58 +02:00
Marcel	0bfaa7540b	fix(async): queue 100 tasks + CallerRunsPolicy instead of abort Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Better to wait than to error. Queue capacity 100 holds plenty of OCR jobs. CallerRunsPolicy means if the queue is somehow full, the request blocks instead of getting rejected with an exception. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 23:01:37 +02:00
Marcel	b6d928e1c5	fix(async): increase thread pool to 2 threads + queue of 10 Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 2s Details The old pool (1 thread, queue=1) meant OCR blocked all other async tasks (imports). Now 2 concurrent async tasks with a queue of 10 — enough for OCR + import to run in parallel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:59:31 +02:00
Marcel	aa50951320	fix(ocr): set 10-minute read timeout on RestClientOcrClient Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 0s Details Default RestClient timeout was 10 seconds — OCR on CPU takes minutes. Set connect timeout to 10s, read timeout to 10 minutes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:58:00 +02:00
Marcel	dd175d09e2	refactor(ocr): make single-document OCR async, fix circular dependency Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR processing logic (processDocument, clearExistingBlocks, createBlocks) into OcrAsyncRunner. OcrService is now a thin entry point that validates, creates the job, and dispatches to OcrAsyncRunner. Architecture: - OcrService: validates document, checks health, creates OcrJob, delegates - OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch - OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner - No circular dependencies Single-document OCR is now async (returns jobId immediately). Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED. 816 backend tests pass, 687 frontend tests pass. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:55:52 +02:00
Marcel	741979304c	fix(ocr): increase to 8g mem_limit and larger batch sizes Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 2s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 0s Details 5GB free on host while OCR runs — give the container more room. Bump batch sizes (detector=2, recognition=4) so it processes faster with the available memory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:35:34 +02:00
Marcel	e9cf2998fe	fix(ocr): reduce mem_limit to 4g, allow 4g swap for 16GB dev machines Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details mem_limit 4g keeps more RAM free for the host. memswap_limit 8g (= 4g swap) lets peaks spill to disk instead of OOM-killing. Slower during peak inference but won't starve the dev machine. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:33:05 +02:00
Marcel	902d423f3c	fix(ocr): reduce memory usage for 16GB dev machines Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - Surya models lazy-load on first OCR request instead of at startup (saves ~3-4GB idle RAM — Kraken stays eager at ~16MB) - Process one page at a time in Surya engine (limits peak memory) - RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM) - Revert mem_limit back to 6GB (sufficient with these optimizations) - Render DPI stays at 200 Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:26:50 +02:00
Marcel	7f78bc9cf4	fix(ocr): increase memory limit to 10GB, reduce render DPI to 200 Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Surya 0.17 models use ~5GB idle. At 300 DPI on a multi-page PDF, page images + inference tensors push past the 6GB limit, causing OOM kills during 'Detecting bboxes'. Increased to 10GB and reduced render DPI to 200 (still sufficient for OCR, uses ~44% less memory). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:20:36 +02:00
Marcel	4500c99e40	fix(ocr): use presigned URLs for MinIO access from OCR service Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details The OCR service was getting 403 Forbidden because it tried to download PDFs from MinIO using plain internal URLs without authentication. MinIO buckets are private. - Add S3Presigner bean to MinioConfig - FileService.generatePresignedUrl(): generates 15-min presigned URLs - OcrService uses presigned URLs instead of plain internal URLs - Remove unused s3InternalUrl / bucketName @Value fields from OcrService Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:16:52 +02:00
Marcel	7a4da7cb98	fix(pdf): guard against null textLayerEl in renderPage Some checks failed CI / Unit & Component Tests (push) Failing after 0s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Prevents 'can't access property innerHTML, textDiv is null' when the component unmounts while a render is in flight (e.g. switching to OCR progress view tears down the panel content). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:10:33 +02:00
Marcel	f6667e0e15	feat(frontend): show OcrProgress during OCR job + check status on load Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 2s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - triggerOcr captures jobId from POST response and shows OcrProgress - OcrProgress rendered in the transcription panel when ocrJobId is set - handleOcrDone reloads blocks and annotations when OCR completes - checkOcrStatus called when entering transcription mode — resumes progress display if a job is already running for this document Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:09:24 +02:00
Marcel	8dc9243add	feat(frontend): wire OCR trigger + review toggle into transcription panel Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - OcrTrigger component rendered in the transcription empty state when the document has a file and user has write permission - Review checkmark toggle on each TranscriptionBlock (turquoise when reviewed, muted outline when not). Calls PUT .../review to toggle. - TranscriptionBlockData type: added source + reviewed fields - +page.svelte: triggerOcr() and reviewToggle() functions wired up - Paraglide translations (de/en/es) for review toggle + reviewed count All 687 frontend tests pass. Refs #226, #230 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:02:56 +02:00
Marcel	3aaec01421	feat(transcription): add source/reviewed fields for training pipeline Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - BlockSource enum: MANUAL, OCR - V26 migration adds source + reviewed columns to transcription_blocks - OcrService sets source=OCR when creating blocks - TranscriptionService.reviewBlock() toggles the reviewed flag - PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint - 5 new tests: reviewBlock toggle/untoggle/notfound, controller, OcrService source=OCR verification The reviewed flag enables the Kraken fine-tuning pipeline: only blocks marked as reviewed by a human are exported as training data. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 21:44:51 +02:00
Marcel	f064b27439	feat(ocr): per-script-type confidence thresholds Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Kurrent OCR produces much lower confidence than typewriter/Latin. Separate thresholds allow aggressive filtering for Kurrent (0.5) while keeping typewriter lenient (0.3). - OCR_CONFIDENCE_THRESHOLD: default for Surya paths (0.3) - OCR_CONFIDENCE_THRESHOLD_KURRENT: Kraken Kurrent path (0.5) - apply_confidence_markers() now accepts threshold parameter - get_threshold(script_type) selects the right threshold Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:50:59 +02:00
Marcel	dd078d50da	fix(ocr): extract PDF pages as PNGs before running kraken OCR Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Kraken's -f pdf mode tries to write output next to the input file, which fails on read-only mounts. Instead, extract pages as PNGs via pypdfium2 (already installed), then run kraken on each image. Both models run in a single container per PDF to avoid overhead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:37:29 +02:00
Marcel	31519af1a4	fix(ocr): add pyvips for kraken PDF input support Some checks failed CI / Unit & Component Tests (push) Failing after 0s Details CI / Backend Unit Tests (push) Failing after 0s Details CI / Unit & Component Tests (pull_request) Failing after 0s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Kraken 7 requires pyvips (optional dep) for -f pdf mode. Added libvips42 system package and pyvips Python package. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:11:14 +02:00

1 2 3 4 5 ...

779 Commits