Queue capacity of 100 is disproportionate for 2 worker threads — a
backed-up queue would represent hours of unprocessed OCR jobs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The resolveUserId() catch block was silently swallowing exceptions,
making auth failures invisible in logs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added @JsonIgnoreProperties(ignoreUnknown = true) to OcrBlockResult so
new fields from the Python OCR service don't crash the Java parser,
while keeping FAIL_ON_UNKNOWN_PROPERTIES strict globally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrAsyncRunner was bypassing TranscriptionService — building blocks
directly and calling blockRepository.save(), skipping sanitizeText()
and saveVersion(). Also replaced N individual deleteBlock() calls with
a single bulk deleteAllBlocksByDocument() for OCR re-runs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrService.startOcr() was setting scriptType on a detached entity,
silently losing the mutation. Added DocumentService.updateScriptType()
with @Transactional to persist the change properly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrController was injecting OcrJobRepository and OcrJobDocumentRepository
directly, violating the Controller → Service → Repository layering rule.
Moved getJob() and getDocumentOcrStatus() logic into OcrService.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The skipped-pages warning is inlined directly in +page.svelte.
The component and its tests are no longer needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The progress message already says "Seite 3 von 7 wird analysiert…"
so the separate "3 / 7" counter was redundant. Remove the
OcrProgressBar from the page and inline only the skipped-pages
warning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The thin bar without a border looked broken at low progress values.
The text counter (e.g. "1 / 6") already communicates progress clearly
so the bar is unnecessary.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ANALYZING message appeared while the Python service was still
downloading the PDF and loading models. Remove it so the LOADING
message ("Lade Modell und Dokument…") stays visible until the first
ANALYZING_PAGE event arrives from the stream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a collapsible OCR trigger below the block list in edit mode.
Uses a <details> element so it's unobtrusive — the primary workflow
is editing existing blocks, but users can expand to re-run OCR with
a confirmation dialog that warns about replacing existing blocks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cover Surya polygon/word-level extraction, health endpoint states,
Kraken script-type routing, 503 when models not ready, 400 when
Kraken unavailable for Kurrent, and confidence marker application
during streaming. Production code coverage: 88%.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The PDF viewer uses 1-based currentPage (starting at 1) but the OCR
engines produced 0-based pageNumber from enumerate(). Annotations
created by OCR were assigned to page 0, which doesn't exist in the
viewer. Change enumerate() to start=1 in both engines and the
streaming endpoint.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace inline translateOcrProgress with the extracted module. Add
OcrProgressBar below the spinner during OCR. Parse page numbers from
ANALYZING_PAGE progress codes and feed them to the bar. On Done, fill
bar to 100% briefly before clearing the overlay.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Progress bar shows brand-mint fill on brand-sand background with
smooth transition. Displays page counter with tabular-nums and
skipped-pages warning in amber when applicable. Only renders when
totalPages > 0.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move translateOcrProgress from page.svelte to a testable module.
Return structured result with currentPage/totalPages/skippedPages
for the progress bar. Add ANALYZING_PAGE and DONE with skipped pages
parsing. Add i18n keys for de/en/es.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the single extractBlocks() call with streamBlocks() that
processes pages incrementally. Each page's blocks are persisted
immediately via createSingleBlock(). Progress updates use the
ANALYZING_PAGE:current:total:blocks format. Per-page errors are
logged at WARN level without failing the entire job. The batch path
(processDocument) remains on the old extractBlocks() path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enable per-page block creation during streaming by extracting the
loop body into a package-private createSingleBlock() method with an
explicit sortOrder parameter.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON
response line by line with a dedicated ObjectMapper. Falls back to
the old /ocr endpoint via the default method when /ocr/stream returns
404. Uses a separate HttpClient with 5-minute request timeout for
streaming.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default method synthesizes Start/Page/Done events from
extractBlocks() results, providing backward compatibility for
implementations that don't support streaming natively.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed
interface with record subtypes for exhaustive pattern matching in the
consumer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Streams one JSON line per completed page instead of buffering the
entire result. Emits start/page/error/done events. On per-page
failure, logs the traceback but yields a generic error message and
continues with the next page. Adds X-Accel-Buffering: no and
Cache-Control: no-cache headers for reverse-proxy compatibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enable per-page processing by extracting the inner loop body of
extract_blocks() into extract_page_blocks(image, page_idx, language).
The original extract_blocks() now delegates to the new function,
preserving backward compatibility for the batch path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCR engines are CPU-bound and were blocking Uvicorn's single async
event loop, making /health unresponsive during processing. This caused
new OCR requests to fail silently (health check failure → no DB record
→ UI shows NONE). Wrap engine calls in asyncio.to_thread() to keep the
event loop free. Also surface OCR trigger errors in the frontend
instead of silently resetting the spinner.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents users from drawing annotations that would be cleared when
the OCR job finishes. transcribeMode is set to false for the PDF
viewer while ocrRunning is true.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Backend sends progress codes (PREPARING, LOADING, ANALYZING,
CREATING_BLOCKS:N, DONE:N, ERROR) via OcrJob.progressMessage.
Frontend translates them via Paraglide (de/en/es) and displays
below the spinner.
- V27 migration: adds progress_message column to ocr_jobs
- OcrAsyncRunner updates progress at each phase
- Poll interval reduced to 2s for snappier updates
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4GB headroom in the container. Doubling batches should use ~2GB
more RAM but significantly speed up inference.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Surya downloads models from HuggingFace to /root/.cache on first use.
Without a volume, every container restart re-downloads ~73MB+.
Added ocr_cache volume to persist the cache.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5GB free on host during OCR, container at 3.8/8GB. Larger batches
use more memory but process faster on CPU.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Single-document OCR now creates an OcrJobDocument row so
GET /api/documents/{id}/ocr-status can find running jobs.
OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED).
Frontend checks OCR status when entering transcription mode —
if a job is running, resumes polling and shows the spinner.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JDK HttpClient defaults to HTTP/2 with upgrade negotiation. Uvicorn
rejects the upgrade ('Unsupported upgrade request'), causing the
request body to be lost and a 422 'Field required' from FastAPI.
Force HTTP/1.1 since the OCR service is internal and doesn't need h2.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pydantic v2 Field(alias=...) doesn't work with FastAPI as expected.
The Java client sends camelCase (pdfUrl, scriptType, pageNumber).
Use camelCase field names directly instead of aliases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CallerRunsPolicy would cause the HTTP request to hang for minutes
if the queue is full. AbortPolicy with queue=100 is safe — the queue
will never realistically fill for a family archive. If it somehow
does, a clear error is better than a silent multi-minute hang.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Better to wait than to error. Queue capacity 100 holds plenty of
OCR jobs. CallerRunsPolicy means if the queue is somehow full,
the request blocks instead of getting rejected with an exception.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The old pool (1 thread, queue=1) meant OCR blocked all other async
tasks (imports). Now 2 concurrent async tasks with a queue of 10
— enough for OCR + import to run in parallel.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Default RestClient timeout was 10 seconds — OCR on CPU takes minutes.
Set connect timeout to 10s, read timeout to 10 minutes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR
processing logic (processDocument, clearExistingBlocks, createBlocks)
into OcrAsyncRunner. OcrService is now a thin entry point that
validates, creates the job, and dispatches to OcrAsyncRunner.
Architecture:
- OcrService: validates document, checks health, creates OcrJob, delegates
- OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch
- OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner
- No circular dependencies
Single-document OCR is now async (returns jobId immediately).
Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED.
816 backend tests pass, 687 frontend tests pass.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5GB free on host while OCR runs — give the container more room.
Bump batch sizes (detector=2, recognition=4) so it processes
faster with the available memory.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mem_limit 4g keeps more RAM free for the host. memswap_limit 8g
(= 4g swap) lets peaks spill to disk instead of OOM-killing.
Slower during peak inference but won't starve the dev machine.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Surya models lazy-load on first OCR request instead of at startup
(saves ~3-4GB idle RAM — Kraken stays eager at ~16MB)
- Process one page at a time in Surya engine (limits peak memory)
- RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM)
- Revert mem_limit back to 6GB (sufficient with these optimizations)
- Render DPI stays at 200
Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Surya 0.17 models use ~5GB idle. At 300 DPI on a multi-page PDF,
page images + inference tensors push past the 6GB limit, causing
OOM kills during 'Detecting bboxes'. Increased to 10GB and reduced
render DPI to 200 (still sufficient for OCR, uses ~44% less memory).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OCR service was getting 403 Forbidden because it tried to
download PDFs from MinIO using plain internal URLs without
authentication. MinIO buckets are private.
- Add S3Presigner bean to MinioConfig
- FileService.generatePresignedUrl(): generates 15-min presigned URLs
- OcrService uses presigned URLs instead of plain internal URLs
- Remove unused s3InternalUrl / bucketName @Value fields from OcrService
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents 'can't access property innerHTML, textDiv is null' when
the component unmounts while a render is in flight (e.g. switching
to OCR progress view tears down the panel content).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- triggerOcr captures jobId from POST response and shows OcrProgress
- OcrProgress rendered in the transcription panel when ocrJobId is set
- handleOcrDone reloads blocks and annotations when OCR completes
- checkOcrStatus called when entering transcription mode — resumes
progress display if a job is already running for this document
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- OcrTrigger component rendered in the transcription empty state when
the document has a file and user has write permission
- Review checkmark toggle on each TranscriptionBlock (turquoise when
reviewed, muted outline when not). Calls PUT .../review to toggle.
- TranscriptionBlockData type: added source + reviewed fields
- +page.svelte: triggerOcr() and reviewToggle() functions wired up
- Paraglide translations (de/en/es) for review toggle + reviewed count
All 687 frontend tests pass.
Refs #226, #230
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BlockSource enum: MANUAL, OCR
- V26 migration adds source + reviewed columns to transcription_blocks
- OcrService sets source=OCR when creating blocks
- TranscriptionService.reviewBlock() toggles the reviewed flag
- PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint
- 5 new tests: reviewBlock toggle/untoggle/notfound, controller,
OcrService source=OCR verification
The reviewed flag enables the Kraken fine-tuning pipeline: only blocks
marked as reviewed by a human are exported as training data.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken's -f pdf mode tries to write output next to the input file,
which fails on read-only mounts. Instead, extract pages as PNGs via
pypdfium2 (already installed), then run kraken on each image.
Both models run in a single container per PDF to avoid overhead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken 7 requires pyvips (optional dep) for -f pdf mode.
Added libvips42 system package and pyvips Python package.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>