Move translateOcrProgress from page.svelte to a testable module.
Return structured result with currentPage/totalPages/skippedPages
for the progress bar. Add ANALYZING_PAGE and DONE with skipped pages
parsing. Add i18n keys for de/en/es.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the single extractBlocks() call with streamBlocks() that
processes pages incrementally. Each page's blocks are persisted
immediately via createSingleBlock(). Progress updates use the
ANALYZING_PAGE:current:total:blocks format. Per-page errors are
logged at WARN level without failing the entire job. The batch path
(processDocument) remains on the old extractBlocks() path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enable per-page block creation during streaming by extracting the
loop body into a package-private createSingleBlock() method with an
explicit sortOrder parameter.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON
response line by line with a dedicated ObjectMapper. Falls back to
the old /ocr endpoint via the default method when /ocr/stream returns
404. Uses a separate HttpClient with 5-minute request timeout for
streaming.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default method synthesizes Start/Page/Done events from
extractBlocks() results, providing backward compatibility for
implementations that don't support streaming natively.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed
interface with record subtypes for exhaustive pattern matching in the
consumer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Streams one JSON line per completed page instead of buffering the
entire result. Emits start/page/error/done events. On per-page
failure, logs the traceback but yields a generic error message and
continues with the next page. Adds X-Accel-Buffering: no and
Cache-Control: no-cache headers for reverse-proxy compatibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enable per-page processing by extracting the inner loop body of
extract_blocks() into extract_page_blocks(image, page_idx, language).
The original extract_blocks() now delegates to the new function,
preserving backward compatibility for the batch path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCR engines are CPU-bound and were blocking Uvicorn's single async
event loop, making /health unresponsive during processing. This caused
new OCR requests to fail silently (health check failure → no DB record
→ UI shows NONE). Wrap engine calls in asyncio.to_thread() to keep the
event loop free. Also surface OCR trigger errors in the frontend
instead of silently resetting the spinner.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents users from drawing annotations that would be cleared when
the OCR job finishes. transcribeMode is set to false for the PDF
viewer while ocrRunning is true.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Backend sends progress codes (PREPARING, LOADING, ANALYZING,
CREATING_BLOCKS:N, DONE:N, ERROR) via OcrJob.progressMessage.
Frontend translates them via Paraglide (de/en/es) and displays
below the spinner.
- V27 migration: adds progress_message column to ocr_jobs
- OcrAsyncRunner updates progress at each phase
- Poll interval reduced to 2s for snappier updates
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4GB headroom in the container. Doubling batches should use ~2GB
more RAM but significantly speed up inference.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Surya downloads models from HuggingFace to /root/.cache on first use.
Without a volume, every container restart re-downloads ~73MB+.
Added ocr_cache volume to persist the cache.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5GB free on host during OCR, container at 3.8/8GB. Larger batches
use more memory but process faster on CPU.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Single-document OCR now creates an OcrJobDocument row so
GET /api/documents/{id}/ocr-status can find running jobs.
OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED).
Frontend checks OCR status when entering transcription mode —
if a job is running, resumes polling and shows the spinner.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JDK HttpClient defaults to HTTP/2 with upgrade negotiation. Uvicorn
rejects the upgrade ('Unsupported upgrade request'), causing the
request body to be lost and a 422 'Field required' from FastAPI.
Force HTTP/1.1 since the OCR service is internal and doesn't need h2.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pydantic v2 Field(alias=...) doesn't work with FastAPI as expected.
The Java client sends camelCase (pdfUrl, scriptType, pageNumber).
Use camelCase field names directly instead of aliases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CallerRunsPolicy would cause the HTTP request to hang for minutes
if the queue is full. AbortPolicy with queue=100 is safe — the queue
will never realistically fill for a family archive. If it somehow
does, a clear error is better than a silent multi-minute hang.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Better to wait than to error. Queue capacity 100 holds plenty of
OCR jobs. CallerRunsPolicy means if the queue is somehow full,
the request blocks instead of getting rejected with an exception.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The old pool (1 thread, queue=1) meant OCR blocked all other async
tasks (imports). Now 2 concurrent async tasks with a queue of 10
— enough for OCR + import to run in parallel.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Default RestClient timeout was 10 seconds — OCR on CPU takes minutes.
Set connect timeout to 10s, read timeout to 10 minutes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR
processing logic (processDocument, clearExistingBlocks, createBlocks)
into OcrAsyncRunner. OcrService is now a thin entry point that
validates, creates the job, and dispatches to OcrAsyncRunner.
Architecture:
- OcrService: validates document, checks health, creates OcrJob, delegates
- OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch
- OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner
- No circular dependencies
Single-document OCR is now async (returns jobId immediately).
Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED.
816 backend tests pass, 687 frontend tests pass.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5GB free on host while OCR runs — give the container more room.
Bump batch sizes (detector=2, recognition=4) so it processes
faster with the available memory.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mem_limit 4g keeps more RAM free for the host. memswap_limit 8g
(= 4g swap) lets peaks spill to disk instead of OOM-killing.
Slower during peak inference but won't starve the dev machine.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Surya models lazy-load on first OCR request instead of at startup
(saves ~3-4GB idle RAM — Kraken stays eager at ~16MB)
- Process one page at a time in Surya engine (limits peak memory)
- RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM)
- Revert mem_limit back to 6GB (sufficient with these optimizations)
- Render DPI stays at 200
Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Surya 0.17 models use ~5GB idle. At 300 DPI on a multi-page PDF,
page images + inference tensors push past the 6GB limit, causing
OOM kills during 'Detecting bboxes'. Increased to 10GB and reduced
render DPI to 200 (still sufficient for OCR, uses ~44% less memory).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OCR service was getting 403 Forbidden because it tried to
download PDFs from MinIO using plain internal URLs without
authentication. MinIO buckets are private.
- Add S3Presigner bean to MinioConfig
- FileService.generatePresignedUrl(): generates 15-min presigned URLs
- OcrService uses presigned URLs instead of plain internal URLs
- Remove unused s3InternalUrl / bucketName @Value fields from OcrService
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents 'can't access property innerHTML, textDiv is null' when
the component unmounts while a render is in flight (e.g. switching
to OCR progress view tears down the panel content).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- triggerOcr captures jobId from POST response and shows OcrProgress
- OcrProgress rendered in the transcription panel when ocrJobId is set
- handleOcrDone reloads blocks and annotations when OCR completes
- checkOcrStatus called when entering transcription mode — resumes
progress display if a job is already running for this document
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- OcrTrigger component rendered in the transcription empty state when
the document has a file and user has write permission
- Review checkmark toggle on each TranscriptionBlock (turquoise when
reviewed, muted outline when not). Calls PUT .../review to toggle.
- TranscriptionBlockData type: added source + reviewed fields
- +page.svelte: triggerOcr() and reviewToggle() functions wired up
- Paraglide translations (de/en/es) for review toggle + reviewed count
All 687 frontend tests pass.
Refs #226, #230
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BlockSource enum: MANUAL, OCR
- V26 migration adds source + reviewed columns to transcription_blocks
- OcrService sets source=OCR when creating blocks
- TranscriptionService.reviewBlock() toggles the reviewed flag
- PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint
- 5 new tests: reviewBlock toggle/untoggle/notfound, controller,
OcrService source=OCR verification
The reviewed flag enables the Kraken fine-tuning pipeline: only blocks
marked as reviewed by a human are exported as training data.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken's -f pdf mode tries to write output next to the input file,
which fails on read-only mounts. Instead, extract pages as PNGs via
pypdfium2 (already installed), then run kraken on each image.
Both models run in a single container per PDF to avoid overhead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken 7 requires pyvips (optional dep) for -f pdf mode.
Added libvips42 system package and pyvips Python package.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous approach used find across the htrmopo cache which failed
because -newer /tmp ran in a separate container. Now parses the
'Model dir: <path>' line from kraken get output directly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken 7 uses DOIs (not short names) to identify models from Zenodo.
Updated to use actual DOIs:
- 10.5281/zenodo.7933463 — German handwriting HTR
- 10.5281/zenodo.13788177 — McCATMuS generic handwritten/printed/typed
Added -f pdf flag for PDF input, volume mounts for import dir,
and post-download copy from htrmopo cache to the models volume.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torchvision installed from PyPI expects CUDA torch operator
registrations. Installing from the CPU whl index ensures torchvision
matches the CPU-only torch build. Fixes 'torchvision::nms does not
exist' RuntimeError on startup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
transformers 5.x breaks surya 0.17.1 — SuryaDecoderConfig is missing
pad_token_id. Pin to transformers>=4.56.1,<5.0.0.
Also add torch==2.7.1 to requirements.txt to prevent pip from upgrading
it past the CPU-only build installed in the Dockerfile layer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runbook script to download both HTR-United Kurrent model candidates
(german_kurrent_manu_9, kurrent-de) into the ocr_models Docker volume,
test them against sample documents, and activate the winner.
Usage:
./scripts/download-kraken-models.sh # download both
./scripts/download-kraken-models.sh --activate 1 # pick model 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New confidence.py module with two functions:
- apply_confidence_markers(): replaces words below threshold with
[unleserlich], collapses adjacent markers into one
- words_from_characters(): reconstructs word-level confidence from
Kraken's character-level data
Surya 0.17 provides native word-level confidence via line.words.
Kraken 7.0 provides per-character confidences via record.confidences.
Both engines now pass word+confidence data through main.py, which
applies the marker post-processing before returning the API response.
Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3).
Frontend already renders [unleserlich] markers via transcriptionMarkers.ts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kraken 5.2.9 required torch~=2.1.0, incompatible with surya-ocr's
torch>=2.3.0. kraken 6.0.3 requires torch>=2.4.0,<=2.9 which
overlaps with surya and our pinned torch==2.5.1.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
surya-ocr 0.6.3 requires pillow<11.0.0,>=10.2.0. The previous
pin at 11.1.0 caused a dependency resolution failure during
Docker build.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace @Convert(PolygonConverter) with Hibernate native @JdbcTypeCode(SqlTypes.JSON)
to fix JDBC type mismatch — PostgreSQL requires jsonb type, not varchar.
The PolygonConverter is retained as a standalone utility but no longer
used on the entity. Hibernate 6 natively handles List<List<Double>>
serialization to JSONB.
Refs #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AnnotationShape.svelte: renders a single annotation as either a
rectangle or a polygon-clipped div (via CSS clip-path: polygon())
- AnnotationLayer.svelte: refactored to delegate rendering to
AnnotationShape, keeping draw logic and hover state management
- Annotation type: added optional polygon field ([number, number][] | null)
- Polygon coordinates are converted from page-normalized to
bounding-box-relative percentages for clip-path
All 687 existing frontend tests pass.
Refs #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Python microservice (ocr-service/):
- FastAPI app with /ocr and /health endpoints
- Surya engine: transformer-based OCR for typewritten/modern handwriting
- Kraken engine: historical HTR for Kurrent/Suetterlin with
pure-Python polygon-to-quad approximation (gift wrapping + rotating calipers)
- Eager model loading at startup via lifespan context manager
- PDF download via httpx, page rendering via pypdfium2 at 300 DPI
Java RestClientOcrClient:
- Implements OcrClient + OcrHealthClient interfaces
- Calls Python service via Spring RestClient
- Health check with graceful fallback
Docker Compose:
- New ocr-service container (mem_limit 6g, no host ports)
- Health check with start_period 60s for model loading
- ocr_models volume for Kraken model files
- Backend depends on ocr-service health
Refs #226, #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>