Commit Graph

755 Commits

Author SHA1 Message Date
Marcel
d8dcba1a71 fix(ocr): unblock event loop during OCR and show errors in UI
OCR engines are CPU-bound and were blocking Uvicorn's single async
event loop, making /health unresponsive during processing. This caused
new OCR requests to fail silently (health check failure → no DB record
→ UI shows NONE). Wrap engine calls in asyncio.to_thread() to keep the
event loop free. Also surface OCR trigger errors in the frontend
instead of silently resetting the spinner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 23:50:39 +02:00
Marcel
ef11e4af09 fix(ocr): disable manual annotation drawing while OCR is running
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Prevents users from drawing annotations that would be cleared when
the OCR job finishes. transcribeMode is set to false for the PDF
viewer while ocrRunning is true.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:32:55 +02:00
Marcel
971527a50e feat(ocr): show translated progress messages during OCR processing
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
Backend sends progress codes (PREPARING, LOADING, ANALYZING,
CREATING_BLOCKS:N, DONE:N, ERROR) via OcrJob.progressMessage.
Frontend translates them via Paraglide (de/en/es) and displays
below the spinner.

- V27 migration: adds progress_message column to ocr_jobs
- OcrAsyncRunner updates progress at each phase
- Poll interval reduced to 2s for snappier updates

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:31:23 +02:00
Marcel
0b0d4a7d5e perf(ocr): double batch sizes (detector=8, recognition=16)
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
4GB headroom in the container. Doubling batches should use ~2GB
more RAM but significantly speed up inference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:23:13 +02:00
Marcel
1b7540143e fix(ocr): persist model cache across container restarts
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 0s
CI / Backend Unit Tests (pull_request) Failing after 1s
Surya downloads models from HuggingFace to /root/.cache on first use.
Without a volume, every container restart re-downloads ~73MB+.
Added ocr_cache volume to persist the cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:21:51 +02:00
Marcel
2cc7dcd5e3 perf(ocr): increase batch sizes (detector=4, recognition=8)
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 2s
5GB free on host during OCR, container at 3.8/8GB. Larger batches
use more memory but process faster on CPU.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:19:22 +02:00
Marcel
c1befd3fa3 fix(ocr): resume polling on page reload + track single-doc job status
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 0s
CI / Backend Unit Tests (pull_request) Failing after 1s
Single-document OCR now creates an OcrJobDocument row so
GET /api/documents/{id}/ocr-status can find running jobs.
OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED).

Frontend checks OCR status when entering transcription mode —
if a job is running, resumes polling and shows the spinner.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:16:59 +02:00
Marcel
2db1b73d5d fix(ocr): force HTTP/1.1 on RestClient to OCR service
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 0s
JDK HttpClient defaults to HTTP/2 with upgrade negotiation. Uvicorn
rejects the upgrade ('Unsupported upgrade request'), causing the
request body to be lost and a 422 'Field required' from FastAPI.
Force HTTP/1.1 since the OCR service is internal and doesn't need h2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:08:11 +02:00
Marcel
838330b405 fix(ocr): use camelCase field names in Pydantic models
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Pydantic v2 Field(alias=...) doesn't work with FastAPI as expected.
The Java client sends camelCase (pdfUrl, scriptType, pageNumber).
Use camelCase field names directly instead of aliases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:04:42 +02:00
Marcel
9e01009e3d fix(async): revert to AbortPolicy — CallerRunsPolicy blocks requests
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 0s
CallerRunsPolicy would cause the HTTP request to hang for minutes
if the queue is full. AbortPolicy with queue=100 is safe — the queue
will never realistically fill for a family archive. If it somehow
does, a clear error is better than a silent multi-minute hang.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:02:58 +02:00
Marcel
0bfaa7540b fix(async): queue 100 tasks + CallerRunsPolicy instead of abort
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Better to wait than to error. Queue capacity 100 holds plenty of
OCR jobs. CallerRunsPolicy means if the queue is somehow full,
the request blocks instead of getting rejected with an exception.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:01:37 +02:00
Marcel
b6d928e1c5 fix(async): increase thread pool to 2 threads + queue of 10
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 2s
The old pool (1 thread, queue=1) meant OCR blocked all other async
tasks (imports). Now 2 concurrent async tasks with a queue of 10
— enough for OCR + import to run in parallel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:59:31 +02:00
Marcel
aa50951320 fix(ocr): set 10-minute read timeout on RestClientOcrClient
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 0s
Default RestClient timeout was 10 seconds — OCR on CPU takes minutes.
Set connect timeout to 10s, read timeout to 10 minutes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:58:00 +02:00
Marcel
dd175d09e2 refactor(ocr): make single-document OCR async, fix circular dependency
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR
processing logic (processDocument, clearExistingBlocks, createBlocks)
into OcrAsyncRunner. OcrService is now a thin entry point that
validates, creates the job, and dispatches to OcrAsyncRunner.

Architecture:
- OcrService: validates document, checks health, creates OcrJob, delegates
- OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch
- OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner
- No circular dependencies

Single-document OCR is now async (returns jobId immediately).
Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED.

816 backend tests pass, 687 frontend tests pass.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:55:52 +02:00
Marcel
741979304c fix(ocr): increase to 8g mem_limit and larger batch sizes
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 0s
5GB free on host while OCR runs — give the container more room.
Bump batch sizes (detector=2, recognition=4) so it processes
faster with the available memory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:35:34 +02:00
Marcel
e9cf2998fe fix(ocr): reduce mem_limit to 4g, allow 4g swap for 16GB dev machines
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
mem_limit 4g keeps more RAM free for the host. memswap_limit 8g
(= 4g swap) lets peaks spill to disk instead of OOM-killing.
Slower during peak inference but won't starve the dev machine.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:33:05 +02:00
Marcel
902d423f3c fix(ocr): reduce memory usage for 16GB dev machines
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
- Surya models lazy-load on first OCR request instead of at startup
  (saves ~3-4GB idle RAM — Kraken stays eager at ~16MB)
- Process one page at a time in Surya engine (limits peak memory)
- RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM)
- Revert mem_limit back to 6GB (sufficient with these optimizations)
- Render DPI stays at 200

Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:26:50 +02:00
Marcel
7f78bc9cf4 fix(ocr): increase memory limit to 10GB, reduce render DPI to 200
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 0s
CI / Backend Unit Tests (pull_request) Failing after 1s
Surya 0.17 models use ~5GB idle. At 300 DPI on a multi-page PDF,
page images + inference tensors push past the 6GB limit, causing
OOM kills during 'Detecting bboxes'. Increased to 10GB and reduced
render DPI to 200 (still sufficient for OCR, uses ~44% less memory).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:20:36 +02:00
Marcel
4500c99e40 fix(ocr): use presigned URLs for MinIO access from OCR service
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
The OCR service was getting 403 Forbidden because it tried to
download PDFs from MinIO using plain internal URLs without
authentication. MinIO buckets are private.

- Add S3Presigner bean to MinioConfig
- FileService.generatePresignedUrl(): generates 15-min presigned URLs
- OcrService uses presigned URLs instead of plain internal URLs
- Remove unused s3InternalUrl / bucketName @Value fields from OcrService

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:16:52 +02:00
Marcel
7a4da7cb98 fix(pdf): guard against null textLayerEl in renderPage
Some checks failed
CI / Unit & Component Tests (push) Failing after 0s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Prevents 'can't access property innerHTML, textDiv is null' when
the component unmounts while a render is in flight (e.g. switching
to OCR progress view tears down the panel content).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:10:33 +02:00
Marcel
f6667e0e15 feat(frontend): show OcrProgress during OCR job + check status on load
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
- triggerOcr captures jobId from POST response and shows OcrProgress
- OcrProgress rendered in the transcription panel when ocrJobId is set
- handleOcrDone reloads blocks and annotations when OCR completes
- checkOcrStatus called when entering transcription mode — resumes
  progress display if a job is already running for this document

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:09:24 +02:00
Marcel
8dc9243add feat(frontend): wire OCR trigger + review toggle into transcription panel
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
- OcrTrigger component rendered in the transcription empty state when
  the document has a file and user has write permission
- Review checkmark toggle on each TranscriptionBlock (turquoise when
  reviewed, muted outline when not). Calls PUT .../review to toggle.
- TranscriptionBlockData type: added source + reviewed fields
- +page.svelte: triggerOcr() and reviewToggle() functions wired up
- Paraglide translations (de/en/es) for review toggle + reviewed count

All 687 frontend tests pass.

Refs #226, #230

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:02:56 +02:00
Marcel
3aaec01421 feat(transcription): add source/reviewed fields for training pipeline
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 0s
CI / Backend Unit Tests (pull_request) Failing after 1s
- BlockSource enum: MANUAL, OCR
- V26 migration adds source + reviewed columns to transcription_blocks
- OcrService sets source=OCR when creating blocks
- TranscriptionService.reviewBlock() toggles the reviewed flag
- PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint
- 5 new tests: reviewBlock toggle/untoggle/notfound, controller,
  OcrService source=OCR verification

The reviewed flag enables the Kraken fine-tuning pipeline: only blocks
marked as reviewed by a human are exported as training data.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 21:44:51 +02:00
Marcel
f064b27439 feat(ocr): per-script-type confidence thresholds
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Kurrent OCR produces much lower confidence than typewriter/Latin.
Separate thresholds allow aggressive filtering for Kurrent (0.5)
while keeping typewriter lenient (0.3).

- OCR_CONFIDENCE_THRESHOLD: default for Surya paths (0.3)
- OCR_CONFIDENCE_THRESHOLD_KURRENT: Kraken Kurrent path (0.5)
- apply_confidence_markers() now accepts threshold parameter
- get_threshold(script_type) selects the right threshold

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:50:59 +02:00
Marcel
dd078d50da fix(ocr): extract PDF pages as PNGs before running kraken OCR
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
Kraken's -f pdf mode tries to write output next to the input file,
which fails on read-only mounts. Instead, extract pages as PNGs via
pypdfium2 (already installed), then run kraken on each image.
Both models run in a single container per PDF to avoid overhead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:37:29 +02:00
Marcel
31519af1a4 fix(ocr): add pyvips for kraken PDF input support
Some checks failed
CI / Unit & Component Tests (push) Failing after 0s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 0s
CI / Backend Unit Tests (pull_request) Failing after 1s
Kraken 7 requires pyvips (optional dep) for -f pdf mode.
Added libvips42 system package and pyvips Python package.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:11:14 +02:00
Marcel
c0004f5e6f fix(ocr): parse kraken 'Model dir' output to locate downloaded model
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 0s
The previous approach used find across the htrmopo cache which failed
because -newer /tmp ran in a separate container. Now parses the
'Model dir: <path>' line from kraken get output directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:09:23 +02:00
Marcel
f12b41161e fix(ocr): update model script for kraken 7 DOI-based downloads
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Kraken 7 uses DOIs (not short names) to identify models from Zenodo.
Updated to use actual DOIs:
- 10.5281/zenodo.7933463 — German handwriting HTR
- 10.5281/zenodo.13788177 — McCATMuS generic handwritten/printed/typed

Added -f pdf flag for PDF input, volume mounts for import dir,
and post-download copy from htrmopo cache to the models volume.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:05:29 +02:00
Marcel
37abc376ec fix(ocr): install torchvision from CPU index alongside torch
Some checks failed
CI / Unit & Component Tests (push) Failing after 3s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
torchvision installed from PyPI expects CUDA torch operator
registrations. Installing from the CPU whl index ensures torchvision
matches the CPU-only torch build. Fixes 'torchvision::nms does not
exist' RuntimeError on startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:46:37 +02:00
Marcel
0af4749677 feat(ocr): extend model script with automatic OCR evaluation
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 2s
Downloads both Kraken models, then runs each against 4 sample PDFs
from the import folder (Eu-0693, Eu-0692, W-0150, W-0575). Output
goes to ocr-model-evaluation/<model-name>/<doc>.txt for side-by-side
comparison.

Usage:
  ./scripts/download-kraken-models.sh           # download + evaluate
  ./scripts/download-kraken-models.sh --eval-only  # re-run evaluation
  ./scripts/download-kraken-models.sh --activate 1  # pick winner

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:41:59 +02:00
Marcel
6669fffead fix(ocr): pin transformers<5.0 and torch==2.7.1 in requirements.txt
Some checks failed
CI / Unit & Component Tests (push) Failing after 3s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
transformers 5.x breaks surya 0.17.1 — SuryaDecoderConfig is missing
pad_token_id. Pin to transformers>=4.56.1,<5.0.0.

Also add torch==2.7.1 to requirements.txt to prevent pip from upgrading
it past the CPU-only build installed in the Dockerfile layer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:34:03 +02:00
Marcel
41f9262238 feat(ocr): add Kraken model download and evaluation script
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 2s
Runbook script to download both HTR-United Kurrent model candidates
(german_kurrent_manu_9, kurrent-de) into the ocr_models Docker volume,
test them against sample documents, and activate the winner.

Usage:
  ./scripts/download-kraken-models.sh              # download both
  ./scripts/download-kraken-models.sh --activate 1  # pick model 1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:19:39 +02:00
Marcel
c74539b04b feat(ocr): auto-insert [unleserlich] markers for low-confidence words
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
New confidence.py module with two functions:
- apply_confidence_markers(): replaces words below threshold with
  [unleserlich], collapses adjacent markers into one
- words_from_characters(): reconstructs word-level confidence from
  Kraken's character-level data

Surya 0.17 provides native word-level confidence via line.words.
Kraken 7.0 provides per-character confidences via record.confidences.
Both engines now pass word+confidence data through main.py, which
applies the marker post-processing before returning the API response.

Threshold configurable via OCR_CONFIDENCE_THRESHOLD env var (default 0.3).
Frontend already renders [unleserlich] markers via transcriptionMarkers.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:16:17 +02:00
Marcel
49975154d9 feat(ocr): bump to latest surya 0.17.1, kraken 7.0, torch 2.7.1
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
- surya-ocr 0.6.3 → 0.17.1: new predictor API (FoundationPredictor,
  RecognitionPredictor, DetectionPredictor), native polygon output
  on text lines (4-point clockwise)
- kraken 5.2.9 → 7.0: wider torch range (>=2.4,<=2.10), unpinned numpy
- torch 2.5.1 → 2.7.1: satisfies surya's >=2.7.0 requirement
- Rewrite engines/surya.py for the 0.17 predictor class API
- Surya now outputs polygons natively — no longer rectangle-only

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 18:53:14 +02:00
Marcel
e29c865016 fix(ocr): upgrade kraken to 6.0.3 for torch>=2.4 compatibility
Some checks failed
CI / Unit & Component Tests (push) Failing after 3s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 3s
kraken 5.2.9 required torch~=2.1.0, incompatible with surya-ocr's
torch>=2.3.0. kraken 6.0.3 requires torch>=2.4.0,<=2.9 which
overlaps with surya and our pinned torch==2.5.1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 18:48:14 +02:00
Marcel
d49010cd7b fix(ocr): relax pillow version to match surya-ocr constraint
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
surya-ocr 0.6.3 requires pillow<11.0.0,>=10.2.0. The previous
pin at 11.1.0 caused a dependency resolution failure during
Docker build.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 18:40:46 +02:00
Marcel
931fbc28e5 fix(annotations): use @JdbcTypeCode(JSON) for polygon JSONB column
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 2s
Replace @Convert(PolygonConverter) with Hibernate native @JdbcTypeCode(SqlTypes.JSON)
to fix JDBC type mismatch — PostgreSQL requires jsonb type, not varchar.

The PolygonConverter is retained as a standalone utility but no longer
used on the entity. Hibernate 6 natively handles List<List<Double>>
serialization to JSONB.

Refs #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:39:54 +02:00
Marcel
a4651aa317 feat(frontend): add OCR UI components and translations
- ScriptTypeSelect: native select for TYPEWRITER/HANDWRITING_LATIN/KURRENT
- OcrTrigger: wraps script type select + start button + confirmation dialog
- OcrProgress: SSE-based progress display with page counter and progress bar
- Paraglide translations for OCR (de/en/es): script types, trigger labels,
  confirmation dialog, progress messages, error messages
- ErrorCode type + getErrorMessage: OCR_SERVICE_UNAVAILABLE, OCR_JOB_NOT_FOUND,
  OCR_DOCUMENT_NOT_UPLOADED, OCR_PROCESSING_FAILED

All 687 frontend tests pass.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:36:00 +02:00
Marcel
cf8dc3559f feat(frontend): extract AnnotationShape component with polygon support
- AnnotationShape.svelte: renders a single annotation as either a
  rectangle or a polygon-clipped div (via CSS clip-path: polygon())
- AnnotationLayer.svelte: refactored to delegate rendering to
  AnnotationShape, keeping draw logic and hover state management
- Annotation type: added optional polygon field ([number, number][] | null)
- Polygon coordinates are converted from page-normalized to
  bounding-box-relative percentages for clip-path

All 687 existing frontend tests pass.

Refs #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:30:27 +02:00
Marcel
6737bd6db5 feat(ocr): add Python OCR microservice, RestClientOcrClient, Docker Compose
Python microservice (ocr-service/):
- FastAPI app with /ocr and /health endpoints
- Surya engine: transformer-based OCR for typewritten/modern handwriting
- Kraken engine: historical HTR for Kurrent/Suetterlin with
  pure-Python polygon-to-quad approximation (gift wrapping + rotating calipers)
- Eager model loading at startup via lifespan context manager
- PDF download via httpx, page rendering via pypdfium2 at 300 DPI

Java RestClientOcrClient:
- Implements OcrClient + OcrHealthClient interfaces
- Calls Python service via Spring RestClient
- Health check with graceful fallback

Docker Compose:
- New ocr-service container (mem_limit 6g, no host ports)
- Health check with start_period 60s for model loading
- ocr_models volume for Kraken model files
- Backend depends on ocr-service health

Refs #226, #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:26:40 +02:00
Marcel
aea46c5fd0 feat(ocr): add OcrService, OcrBatchService, OcrProgressService, OcrController
- OcrService: single-document OCR (health check, block clearing,
  presigned URL, annotation + block creation)
- OcrBatchService: batch processing with @Async, per-document status
  tracking, SKIPPED for PLACEHOLDER documents, failure isolation
- OcrProgressService: SSE emitter registry per job ID with 5-min timeout
- OcrController: POST /api/documents/{id}/ocr (WRITE_ALL),
  POST /api/ocr/batch (ADMIN), GET /api/ocr/jobs/{id} (READ_ALL),
  GET /api/ocr/jobs/{id}/progress (SSE), GET /api/documents/{id}/ocr-status

19 tests: 6 OcrService, 4 OcrBatchService, 3 OcrProgressService, 6 OcrController

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:24:15 +02:00
Marcel
ff3990710e feat(ocr): add OCR infrastructure (interfaces, entities, migrations, DTOs)
- OcrClient + OcrHealthClient interfaces for testable OCR integration
- OcrBlockResult record for OCR engine response mapping
- OcrJob + OcrJobDocument entities with status enums
- V25 migration creates ocr_jobs and ocr_job_documents tables
- Repositories for job and job-document queries
- TriggerOcrDTO, BatchOcrDTO (@Size max=500), OcrStatusDTO
- ErrorCodes: OCR_SERVICE_UNAVAILABLE, OCR_JOB_NOT_FOUND,
  OCR_DOCUMENT_NOT_UPLOADED, OCR_PROCESSING_FAILED

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:15:16 +02:00
Marcel
d194b6b225 feat(documents): add ScriptType enum and script_type column
- ScriptType enum: UNKNOWN, TYPEWRITER, HANDWRITING_LATIN, HANDWRITING_KURRENT
- V24 migration adds script_type VARCHAR(30) NOT NULL DEFAULT 'UNKNOWN'
- Document entity: scriptType field with @Builder.Default UNKNOWN
- DocumentUpdateDTO: optional scriptType field
- DocumentService: wires scriptType through update method

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:13:42 +02:00
Marcel
c19c41f812 feat(annotations): add createOcrAnnotation that skips overlap check
OCR creates many adjacent text line annotations that would fail the
existing overlap check. createOcrAnnotation() accepts an optional
polygon and bypasses overlap detection entirely.

Refs #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:12:11 +02:00
Marcel
878a90a86d feat(annotations): add polygon JSONB support for quadrilateral shapes
- V23 migration adds polygon JSONB column with 4-point CHECK constraint
- PolygonConverter: AttributeConverter for List<List<Double>> <-> JSONB
- @UniquePoints custom validator rejects duplicate coordinates
- CreateAnnotationDTO: validated optional polygon field
- DocumentAnnotation entity: polygon field with converter

Refs #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:10:35 +02:00
Marcel
ec32d225b5 docs(adr): add ADR-001 (OCR microservice) and ADR-002 (polygon JSONB)
ADR-001 documents the decision to use a separate Python container for
OCR (Surya + Kraken), the interface contract, and why alternatives
like Tess4J were rejected.

ADR-002 documents the decision to store polygon annotations as JSONB
with a 4-point CHECK constraint, backed by an AttributeConverter.

Refs #226, #227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:07:46 +02:00
Marcel
11a35f2952 fix(tests): resolve all 4 pre-existing test failures
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
- CommentThread: add missing empty-state paragraph using comment_empty_hint
  i18n key (key existed but was never rendered in the template)
- TranscriptionBlock: add selectedQuote hint using transcription_block_quote_hint
  i18n key (key existed but was never rendered); fix test to use native DOM
  el.focus()/setSelectionRange()/dispatchEvent instead of locator.selectText()
  which is not available in this vitest-browser version
- TranscriptionEditView: fix test to use native el.dispatchEvent(FocusEvent)
  instead of locator.blur() which is not available
- Conversations: fix test expected text from stale "Korrespondenz durchsuchen"
  to match current conv_empty_heading() = "Wessen Briefe möchten Sie lesen?"

All 687 tests now pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:55:34 +02:00
Marcel
d046c89631 test(confirm): add ConfirmDialog component spec (12 tests)
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 3s
CI / Backend Unit Tests (pull_request) Failing after 1s
CI / Unit & Component Tests (push) Failing after 3s
CI / Backend Unit Tests (push) Failing after 1s
Covers: title/body rendering, destructive vs primary button class,
custom labels, settle true/cancel, aria-labelledby, and hide-after-settle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:38:58 +02:00
Marcel
a2d078b8f9 refactor(persons): replace non-null assertion with null guard on removeFormEl
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:36:47 +02:00
Marcel
0b95c90e7a refactor(confirm): use import { m } instead of import * as m in ConfirmDialog
Consistent with every other component in the project.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:35:42 +02:00