Commit Graph

786 Commits

Author SHA1 Message Date
Marcel
fdf1eb92ad feat(training): add document-level training enrollment
- V29 migration: document_training_labels join table
- TrainingLabel enum: KURRENT_RECOGNITION, KURRENT_SEGMENTATION
- Document.trainingLabels @ElementCollection
- DocumentService.addTrainingLabel / removeTrainingLabel
- PATCH /api/documents/{id}/training-labels (WRITE_ALL)
- Auto-enroll on Kurrent OCR trigger (OcrService.startOcr)
- TranscriptionEditView: enrollment chips in panel footer
- JPQL queries updated to use MEMBER OF trainingLabels

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 14:30:51 +02:00
Marcel
73229077be feat(transcription): add sticky review progress counter to TranscriptionEditView
Shows 'X / Y geprüft' with a brand-mint progress bar at the top of the
transcription panel. Derived from the blocks prop — no extra state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 13:59:35 +02:00
Marcel
33dc4654e5 fix(ocr): use correct Kraken record attributes for line geometry
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
BaselineOCRRecord has 'baseline' and 'boundary' attributes, not 'line'
and 'cuts'. The fallback used record.line which doesn't exist, causing
AttributeError on every Kurrent OCR page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 13:16:25 +02:00
Marcel
70689b8f7b feat(ocr): add SSRF protection for PDF URL downloads
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 0s
Validates PDF download URLs against an ALLOWED_PDF_HOSTS allowlist
(default: minio,localhost,127.0.0.1) and disables redirect following
to prevent redirect-based SSRF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:29:42 +02:00
Marcel
0beaf351f0 fix(docker): soften ocr-service dependency and clean up compose
Changed ocr-service dependency from service_healthy to service_started
since the backend already handles OCR unavailability gracefully. Removed
unused APP_S3_INTERNAL_URL env var. Added expose directive and
.dockerignore for ocr-service.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:29:21 +02:00
Marcel
b7fd4018c2 fix(frontend): normalize paraglide imports and improve accessibility
Changed OcrTrigger and ScriptTypeSelect from 'import * as m' to
'import { m }' to match the rest of the codebase. Increased
ScriptTypeSelect label to text-sm and annotation badge font to 12px
for better readability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:29:00 +02:00
Marcel
8c07779a91 fix(ocr): fix SSE retry to actually reconnect EventSource
The retry button set status='running' but didn't re-trigger the $effect
because jobId hadn't changed. Added retryCount state so the effect
re-runs and creates a fresh EventSource on retry. Also added aria-label
to the progress bar for accessibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:28:40 +02:00
Marcel
dd47a48d90 feat(ocr): add unique constraint on (job_id, document_id)
Prevents the same document from being added to an OCR job twice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:28:18 +02:00
Marcel
08b1cd5dac fix(ocr): reduce async queue capacity from 100 to 10
Queue capacity of 100 is disproportionate for 2 worker threads — a
backed-up queue would represent hours of unprocessed OCR jobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:27:58 +02:00
Marcel
5a97316940 fix(ocr): log warning when user ID resolution fails
The resolveUserId() catch block was silently swallowing exceptions,
making auth failures invisible in logs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:27:39 +02:00
Marcel
9282e46a02 fix(ocr): handle unknown NDJSON fields with @JsonIgnoreProperties
Added @JsonIgnoreProperties(ignoreUnknown = true) to OcrBlockResult so
new fields from the Python OCR service don't crash the Java parser,
while keeping FAIL_ON_UNKNOWN_PROPERTIES strict globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:27:20 +02:00
Marcel
caae2ead81 refactor(ocr): route block lifecycle through TranscriptionService
OcrAsyncRunner was bypassing TranscriptionService — building blocks
directly and calling blockRepository.save(), skipping sanitizeText()
and saveVersion(). Also replaced N individual deleteBlock() calls with
a single bulk deleteAllBlocksByDocument() for OCR re-runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:27:01 +02:00
Marcel
6a0fd25662 fix(ocr): persist scriptType override via DocumentService transaction
OcrService.startOcr() was setting scriptType on a detached entity,
silently losing the mutation. Added DocumentService.updateScriptType()
with @Transactional to persist the change properly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:26:37 +02:00
Marcel
2d43f09172 refactor(ocr): move repository access from OcrController into OcrService
OcrController was injecting OcrJobRepository and OcrJobDocumentRepository
directly, violating the Controller → Service → Repository layering rule.
Moved getJob() and getDocumentOcrStatus() logic into OcrService.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:26:14 +02:00
Marcel
410ef88e1a refactor(ocr): delete unused OcrProgressBar component
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
The skipped-pages warning is inlined directly in +page.svelte.
The component and its tests are no longer needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:53:10 +02:00
Marcel
6b94882409 fix(ocr): remove redundant page counter from progress display
The progress message already says "Seite 3 von 7 wird analysiert…"
so the separate "3 / 7" counter was redundant. Remove the
OcrProgressBar from the page and inline only the skipped-pages
warning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:50:05 +02:00
Marcel
b868da07cd fix(ocr): remove progress bar, keep text-only page counter
The thin bar without a border looked broken at low progress values.
The text counter (e.g. "1 / 6") already communicates progress clearly
so the bar is unnecessary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:46:29 +02:00
Marcel
84aca240ea fix(ocr): remove misleading ANALYZING progress before streaming starts
The ANALYZING message appeared while the Python service was still
downloading the PDF and loading models. Remove it so the LOADING
message ("Lade Modell und Dokument…") stays visible until the first
ANALYZING_PAGE event arrives from the stream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:40:54 +02:00
Marcel
3fe6eedffb feat(ocr): allow re-running OCR when transcription blocks already exist
Add a collapsible OCR trigger below the block list in edit mode.
Uses a <details> element so it's unobtrusive — the primary workflow
is editing existing blocks, but users can expand to re-run OCR with
a confirmation dialog that warns about replacing existing blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:37:51 +02:00
Marcel
69768a104d test(ocr): add business-logic tests for polygon extraction, Kraken routing, and confidence markers
Cover Surya polygon/word-level extraction, health endpoint states,
Kraken script-type routing, 503 when models not ready, 400 when
Kraken unavailable for Kurrent, and confidence marker application
during streaming. Production code coverage: 88%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:34:23 +02:00
Marcel
97e5138934 fix(ocr): use 1-based page numbers to match frontend PDF viewer
The PDF viewer uses 1-based currentPage (starting at 1) but the OCR
engines produced 0-based pageNumber from enumerate(). Annotations
created by OCR were assigned to page 0, which doesn't exist in the
viewer. Change enumerate() to start=1 in both engines and the
streaming endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:32:08 +02:00
Marcel
bac67706b9 feat(ocr): integrate progress bar and streaming progress into document page
Replace inline translateOcrProgress with the extracted module. Add
OcrProgressBar below the spinner during OCR. Parse page numbers from
ANALYZING_PAGE progress codes and feed them to the bar. On Done, fill
bar to 100% briefly before clearing the overlay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:15:55 +02:00
Marcel
035f9768bd feat(ocr): add OcrProgressBar component with page-based ARIA semantics
Progress bar shows brand-mint fill on brand-sand background with
smooth transition. Displays page counter with tabular-nums and
skipped-pages warning in amber when applicable. Only renders when
totalPages > 0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:13:57 +02:00
Marcel
ddec64fc79 feat(ocr): extract translateOcrProgress with ANALYZING_PAGE and DONE:skipped support
Move translateOcrProgress from page.svelte to a testable module.
Return structured result with currentPage/totalPages/skippedPages
for the progress bar. Add ANALYZING_PAGE and DONE with skipped pages
parsing. Add i18n keys for de/en/es.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:09:29 +02:00
Marcel
292dc66f3c feat(ocr): rewrite runSingleDocument to use streamBlocks with per-page progress
Replace the single extractBlocks() call with streamBlocks() that
processes pages incrementally. Each page's blocks are persisted
immediately via createSingleBlock(). Progress updates use the
ANALYZING_PAGE:current:total:blocks format. Per-page errors are
logged at WARN level without failing the entire job. The batch path
(processDocument) remains on the old extractBlocks() path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:07:06 +02:00
Marcel
6823973429 refactor(ocr): extract createSingleBlock from createTranscriptionBlocks
Enable per-page block creation during streaming by extracting the
loop body into a package-private createSingleBlock() method with an
explicit sortOrder parameter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:04:02 +02:00
Marcel
93c3154b3c feat(ocr): implement NDJSON streaming in RestClientOcrClient
Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON
response line by line with a dedicated ObjectMapper. Falls back to
the old /ocr endpoint via the default method when /ocr/stream returns
404. Uses a separate HttpClient with 5-minute request timeout for
streaming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:03:12 +02:00
Marcel
641e91d5a3 feat(ocr): add default streamBlocks method to OcrClient interface
The default method synthesizes Start/Page/Done events from
extractBlocks() results, providing backward compatibility for
implementations that don't support streaming natively.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:01:26 +02:00
Marcel
e21d01e10b feat(ocr): add OcrStreamEvent sealed interface with Start/Page/Error/Done records
Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed
interface with record subtypes for exhaustive pattern matching in the
consumer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:00:02 +02:00
Marcel
97c6cf6a65 feat(ocr): add NDJSON streaming endpoint POST /ocr/stream
Streams one JSON line per completed page instead of buffering the
entire result. Emits start/page/error/done events. On per-page
failure, logs the traceback but yields a generic error message and
continues with the next page. Adds X-Accel-Buffering: no and
Cache-Control: no-cache headers for reverse-proxy compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 09:57:57 +02:00
Marcel
b7d5f71ef7 refactor(ocr): extract extract_page_blocks() from both OCR engines
Enable per-page processing by extracting the inner loop body of
extract_blocks() into extract_page_blocks(image, page_idx, language).
The original extract_blocks() now delegates to the new function,
preserving backward compatibility for the batch path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 09:56:34 +02:00
Marcel
d8dcba1a71 fix(ocr): unblock event loop during OCR and show errors in UI
OCR engines are CPU-bound and were blocking Uvicorn's single async
event loop, making /health unresponsive during processing. This caused
new OCR requests to fail silently (health check failure → no DB record
→ UI shows NONE). Wrap engine calls in asyncio.to_thread() to keep the
event loop free. Also surface OCR trigger errors in the frontend
instead of silently resetting the spinner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 23:50:39 +02:00
Marcel
ef11e4af09 fix(ocr): disable manual annotation drawing while OCR is running
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Prevents users from drawing annotations that would be cleared when
the OCR job finishes. transcribeMode is set to false for the PDF
viewer while ocrRunning is true.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:32:55 +02:00
Marcel
971527a50e feat(ocr): show translated progress messages during OCR processing
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
Backend sends progress codes (PREPARING, LOADING, ANALYZING,
CREATING_BLOCKS:N, DONE:N, ERROR) via OcrJob.progressMessage.
Frontend translates them via Paraglide (de/en/es) and displays
below the spinner.

- V27 migration: adds progress_message column to ocr_jobs
- OcrAsyncRunner updates progress at each phase
- Poll interval reduced to 2s for snappier updates

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:31:23 +02:00
Marcel
0b0d4a7d5e perf(ocr): double batch sizes (detector=8, recognition=16)
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
4GB headroom in the container. Doubling batches should use ~2GB
more RAM but significantly speed up inference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:23:13 +02:00
Marcel
1b7540143e fix(ocr): persist model cache across container restarts
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 0s
CI / Backend Unit Tests (pull_request) Failing after 1s
Surya downloads models from HuggingFace to /root/.cache on first use.
Without a volume, every container restart re-downloads ~73MB+.
Added ocr_cache volume to persist the cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:21:51 +02:00
Marcel
2cc7dcd5e3 perf(ocr): increase batch sizes (detector=4, recognition=8)
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 2s
5GB free on host during OCR, container at 3.8/8GB. Larger batches
use more memory but process faster on CPU.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:19:22 +02:00
Marcel
c1befd3fa3 fix(ocr): resume polling on page reload + track single-doc job status
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 0s
CI / Backend Unit Tests (pull_request) Failing after 1s
Single-document OCR now creates an OcrJobDocument row so
GET /api/documents/{id}/ocr-status can find running jobs.
OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED).

Frontend checks OCR status when entering transcription mode —
if a job is running, resumes polling and shows the spinner.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:16:59 +02:00
Marcel
2db1b73d5d fix(ocr): force HTTP/1.1 on RestClient to OCR service
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 0s
JDK HttpClient defaults to HTTP/2 with upgrade negotiation. Uvicorn
rejects the upgrade ('Unsupported upgrade request'), causing the
request body to be lost and a 422 'Field required' from FastAPI.
Force HTTP/1.1 since the OCR service is internal and doesn't need h2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:08:11 +02:00
Marcel
838330b405 fix(ocr): use camelCase field names in Pydantic models
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Pydantic v2 Field(alias=...) doesn't work with FastAPI as expected.
The Java client sends camelCase (pdfUrl, scriptType, pageNumber).
Use camelCase field names directly instead of aliases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:04:42 +02:00
Marcel
9e01009e3d fix(async): revert to AbortPolicy — CallerRunsPolicy blocks requests
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 0s
CallerRunsPolicy would cause the HTTP request to hang for minutes
if the queue is full. AbortPolicy with queue=100 is safe — the queue
will never realistically fill for a family archive. If it somehow
does, a clear error is better than a silent multi-minute hang.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:02:58 +02:00
Marcel
0bfaa7540b fix(async): queue 100 tasks + CallerRunsPolicy instead of abort
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Better to wait than to error. Queue capacity 100 holds plenty of
OCR jobs. CallerRunsPolicy means if the queue is somehow full,
the request blocks instead of getting rejected with an exception.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 23:01:37 +02:00
Marcel
b6d928e1c5 fix(async): increase thread pool to 2 threads + queue of 10
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 2s
The old pool (1 thread, queue=1) meant OCR blocked all other async
tasks (imports). Now 2 concurrent async tasks with a queue of 10
— enough for OCR + import to run in parallel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:59:31 +02:00
Marcel
aa50951320 fix(ocr): set 10-minute read timeout on RestClientOcrClient
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 0s
Default RestClient timeout was 10 seconds — OCR on CPU takes minutes.
Set connect timeout to 10s, read timeout to 10 minutes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:58:00 +02:00
Marcel
dd175d09e2 refactor(ocr): make single-document OCR async, fix circular dependency
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR
processing logic (processDocument, clearExistingBlocks, createBlocks)
into OcrAsyncRunner. OcrService is now a thin entry point that
validates, creates the job, and dispatches to OcrAsyncRunner.

Architecture:
- OcrService: validates document, checks health, creates OcrJob, delegates
- OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch
- OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner
- No circular dependencies

Single-document OCR is now async (returns jobId immediately).
Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED.

816 backend tests pass, 687 frontend tests pass.

Refs #226

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:55:52 +02:00
Marcel
741979304c fix(ocr): increase to 8g mem_limit and larger batch sizes
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 0s
5GB free on host while OCR runs — give the container more room.
Bump batch sizes (detector=2, recognition=4) so it processes
faster with the available memory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:35:34 +02:00
Marcel
e9cf2998fe fix(ocr): reduce mem_limit to 4g, allow 4g swap for 16GB dev machines
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
mem_limit 4g keeps more RAM free for the host. memswap_limit 8g
(= 4g swap) lets peaks spill to disk instead of OOM-killing.
Slower during peak inference but won't starve the dev machine.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:33:05 +02:00
Marcel
902d423f3c fix(ocr): reduce memory usage for 16GB dev machines
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
- Surya models lazy-load on first OCR request instead of at startup
  (saves ~3-4GB idle RAM — Kraken stays eager at ~16MB)
- Process one page at a time in Surya engine (limits peak memory)
- RECOGNITION_BATCH_SIZE=1, DETECTOR_BATCH_SIZE=1 (slower but fits in RAM)
- Revert mem_limit back to 6GB (sufficient with these optimizations)
- Render DPI stays at 200

Idle memory: ~2GB (Kraken only). Peak during OCR: ~5-6GB (Surya loaded).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:26:50 +02:00
Marcel
7f78bc9cf4 fix(ocr): increase memory limit to 10GB, reduce render DPI to 200
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 0s
CI / Backend Unit Tests (pull_request) Failing after 1s
Surya 0.17 models use ~5GB idle. At 300 DPI on a multi-page PDF,
page images + inference tensors push past the 6GB limit, causing
OOM kills during 'Detecting bboxes'. Increased to 10GB and reduced
render DPI to 200 (still sufficient for OCR, uses ~44% less memory).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:20:36 +02:00
Marcel
4500c99e40 fix(ocr): use presigned URLs for MinIO access from OCR service
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 0s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
The OCR service was getting 403 Forbidden because it tried to
download PDFs from MinIO using plain internal URLs without
authentication. MinIO buckets are private.

- Add S3Presigner bean to MinioConfig
- FileService.generatePresignedUrl(): generates 15-min presigned URLs
- OcrService uses presigned URLs instead of plain internal URLs
- Remove unused s3InternalUrl / bucketName @Value fields from OcrService

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:16:52 +02:00