Reference in New Issue
Block a user
Delete Branch "feat/issue-226-227-ocr-pipeline-polygon"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
POST /ocr/stream) — Python sends one JSON line per completed page, Java consumes incrementallyCloses #226, closes #227, closes #231
Changes by layer
Python OCR Service
extract_page_blocks()per-page processing for both enginesPOST /ocr/streamNDJSON endpoint withstart/page/error/doneevent protocol[unleserlich]marker insertion with per-script thresholdsX-Accel-Buffering: no+Cache-Control: no-cachefor reverse-proxy compatibilityJava Backend
OcrStreamEventsealed interface withStart/Page/Error/Donerecord subtypesOcrClient.streamBlocks()with default fallback that synthesizes events fromextractBlocks()RestClientOcrClient.streamBlocks()parses NDJSON with dedicatedObjectMapper, falls back to/ocron 404OcrAsyncRunner.runSingleDocument()uses streaming with per-page block persistencerunBatch) unchanged — stays on oldextractBlocks()Frontend
translateOcrProgressextracted to testable$lib/ocr/moduleANALYZING_PAGE:current:total:blocksandDONE:count:skippedprogress codesTest plan
cd backend && ./mvnw test— 831 tests passcd frontend && npm run test— 698 tests passcd ocr-service && python -m pytest— 38 tests pass- OcrService: single-document OCR (health check, block clearing, presigned URL, annotation + block creation) - OcrBatchService: batch processing with @Async, per-document status tracking, SKIPPED for PLACEHOLDER documents, failure isolation - OcrProgressService: SSE emitter registry per job ID with 5-min timeout - OcrController: POST /api/documents/{id}/ocr (WRITE_ALL), POST /api/ocr/batch (ADMIN), GET /api/ocr/jobs/{id} (READ_ALL), GET /api/ocr/jobs/{id}/progress (SSE), GET /api/documents/{id}/ocr-status 19 tests: 6 OcrService, 4 OcrBatchService, 3 OcrProgressService, 6 OcrController Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
Clean commit history, good TDD evidence (36 new backend tests), and the component extraction of
AnnotationShape.svelteis exactly right. Several items I'd want addressed before merge.Blockers
OcrService.startOcr()is@Transactionalbut calls an external HTTP service inside the transaction (OcrService.java:39-76). TheocrClient.extractBlocks()call at line 82 (viaprocessDocument) makes an HTTP request to the Python service. If that call takes 30–60 seconds per page on CPU, the database transaction is held open for the entire duration. This will exhaust the connection pool under concurrent use. Fix: split into two methods — one@Transactionalfor the DB writes (clear blocks, create blocks), and a non-transactional orchestrator that calls the OCR service outside the transaction boundary.OcrService.processDocument()is package-private but called fromOcrBatchService(OcrService.java:78). Both classes are in the same package so this compiles, but it's an implicit coupling. If anyone movesOcrBatchServiceto a different package, it breaks silently. MakeprocessDocumentexplicitlypublicor extract it to an interface.OcrServicebypassesTranscriptionService.createBlock()and writes directly toTranscriptionBlockRepository(OcrService.java:113). The existingTranscriptionService.createBlock()handlessanitizeText(),saveVersion(), and audit fields. The OCR path skips all of this — no version history entry, no text sanitization. UseTranscriptionServiceor extract the shared logic.Suggestions
RestClientOcrClienthas no timeout configured (RestClientOcrClient.java:21). TheRestClientuses default timeouts. For an OCR call that can take minutes on CPU, add explicit connect/read timeouts:.requestFactory(new JdkClientHttpRequestFactory(HttpClient.newBuilder().connectTimeout(Duration.ofSeconds(10)).build()))and set a read timeout proportional to expected processing time.PolygonConverteris now unused on the entity but still in the codebase (PolygonConverter.java). The fix commit switched to@JdbcTypeCode(SqlTypes.JSON). Either delete the converter or add a comment explaining it's retained as a standalone utility — currently it looks like dead code.OcrBatchService.processBatchAsync()does oneocrJobRepository.save(job)per document in the loop (OcrBatchService.java:95-96). For a 500-document batch, that's 1000+ individual saves. Consider batching the job state update less frequently (e.g. every 10 documents) or at least not saving the parent job on every iteration.No
@Validon theTriggerOcrDTOin the controller (OcrController.java:49). TheBatchOcrDTOcorrectly has@Valid, butTriggerOcrDTOdoes not. If validation annotations are added toTriggerOcrDTOlater, they won't fire without@Valid.AnnotationShape.svelte:53opacity logic —fadedprop controls the dimming of non-active annotations. The naming is good, but consider documenting the prop with a JSDoc comment since the semantics offadedvsdimmedare subtle (dimmed = all annotations muted in non-transcription mode; faded = another annotation is active).🏛️ Markus Keller — Application Architect
Verdict: ⚠️ Approved with concerns
The architecture is sound. ADRs are present, the Python microservice is justified, the job state is in PostgreSQL (not in-memory), and the interface boundaries (
OcrClient,OcrHealthClient) are clean. Two structural concerns.Blockers
OcrControllerinjects two repositories directly (OcrController.java:39-40).OcrJobRepositoryandOcrJobDocumentRepositoryare accessed directly from the controller forgetJobStatus()andgetDocumentOcrStatus(). This violates the layering rule: controllers delegate to services, services own repositories. Create read methods onOcrServiceor a dedicatedOcrJobServiceto wrap these queries. The controller should never touch a repository.@TransactionalonstartOcr()wraps an external HTTP call — same as Felix flagged. From the architecture perspective: the transaction boundary must match the data boundary. The OCR HTTP call is not part of the data boundary. The method should be split:@Transactionalfor clearing blocks + creating blocks, non-transactional for the HTTP call.Suggestions
OcrBatchServicehas a direct dependency onOcrServicerather than on an interface. For now this is fine (both are in the same package, single deployment), but if the batch path ever needs to be swapped (e.g. for a queue-based implementation), extracting an interface for document-level OCR processing would make that possible without touching the batch service.The
processDocumentmethod onOcrServicetakes aDocumententity object as a parameter (OcrService.java:78). This leaks the persistence model across the service boundary. Consider taking just the fields needed (documentId, filePath, fileHash, scriptType) or a value object. Not a blocker — the two classes are in the same package — but worth noting for when package-by-feature refactoring happens.ocr_job_documentstable has no unique constraint on(job_id, document_id)(V25__add_ocr_job_tables.sql). The same document could be added to the same batch job twice. Consider addingUNIQUE(job_id, document_id).SSE endpoint returns
SseEmitterdirectly from the controller (OcrController.java:76). This works but ties the controller to the Servlet API. If SvelteKit proxies this via a+server.tsendpoint (as the discussion resolved), the proxy must correctly pass throughtext/event-streamcontent type and disable response buffering. Worth a comment in the controller or ADR noting this constraint.🧪 Sara Holt — QA Engineer
Verdict: ⚠️ Approved with concerns
36 new backend tests with good coverage of the core OCR flow. 687 frontend tests still passing. The TDD approach is visible in the commit history. Some gaps in test coverage.
Blockers
No integration test for the polygon round-trip — the
PolygonConverterTestcovers serialization in isolation, andAnnotationServiceTestmocks the repository. But there is no@DataJpaTestthat writes an annotation with a polygon to a real PostgreSQL (Testcontainers) and reads it back. The JSONB type mismatch bug that was caught and fixed (@JdbcTypeCodecommit) would have been caught earlier by such a test. This is a regression risk.OcrBatchService.processBatchAsync()is@Asyncbut has no integration test verifying async behavior. The unit test callsprocessBatchAsyncdirectly (synchronously). There should be a@SpringBootTestintegration test that verifiesstartBatchreturns immediately and the job eventually transitions toDONEusing Awaitility. Without this, there's no test confirming the@Asyncannotation is wired correctly.Suggestions
OcrServiceTestdoesn't verify text sanitization on OCR-produced blocks — Felix flagged thatOcrServicebypassesTranscriptionService.createBlock(). Until that's fixed, there should be a test asserting that OCR text exceeding 10K characters is handled gracefully (currently it's not sanitized at all).OcrControllerTestdoesn't test the SSE progress endpoint (/api/ocr/jobs/{jobId}/progress). Testing SSE with MockMvc requiresMockMvcBuilders.standaloneSetup()with an async dispatcher orWebTestClient. I'd add at least a test confirming the endpoint returnstext/event-streamcontent type.UniquePointsValidatorTestdoesn't test the edge case of exactly 4 identical points (all four points the same coordinate). The current validator checksnew HashSet<>(polygon).size() == polygon.size(), which would correctly reject this, but an explicit test documents the behavior.No test for
OcrController.getDocumentOcrStatus()when a RUNNING job exists — only theNONEcase is tested. Add a test whereocrJobDocumentRepository.findFirstByDocumentIdAndStatusIn()returns a result and verify the DTO is populated correctly.The
OcrBatchServiceTest.processBatchAsync_continuesAfterSingleDocumentFailuretest is excellent — it verifies failure isolation, which is the most important batch behavior. Good work.Frontend tests: no component test for
OcrTriggerorOcrProgress— these are new UI components with non-trivial logic (confirmation dialog flow, EventSource lifecycle). Consider adding Vitest component tests that verify: (1) OcrTrigger shows confirmation when existingBlockCount > 0, (2) OcrTrigger calls onTrigger with the selected script type, (3) OcrProgress renders the progress bar with correct aria attributes.🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
Good security posture overall:
@RequirePermissionon all OCR endpoints,@Size(max=500)on the batch DTO, OCR service on internal network only. One blocker and several items to address.Blockers
pdfUrlparameter in the Python OCR service (main.py:59,77-81). The OCR service downloads a PDF from whatever URL is provided in thepdfUrlfield. Spring Boot constructs this URL from the S3 internal path (OcrService.java:117-118), but the Python service has no validation that the URL points to the expected MinIO host. A compromised or misconfigured Spring Boot instance could sendhttp://169.254.169.254/latest/meta-data/(cloud metadata) or any internal service URL. Fix: Validate the URL scheme and host in the Python service against an allowlist (e.g.ALLOWED_PDF_HOSTS=minio:9000). This is CWE-918 (Server-Side Request Forgery).Suggestions
OcrController.resolveUserId()silently returnsnullon failure (OcrController.java:105-112). If the user lookup fails, OCR proceeds withuserId=null. This means thecreated_byfield on the job and all created annotations/blocks will benull— losing the audit trail. Consider throwingDomainException.unauthorized()instead of silently degrading. The same pattern exists onAnnotationController— a pre-existing issue, but worth fixing here since this is security-sensitive (who triggered the destructive block replacement?).No rate limiting on
POST /api/documents/{id}/ocr— each call triggers a CPU-intensive operation on the Python service. A user withWRITE_ALLpermission could trigger hundreds of concurrent OCR jobs, exhausting the server. Consider adding a per-user or global concurrency check (e.g. only one per-document OCR at a time, reject if a job is already RUNNING for this document).The
BatchOcrDTO.documentIdslist is validated for size but not for duplicates (BatchOcrDTO.java). Submitting the same document ID 500 times would pass the@Size(max=500)check but process the same document 500 times, wasting CPU. Add a@UniqueElementsvalidation or deduplicate in the service.RestClientOcrClientlogs the OCR service error message at WARN level (RestClientOcrClient.java:54). If the error message contains the presigned URL (which includes S3 credentials), this leaks credentials to the log. Uselog.warn("OCR service health check failed")withoute.getMessage(), or sanitize the message.Python service
_download_and_convert_pdfhas no response size limit (main.py:77-81). A crafted URL could serve a multi-GB file, exhausting the container's 6GB memory limit. Add aContent-Lengthcheck before reading the full response, or stream with a size cap.Good:
@RequirePermission(Permission.ADMIN)on the batch endpoint,@RequirePermission(Permission.WRITE_ALL)on per-document OCR — matches the discussion. The SSE progress endpoint correctly usesREAD_ALL.BatchOcrDTOhas@Size(max=500). The OCR service has no host port mapping in Docker Compose. The polygon DTO validation (@DecimalMin,@DecimalMax,@Size,@UniquePoints) is thorough.🎨 Leonie Voss — UX Design & Accessibility
Verdict: ⚠️ Approved with concerns
The component decomposition follows the discussion agreements:
ScriptTypeSelectis a native<select>with a visible<label>,OcrTriggerwraps it with the confirmation flow,OcrProgressshows the SSE-driven progress bar. Accessibility is partially addressed. One blocker.Blockers
OcrProgress.svelteprogress bar is missingrole="progressbar"and ARIA attributes. The discussion explicitly requiredrole="progressbar"witharia-valuenow,aria-valuemin,aria-valuemax, andaria-label. The current implementation renders a styled<div>for the progress bar without any ARIA attributes. Screen readers will not announce progress changes. Fix: addrole="progressbar" aria-valuenow={progressPercent} aria-valuemin={0} aria-valuemax={100} aria-label={m.ocr_progress_heading()}to the progress bar track element.Suggestions
ScriptTypeSelect— the disabled placeholder option should havevalue=""so that when no script type is selected,bind:valuecleanly produces an empty string. Verify this is the case; if the placeholder has no explicitvalue, some browsers return the text content as the value.OcrTrigger— the "OCR starten" button usesdisabledcorrectly (native attribute, not justaria-disabled), which matches the discussion agreement. Good.AnnotationShape.svelteclip-path rendering — the polygon clip-path approach is clever and avoids the complexity of switching to SVG. However,clip-path: polygon()clips the background fill and the block number badge. If a polygon annotation's top-left corner is not at the bounding box top-left, the badge will be clipped and invisible. Consider positioning the badge outside the clip-path by placing it as a sibling element rather than a child of the clipped div.No visual distinction between rectangular and polygon annotations — the discussion (Leonie's comment) asked: "Will users be able to distinguish polygon annotations from rectangle annotations?" The current implementation makes them visually identical. Consider a subtle visual cue — for example, a slightly different stroke style on polygon annotations (dashed border vs solid) — so users can see that Kraken produced a precise shape fit rather than a generic rectangle.
Mobile layout for ScriptTypeSelect + OcrTrigger — the discussion specified full-width stacked layout at narrow viewports with
min-h-[44px]touch targets. The button hasmin-h-[44px]andw-full. The select also hasmin-h-[44px]. The stacked layout usesflex-col gap-3. This matches the spec. Good.Error state in OcrProgress — a red left border and heading is good. The retry button text "Erneut versuchen" matches the discussion. The error heading uses
<h3>as specified for screen reader announcement. Verify that focus moves to the error heading when the error state appears (currently it likely doesn't — would require programmatic focus management).🛠️ Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
The Docker Compose integration is well done:
mem_limit: 6g,memswap_limit: 6g, health check withstart_period: 60s, no host port exposure, named volume for models. One blocker and several operational concerns.Blockers
depends_on: ocr-service: condition: service_healthymakes the entire application fail to start if the OCR service is unhealthy (docker-compose.yml). The discussion explicitly resolved that OCR should degrade gracefully — the rest of the app must work without OCR. The hard dependency means if the OCR Docker image fails to build, or if model loading crashes (OOM), or if the user hasn't downloaded Kraken models yet, the entire stack won't start. Fix: Remove thedepends_onforocr-servicefrom the backend, or move it to a dev-only override. TheOcrHealthClient.isHealthy()check already handles runtime unavailability gracefully.Suggestions
ocr-serviceDockerfile installstorch==2.5.1which may not be the latest CPU build (ocr-service/Dockerfile:13). Pin intentionally is fine, but add a comment noting the version choice and that this should be updated with the other dependencies. Also consider--no-cache-diris already used — good.No
.dockerignoreinocr-service/— the build context will include any model files,__pycache__,.git, etc. that happen to be in the directory. Create anocr-service/.dockerignore:surya-ocr==0.6.3andkraken==5.2.9pinned versions (requirements.txt) — good for reproducibility. Consider adding a comment noting when these were last verified, or set up Dependabot/Renovate for the Python service.The
ocr_modelsvolume is empty by default — the container will start, load Surya models from pip-installed packages (bundled), but Kraken will log a warning and be unavailable. The runbook for downloading the Kraken model is mentioned in the issue discussion but not documented anywhere in the repo. Add aocr-service/README.mdor a scriptscripts/download-kraken-model.shso the setup step isn't lost.No restart behavior specified for the backend when
ocr-servicecrashes — sinceocr-servicehasrestart: unless-stopped, it will auto-restart. But the backend'sdepends_ononly applies at initial startup. If the OCR service crashes and restarts mid-operation, in-flight OCR jobs will fail. TheOcrBatchServicefailure isolation handles this per-document, which is correct. No action needed, just noting the behavior.APP_OCR_BASE_URLandAPP_S3_INTERNAL_URLare hardcoded indocker-compose.ymlrather than using.envvariables. For consistency with the rest of the compose file (which uses${VAR}patterns), considerAPP_OCR_BASE_URL: ${OCR_BASE_URL:-http://ocr-service:8000}.Good:
memswap_limit: 6gequal tomem_limitdisables swap — correct, OOM-kill is better than thrashing. Health checkretries: 12withinterval: 10sgives 2 minutes for model loading.restart: unless-stoppedon the OCR service. Namedocr_modelsvolume for model persistence across rebuilds.- BlockSource enum: MANUAL, OCR - V26 migration adds source + reviewed columns to transcription_blocks - OcrService sets source=OCR when creating blocks - TranscriptionService.reviewBlock() toggles the reviewed flag - PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint - 5 new tests: reviewBlock toggle/untoggle/notfound, controller, OcrService source=OCR verification The reviewed flag enables the Kraken fine-tuning pipeline: only blocks marked as reviewed by a human are exported as training data. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR processing logic (processDocument, clearExistingBlocks, createBlocks) into OcrAsyncRunner. OcrService is now a thin entry point that validates, creates the job, and dispatches to OcrAsyncRunner. Architecture: - OcrService: validates document, checks health, creates OcrJob, delegates - OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch - OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner - No circular dependencies Single-document OCR is now async (returns jobId immediately). Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED. 816 backend tests pass, 687 frontend tests pass. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>JDK HttpClient defaults to HTTP/2 with upgrade negotiation. Uvicorn rejects the upgrade ('Unsupported upgrade request'), causing the request body to be lost and a 422 'Field required' from FastAPI. Force HTTP/1.1 since the OCR service is internal and doesn't need h2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>Single-document OCR now creates an OcrJobDocument row so GET /api/documents/{id}/ocr-status can find running jobs. OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED). Frontend checks OCR status when entering transcription mode — if a job is running, resumes polling and shows the spinner. Refs #226 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>The ANALYZING message appeared while the Python service was still downloading the PDF and loading models. Remove it so the LOADING message ("Lade Modell und Dokument…") stays visible until the first ANALYZING_PAGE event arrives from the stream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>feat: local OCR pipeline + polygon annotations (#226, #227)to feat: OCR pipeline with NDJSON streaming and real-time progress (#226, #227, #231)👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
Solid feature implementation with good test coverage across all three layers. The sealed interface for
OcrStreamEventwith pattern matching is clean Java 21 usage, and the component decomposition on the frontend (AnnotationShape, OcrTrigger, OcrProgress, ScriptTypeSelect) follows the visual boundary rule well. I have a few concerns:Blockers
None — the code is functional and well-tested.
Suggestions
OcrController injects repositories directly (
OcrController.java:95-96). The controller injectsOcrJobRepositoryandOcrJobDocumentRepositorydirectly and queries them ingetJobStatus()andgetDocumentOcrStatus(). This violates the layering rule: Controllers never inject or call repositories directly. These queries belong inOcrServiceor a dedicated read service.OcrAsyncRunneris doing too much (OcrAsyncRunner.java, 230 lines). This class handles single-document streaming, batch processing, block creation, existing block cleanup, and progress updates. That's at least 3 responsibilities. Consider splitting:OcrBlockCreatorfor the annotation+block creation logic, keepingOcrAsyncRunneras the async orchestrator only.resolveUserIdswallows exceptions silently (OcrController.java:162-169). The catch-allExceptionhandler returnsnull— if user resolution fails for an unexpected reason, the OCR job runs without a user ID and there's no log entry. At minimum, log atwarnlevel.clearExistingBlocksiterates and deletes one-by-one (OcrAsyncRunner.java:131-135). For documents with many blocks, this generates N delete queries. A bulkdeleteAllByDocumentIdon the repository would be cleaner and faster.OcrProgress.svelteretry button resets state but doesn't re-create EventSource (line 81). Settingstatus = 'running'doesn't reconnect the SSE stream — the$effectwon't re-run becausejobIdhasn't changed. The retry is cosmetic only.Inconsistent import style in
OcrTrigger.svelte— usesimport * as mwhileOcrProgress.svelteusesimport { m }. Should be consistent across the codebase.AnnotationShape.svelteuses inline styles extensively — theshapeStylecomputed string is 10+ CSS properties concatenated. This is functional but harder to read than a class-based approach. Not a blocker given the dynamic nature of the values.NDJSON_MAPPERhasFAIL_ON_UNKNOWN_PROPERTIES = true(RestClientOcrClient.java:1393). This makes the Java consumer brittle — if the Python service adds a new field to the NDJSON events, the Java side will break. For a streaming protocol between two services you control,falseis safer and more forward-compatible.Overall: clean implementation, good TDD evidence in the test suite, and the streaming protocol is well-designed. The controller-repository layering violation is the most important item to address.
🏗️ Markus Keller — Application Architect
Verdict: ⚠️ Approved with concerns
The architectural decisions here are sound. The Python OCR microservice is justified — Surya and Kraken have no Java equivalents, and the ADRs (001 and 002) clearly document the reasoning. NDJSON streaming is the right transport choice: simpler than WebSockets, works through reverse proxies, and the per-page persistence eliminates the timeout problem. Good call on keeping the batch path on the old
extractBlocks()for now.Blockers
None.
Suggestions
Controller-to-repository coupling in
OcrController. The controller directly queriesOcrJobRepositoryandOcrJobDocumentRepository. This bypasses the service layer and creates a shortcut that will bite you when you need to add authorization checks or caching on job status reads. Move these queries behindOcrService.OcrService.startOcr()mutates the Document entity without saving (OcrService.java:1311-1313).doc.setScriptType(scriptTypeOverride)modifies the entity in the persistence context, but there's no explicitdocumentService.save(doc)call. This relies on JPA dirty checking within a transaction — butstartOcr()is not annotated@Transactional. The mutation might or might not persist depending on the Hibernate flush mode.Thread pool sizing (
AsyncConfig.java). The pool was changed from 1/1/1 to 2/2/100. A queue capacity of 100 with max 2 threads means up to 100 OCR jobs can be queued. Each OCR job can consume 5-8 GB of RAM on the Python side. Consider whether the queue should be smaller (e.g., 5-10) to prevent resource exhaustion, or add backpressure signaling.No
updated_attrigger onocr_jobs/ocr_job_documents. The migration creates these tables withupdated_at TIMESTAMPTZ NOT NULL DEFAULT now()but doesn't add a trigger to auto-update it. The JPA@UpdateTimestamphandles it on the Java side, but any direct SQL updates (admin queries, debugging) won't update the timestamp.Presigned URL TTL (
FileService.java). The 15-minute TTL is documented as "enough for OCR processing on CPU." But the issue description mentions a 7-page document taking 2+ hours. If a single page can take 10+ minutes on CPU, the presigned URL might expire before the Python service finishes downloading. Consider increasing to 30 minutes or making it configurable.SSE timeout is 5 minutes (
OcrProgressService.java:1216). For long OCR jobs, the SSE connection will time out and the frontend will lose progress updates. TheOcrProgress.svelteerror handler fires but the retry doesn't reconnect (as Felix noted). Consider a longer timeout or a reconnect mechanism.ADR-001 mentions WireMock for integration tests but none are present. The ADR states "Integration tests require WireMock stub" but the test suite only has unit tests with mocks. This is fine for now, but worth noting as a gap.
The NDJSON protocol design with
start/page/error/doneevents is clean and extensible. The polygon JSONB storage with a CHECK constraint is the right balance of flexibility and safety. Good architectural work overall.🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
Impressive test coverage for a PR of this size. The backend adds 14 new test classes with thorough coverage of the OCR pipeline, streaming protocol, and edge cases. The frontend
translateOcrProgress.spec.tsis well-structured. Python tests at 38 with 88% coverage is solid. My concerns are around gaps and test quality.Blockers
None.
Suggestions
No integration tests for the OCR pipeline. All Java tests are unit tests with mocks. The critical path —
OcrAsyncRunnercallingstreamBlocks→ creating annotations → creating transcription blocks — is only tested with mocked dependencies. A@SpringBootTestintegration test with Testcontainers that exercises the full flow (minus the actual Python service) would catch JPA mapping issues, Flyway migration problems, and transaction boundary bugs. This is especially important given the JSONB polygon column and the new@JdbcTypeCode(SqlTypes.JSON)annotation.OcrProgress.sveltehas no tests. This component manages SSE connections, parses events, and drives the progress UI — all untested. At minimum, test the state transitions (running → done, running → error).OcrTrigger.sveltehas no tests. The confirmation dialog flow (existing blocks → confirm → trigger) is a critical user path that should have component tests.AnnotationShape.svelte— no tests for polygon rendering. TheclipPathderived computation with coordinate transforms is non-trivial math. A unit test for the clip-path calculation would prevent regression.FileServiceTestpassesnullforS3Presigner(FileServiceTest.java:508). This means the presigned URL generation path is completely untested at the unit level. If any test callsgeneratePresignedUrl, it will NPE.RestClientOcrClientStreamTesttests parsing but not HTTP fallback. The 404 → fallback-to-/ocrpath instreamBlocks()is tested only in the staticparseNdjsonStreammethod. The HTTP-level fallback logic is untested.Python tests don't test the streaming endpoint (
/ocr/stream).test_stream.pypresumably covers this, but I notice onlytest_confidence.pyandtest_engines.pyin the diff. The stream endpoint's error handling (per-page failure → continue) should have explicit test coverage.No test for
reviewBlocktoggle behavior in the frontend. TheTranscriptionBlock.sveltereview toggle is connected but never tested from the component level.Batch path (
runBatch) uses oldextractBlocks()butprocessDocumentis only tested viaOcrAsyncRunnerTest— the batch loop's skip/fail/success state transitions are tested inOcrBatchServiceTestbut only at the service level, not the runner level.Test quality is good where tests exist — descriptive names, single assertions per behavior, proper use of
@ExtendWith(MockitoExtension.class)and AssertJ. The gaps are in integration coverage and frontend component testing.🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
The OCR pipeline introduces a new external service boundary and presigned URL generation — both expand the attack surface. Overall the security posture is reasonable, but there are items to address.
Blockers
None — no critical vulnerabilities found.
Suggestions
Presigned URL exposure via OCR service (
FileService.java:generatePresignedUrl). The presigned URL is passed to the Python service over the internal Docker network, which is acceptable. However, the 15-minute TTL means the URL is valid for anyone who intercepts it. Verify that:loggercalls)Risk: Low (internal network only), but worth a security smell note.
No input validation on
TriggerOcrDTO.scriptType(TriggerOcrDTO.java). The DTO accepts anyScriptTypeenum value includingUNKNOWN. IfUNKNOWNis passed, Surya processes it (which is fine), but there's no explicit@NotNullvalidation. AnullscriptType is silently accepted and may cause unexpected behavior inOcrService.startOcr()wherescriptTypeOverride != nullis the guard.resolveUserIdsilently swallows authentication failures (OcrController.java:162-169). If the user resolution throws, the OCR job runs withuserId = null. This means:createdByonOcrJobis null — audit trail is brokencreatedByon annotations is null — attribution is lostCWE-778 (Insufficient Logging): At minimum, log the exception at
warnlevel.Batch endpoint requires ADMIN but single-document requires WRITE_ALL (
OcrController.java). This is intentional (batch is more destructive), but verify that WRITE_ALL users cannot abuse the single-document endpoint to re-run OCR repeatedly and cause resource exhaustion. Consider rate limiting or a check for concurrent OCR jobs on the same document.Python
_download_and_convert_pdffollows redirects by default (main.py). Thehttpx.AsyncClientdefault follows redirects. If an attacker can inject a crafted presigned URL (unlikely given it's generated server-side), they could redirect the Python service to an arbitrary URL. This is a defense-in-depth note — addfollow_redirects=Falseto harden the client.No SSRF protection on the
pdfUrlparameter (OcrRequest.pdfUrl). The Python service downloads whatever URL it receives. In the current architecture, the URL is always a presigned MinIO URL generated by Spring Boot, so the input is trusted. But if the/ocror/ocr/streamendpoints are ever exposed beyond the internal network, this becomes an SSRF vector. Add a comment documenting the trust assumption.NDJSON_MAPPERhasFAIL_ON_UNKNOWN_PROPERTIES = true. From a security perspective, strict parsing is actually good — it prevents deserialization of unexpected fields. However, this also means a malicious or buggy Python service sending extra fields will crash the Java consumer. Trade-off is acceptable.OcrBlockResultrecord fields are not validated (OcrBlockResult.java). Thex,y,width,heightvalues from the Python service are trusted and stored directly. A malicious OCR response could inject coordinates outside [0,1], which the frontend would render as overflowing annotations. The DTO validation onCreateAnnotationDTO.polygonhas range checks, but the AABB fields don't. Low risk (internal service), but worth noting.No critical vulnerabilities. The main action items are logging improvements and documenting trust boundaries.
🎨 Leonie Voss — UI/UX Design Lead
Verdict: ⚠️ Approved with concerns
The frontend additions follow the project's brand guidelines well — correct use of
brand-navy,brand-mint,brand-sand, properfont-serif/font-sansseparation, and the card patterns match the existing design language. The component decomposition is clean:OcrTrigger,OcrProgress,ScriptTypeSelect, andAnnotationShapeeach represent one visual region.Blockers
None.
Suggestions
Progress bar has no text alternative for screen readers (
OcrProgress.svelte:459-467). Therole="progressbar"witharia-valuenowis correct, but there's noaria-labelor visible text describing what the progress bar represents. Addaria-label={m.ocr_progress_heading()}to the progressbar div.ScriptTypeSelect.svelte—<select>touch target is correct (min-h-[44px]), good. But the label text attext-xs(12px) is at the absolute minimum. For the senior user audience, considertext-sm(14px) for form labels.Error state border treatment (
OcrProgress.svelte:474). Theborder-l-4 border-l-red-500is a good visual signal, but red alone as the error indicator may not be sufficient for color-blind users. The heading text "OCR fehlgeschlagen" provides the textual cue, which is good — but consider adding an error icon (⚠ or ✕) for redundant signaling."OCR erneut ausführen…" trigger (
TranscriptionEditView.svelte:716-730). The<details>element is semantically correct for progressive disclosure. Thetext-xssummary text is very small — at 12px it's the minimum. Acceptable but tight for the target audience.Review toggle icon (
TranscriptionBlock.svelte:243-268). The checkmark circle icon toggles between filled and outlined states — this is a clean interaction pattern. Thetext-turquoisecolor for the reviewed state aligns with the brand accent. Thearia-labeltoggles between review/unreview states, which is correct for accessibility.Block number badge in
AnnotationShape.svelte(line 344-367). The badge usesfont-size: 11pxwhich is below the 12px minimum. This is a regression from the original code, but worth flagging. At 20px circle diameter with 11px text, the numbers will be hard to read on mobile.OcrTrigger.sveltebutton usesmin-h-[44px]— correct touch target. The disabled state usesopacity-50which reduces contrast below WCAG AA for the white-on-navy text. Consider using a distinct disabled color instead of opacity.Polygon clip-path rendering (
AnnotationShape.svelte:290-300). The CSSclip-path: polygon(...)approach is correct for displaying quadrilateral shapes. Note thatclip-pathclips the entire element including the box-shadow, so active/hover states will not show the inset shadow on polygon-clipped annotations. This may be intentional but is worth verifying visually.Overall the UI additions are clean and follow the established patterns. The accessibility items are minor — the fundamentals (ARIA roles, touch targets, semantic HTML) are in place.
🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
The Docker Compose additions for the OCR service are well-structured. The healthcheck with
start_period: 60saccounts for model loading, themem_limit: 8gis a sensible guard, and the internal-only networking is correct. Two ADRs documenting the architectural decisions — that's rare and appreciated.Blockers
None.
Suggestions
ocr-servicehas no host port mapping — good, it's internal only. But there's also noexposedirective. While Docker Compose services on the same network can reach each other by service name withoutexpose, addingexpose: ["8000"]makes the intent explicit and helps with documentation.mem_limit: 8gandmemswap_limit: 8gmeans no swap for the OCR container. On a 16-32 GB server running the full stack (PostgreSQL + MinIO + Spring Boot + SvelteKit + OCR), this is tight. If Surya loads at ~5 GB idle and processes a large document, the OOM killer may hit. Consider either:memswap_limitto 10g to allow some swap headroomBackend now depends on
ocr-service: condition: service_healthy(docker-compose.yml:1359-1360). This means the entire backend won't start if the OCR service is unhealthy. OCR is a non-essential feature — the app should work without it. Consider changing tocondition: service_startedor removing the dependency entirely, since theOcrHealthClient.isHealthy()check already handles the unavailable case gracefully.Dockerfile uses unpinned
python:3.11-slim. For reproducible builds, pin to a specific digest or at least a patch version:python:3.11.12-slim. Also,torch==2.7.1is pinned but installed from the CPU index in a separateRUNlayer and then listed again inrequirements.txt. Ifpip install -r requirements.txtre-resolves torch from PyPI (not the CPU index), it could pull in the CUDA variant (2+ GB larger). Verify that the pre-installed CPU wheel satisfies therequirements.txtconstraint.ocr_modelsandocr_cachenamed volumes are correct for persistence. But the Kraken model (german_kurrent.mlmodel) needs to be downloaded separately — there's ascripts/download-kraken-models.shscript in the diff but it's not integrated into the Docker build or documented in the compose file. First-time users will get a healthy OCR service that fails on Kurrent requests with a confusing "model not loaded" warning.No resource limits on the existing containers. The OCR service has
mem_limitbut PostgreSQL, MinIO, and the backend don't. On a shared VPS, the OCR service's 8 GB allocation could starve the other services. Consider addingmem_limitto the backend and PostgreSQL containers as well, or at minimum documenting the total RAM budget.APP_S3_INTERNAL_URLenvironment variable is added (docker-compose.yml:1369) but I don't see it referenced in the Spring Boot code. Is this used? If it's for the presigned URL base, the presigned URL is generated byS3Presignerwhich uses the endpoint fromMinioConfig. Verify this isn't a dead env var.No CI workflow changes. The OCR service adds a Dockerfile and tests but the Gitea Actions workflow isn't updated to build/test the Python service. The
cd ocr-service && python -m pyteststep in the test plan is manual-only.Good infrastructure work. The main operational concern is the hard dependency on OCR service health blocking the backend startup.
Changed OcrTrigger and ScriptTypeSelect from 'import * as m' to 'import { m }' to match the rest of the codebase. Increased ScriptTypeSelect label to text-sm and annotation badge font to 12px for better readability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>