feat: local OCR pipeline (batch + per-document) with Surya and Kraken #226
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Overview
Add a local OCR pipeline that pre-populates
TranscriptionBlocks automatically. Two entry points:Documents are scans of typewritten letters and handwritten letters (including historical German scripts like Kurrent). OCR results are always editable — the existing transcription editor is the correction surface.
Engine Strategy
typewriterhandwriting-modernhandwriting-kurrentunknown(default)No automatic confidence-based switching between engines — the engine is selected by the document's
scriptTypefield (set by the user). This avoids fragile threshold tuning and gives the user control.Data Model Changes
DocumententityAdd a
scriptTypefield:Flyway migration:
ALTER TABLE documents ADD COLUMN script_type VARCHAR(30) NOT NULL DEFAULT 'UNKNOWN'.DocumentDTOExpose
scriptTypeinDocumentUpdateDTOso users can set it from the edit form and from the per-document OCR trigger.API Endpoints
Per-document OCR
scriptTypeon the document if provided202 Acceptedwith a job reference (or document ID to poll)TranscriptionBlocks for the document (with confirmation in the UI)Batch OCR
scriptType(defaults toUNKNOWN→ Surya)202 Accepted; processing happens in the backgroundStatus polling (optional, but needed for batch UX)
OCR Microservice
A separate Python container (
ocr-service) indocker-compose.yml:Responsibilities of the Python service:
scriptTypeparameterSurya path: layout detection + recognition in one pass, returns bounding boxes natively.
Kraken path: Kraken's own baseline segmentation → recognition with the specified historical model. Models are bundled into the Docker image or mounted as a volume.
Interface (request):
Interface (response):
Coordinates are normalized (0–1) relative to page dimensions, matching how the PDF viewer handles annotations.
Backend Integration
OcrServicein Spring Boot:RestClientCreateTranscriptionBlockDTOobjectsTranscriptionService.createBlock()for each result@Asyncfor non-blocking executionOcrJobService(or reuseMassImportServicepattern):Architectural Considerations
Block replacement on re-run
If OCR is triggered on a document that already has blocks, the existing blocks must be cleared first. Options:
Show a confirmation dialog in the UI if blocks already exist.
Async processing and failure handling
ocrErrorfield or log entry)Resource constraints
Surya and Kraken are CPU-heavy (GPU optional). On a home NAS:
Kraken model management
sortOrderfor auto-created blocksBlocks returned by OCR are already ordered by page then vertical position. Assign
sortOrdersequentially (0, 1, 2, …) in that order.Bounding box coordinate system
The existing
DocumentAnnotationstoresx, y, width, heightas doubles. Confirm whether these are normalized (0–1) or pixel values — the OCR service must output the same convention. Current frontend PDF viewer appears to use normalized values.MassImport integration
MassImportServicecreatesPLACEHOLDERdocuments. OCR only makes sense onUPLOADEDdocuments (file present). The batch OCR endpoint must skip or reject documents not inUPLOADED(or later) status.Open Questions
scriptTypeon the document list/search, or only on the detail view?Out of Scope
Hardware clarification
The server has no GPU. CPU-only inference is the target. The server RAM can be upgraded to meet requirements.
Hardware assumption: 16–32 GB system RAM, no GPU.
This resolves the resource constraint concern in the architectural considerations:
ocr-serviceDocker container can be given a generous memory limit (e.g.mem_limit: 6g) rather than being squeezed.Processing speed expectations on CPU:
For overnight batch imports this is fully acceptable. Per-document OCR will feel slow if triggered interactively — the UI should reflect that it is a background job (spinner / async status) rather than a synchronous response.
No GPU, no CUDA, no NVIDIA Container Toolkit needed. PyTorch CPU-only build in the Docker image is sufficient and keeps the image smaller.
Progress reporting
Given that OCR is slow on CPU (minutes per document), the user needs live feedback while it runs — both for per-document and batch jobs.
Recommended approach: Server-Sent Events (SSE)
SSE is the right fit here: one-way server→client stream, no WebSocket overhead, works natively with Spring Boot's
SseEmitteron the existing Jetty stack, and the frontend can consume it with a plainEventSource.Event shape:
The Python service emits progress per page (it processes pages sequentially). Spring Boot relays these to the SSE stream as it receives them from the OCR microservice.
Per-document OCR
Show a progress bar in the transcription panel: "Seite 3 von 12 wird analysiert…". Blocks appear incrementally as each page completes — the user sees results filling in rather than a blank screen until the end.
Batch OCR
Show a batch progress overlay (or a persistent status bar): "47 Dokumente · 5 abgeschlossen · 2 Fehler". Each completed document can be linked so the user can jump straight to its transcription.
Fallback: polling
If SSE proves complex to integrate with the async job infrastructure, a simple
GET /api/ocr/jobs/{jobId}polling endpoint (every 2–3 seconds from the frontend) is an acceptable fallback. Coarser granularity but much simpler to implement. Worth starting here and upgrading to SSE if the UX feels laggy.Connection loss
If the user closes the tab or loses the connection mid-job, the OCR job must continue running on the server. The SSE stream is display-only — job lifecycle is independent of the client connection.
👨💻 Felix Brandt — Senior Fullstack Developer
Questions & Observations
OcrService responsibilities are too broad
The proposed
OcrServicedoes four things: fetch from MinIO, call the Python service, map the response to DTOs, and callTranscriptionService. That's four reasons to change. I'd split it:OcrService— orchestrates the flow, owns the job lifecycleOcrMicroserviceClient— theRestClientwrapper; one method, one responsibilityOcrServiceas a private helperTDD question: how do we test
OcrServicein isolation?The Python microservice is an external HTTP dependency. That means
OcrMicroserviceClientneeds an interface so we can inject a mock in unit tests. Without an interface,OcrServiceis not unit-testable. Suggest:Command-query separation on the
POST /api/documents/{id}/ocrendpointThe endpoint both triggers OCR (command) and returns a job reference (query). That's fine for REST semantics, but the service method should not mix the two internally.
OcrService.startOcr(documentId)should return a job ID as a pure creation result — not a status read.In-memory job tracking is a smell
"In-memory map (simple first)" will lose all job state on restart. That's fine for a first iteration, but the issue should explicitly flag it as tech debt with a follow-up ticket, otherwise it will never get addressed. A simple
ocr_jobstable (Flyway-managed) is not much more work and gives you restart resilience from day one.Frontend:
EventSourcein SvelteKitEventSourceis a browser API — it does not work in SvelteKit server load functions. Progress display must be purely client-side (anonMount-triggeredEventSourceor a Svelte action). This is not a problem architecturally but it means the progress UI cannot use the standard+page.server.tsdata flow. Worth calling out explicitly so there's no confusion during implementation.Naming:
HANDWRITING_MODERNvsHANDWRITING_LATIN_CURSIVEMODERNis ambiguous — modern relative to what? The Kurrent speaker in 1920 would have called their script "modern". ConsiderHANDWRITING_LATINas a clearer contrast toHANDWRITING_KURRENT.Suggestions
OcrClientas an interface from the start — TDD requires itScriptTypevalidator at the controller boundary that rejects unknown values with a400, not a500sortOrderassignment should be a named constant or helper, not an inline(i, _) -> ilambda scattered across the mapping code🏛️ Markus Keller — Application Architect
Questions & Observations
The microservice is justified — but document it
I'm generally skeptical of extracting services prematurely. In this case, the Python microservice is genuinely justified: the OCR engines (Surya, Kraken) have no Java bindings and only exist in the Python ecosystem. That's a concrete, present requirement. However, this should be captured in an ADR (
docs/adr/) before implementation starts, covering: why a separate service, why not Tess4J, and what the interface contract guarantees. Otherwise future maintainers won't know why the complexity exists.Job state belongs in PostgreSQL, not memory
The issue says "in-memory map (simple first)" for job tracking. I'd push back on this being the starting point. An
ocr_jobstable is straightforward, gives you restart resilience, and can be queried by the SSE endpoint directly. In-memory state requires careful synchronization across@Asyncthreads and disappears on any restart or crash — which is exactly when you need job state most. Consider:The
scriptTypefield: processing hint or document attribute?There's a design question here.
scriptTypedescribes how the document was written (a permanent fact about the document) — not just a processing hint for OCR. That argues for it living onDocumentpermanently, which the issue proposes. I agree with this. But it also meansscriptTypeshould appear in the document edit form, not only as part of the OCR trigger — a user should be able to set it without triggering OCR.POST /api/ocr/batchlayeringThe batch endpoint sits at
/api/ocr/batch, outside the document resource tree. That's clean. But make sureOcrControlleronly injectsOcrService— it must not reach intoDocumentServiceorTranscriptionServicedirectly. The domain boundary must be: OCR service coordinates; transcription service owns blocks.SSE is the right transport here
Agreed with the progress reporting comment: SSE via
SseEmitteris the correct choice for one-way progress streaming. One consideration:SseEmitterhas a default timeout. Set it explicitly (e.g. to the max expected batch duration) or the connection will close silently partway through a long batch. Also, eachSseEmitterholds a thread — for a home server processing one batch at a time, this is fine, but worth noting.LISTEN/NOTIFY as an alternative to in-memory job tracking
A pattern worth considering: when the Python service completes a page, it calls back to Spring Boot, which updates the
ocr_jobstable and sends a PostgreSQLNOTIFY ocr_progress. The SSE endpoint holds a connection listening for notifications and pushes them to the browser. This eliminates the need for a separate polling mechanism entirely and keeps state durable. Probably overkill for v1, but worth flagging as the clean long-term direction.Suggestions
ocr_jobstable, not in-memory map — same effort, much more resilientSseEmittertimeout explicitly to prevent silent connection dropsOcrControlleronly depends onOcrService— no cross-domain repository or service injection🧪 Sara Holt — QA Engineer
Missing acceptance criteria
The issue has a thorough architectural description but no acceptance criteria. Before implementation starts, we need testable definitions of done. Proposed:
POST /api/documents/{id}/ocron aPLACEHOLDERdocument returns400(no file present)POST /api/documents/{id}/ocron anUPLOADEDdocument returns202and creates a jobsortOrderGET /api/ocr/jobs/{jobId}returns404for an unknown job IDdoneevent after all documents in a batch are processedTest strategy by layer
Unit tests (JUnit 5 + Mockito)
OcrService: mockOcrClientandTranscriptionService; test status transitions, block mapping, failure isolation,PLACEHOLDERguardOcrJobService(if in-memory): test state machine — PENDING → RUNNING → DONE/FAILEDScriptTypeenum: validate all values accepted at the controller boundaryIntegration tests (Testcontainers + WireMock)
POST /api/documents/{id}/ocrfull flow through Spring context: job created, blocks persisted, annotations createdPOST /api/ocr/batchwith a mix ofUPLOADEDandPLACEHOLDERdocuments: onlyUPLOADEDones are processedE2E tests (Playwright)
Key testability concerns
The Python OCR service must be mockable in CI
CI should not spin up Surya or Kraken — they are too large and slow. The
OcrClientinterface (as Felix suggests) makes this clean. WireMock handles the integration test layer. The Python service itself needs its own test suite (pytest), but that runs independently.Async behavior — use Awaitility, not
Thread.sleep()OCR processing is
@Async. Integration tests that verify block creation after triggering OCR must use Awaitility:Never
Thread.sleep()— flaky tests are worse than no tests.SSE is hard to test at the integration layer
Testing
SseEmitterbehavior requires either Spring'sMockMvcSSE support or a smallWebClient-based integration test. Make sure the SSE endpoint is covered at the integration layer, not just manually verified in a browser.Block re-run test: preserve version history
When OCR replaces existing blocks, verify that
TranscriptionBlockVersionrecords from the previous (manual) edits are also cleaned up. If they are orphaned in the DB, the history table grows unboundedly.Suggestions
script_typecolumn migration runs cleanly from a clean DB🔒 Nora "NullX" Steiner — Application Security
Findings & Questions
1. Authorization on OCR endpoints — who can trigger OCR?
The issue specifies
WRITE_ALLpermission for transcription block creation, but doesn't explicitly state what permission guardsPOST /api/documents/{id}/ocrandPOST /api/ocr/batch. These must be annotated with@RequirePermission— OCR replaces all existing blocks, making it a destructive operation. Triggering it without authorization could wipe manually verified transcriptions.Recommended:
WRITE_ALLfor per-document OCR,ADMINorWRITE_ALLfor batch (batch is higher impact).2. IDOR risk on
POST /api/ocr/batchThe batch endpoint accepts an arbitrary list of
documentIds. The implementation must verify that every document ID in the list belongs to a document the requesting user is permitted to access. Without this check, a user could supply document IDs they don't own and trigger OCR (and block replacement) on other users' documents.This also applies to the per-document endpoint —
documentService.getById()must validate access, not just existence.3. Presigned URL lifetime
The issue proposes passing a MinIO presigned URL to the Python service. Presigned URLs have a configurable TTL. For large documents processed slowly on CPU, a very short TTL risks expiry mid-processing. A very long TTL widens the exposure window if the URL is logged by the Python service. Recommended: set TTL to 15–30 minutes (enough for a large document on CPU) and ensure the Python service does not log the full URL.
4. The Python OCR service must not be publicly reachable
The
ocr-servicecontainer should be on the internal Docker network only — noports:mapping indocker-compose.yml. Spring Boot calls it via the internal service name (http://ocr-service:8000). If someone exposes it to the host, anyone on the network can submit arbitrary PDFs for processing, potentially exhausting CPU resources.5. Job ID enumeration
GET /api/ocr/jobs/{jobId}must use UUIDs as job IDs (which the proposed schema does). Verify that the endpoint returns404(not403) for job IDs that exist but belong to another user —403confirms the job exists, which leaks information.6. Batch endpoint size limit
POST /api/ocr/batchaccepts an unbounded list of document IDs. Add a@Size(max = 500)constraint ondocumentIdsto prevent a single request from queuing an unbounded number of documents and starving the server.7. OCR output is user-visible — XSS via OCR text?
OCR output is stored as
textinTranscriptionBlockand rendered in the transcription panel. If the OCR engine produces malicious strings (unlikely but possible with crafted PDFs), and the frontend renders them without escaping, XSS is possible. Verify thatTranscriptionBlock.textis rendered as plain text (.textContent, not.innerHTML) in the Svelte components — a quick grep ofTranscriptionReadView.svelteandTranscriptionBlock.svelteshould confirm this.Suggestions
@RequirePermissionexplicitly in the issue and in implementation@Size(max = 500)todocumentIdsin the batch DTOOcrServiceports:on theocr-serviceindocker-compose.yml🎨 Leonie Voss — UX Design & Accessibility
Questions & Observations
Where does the OCR trigger live in the existing UI?
The issue describes the feature but not the interaction entry point. For per-document OCR, there are two plausible locations:
scriptType— a paired selector + trigger)Option 1 is better. The transcription panel is where the user is already thinking about text extraction. The
scriptTypeselector and "OCR starten" button should live there together, not scattered across two pages.The
scriptTypeselector needs a human-readable UIThe enum values (
UNKNOWN,TYPEWRITER,HANDWRITING_MODERN,HANDWRITING_KURRENT) are developer names — they should never appear in the UI as-is. Proposed German labels:UNKNOWN→ "(nicht festgelegt)"TYPEWRITER→ "Schreibmaschine"HANDWRITING_MODERN→ "Handschrift (lateinisch)"HANDWRITING_KURRENT→ "Handschrift (Kurrent/Sütterlin)"These should be Paraglide translation keys, not hardcoded strings.
Progress UX: incremental block appearance vs. spinner
The issue proposes "Seite 3 von 12 wird analysiert…" with blocks appearing incrementally. This is the right pattern — incremental display gives the user immediate value and makes the wait feel shorter. However, the blocks that appear during processing should be visually marked as "draft/processing" so the user doesn't start editing a block that might be replaced when the next page finishes. A subtle muted border or opacity reduction would work.
The confirmation dialog for block replacement needs care
"You have existing blocks. OCR will replace them." — this dialog is the right pattern, but the phrasing must make the consequence explicit, especially for older users who may have spent hours on manual transcription:
The destructive action button ("Ersetzen") should not be the primary (blue) style — use a red/danger style consistent with other destructive confirmations in the app.
Batch OCR progress: where does it surface?
The issue mentions "a persistent status bar" for batch progress but doesn't specify where. I'd suggest a dismissible notification banner at the top of the document list — the user can navigate away and come back. The banner should show:
Error state in the transcription panel
The open question "should failed OCR documents be surfaced with a badge?" — yes, they should, but keep it subtle. A small grey label "OCR fehlgeschlagen" below the document title in the list is enough. In the transcription panel itself, show the reason in the empty state: "OCR konnte nicht abgeschlossen werden. [Erneut versuchen]"
Accessibility
scriptTypedropdown must have a visible<label>— not placeholder-only labelingrole="progressbar"witharia-valuenow,aria-valuemin,aria-valuemaxaria-disabled="true"while a job is already running, not just visually greyed outSuggestions
scriptTypeselector and OCR trigger together in the transcription panel, not on the edit pageScriptTypeenum values before implementation🛠️ Tobias Wendt — DevOps & Platform
Questions & Observations
The Docker image will be very large
A PyTorch CPU-only image with Surya installed is typically 4–8 GB. Kraken adds less (it uses lighter models) but the base image is still heavy. A few things to keep this manageable:
pytorch/pytorch:2.x.x-cpuas the base — not the default PyTorch image which includes CUDA and is 10+ GB. The CPU-only build is a fraction of that..dockerignoremust exclude any model files from the build context — they will accidentally bloat the image if left loose in the directory.Model files must not be in the Docker image
Large binary models (Kraken HTR-United models can be 200–500 MB each) should be stored in a named Docker volume or a bind mount, not baked into the image. This keeps the image portable and lets models be updated without rebuilding. Add a
docker-compose.ymlvolume:Document the one-time model download step in the runbook.
Startup time and health checks
Surya loads transformer models into RAM at startup — this takes 30–60 seconds on first launch. The
healthcheckmust wait for models to be loaded, not just for the HTTP server to bind. The Python service should expose a/healthendpoint that returns200only after models are ready. Spring Boot must not start sending OCR jobs until this health check passes.And Spring Boot's
depends_on:Restart behavior during a running batch
If the
ocr-servicecontainer crashes mid-batch (OOM, segfault — both realistic on CPU-heavy workloads), in-flight documents are lost. The job tracking must be in the database (as Markus noted) so that on restart the job can be resumed or at least marked as failed. Arestart: unless-stoppedpolicy on the container is the minimum.CI must not run the OCR service
The real OCR service must not be part of the Gitea Actions CI pipeline — the Docker image is too large to pull and the inference is too slow. The backend integration tests should mock the OCR service (WireMock stub, as Sara describes). Add a comment in the Gitea Actions workflow file making this explicit so no one accidentally adds it later.
mem_limitin Docker ComposeSet an explicit memory limit on the
ocr-serviceto prevent it from taking down the entire host on a runaway job:memswap_limitequal tomem_limitdisables swap — better to OOM-kill the container than to thrash the disk at 20 MB/s while the rest of the stack grinds to a halt.No
ports:on the OCR serviceAs NullX flagged: no host port mapping. Internal Docker network only. The service name
ocr-serviceis the hostname Spring Boot uses.Suggestions
pytorch/pytorch:*-cpu) — significantly smaller than the defaultstart_period(60s) andretries: 12to account for model loadingmem_limit: 6g+memswap_limit: 6gon theocr-servicerestart: unless-stoppedon the container🏛️ Markus Keller — Architecture Discussion Summary
Discussion of the seven open architectural items. All resolved. ADR written to
docs/adr/001-ocr-python-microservice.md.Resolved decisions
1. ADR first
ADRs go in
docs/adr/, numbered sequentially.001-ocr-python-microservice.mdcovers the Python microservice decision and must exist before implementation starts.2. Job persistence —
ocr_jobs+ocr_job_documentstablesNo in-memory map. Two Flyway-managed tables:
ocr_jobs— batch-level tracking (status, total, timestamps)ocr_job_documents— per-document status (PENDING/RUNNING/DONE/FAILED)Python service stays stateless. Spring Boot owns all state. Batches are resumable after OOM or restart — unprocessed documents remain
PENDINGand can be retried.3. Graceful degradation in production, hard dependency in dev
OcrServicechecks the OCR service health before each job. If it's down, a clearDomainExceptionis returned; the rest of the app is unaffected. The harddepends_on: service_healthyis kept in the dev compose only — removed in the production overlay.4. SSE emitter timeout — ~5 minutes with auto-reconnect
SseEmitterset to ~5 minutes. BrowserEventSourcereconnects automatically. On reconnect, the endpoint reads current job state from the DB and resumes streaming. The DB is the source of truth, not the emitter.5. LISTEN/NOTIFY explicitly deferred
Multiple concurrent jobs (even from a single user) are handled via a
ConcurrentHashMap<UUID, List<SseEmitter>>— ephemeral routing table, fine to lose on restart. LISTEN/NOTIFY is the upgrade path for multi-instance deployments; not needed for a single home server instance. Noted in the ADR as future direction.6. REST routes confirmed
POST /api/documents/{id}/ocr— per-documentPOST /api/ocr/batch— batchGET /api/ocr/jobs/{jobId}/progress— SSE streamOcrControlleris a top-level controller, not nested under documents.7. Package placement — follow existing convention
New classes go into the existing package-by-layer structure (
controller/,service/,model/,repository/,dto/). ADR notes package-by-feature as the preferred future direction.The architecture is solid. The two things that matter most before writing any code: get the ADR committed, and nail down the Flyway migrations for
ocr_jobsandocr_job_documents— those tables are the backbone everything else hangs on.👨💻 Felix Brandt — Developer Discussion Summary
Nine implementation decisions resolved. These lock down the class structure, frontend pattern, and annotation integration before a line of code is written.
Resolved decisions
1. Interface split —
OcrClient+OcrHealthClientTwo interfaces, not one.
OcrClientexposesextractBlocks(),OcrHealthClientexposesisHealthy().RestClientOcrClientimplements both.OcrBlockResultis a record. Unit tests mock only the interface they need.2. Class breakdown — four classes
OcrService— single-document work (presigned URL → OCR call → block mapping → TranscriptionService)OcrBatchService— batch loop, ownsocr_job_documentsstate, callsOcrServiceper documentOcrProgressService— ownsConcurrentHashMap<UUID, List<SseEmitter>>, exposesregister()andemit()RestClientOcrClient— HTTP infrastructure onlyPer-document OCR (
POST /api/documents/{id}/ocr) bypassesOcrBatchServiceentirely — goes directly toOcrService. Creates oneocr_jobsrow, noocr_job_documentsrows.3.
ScriptTypeenum valuesUNKNOWN,TYPEWRITER,HANDWRITING_LATIN,HANDWRITING_KURRENT. These are persisted as strings in the DB column — must be correct in the Flyway migration before anything else touches the column.4. Frontend
EventSourcepatternSSE stream proxied through a SvelteKit
+server.tsthat pipes the Spring Boot response body. Browser opens same-originEventSource— no Spring Boot URL exposed client-side, auth cookie included automatically. OCR trigger POSTs use the generated typed API client as normal.5. Frontend component decomposition
New components in the transcription panel:
OcrTrigger— wrapsScriptTypeSelectdropdown + "OCR starten" button + replacement confirmation dialogOcrProgress— owns theEventSourcelifecycle, receivesjobId+onDoneprops, renders page progress barTranscriptionPanelholds job ID state and switches betweenOcrTrigger,OcrProgress, and the existing edit/read views.6. Batch progress — document detail page, not a list banner
No cross-page banner. Each document detail page calls
GET /api/documents/{id}/ocr-statusin the server load function. If status isPENDINGorRUNNING, the transcription panel rendersOcrProgresswith the job ID. User finds batch progress where they already look for transcriptions.7.
ScriptTypecontroller validationNo custom validator. Jackson's default behaviour throws
HttpMessageNotReadableException→400 Bad Requestfor unknown enum values. Service null-checksdto.scriptType()and falls back to the document's storedscriptTypeif not provided.8. Annotation overlap check —
createOcrAnnotation()OCR creates many adjacent text line annotations that would fail the existing overlap check. Solution: a separate
createOcrAnnotation()method onAnnotationServicethat skips the overlap check. No boolean flag argument.TranscriptionServicecallscreateOcrAnnotation()when the block source is OCR.9. Polygon annotation sequencing
Kraken returns polygon boundaries, not rectangles. Rather than permanently storing AABB approximations, #227 (polygon annotation support —
polygon JSONBcolumn ondocument_annotations) ships first. Once the DB and backend are in place, Kraken integration in this feature can output proper quadrilateral shapes from day one.createOcrAnnotation()accepts the optional polygon field from the start so no rework is needed later.The implementation order is: #227 DB + backend → then #226. Everything above is locked in before implementation starts.
🎨 Leonie Voss — UX Design & Accessibility Discussion Summary
Eight UI decisions resolved. These cover every visual state of the OCR feature from empty panel to error recovery, with full accessibility and mobile specification.
Resolved decisions
1. Empty state & trigger placement
OcrTriggeris the primary CTA, manual drawing is secondary text belowocrAvailablechecked server-side in the page load function — trigger not rendered at all if the OCR service is down (no flicker, no broken button)2.
ScriptTypeSelect— three options, ParaglideNative
<select>with visible<label>. Three options only — UNKNOWN removed from the UI:script_type_typewriter→ "Schreibmaschine" / "Typewriter" / "Máquina de escribir"script_type_handwriting_latin→ "Handschrift (lateinisch)" / "Handwriting (Latin)" / "Escritura manuscrita (latina)"script_type_handwriting_kurrent→ "Handschrift (Kurrent/Sütterlin)" / "Handwriting (Kurrent/Sütterlin)" / "Escritura manuscrita (Kurrent/Sütterlin)""OCR starten" button disabled until a script type is selected. Document's stored
scriptTypepre-selected when available. When stored value isUNKNOWN, no option pre-selected and button stays disabled.3. Confirmation dialog — only when blocks exist
Uses
ConfirmModalfrom #207 (dependency — must ship first or in parallel). No dialog on first OCR run. When blocks exist: dynamic block count + explicit mention of comments in the body text. "Ersetzen" uses destructive button style.4. In-progress block visual state
border-brand-sand(#E4E2D7) muted left border on blocks during OCR processing — signals provisional state without reducing readability. Blocks rendered as read-only preview (TranscriptionReadView) beneathOcrProgress. Edit controls not rendered while OCR runs — architecture prevents interaction, no explicit disabling needed. Normal turquoise border restored when edit mode activates on completion.5.
OcrProgresscomponent designtext-xs font-bold uppercase tracking-widest text-gray-400style#A6DAD8) fill on brand-sand (#E4E2D7) tracktext-sm text-gray-5006. Error state
OcrProgresstransitions to error state — red left border, failure message with page count reached, "Erneut versuchen" buttonscriptType(set when OCR was triggered — no extra state needed)7. Accessibility
role="progressbar"witharia-valuenow,aria-valuemin,aria-valuemax,aria-labeldisabledattribute (not justaria-disabled) when no script type selectedConfirmModalfrom #207 handlesrole="dialog", focus trapping, focus return, and touch targets ≥ 44pxScriptTypeSelect: native<select>with<label for="">— keyboard navigation and screen reader support free<h3>element so screen readers announce it when focus moves to the error state8. Mobile layout
Full-width stacked layout at all narrow viewports:
OcrProgressis full-width by nature — no mobile-specific changes needed.Dependency note: The OCR confirmation dialog requires
ConfirmModalfrom #207. Both issues should be tracked together for implementation ordering.👨💻 Felix Brandt — Developer Discussion Summary (Python Microservice)
Six implementation decisions resolved, plus Docker Compose and Dockerfile spec for the
ocr-service.Resolved decisions
1. HTTP framework — FastAPI
FastAPI + Pydantic v2 for the Python microservice. Typed request/response models keep the Python interface in sync with
RestClientOcrClienton the Java side. Auto-generated/docsis free. Uvicorn as the ASGI server.2. Model loading — eager load both at startup
Both Surya and Kraken models are loaded when the container starts. The
/healthendpoint returns200only after both are ready. Tobias'sstart_period: 60s+retries: 12covers model load time. No lazy-loading — first request is always fast, RAM is committed upfront.3.
GET /api/documents/{id}/ocr-statusresponse shapeNew endpoint, called from the document page server load function to decide whether to render
OcrProgress:statusvalues:NONE/PENDING/RUNNING/DONE/FAILED.currentPage+totalPagesgive the progress bar an initial value on page load — avoids always starting at 0% mid-job.4. Batch OCR trigger — always manual, via Admin system page
MassImportServicestays untouched — no OCR coupling. Batch OCR is triggered manually from the Admin system page. The admin controller callsOcrBatchService.startBatch(), protected byADMINpermission (as NullX specified).5. Non-UPLOADED documents in batch —
SKIPPEDas distinct terminal statePLACEHOLDERdocuments in a batch are recorded asSKIPPEDinocr_job_documents— not counted as errors. The SSEdoneevent andocr_jobstable carry separateskipped_countalongsideprocessed_countanderror_count. Admin sees: "45 verarbeitet · 3 übersprungen · 2 fehlgeschlagen".6. Python project structure
Docker Compose addition
Add to
docker-compose.ymlservices:Add
ocr-servicedependency tobackend:Add to
volumes:section:Dockerfile + dependencies
ocr-service/Dockerfile:ocr-service/requirements.txt:Notes:
fastapi[standard]pulls in uvicorn, pydantic v2, python-multipartkraken==5.2.9— pin to 5.x; the 4.x→5.x jump changed the model formatpypdfium2converts PDF pages to PIL images before passing to Surya (nopopplersystem dependency, keeps image lean)httpxfor fetching the presigned MinIO URLKraken model selection — requires evaluation
The issue correctly flags this as unresolved. Two HTR-United candidates for 19th–20th century German Kurrent:
german_kurrent_manu_9kurrent-deDo not bake the model choice into the issue. The one-time runbook step: download both models into the
ocr_modelsvolume, run against 2–3 sample documents from the actual archive, keep the better one at the pathKRAKEN_MODEL_PATHpoints to. The env var abstraction in Docker Compose means the model file can be swapped without touching any code.Implementation Complete — OCR Pipeline
Branch:
feat/issue-226-227-ocr-pipeline-polygon(implements #226 and #227 together as agreed in the discussion)Commits
ec32d22878a90ac19c41fcreateOcrAnnotation()on AnnotationService (skips overlap check)d194b6bff39907aea46c56737bd6cf8dc35a4651aa931fbc2What was implemented
Backend (Java/Spring Boot):
ScriptTypeenum: UNKNOWN, TYPEWRITER, HANDWRITING_LATIN, HANDWRITING_KURRENTOcrClient+OcrHealthClientinterfaces (mockable for TDD)OcrService: single-document OCR orchestration (health check → clear blocks → OCR → create annotations + blocks)OcrBatchService: batch processing with @Async, per-document status, SKIPPED for PLACEHOLDER docs, failure isolationOcrProgressService: SSE emitter registry per job IDOcrController:POST /api/documents/{id}/ocr,POST /api/ocr/batch,GET /api/ocr/jobs/{id},GET /api/ocr/jobs/{id}/progress(SSE),GET /api/documents/{id}/ocr-statusRestClientOcrClient: HTTP client to Python microservicePython microservice (ocr-service/):
/ocrand/healthendpointsFrontend (SvelteKit):
AnnotationShape.svelte: renders rect or polygon via CSSclip-path: polygon()ScriptTypeSelect,OcrTrigger,OcrProgresscomponentsTest results
Next steps