feat: Kraken fine-tuning pipeline (block origin tracking + training export + admin UI) #230
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Overview
Kraken's zero-shot Kurrent recognition produces ~70-80% character error rate on family letters. Fine-tuning on human-corrected transcriptions would dramatically improve quality. This issue tracks the full pipeline: origin tracking on blocks, training data export, and an admin UI to trigger and monitor training runs.
Depends on #226 (OCR pipeline) and #227 (polygon annotations).
Part 1: Block origin tracking ✅ (implemented)
Already committed on
feat/issue-226-227-ocr-pipeline-polygon:BlockSourceenum:MANUAL,OCRsource+reviewedcolumns totranscription_blocksOcrServicesetssource=OCRwhen creating blocksTranscriptionService.reviewBlock()toggles thereviewedflagPUT /api/documents/{id}/transcription-blocks/{blockId}/reviewendpointStill needed for Part 1: Frontend review toggle button in
TranscriptionBlock.svelte(checkmark icon in block toolbar).Part 2: Recognition training data export
Query
Single JPQL query on
TranscriptionBlockRepository— includes both manually authored blocks and reviewed OCR blocks:(source = MANUAL OR (source = OCR AND reviewed = true))scriptType = HANDWRITING_KURRENTRationale:
MANUALblocks are implicitly reviewed (a human authored them from scratch), so they are equal or better quality training data than corrected OCR blocks.Service:
TrainingDataExportServiceFileService.downloadFileBytes()PDFRendererat 300 DPIx, y, width, height, write PNG +.gt.txtinto aZipOutputStreamStreamingResponseBodyZIP structure:
<block-id>.png+<block-id>.gt.txtpairs (Kraken'sketos trainformat).Endpoint
Returns 204 if no eligible blocks exist.
Dependencies
org.apache.pdfbox:pdfbox:3.0.4to pom.xmlPart 3: Python recognition training endpoint
POST /trainon the OCR service (main.py)ketos trainPython API with--loadfrom the current model (transfer learning)Runs synchronously — training on 10-30 page crops takes seconds to minutes on CPU.
Java side
RestClientOcrClient.trainModel(byte[] trainingDataZip)sends ZIP toPOST /trainOcrTrainingServiceorchestrates: callsTrainingDataExportService→ sends to Python → records resultPart 4: Training history
Entity:
OcrTrainingRunInfo endpoint
GET /api/ocr/training-inforeturns:Part 5: Admin panel UI
Add a card to the System tab (
admin/system/+page.svelte) following the existing pattern (same as mass import / backfill cards).Info display:
Action: "Training starten" button → calls
POST /api/ocr/trainHistory table (last 5 runs):
Status as colored badge (green=DONE, red=FAILED).
Part 6: Segmentation training (layout-only, no text required)
Kraken has two completely separate models:
blla)ketos segtrain.gt.txtketos trainTraining the segmentation model only requires drawing bounding boxes around text lines — no Kurrent reading ability needed. This makes it accessible to anyone and is the right fix when OCR produces hundreds of phantom line detections on a document.
Annotation workflow
Users draw boxes in the existing annotation UI, creating
MANUALtranscription blocks with empty text (text = ""). Thetextfield must be made nullable (or default to"") to support segmentation-only blocks.A visual indicator (e.g. a tag "Nur Segmentierung") should distinguish these from full transcription blocks in the UI.
Export:
SegmentationTrainingExportServiceMANUALblocks wheretext IS NULL OR text = '', document hasscriptType = HANDWRITING_KURRENTPAGE XML format per page:
Coords come from the annotation's
polygon(4-point quad, de-normalized to pixel space). Baseline is approximated as the bottom edge of the quad.ZIP structure:
page-{docId}-{pageNum}.png+page-{docId}-{pageNum}.xmlpairs.Endpoint
Python:
POST /segtrainketos segtrain --loadfrom current segmentation model (transfer learning onblla)Admin UI
Second card in the System tab: "Segmentierungsmodell trainieren"
POST /api/ocr/segtrainOcrTrainingRunhistory table with amodel_nameofbllaFrontend: Review toggle in transcription editor
Each block in
TranscriptionBlock.sveltegets a checkmark toggle:PUT .../reviewOpen questions
text = ""on aMANUALblock sufficient to distinguish them?👨💻 Felix Brandt — Senior Fullstack Developer
Questions & Observations
TrainingDataExportService is doing a lot: query blocks → group by document → download PDFs → render pages with PDFBox → crop images → write ZIP. That's 5+ responsibilities in one service method. I'd want to see this decomposed into smaller, single-responsibility methods (e.g.
queryReviewedBlocks(),renderPageImage(document, page, dpi),cropBlockImage(pageImage, annotation),writeTrainingPair(zipOut, blockId, image, text)). Each becomes independently testable.The
TrainingDataExportServicedepends onFileServicefor PDF downloads: That's correct per layering rules. But it also needs to parse annotations from blocks — is the annotation data already on theTranscriptionBlockentity, or does it require a separate repository call? The spec mentions "look up its annotation" without specifying the access path.OcrTrainingServiceorchestrates export + HTTP call + record result: This feels like the right boundary — one service owns the workflow, delegates to specialized services. Good separation.Frontend review toggle: The spec says "click calls
PUT .../review" — is this an optimistic UI update or does it wait for the server response? For a toggle in a transcription editor where users are working fast, optimistic update with rollback on failure would be better UX, but the TDD test needs to cover both the success and rollback paths.Component splitting for the admin card: The training card in
admin/system/+page.sveltehas info display + action button + history table — that's at least two visual regions (info/action area + history table). I'd expectOcrTrainingCard.svelteas the container withTrainingHistory.svelteextracted for the table.Suggestions
Test strategy for
TrainingDataExportService: The PDF rendering + image cropping logic is the hardest part to test. I'd want an integration test with a real small PDF fixture that verifies the exported ZIP contains correctly named.png+.gt.txtpairs. Unit tests can cover the query and grouping logic with mocked repositories.The review toggle in
TranscriptionBlock.svelte: Keep the toggle state as a prop flowing from the parent, not local$state— the parent should own the reviewed/not-reviewed state and re-derive the progress counter via$derived. This avoids the parent and child getting out of sync.204 for "no reviewed blocks": This is fine for a programmatic API, but the admin UI should handle this gracefully — the button should be disabled with a hint before the user clicks, not after they get a 204. The spec already describes this ("If 0 reviewed: disabled button + hint") — just confirming the frontend should check
availableBlocksfrom the info endpoint, not rely on the export endpoint returning 204.Open question — sync vs async training: For 10-30 page crops taking seconds to minutes, synchronous is fine now. But the
OcrTrainingRunentity already hasRUNNING/DONE/FAILEDstatus, which implies async. I'd suggest: implement synchronous first (KISS), but keep the status model — it's cheap and makes the eventual async migration trivial. The TDD tests should assert on the status transitions regardless.🏗️ Markus Keller — Application Architect
Questions & Observations
New dependency: PDFBox 3.0.4: This is a significant addition — PDFBox pulls in ~5 transitive JARs (fontbox, commons-logging, etc.). It's the right tool for server-side PDF rendering, but be aware it's memory-hungry when rendering at 300 DPI. For a single admin-triggered export this is fine; if it ever becomes a frequent operation, you'd want to cap concurrent renders. For now: KISS, add it, move on.
Domain boundary question — where does
TrainingDataExportServicelive? It crosses three domains: transcription blocks, documents/files, and OCR. I'd place it in the OCR package since its purpose is OCR model training. It should depend onTranscriptionBlockRepository(or aTranscriptionServicemethod) andFileService— never reach intoDocumentRepositorydirectly.OcrTrainingRunentity — is this a new domain or part of OCR? It tracks training runs, which is OCR-specific. Keep it in the OCR package alongsideOcrTrainingService. Don't create a separatetrainingpackage — that's premature given there's only one entity and one service.The Python OCR service now has two responsibilities: inference (
POST /ocr) and training (POST /train). This is fine for now — it's one process, one model, and training needs access to the loaded model for transfer learning. But document in a comment that training mutates the in-process model state. If the service ever gets replicated, this becomes a consistency problem.StreamingResponseBodyfor the ZIP export: Good choice — avoids buffering the entire ZIP in memory. But the controller must not annotate the method with@Transactional, or the DB connection stays open for the entire streaming duration. Make sure the query executes and collects results before entering the streaming phase.Suggestions
Consider a dedicated JPQL query method on
TranscriptionBlockRepositoryrather than filtering in Java:Push the filtering down to the database. This is cleaner and more efficient than loading all OCR blocks and filtering in memory.
The info endpoint (
GET /api/ocr/training-info) aggregates multiple queries:availableBlocks,totalOcrBlocks,availableDocuments,ocrServiceAvailable,lastRun,runs. Consider whether these are all needed on every page load or if some can be lazy-loaded. For an admin panel that's rarely visited, a single endpoint is fine — don't over-optimize.triggered_by UUIDinocr_training_runs: Add a foreign key toapp_userswithON DELETE SET NULL. If a user is deleted, the training history should survive but the reference should null out gracefully.Sync vs async: Start synchronous. The
OcrTrainingRunstatus model already supports async if needed later. The cost of adding@Async+ a status polling endpoint later is minimal. The cost of building async infrastructure now for a feature that processes 10-30 images is wasted time.🧪 Sara Holt — QA Engineer & Test Strategist
Questions & Observations
TrainingDataExportService is the testing crux of this issue. It involves PDF rendering, image cropping, and ZIP assembly — all I/O-heavy operations that are hard to mock meaningfully. I need clarity on the test strategy:
The Python
POST /trainendpoint is a black box from the Java side. How do we testOcrTrainingService.trainModel()in CI without a running Python service? Options:OcrClientinterface at the unit test level (fine for testing orchestration logic)I'd go with option 1 for CI + option 3 as a manual/nightly check.
Review toggle endpoint already has 5 tests — good. But I want to verify: do the existing tests cover the case where a non-OCR block (source=MANUAL) is toggled? The spec doesn't say whether manual blocks can be reviewed, but the review endpoint presumably doesn't check source. Should it?
Admin UI testing: The training card has conditional states (0 blocks → disabled, OCR unavailable → disabled, happy path → enabled). Each state needs a test. The history table with status badges (DONE/FAILED) needs visual verification.
Suggestions
Test plan by layer:
TrainingDataExportServicequery logic,OcrTrainingServiceorchestration (mock OcrClient), ZIP structure validation.png+.gt.txtpairs;OcrTrainingRunpersistence and status transitions; info endpoint aggregation correctnessPDF fixture file: Create a minimal 1-page PDF with known dimensions in
src/test/resources/fixtures/. This avoids depending on real uploaded documents in tests and makes the test deterministic.Edge case checklist:
🔒 Nora "NullX" Steiner — Application Security Engineer
Questions & Observations
GET /api/ocr/training-data/exportreturns a ZIP containing cropped document images + transcription text. This endpoint is@RequirePermission(Permission.ADMIN)— good. But the ZIP contains actual family document content. Verify that:ZIP Slip risk in the Python
POST /trainendpoint: When the Python service extracts the uploaded ZIP to a temp directory, it must validate that extracted file paths don't escape the temp directory. This is a classic ZIP Slip vulnerability (CWE-22). The Java side constructs the ZIP so it should be safe, but the Python side should still validate defensively:POST /trainon the Python service — is it authenticated? The issue doesn't mention authentication between Spring Boot and the Python OCR service. If the OCR service is on the Docker network only (no host port exposure), network isolation provides some protection. But if the port is exposed (even accidentally), anyone could trigger training or upload malicious training data. At minimum, use a shared secret/API key header.Training data poisoning: A malicious admin (or compromised admin account) could mark adversarial blocks as "reviewed" to poison the training data. This is a low-likelihood but high-impact risk for an ML pipeline. Consider: should there be a minimum threshold of reviewed blocks before training is allowed? The spec mentions this implicitly (button disabled when 0 blocks) but doesn't set a meaningful minimum.
triggered_by UUIDstores who triggered training: Good for audit. Make sure this is populated from the authenticated session, not from a request parameter that could be spoofed.Suggestions
Sanitize block IDs in the ZIP: The ZIP entries use
<block-id>.pngand<block-id>.gt.txt. Block IDs are UUIDs (safe characters), but still validate that the ID format matches UUID before using it as a filename — defense in depth.Rate-limit the training endpoint: Training is CPU-intensive. Without rate limiting, a compromised admin session could trigger repeated training runs as a denial-of-service against the OCR service. Even a simple "reject if a run is already in RUNNING status" check (which the
OcrTrainingRunentity supports) would suffice.The export ZIP should not include the full file path or document metadata beyond what's needed for training (block ID + image + text). The current spec looks clean on this — just confirming it stays that way during implementation.
Python temp directory cleanup: After training completes (success or failure), the extracted ZIP contents in the temp directory must be cleaned up. Use a
try/finallyor Python'stempfile.TemporaryDirectorycontext manager to ensure cleanup even on exceptions.🎨 Leonie Voss — UI/UX Design Lead
Questions & Observations
Review toggle in
TranscriptionBlock.svelte: The spec says "outline checkmark, muted" → "filled checkmark, brand-mint" on toggle. This is good, but color alone cannot be the only cue (WCAG 1.4.1). The outline vs. filled visual difference helps, but I'd also recommend:aria-labelthat changes: "Als geprüft markieren" ↔ "Prüfung aufheben"Progress indicator "12 von 17 geprüft": Where exactly in the panel header? If the transcription editor has a toolbar, this should be in the toolbar — not in the page header. It's contextual to the current document's blocks, so it belongs visually close to the blocks. Consider:
Admin training card: Following the existing card pattern is correct. But the info text "{reviewedCount} geprüfte Blöcke aus {docCount} Dokumenten verfügbar (von {totalOcrCount} OCR-Blöcken gesamt)" is dense for a single line. Consider splitting:
This creates visual hierarchy — the actionable number (23 reviewed) stands out.
History table status badges: Green for DONE, red for FAILED — fine, but pair with an icon (checkmark for DONE, X for FAILED) for colorblind users. Also consider the RUNNING state — the spec mentions it in the entity but not in the table. If training becomes async later, you'll need a spinner or animated indicator for RUNNING.
Suggestions
Disabled button states need clear communication: "OCR-Dienst nicht erreichbar" and "Keine geprüften Blöcke vorhanden" — these hints should be visible without hovering. Use a small text line below the disabled button, not a tooltip, because:
The review toggle should have a focus style: When navigating by keyboard through transcription blocks, the checkmark toggle needs a visible focus ring. Use
focus-visible:ring-2 focus-visible:ring-brand-navyconsistent with other interactive elements.Consider the flow from admin panel to transcription editor: When the admin sees "0 reviewed blocks" with a hint to mark blocks in the transcription view — is there a direct link to a document with unreviewed OCR blocks? A "Dokumente mit OCR-Blöcken anzeigen" link would reduce friction significantly.
⚙️ Tobias Wendt — DevOps & Platform Engineer
Questions & Observations
PDFBox at 300 DPI is memory-hungry: Rendering a single A4 page at 300 DPI produces a ~25 MB uncompressed image in memory (
2480 × 3508 × 4 bytes ARGB). If you're processing 10-30 blocks across multiple documents, peak memory could spike significantly. The current Spring Boot container likely has a default heap of 256-512 MB. You may need to:docker-compose.ymlPDFDocumentafter rendering)/actuator/metrics/jvm.memory.usedduring a test runThe Python OCR service now does training on CPU: "seconds to minutes" for 10-30 crops. That's fine, but during training the OCR service is busy — what happens to incoming OCR inference requests? If
POST /trainis synchronous and blocks the Python process (likely a single-worker FastAPI/Flask), all OCR requests will queue or timeout. Consider:Model file replacement during training: "backs up old model, replaces with fine-tuned model, reloads in-process". This is a file system operation inside the container. If the container restarts between backup and reload, which model loads? The spec should define:
The
ocr_training_runstable needs a Flyway migration: The spec includes the SQL. Make sure this is the next sequential migration number after whatever V26 added for the block source/reviewed columns.Suggestions
Named volume for OCR models: The model file should live on a named Docker volume, not baked into the image or on the container filesystem. This ensures:
Health check awareness: If the OCR service is training and temporarily unavailable for inference, the
ocrServiceAvailableflag in the info endpoint should reflect this. The admin UI already plans to show availability — make sure the health check endpoint on the Python side returns a meaningful status during training (e.g.{"status": "training", "inference_available": false}).Container resource limits: Add explicit memory limits to the backend container in docker-compose for the PDF rendering workload:
And similarly for the OCR service during training.
Log the training run: Both the Java side (
OcrTrainingService) and the Python side should log at INFO level: training started, block count, duration, success/failure. This is essential for debugging when a training run fails silently. TheOcrTrainingRunentity stores this, but structured logs make it searchable in Loki without querying the database.Concurrent training guard: Add a check in
OcrTrainingService: if a run with statusRUNNINGalready exists, reject the new request with a 409 Conflict. The spec doesn't mention this explicitly, but it's a necessary safeguard — two concurrent training runs would corrupt the model file.🗂️ Implementation Plan — Felix Brandt
After reading the issue, all comments, and exploring the codebase, here is the full implementation plan.
Key finding: All of Part 1's backend + frontend work is already committed on
feat/issue-226-227-ocr-pipeline-polygon(V26 migration, BlockSource enum, reviewBlock() service/controller, TranscriptionBlock.svelte toggle, TranscriptionEditView wiring, document page reviewToggle()). The only remaining Part 1 gap is the "X von Y geprüft" progress counter in the panel.Part 1 — Remaining progress counter
[frontend]AddreviewedCountandtotalCountderived counters toTranscriptionEditView.svelteand render "X von Y geprüft" in the panel's top bar —$derivedfromblocksprop, no new statePart 2 — Recognition training data export
[backend]Addorg.apache.pdfbox:pdfbox:3.0.4tobackend/pom.xml[backend]AddfindEligibleKurrentBlocks()JPQL query toTranscriptionBlockRepository:(source = MANUAL OR (source = OCR AND reviewed = true))ANDdocument.scriptType = HANDWRITING_KURRENT[backend]TrainingDataExportServicewith decomposed single-responsibility methods:queryEligibleBlocks()→ repositoryrenderPageImage(PDDocument, int pageIdx)→ PDFBox 300 DPI,BufferedImagecropBlockImage(BufferedImage, DocumentAnnotation)→ de-normalize coords × image dims, cropwriteTrainingPair(ZipOutputStream, UUID, BufferedImage, String)→<id>.png+<id>.gt.txtexportToZip(StreamingResponseBody)→ outer orchestrator; query results collected before entering lambda (Markus: no open DB txn during streaming); eachPDDocumentreleased after processing (Tobias: avoid memory spike); block IDs validated as UUID before use as filename (NullX: defense in depth)[backend]GET /api/ocr/training-data/exportinOcrController—@RequirePermission(ADMIN),StreamingResponseBodyresponse,204if no eligible blocks[test]Integration test with a minimal 1-page PDF fixture insrc/test/resources/fixtures/; unit tests for query eligibility logic and 204 pathPart 3 — Python recognition training endpoint
[python]POST /traininocr-service/main.py:UploadFile; ZIP Slip validation on all entries (NullX)tempfile.TemporaryDirectory()(auto-cleanup on success or failure)ketos trainPython API with--loadfromKRAKEN_MODEL_PATH(transfer learning)german_kurrent.mlmodel.bak), replaces, reloads engine in-process{"loss": ..., "accuracy": ..., "epochs": ...}[backend]AddtrainModel(byte[] trainingDataZip)toOcrClientinterface[backend]ImplementtrainModel()inRestClientOcrClient— POST to/trainwith multipart ZIP, 10-minute timeoutPart 4 — Training history
[migration]V29__add_ocr_training_runs.sql:[backend]OcrTrainingRunentity inmodel/— Lombok@Data @Builder @NoArgsConstructor @AllArgsConstructor[backend]OcrTrainingRunRepository—findFirstByStatusOrderByCreatedAtDesc()(concurrent-run guard) +findTop5ByOrderByCreatedAtDesc()(info endpoint)[backend]OcrTrainingService.triggerTraining(UUID triggeredBy):triggered_byfrom session, never request body (NullX)TrainingDataExportService→ collect ZIP bytesocrClient.trainModel(zipBytes)[backend]AddErrorCode.TRAINING_ALREADY_RUNNING+ mirror inerrors.ts+ Paraglide keys (de/en/es)[backend]Two new endpoints inOcrController:POST /api/ocr/train—@RequirePermission(ADMIN)→triggerTraining(userId)GET /api/ocr/training-info—@RequirePermission(ADMIN)→TrainingInfoResponse(availableBlocks, totalOcrBlocks, availableDocuments, ocrServiceAvailable, lastRun, runs)[test]OcrTrainingServicetests: concurrent guard (409), happy path (DONE + completedAt), failure path (FAILED + errorMessage) — mockOcrClientandTrainingDataExportServicePart 5 — Admin panel UI
[frontend]Regenerate API types after backend is built[frontend]TrainingHistory.svelte— table with Datum | Status | Blöcke | Dokumente | Gestartet von; status badge: green + ✓ = DONE, red + × = FAILED, spinner = RUNNING (icon + color, Leonie: colorblind-safe); keyed{#each runs as run (run.id)}[frontend]OcrTrainingCard.svelte:POST /api/ocr/train→ reload infoTrainingHistory.svelte;focus-visible:ring-2 focus-visible:ring-brand-navyon all interactive elements[frontend]WireOcrTrainingCardintoadmin/system/+page.svelte— fetch info via$effect, pass to card[test]Vitest tests forOcrTrainingCard: disabled (0 blocks), disabled (OCR unavailable), enabled (happy path)Part 6 — Segmentation training
[migration]V30__make_transcription_block_text_nullable.sql—ALTER COLUMN text DROP NOT NULL; SET DEFAULT ''[backend]UpdateTranscriptionBlock.text— removenullable = false;sanitizeText()null →""[backend]AddfindSegmentationBlocks()JPQL: MANUAL blocks wheretext IS NULL OR text = '', HANDWRITING_KURRENT documents[backend]SegmentationTrainingExportService— PAGE XML export:querySegmentationBlocks(),renderFullPage(),buildPageXml()(de-normalize polygon to pixel space, baseline = bottom edge of quad),exportToZip()page-{docId}-{pageNum}.png+page-{docId}-{pageNum}.xml[backend]GET /api/ocr/segmentation-training-data/export— same streaming + 204 pattern[python]POST /segtrain— same ZIP Slip +TemporaryDirectorypattern; callsketos segtrain --load blla; backs up + replaces + reloads segmentation model; returns metrics[backend]AddsegtrainModel(byte[] zip)toOcrClientinterface + implement inRestClientOcrClient[backend]ExtendOcrTrainingService+ info endpoint for segmentation block counts;model_name = "blla"in run record[frontend]"Nur Segmentierung" badge inTranscriptionBlock.svelte— small muted tag whentext === '' && source === 'MANUAL'[frontend]SegmentationTrainingCard.svelte— second admin card; "Segmentierung trainieren" button →POST /api/ocr/segtrain; sharesTrainingHistory.sveltefiltered tomodel_name = 'blla'[frontend]WireSegmentationTrainingCardintoadmin/system/+page.sveltebelow recognition cardFiles touched
Backend:
pom.xml,TrainingDataExportService,SegmentationTrainingExportService,OcrTrainingService,OcrClient,RestClientOcrClient,OcrController,OcrTrainingRun,TranscriptionBlockRepository,OcrTrainingRunRepository,ErrorCode,TranscriptionBlock,V29+V30migrationsPython:
ocr-service/main.py—POST /train,POST /segtrainFrontend:
TranscriptionEditView.svelte,TranscriptionBlock.svelte,TrainingHistory.svelte(new),OcrTrainingCard.svelte(new),SegmentationTrainingCard.svelte(new),admin/system/+page.svelte,errors.ts, Paraglide messages (de/en/es), regeneratedapi.ts👨💻 Felix Brandt — Senior Fullstack Developer
Questions & Observations
JPQL join on
annotationIdneeds care: The plan'sfindEligibleKurrentBlocks()query doesJOIN DocumentAnnotation a ON a.id = b.annotationId. That's a non-FK join (no@ManyToOnebetweenTranscriptionBlockandDocumentAnnotation). JPQL handles this withJOIN … ON, but I'd want to verify the query compiles and doesn't produce a Cartesian product before writing tests against it. Consider wrapping the result in a projection that includes the annotation data in one shot, rather than doing N+1 lookups per block inTrainingDataExportService.TrainingDataExportService.exportToZip()collects all block+annotation+document data before the lambda — right call, but how? The plan says "query results collected before entering StreamingResponseBody lambda." That means the method returns aStreamingResponseBodythat captures the already-queried data as a local variable. Make sure the query returns everything needed for rendering (block, annotation coords, document fileHash/s3Key) in one go — not lazy-loaded associations that will fail once the Hibernate session closes.sanitizeText()null handling for Part 6: The existingsanitizeText()inTranscriptionServicereturns""for null. ButTranscriptionBlock.textis currently annotated@Column(nullable = false). When we make it nullable in V30, we need to ensurecreateOcrBlock()andcreateBlock()inTranscriptionServicestill store""notnullfor OCR blocks — only segmentation-only manual blocks should storenull. Worth making this intent explicit with a named constant or a factory method distinction.Python
ketos trainAPI: The Kraken Python API for training (ketos train) uses a callback-based or iterator-based interface. The plan says "callsketos trainPython API" but the synchronous call pattern needs to be verified —ketostraining is not a simple function call, it typically involves iterating an LMDB or ground truth loader. The implementation needs to handle the training loop correctly (even if just calling the CLI via subprocess if the Python API proves complex to integrate directly).OcrTrainingCard.svelte$effecton mount: The plan says "fetch info via$effect". Since the admin system page has no+page.server.ts, this client-side fetch is correct, but the initial state while loading needs to be handled — the button must not appear enabled whiletrainingInfois null/undefined. Use a loading state guard.Suggestions
One integration test per export pipeline: Write
TrainingDataExportServiceIntegrationTestthat uses PDFBox itself to generate the fixture PDF (not a static file), asserting on: (1) the ZIP contains exactlyNentries whereN = 2 × eligibleBlocks, (2) each.gt.txthas the expected text content, (3) each.pngis a valid image with non-zero dimensions.OcrTrainingRunstatus as an enum, not a String:status VARCHAR(20)on the DB side maps cleanly to a Java enumTrainingStatus { RUNNING, DONE, FAILED }. ThenfindFirstByStatus(TrainingStatus.RUNNING)is typesafe and refactor-friendly. The migration stores the string; the entity maps it as@Enumerated(EnumType.STRING).Segmentation "Nur Segmentierung" badge check:
text === ''is not the same astext === null. When the frontend receives a block from the API, the JSON serialization of a JavanullString isnullin JSON but""if we default to"". Verify which value the API returns for segmentation-only blocks, and guard both:(!block.text || block.text.trim() === '') && block.source === 'MANUAL'.🏗️ Markus Keller — Application Architect
Questions & Observations
Concurrent training race condition in
OcrTrainingService: The plan rejects a second trigger if a RUNNING run exists, usingfindFirstByStatusOrderByCreatedAtDesc(). This is a read-then-write pattern — two simultaneous requests can both read "no RUNNING run" and both proceed to create one. Theocr_training_runstable should have a partial unique index to enforce the single-active-run invariant at the database layer:The application check stays as a user-friendly 409. The DB constraint is the safety net.
TrainingDataExportServicedomain boundary: This service touches three areas — transcription blocks, document/file storage, and annotations. That's fine as long as it goes throughFileServicefor S3 (not the S3 client directly) and through theTranscriptionBlockRepository(notDocumentRepository). The annotation access viaAnnotationRepositoryis the grey area: doesTranscriptionServicealready provide a method to fetch blocks with their annotations? If so, delegate to it rather than havingTrainingDataExportServicereach directly intoAnnotationRepository.SegmentationTrainingExportServiceshares 80% of its logic withTrainingDataExportService: Both render PDF pages at 300 DPI, both download files viaFileService, both write PNGs to a ZIP. Before creating two separate services, consider whether the shared rendering logic should live in aPdfPageRendererutility class that both services use. KISS applies — only extract it if the duplication is actually painful to maintain.GET /api/ocr/training-infois called on every page load of the admin system tab: This aggregates 5+ queries and an HTTP health check to the Python service. The health check alone adds latency and introduces a hard dependency on OCR service availability just to display the page. Consider: (a) makeocrServiceAvailablea separate lazy-fetch or (b) cache the health status with a short TTL (30–60s) inOcrHealthClientto avoid hitting the Python service on every admin page open.PAGE XML generation for segmentation training: The spec describes producing PAGE XML with polygon coordinates. The
DocumentAnnotation.polygonis a JSONB field containing a 4-point quad. Verify the coordinate system: are these normalized (0.0–1.0) or pixel-absolute? The plan de-normalizes them to pixel space, which is correct — but theimageWidthandimageHeightin the PAGE XML must match the rendered image dimensions exactly (2480 × 3508 for A4 at 300 DPI), or Kraken's segtrain will reject them. This detail needs a unit test that verifies coordinate de-normalization is consistent with the rendered image dimensions.Suggestions
OcrTrainingRuncompletedAt vs. duration: Storingcreated_at+completed_atlets you compute duration, which is useful in the admin UI. But consider also storingblock_countanddocument_counton the run at creation time (before training starts), not just on completion — this way you can display "training 23 blocks from 8 documents…" while RUNNING, not just retrospectively.Error recovery: If the Spring Boot process crashes during a training run (OOM during PDFBox rendering, for example), the run stays in RUNNING forever. Add a startup recovery step in
OcrTrainingService: on application start, find any RUNNING runs older than 1 hour and mark them FAILED witherror_message = "Abgebrochen: Dienst wurde neugestartet". This prevents the admin from being permanently locked out of triggering new runs.🧪 Sara Holt — QA Engineer & Test Strategist
Questions & Observations
PDF fixture strategy: The plan mentions a "minimal 1-page PDF fixture" in
src/test/resources/fixtures/. A static binary file in git is fine for a small fixture, but consider generating it programmatically with PDFBox in a@BeforeAllmethod instead — it avoids committing a binary and lets you control page dimensions precisely (exact A4 at 300 DPI = 2480×3508px). Either way, the fixture must have at least one text region so a crop produces a non-blank PNG.Integration test data setup:
TrainingDataExportServiceIntegrationTestneeds a full chain in the DB: Document (scriptType=HANDWRITING_KURRENT, with a real s3Key) → DocumentAnnotation (with x/y/width/height) → TranscriptionBlock (source=MANUAL or source=OCR+reviewed=true). That's 3 entities with valid FK relationships. Use@Sqlfixture files or@BeforeEachbuilders — not inlinesave()chains inside the test body.The
findEligibleKurrentBlocks()JPQL query needs explicit negative tests:reviewed=false→ NOT includedreviewed)reviewed=true→ includedscriptType=TYPEWRITER→ NOT includedAll four cases should be in a
@DataJpaTestagainst a real Postgres container.Concurrent training guard test: Testing the 409 requires two near-simultaneous HTTP requests, which is hard to do deterministically in a unit test. A simpler approach: write a unit test that calls
triggerTraining()twice on a mock that always returns "no running" on the first call but a running run on the second, and verify the second throwsDomainException.conflict(). The DB-level partial unique index (if added) would be tested by the integration test inserting two RUNNING rows and expecting a constraint violation.Python test coverage: The
POST /trainandPOST /segtrainendpoints inmain.pyhave no test plan in the current spec. At minimum, pytest tests should cover: (1) ZIP Slip rejection (a ZIP entry with../../etc/passwdpath), (2) valid ZIP is extracted to the correct temp dir, (3) cleanup happens even on exception. These don't require a real Kraken model — mockketos.trainat the module level withunittest.mock.patch.Missing edge case: PDF deleted from S3: If
FileService.downloadFileBytes(s3Key)throws because the file no longer exists in MinIO,TrainingDataExportServiceshould skip that document gracefully (log + continue) rather than aborting the entire ZIP export. The test: one document with a valid block but a missing S3 file → ZIP still contains blocks from other documents.V30migration rollback path: Makingtextnullable is easy to roll forward. Rolling back (setting NOT NULL again) would fail if any null rows exist. Verify the V30 migration includes a comment noting this is intentional and irreversible without a data cleanup step.Suggestions
Test name convention reminder: The test names in the plan are described as behaviors, which is right. Make sure they follow the project pattern:
should_notIncludeUnreviewedOcrBlocks_whenQueryingEligibleKurrentBlocksor justshould return only eligible Kurrent blocks(sentence style). AvoidtestFindEligibleKurrentBlocks.Vitest test isolation for
OcrTrainingCard: When testing the disabled states, mockfetchat the Vitest level usingvi.stubGlobal('fetch', ...). The component's$effectfetch means the test mustawait tick()after render to let the async update propagate before asserting on the button state.🔒 Nora "NullX" Steiner — Application Security Engineer
Questions & Observations
Python
/trainand/segtrain— network-level authentication gap: The plan correctly applies@RequirePermission(ADMIN)on the Spring Boot side. But the Python service itself has no authentication — it accepts any POST to/trainor/segtrain. The current compose exposes the OCR service only internally (expose: 8000, notports:), so Docker network isolation is the only guard. This is acceptable for now, but document it explicitly: ifocr-serviceis ever given a host port binding (e.g. for debugging), anyone with network access can trigger training. A pre-shared secret header (X-Training-Token) would close this with minimal complexity:Spring Boot sends this header; the Python service checks it. Zero user friction, closes the gap.
ZIP Slip — the fix is correct, but needs one more check: The plan's validation checks that the extracted path doesn't escape
temp_dir. Good. But also validate that ZIP entries do not contain absolute paths (entries starting with/orC:\):Both checks together cover the full CWE-22 surface.
Training data as an indirect attack surface: Cropped document images + their
.gt.txttranscriptions are written to a temp dir during training, then (per the plan) cleaned up. But if the model file replacement step fails after the temp dir is created, are the training images cleaned up? Thetempfile.TemporaryDirectory()context manager handles this correctly — as long as it wraps the entire operation including the model replacement, not just the extraction.triggered_bypopulation: The plan correctly specifies this comes from the authenticated session. Just confirming: the implementation should useauthentication.getName()→userService.findByUsername()→user.getId(), same asrequireUserId()inTranscriptionBlockController. Do not add atriggered_byfield to any request body or DTO.Training data export endpoint and IDOR:
GET /api/ocr/training-data/exportis admin-gated and queries all eligible blocks across all documents. No IDOR risk since there's no per-resource scoping. The main surface is the admin permission check — verify@RequirePermission(ADMIN)is actually enforced in the test suite with a test that sends the request without admin permission and expects 403.Suggestions
Add a security test for the export endpoint: A
@WebMvcTestwith@WithMockUser(roles = "READ_ALL")callingGET /api/ocr/training-data/exportshould return 403. This is a regression guard — it ensures@RequirePermission(ADMIN)never gets accidentally removed.Minimum block threshold: The plan currently allows training with as few as 1 reviewed block. Consider a minimum threshold (e.g. 5 blocks) to reduce the practical impact of training data poisoning via a single tampered block. The admin UI can enforce this in the disabled-button logic; the backend should also validate and return a 422 with a clear message if the threshold isn't met.
🎨 Leonie Voss — UI/UX Design Lead
Questions & Observations
Progress counter placement in
TranscriptionEditView.svelte: The plan adds "X von Y geprüft" to the "panel's top bar." The current component has no top bar — it's a scrollable flex column. The counter should appear as a sticky header row at the top of the scrollable area, above the block list, so it remains visible as the user scrolls through many blocks. A thin line below it visually separates it from the first block.OcrTrainingCardinfo text layout on narrow screens: The two-line hierarchy (bold count + docs on line 1, muted total on line 2) works well on desktop. On mobile (320px), long German strings like "23 geprüfte Blöcke aus 8 Dokumenten verfügbar" may overflow. Test at 320px and usebreak-wordsor wrap inside a<p>withtext-smor smaller.History table on mobile: A 5-column table (Datum | Status | Blöcke | Dokumente | Gestartet von) will not fit at 320px without horizontal scrolling or column hiding. Recommend: at
< mdbreakpoint, show only Datum + Status + Blöcke, hide the rest. Usehidden md:table-cellon the lower-priority columns.RUNNING state spinner in
TrainingHistory: The plan mentions a spinner for RUNNING status. Animated spinners in a table column can be visually jarring if the page auto-polls. Use a subtle pulsing dot (animate-pulse) rather than a spinning circle — it's less distracting in a dense table row.OcrTrainingCardbutton disabled feedback timing: When the user clicks "Training starten" and the request is in-flight, the button should show a loading state (disabled + spinner or "…") before the response arrives. Without this, a slow network makes it look like nothing happened, and users click again."Nur Segmentierung" badge in
TranscriptionBlock: The badge should appear where thelabelis currently rendered (the header row inside the card), styled as a small muted chip —text-xs font-medium text-ink-3 bg-muted rounded px-1.5 py-0.5. It must not push the block's layout wider; keep it inline with the label. If the block also has a label, show label first, then the badge.Suggestions
Progress counter with mini progress bar: Consider pairing the "12 / 17 geprüft" text with a thin progress bar underneath (2px height, brand-mint fill on brand-sand background). It gives immediate visual density — users grasp "70% done" faster than parsing the fraction. Total added height: ~10px.
"Training starten" → feedback after completion: After a successful train call, show an inline success message in the card (e.g. a green bordered row: "Training abgeschlossen — Modell aktualisiert") for ~5 seconds, then reload the history. Don't rely on the history table update alone — users may miss a new row appearing at the top of a table they weren't watching.
⚙️ Tobias Wendt — DevOps & Platform Engineer
Questions & Observations
Backend memory limit needs updating: The current
docker-compose.ymldoesn't have explicit memory limits on the backend container. PDFBox at 300 DPI produces ~25 MB per page uncompressed in memory. For a training export across 8 documents with 3–4 pages each, that's ~800 MB peak on top of the normal JVM heap. Add an explicit limit:Without this, a large export could trigger the OOM killer and take down the entire backend container — including all other users' requests.
Stuck RUNNING runs after container restart: Markus already raised the startup recovery logic. From an ops angle: the recovery should run in a
@EventListener(ApplicationReadyEvent.class)method, not in the constructor or@PostConstruct— the JPA context needs to be fully ready before queryingOcrTrainingRunRepository. Log the recovery at WARN level so it's visible in Loki:"Found orphaned RUNNING training run {} — marking FAILED (service was restarted)".OCR service health check during training: The Python service currently returns
{"status": "ok"}or fails the health check entirely. During a synchronous training run, the/healthendpoint should still respond (training blocks the request thread, but FastAPI can serve health on a separate worker). Verify the uvicorn worker count: if it's--workers 1(the default in the Dockerfile), health checks will time out during training. Either add--workers 2or make the/healthendpoint non-blocking. Docker's health check interval is 10s with 5s timeout — a blocked single worker would fail it.Model backup naming: The plan backs up
german_kurrent.mlmodeltogerman_kurrent.mlmodel.bak. A single.bakmeans each training run overwrites the previous backup — you lose the ability to roll back past one run. Consider timestamped backups:german_kurrent.mlmodel.{timestamp}.bak. Keep the last 3. This adds minimal disk usage (models are ~50–200 MB) and gives a recovery path if two consecutive training runs both degrade quality.ocr_modelsvolume and thebllasegmentation model: The compose file mountsocr_models:/app/models. The segmentation model (blla) is part of Kraken's built-in models — it's not in/app/models, it's in the Kraken package or the model cache at/root/.cache. Theocr_cachevolume already covers this. Verify wherebllalives after aketos segtrainrun: the fine-tuned model must be written to a path on a named volume, not to the container filesystem.Suggestions
Structured logging for training runs: Add a
training_run_idfield to every log line emitted during a training run, both in Java (MDC.put("trainingRunId", run.getId().toString())) and in Python (pass it as a log context header or prefix). This makes it trivial to filter all logs for a specific run in Loki:{container="archive-backend"} | json | trainingRunId="uuid-here".Add
TRAINING_TOKENto the compose env block now: Even if it's optional in the first iteration, add the env var todocker-compose.ymlwith a placeholder comment so it's not forgotten:An empty value means no token check (dev-friendly default); a non-empty value enables the check.
✅ Implementierung abgeschlossen
Alle 6 Teile des Kraken Fine-Tuning Pipelines sind implementiert. Branch:
feat/issue-226-227-ocr-pipeline-polygonCommits (chronologisch)
7322907feat(transcription)— Sticky Review-Fortschrittsanzeige (X / Y geprüft + Fortschrittsbalken) in TranscriptionEditView (Part 1)fdf1eb9feat(training)— Training-Enrollment:document_training_labels-Tabelle,TrainingLabel-Enum,@ElementCollectionaufDocument,PATCH /api/documents/{id}/training-labels, Auto-Enrollment bei Kurrent-OCR, Chip-Toggle im Transkriptions-Panel (Part 1b)cfa3c4dfeat(training)— Recognition Training Data Export:TrainingDataExportService(PDFBox 300 DPI, ZIP mit PNG+GT-Textpaaren),GET /api/ocr/training-data/export(Part 2)bc97a2dfeat(ocr)— Python/train-Endpoint (ZIP Slip-Schutz,ketos.train, Backup-Rotation, In-Process-Reload),OcrClient.trainModel(),RestClientOcrClient-Implementierung mit 10-min-Timeout + X-Training-Token (Part 3)88e005efeat(ocr)— Trainings-Verlauf: V30-Migration (ocr_training_runs, partieller Unique-Index),OcrTrainingRun-Entity,OcrTrainingService(409-Guard, Orphan-Recovery),POST /api/ocr/train,GET /api/ocr/training-info(Part 4)4e08d31feat(admin)— Admin-UI:TrainingHistory.svelte,OcrTrainingCard.svelte, inadmin/system/+page.svelteeingebunden, Vitest-Tests (Part 5)9b2f91efeat(training)— Segmentierungs-Training:SegmentationTrainingExportService(PAGE XML, Polygon-Denormalisierung),/segtrain-Python-Endpoint,segtrainModel()in OcrClient, V31-Migration (text nullable), Paraglide i18n für alle Training-Strings (Part 6)86e9c05feat(training)— Alle UI-Komponenten auf Paraglide umgestellt (TrainingHistory, OcrTrainingCard, SegmentationTrainingCard, "Nur Segmentierung"-Badge),SegmentationTrainingCardin Admin-Seite eingebunden,availableSegBlocksin TrainingInfoResponse ergänztTestergebnisse
Nächste Schritte
--workers 2,TRAINING_TOKENin Prod setzenGuided OCR implemented ✅
Branch:
feat/issue-226-227-ocr-pipeline-polygonCommit:
ee58b63What was built
Python (
ocr-service)OcrRegionmodel (annotationId,pageNumber,x/y/width/height)OcrRequestwithregions: list[OcrRegion] | None; extendedOcrBlockwithannotationIdextract_region_text(image, x, y, w, h)to both Kraken and Surya engines — crops to normalized region, runs recognition on the crop/ocr/stream: whenregionsis present, groups by page and recognizes each region without layout detection, annotationId flows back in each blockJava (backend)
OcrBlockResultgainsannotationIdfield (null in normal mode)OcrClient.OcrRegionrecord + updatedstreamBlockssignatureTriggerOcrDTO.useExistingAnnotationsflag (Boolean, defaults to false)TranscriptionBlockRepository.findByAnnotationIdTranscriptionService.upsertGuidedBlock— creates new OCR block, updates existing OCR block, or skips MANUAL block unchangedOcrAsyncRunnerdispatches toupsertGuidedBlockwhenannotationIdis non-null; in guided mode fetches existing annotations and skipsclearExistingBlocks()OcrService.startOcrthreadsuseExistingAnnotationsthroughFrontend
OcrTrigger: newannotationCountprop; when > 0 shows a checkbox "Nur annotierte Bereiche" with a hint; skips the destructive-replace confirmation in guided modeTranscriptionEditView: passesannotationCount={blocks.length}toOcrTriggertriggerOcrpassesuseExistingAnnotationsin the POST bodyocr_use_existing_annotations+ocr_use_existing_annotations_hintin de/en/esTests
TranscriptionServiceGuidedTest(3 cases: creates new, updates OCR, preserves MANUAL)