feat: Kraken fine-tuning pipeline (block origin tracking + training export + admin UI) #230

New Issue

marcel · 2026-04-12T21:51:54+02:00

marcel commented

2026-04-12 21:51:54 +02:00

Overview

Kraken's zero-shot Kurrent recognition produces ~70-80% character error rate on family letters. Fine-tuning on human-corrected transcriptions would dramatically improve quality. This issue tracks the full pipeline: origin tracking on blocks, training data export, and an admin UI to trigger and monitor training runs.

Depends on #226 (OCR pipeline) and #227 (polygon annotations).

Part 1: Block origin tracking ✅ (implemented)

Already committed on feat/issue-226-227-ocr-pipeline-polygon:

BlockSource enum: MANUAL, OCR
V26 migration adds source + reviewed columns to transcription_blocks
OcrService sets source=OCR when creating blocks
TranscriptionService.reviewBlock() toggles the reviewed flag
PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint
5 tests covering toggle/untoggle/notfound/controller/source verification

Still needed for Part 1: Frontend review toggle button in TranscriptionBlock.svelte (checkmark icon in block toolbar).

Part 2: Recognition training data export

Query

Single JPQL query on TranscriptionBlockRepository — includes both manually authored blocks and reviewed OCR blocks:

(source = MANUAL OR (source = OCR AND reviewed = true))
document has scriptType = HANDWRITING_KURRENT

Rationale: MANUAL blocks are implicitly reviewed (a human authored them from scratch), so they are equal or better quality training data than corrected OCR blocks.

Service: `TrainingDataExportService`

Query eligible Kurrent blocks (see above)
Group by documentId, download each PDF via FileService.downloadFileBytes()
Render relevant pages with PDFBox PDFRenderer at 300 DPI
For each block: look up its annotation, crop the page image using normalized x, y, width, height, write PNG + .gt.txt into a ZipOutputStream
Stream the ZIP via StreamingResponseBody

ZIP structure: <block-id>.png + <block-id>.gt.txt pairs (Kraken's ketos train format).

Endpoint

GET /api/ocr/training-data/export → ZIP download
@RequirePermission(Permission.ADMIN)

Returns 204 if no eligible blocks exist.

Dependencies

Add org.apache.pdfbox:pdfbox:3.0.4 to pom.xml

Part 3: Python recognition training endpoint

`POST /train` on the OCR service (`main.py`)

Receives training data as a ZIP upload from Spring Boot
Extracts to a temp directory
Runs Kraken's ketos train Python API with --load from the current model (transfer learning)
On success: backs up old model, replaces with fine-tuned model, reloads in-process
Returns training metrics (loss, accuracy)

Runs synchronously — training on 10-30 page crops takes seconds to minutes on CPU.

Java side

RestClientOcrClient.trainModel(byte[] trainingDataZip) sends ZIP to POST /train
OcrTrainingService orchestrates: calls TrainingDataExportService → sends to Python → records result

Part 4: Training history

Entity: `OcrTrainingRun`

CREATE TABLE ocr_training_runs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    status VARCHAR(20) NOT NULL DEFAULT 'RUNNING',  -- RUNNING, DONE, FAILED
    block_count INT NOT NULL,
    document_count INT NOT NULL,
    model_name VARCHAR(100) NOT NULL,
    error_message TEXT,
    triggered_by UUID,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    completed_at TIMESTAMPTZ
);

Info endpoint

GET /api/ocr/training-info returns:

{
  "availableBlocks": 23,
  "totalOcrBlocks": 150,
  "availableDocuments": 8,
  "ocrServiceAvailable": true,
  "lastRun": { "status": "DONE", "blockCount": 15, ... },
  "runs": [ ... ]
}

Part 5: Admin panel UI

Add a card to the System tab (admin/system/+page.svelte) following the existing pattern (same as mass import / backfill cards).

Info display:

"{reviewedCount} geprüfte Blöcke aus {docCount} Dokumenten verfügbar (von {totalOcrCount} OCR-Blöcken gesamt)"
If 0 reviewed: disabled button + hint to mark blocks as reviewed in the transcription view
If OCR service unavailable: disabled button + warning

Action: "Training starten" button → calls POST /api/ocr/train

History table (last 5 runs):

Datum	Status	Blöcke	Dokumente	Gestartet von

Status as colored badge (green=DONE, red=FAILED).

Part 6: Segmentation training (layout-only, no text required)

Kraken has two completely separate models:

Model	Trained with	Kraken command	What it learns
Segmentation (`blla`)	Page image + line polygons	`ketos segtrain`	Where text lines are
Recognition (HTR)	Line crop image + `.gt.txt`	`ketos train`	What text says

Training the segmentation model only requires drawing bounding boxes around text lines — no Kurrent reading ability needed. This makes it accessible to anyone and is the right fix when OCR produces hundreds of phantom line detections on a document.

Annotation workflow

Users draw boxes in the existing annotation UI, creating MANUAL transcription blocks with empty text (text = ""). The text field must be made nullable (or default to "") to support segmentation-only blocks.

A visual indicator (e.g. a tag "Nur Segmentierung") should distinguish these from full transcription blocks in the UI.

Export: `SegmentationTrainingExportService`

Query: MANUAL blocks where text IS NULL OR text = '', document has scriptType = HANDWRITING_KURRENT
Group by document → render full pages at 300 DPI with PDFBox
For each page: produce one PAGE XML file listing all line polygons on that page

PAGE XML format per page:

<PcGts>
  <Page imageFilename="page-1.png" imageWidth="2480" imageHeight="3508">
    <TextRegion>
      <TextLine id="block-uuid">
        <Coords points="x1,y1 x2,y2 x3,y3 x4,y4"/>
        <Baseline points="x1,y_base x2,y_base"/>
      </TextLine>
      ...
    </TextRegion>
  </Page>
</PcGts>

Coords come from the annotation's polygon (4-point quad, de-normalized to pixel space). Baseline is approximated as the bottom edge of the quad.

ZIP structure: page-{docId}-{pageNum}.png + page-{docId}-{pageNum}.xml pairs.

Endpoint

GET /api/ocr/segmentation-training-data/export → ZIP download
@RequirePermission(Permission.ADMIN)

Python: `POST /segtrain`

Receives ZIP with PNG + PAGE XML pairs
Extracts to temp directory
Runs ketos segtrain --load from current segmentation model (transfer learning on blla)
On success: backs up old segmentation model, replaces, reloads
Returns training metrics

Admin UI

Second card in the System tab: "Segmentierungsmodell trainieren"

Shows count of segmentation-only annotation blocks available
"Segmentierung trainieren" button → calls POST /api/ocr/segtrain
Uses same OcrTrainingRun history table with a model_name of blla

Frontend: Review toggle in transcription editor

Each block in TranscriptionBlock.svelte gets a checkmark toggle:

Unchecked: outline checkmark, muted
Checked: filled checkmark, brand-mint
Click calls PUT .../review
Panel header shows progress: "12 von 17 geprüft"

Open questions

Should training be async with job tracking for large datasets, or synchronous is fine?
Should we show a diff of OCR quality before/after training in the admin panel?
Should segmentation-only blocks be a separate entity/flag, or is text = "" on a MANUAL block sufficient to distinguish them?

## Overview Kraken's zero-shot Kurrent recognition produces ~70-80% character error rate on family letters. Fine-tuning on human-corrected transcriptions would dramatically improve quality. This issue tracks the full pipeline: origin tracking on blocks, training data export, and an admin UI to trigger and monitor training runs. Depends on #226 (OCR pipeline) and #227 (polygon annotations). --- ## Part 1: Block origin tracking ✅ (implemented) Already committed on `feat/issue-226-227-ocr-pipeline-polygon`: - `BlockSource` enum: `MANUAL`, `OCR` - V26 migration adds `source` + `reviewed` columns to `transcription_blocks` - `OcrService` sets `source=OCR` when creating blocks - `TranscriptionService.reviewBlock()` toggles the `reviewed` flag - `PUT /api/documents/{id}/transcription-blocks/{blockId}/review` endpoint - 5 tests covering toggle/untoggle/notfound/controller/source verification **Still needed for Part 1:** Frontend review toggle button in `TranscriptionBlock.svelte` (checkmark icon in block toolbar). --- ## Part 2: Recognition training data export ### Query Single JPQL query on `TranscriptionBlockRepository` — includes both manually authored blocks and reviewed OCR blocks: - `(source = MANUAL OR (source = OCR AND reviewed = true))` - document has `scriptType = HANDWRITING_KURRENT` Rationale: `MANUAL` blocks are implicitly reviewed (a human authored them from scratch), so they are equal or better quality training data than corrected OCR blocks. ### Service: `TrainingDataExportService` 1. Query eligible Kurrent blocks (see above) 2. Group by documentId, download each PDF via `FileService.downloadFileBytes()` 3. Render relevant pages with PDFBox `PDFRenderer` at 300 DPI 4. For each block: look up its annotation, crop the page image using normalized `x, y, width, height`, write PNG + `.gt.txt` into a `ZipOutputStream` 5. Stream the ZIP via `StreamingResponseBody` ZIP structure: `<block-id>.png` + `<block-id>.gt.txt` pairs (Kraken's `ketos train` format). ### Endpoint ``` GET /api/ocr/training-data/export → ZIP download @RequirePermission(Permission.ADMIN) ``` Returns 204 if no eligible blocks exist. ### Dependencies - Add `org.apache.pdfbox:pdfbox:3.0.4` to pom.xml --- ## Part 3: Python recognition training endpoint ### `POST /train` on the OCR service (`main.py`) 1. Receives training data as a ZIP upload from Spring Boot 2. Extracts to a temp directory 3. Runs Kraken's `ketos train` Python API with `--load` from the current model (transfer learning) 4. On success: backs up old model, replaces with fine-tuned model, reloads in-process 5. Returns training metrics (loss, accuracy) Runs synchronously — training on 10-30 page crops takes seconds to minutes on CPU. ### Java side - `RestClientOcrClient.trainModel(byte[] trainingDataZip)` sends ZIP to `POST /train` - `OcrTrainingService` orchestrates: calls `TrainingDataExportService` → sends to Python → records result --- ## Part 4: Training history ### Entity: `OcrTrainingRun` ```sql CREATE TABLE ocr_training_runs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), status VARCHAR(20) NOT NULL DEFAULT 'RUNNING', -- RUNNING, DONE, FAILED block_count INT NOT NULL, document_count INT NOT NULL, model_name VARCHAR(100) NOT NULL, error_message TEXT, triggered_by UUID, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), completed_at TIMESTAMPTZ ); ``` ### Info endpoint `GET /api/ocr/training-info` returns: ```json { "availableBlocks": 23, "totalOcrBlocks": 150, "availableDocuments": 8, "ocrServiceAvailable": true, "lastRun": { "status": "DONE", "blockCount": 15, ... }, "runs": [ ... ] } ``` --- ## Part 5: Admin panel UI Add a card to the System tab (`admin/system/+page.svelte`) following the existing pattern (same as mass import / backfill cards). **Info display:** - "{reviewedCount} geprüfte Blöcke aus {docCount} Dokumenten verfügbar (von {totalOcrCount} OCR-Blöcken gesamt)" - If 0 reviewed: disabled button + hint to mark blocks as reviewed in the transcription view - If OCR service unavailable: disabled button + warning **Action:** "Training starten" button → calls `POST /api/ocr/train` **History table** (last 5 runs): | Datum | Status | Blöcke | Dokumente | Gestartet von | |---|---|---|---|---| Status as colored badge (green=DONE, red=FAILED). --- ## Part 6: Segmentation training (layout-only, no text required) Kraken has two completely separate models: | Model | Trained with | Kraken command | What it learns | |---|---|---|---| | Segmentation (`blla`) | Page image + line polygons | `ketos segtrain` | Where text lines are | | Recognition (HTR) | Line crop image + `.gt.txt` | `ketos train` | What text says | Training the **segmentation model** only requires drawing bounding boxes around text lines — no Kurrent reading ability needed. This makes it accessible to anyone and is the right fix when OCR produces hundreds of phantom line detections on a document. ### Annotation workflow Users draw boxes in the existing annotation UI, creating `MANUAL` transcription blocks with empty text (`text = ""`). The `text` field must be made nullable (or default to `""`) to support segmentation-only blocks. A visual indicator (e.g. a tag "Nur Segmentierung") should distinguish these from full transcription blocks in the UI. ### Export: `SegmentationTrainingExportService` - Query: `MANUAL` blocks where `text IS NULL OR text = ''`, document has `scriptType = HANDWRITING_KURRENT` - Group by document → render full pages at 300 DPI with PDFBox - For each page: produce one PAGE XML file listing all line polygons on that page PAGE XML format per page: ```xml <PcGts> <Page imageFilename="page-1.png" imageWidth="2480" imageHeight="3508"> <TextRegion> <TextLine id="block-uuid"> <Coords points="x1,y1 x2,y2 x3,y3 x4,y4"/> <Baseline points="x1,y_base x2,y_base"/> </TextLine> ... </TextRegion> </Page> </PcGts> ``` Coords come from the annotation's `polygon` (4-point quad, de-normalized to pixel space). Baseline is approximated as the bottom edge of the quad. ZIP structure: `page-{docId}-{pageNum}.png` + `page-{docId}-{pageNum}.xml` pairs. ### Endpoint ``` GET /api/ocr/segmentation-training-data/export → ZIP download @RequirePermission(Permission.ADMIN) ``` ### Python: `POST /segtrain` 1. Receives ZIP with PNG + PAGE XML pairs 2. Extracts to temp directory 3. Runs `ketos segtrain --load` from current segmentation model (transfer learning on `blla`) 4. On success: backs up old segmentation model, replaces, reloads 5. Returns training metrics ### Admin UI Second card in the System tab: "Segmentierungsmodell trainieren" - Shows count of segmentation-only annotation blocks available - "Segmentierung trainieren" button → calls `POST /api/ocr/segtrain` - Uses same `OcrTrainingRun` history table with a `model_name` of `blla` --- ## Frontend: Review toggle in transcription editor Each block in `TranscriptionBlock.svelte` gets a checkmark toggle: - Unchecked: outline checkmark, muted - Checked: filled checkmark, brand-mint - Click calls `PUT .../review` - Panel header shows progress: "12 von 17 geprüft" --- ## Open questions - [ ] Should training be async with job tracking for large datasets, or synchronous is fine? - [ ] Should we show a diff of OCR quality before/after training in the admin panel? - [ ] Should segmentation-only blocks be a separate entity/flag, or is `text = ""` on a `MANUAL` block sufficient to distinguish them?

marcel added the feature label 2026-04-12 21:53:08 +02:00

marcel referenced this issue from a commit

2026-04-12 22:03:30 +02:00

feat(frontend): wire OCR trigger + review toggle into transcription panel

marcel commented

2026-04-12 23:36:50 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

TrainingDataExportService is doing a lot: query blocks → group by document → download PDFs → render pages with PDFBox → crop images → write ZIP. That's 5+ responsibilities in one service method. I'd want to see this decomposed into smaller, single-responsibility methods (e.g. queryReviewedBlocks(), renderPageImage(document, page, dpi), cropBlockImage(pageImage, annotation), writeTrainingPair(zipOut, blockId, image, text)). Each becomes independently testable.
The TrainingDataExportService depends on FileService for PDF downloads: That's correct per layering rules. But it also needs to parse annotations from blocks — is the annotation data already on the TranscriptionBlock entity, or does it require a separate repository call? The spec mentions "look up its annotation" without specifying the access path.
OcrTrainingService orchestrates export + HTTP call + record result: This feels like the right boundary — one service owns the workflow, delegates to specialized services. Good separation.
Frontend review toggle: The spec says "click calls PUT .../review" — is this an optimistic UI update or does it wait for the server response? For a toggle in a transcription editor where users are working fast, optimistic update with rollback on failure would be better UX, but the TDD test needs to cover both the success and rollback paths.
Component splitting for the admin card: The training card in admin/system/+page.svelte has info display + action button + history table — that's at least two visual regions (info/action area + history table). I'd expect OcrTrainingCard.svelte as the container with TrainingHistory.svelte extracted for the table.

Suggestions

Test strategy for TrainingDataExportService: The PDF rendering + image cropping logic is the hardest part to test. I'd want an integration test with a real small PDF fixture that verifies the exported ZIP contains correctly named .png + .gt.txt pairs. Unit tests can cover the query and grouping logic with mocked repositories.
The review toggle in TranscriptionBlock.svelte: Keep the toggle state as a prop flowing from the parent, not local $state — the parent should own the reviewed/not-reviewed state and re-derive the progress counter via $derived. This avoids the parent and child getting out of sync.
204 for "no reviewed blocks": This is fine for a programmatic API, but the admin UI should handle this gracefully — the button should be disabled with a hint before the user clicks, not after they get a 204. The spec already describes this ("If 0 reviewed: disabled button + hint") — just confirming the frontend should check availableBlocks from the info endpoint, not rely on the export endpoint returning 204.
Open question — sync vs async training: For 10-30 page crops taking seconds to minutes, synchronous is fine now. But the OcrTrainingRun entity already has RUNNING/DONE/FAILED status, which implies async. I'd suggest: implement synchronous first (KISS), but keep the status model — it's cheap and makes the eventual async migration trivial. The TDD tests should assert on the status transitions regardless.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Questions & Observations - **TrainingDataExportService is doing a lot**: query blocks → group by document → download PDFs → render pages with PDFBox → crop images → write ZIP. That's 5+ responsibilities in one service method. I'd want to see this decomposed into smaller, single-responsibility methods (e.g. `queryReviewedBlocks()`, `renderPageImage(document, page, dpi)`, `cropBlockImage(pageImage, annotation)`, `writeTrainingPair(zipOut, blockId, image, text)`). Each becomes independently testable. - **The `TrainingDataExportService` depends on `FileService` for PDF downloads**: That's correct per layering rules. But it also needs to parse annotations from blocks — is the annotation data already on the `TranscriptionBlock` entity, or does it require a separate repository call? The spec mentions "look up its annotation" without specifying the access path. - **`OcrTrainingService` orchestrates export + HTTP call + record result**: This feels like the right boundary — one service owns the workflow, delegates to specialized services. Good separation. - **Frontend review toggle**: The spec says "click calls `PUT .../review`" — is this an optimistic UI update or does it wait for the server response? For a toggle in a transcription editor where users are working fast, optimistic update with rollback on failure would be better UX, but the TDD test needs to cover both the success and rollback paths. - **Component splitting for the admin card**: The training card in `admin/system/+page.svelte` has info display + action button + history table — that's at least two visual regions (info/action area + history table). I'd expect `OcrTrainingCard.svelte` as the container with `TrainingHistory.svelte` extracted for the table. ### Suggestions - **Test strategy for `TrainingDataExportService`**: The PDF rendering + image cropping logic is the hardest part to test. I'd want an integration test with a real small PDF fixture that verifies the exported ZIP contains correctly named `.png` + `.gt.txt` pairs. Unit tests can cover the query and grouping logic with mocked repositories. - **The review toggle in `TranscriptionBlock.svelte`**: Keep the toggle state as a prop flowing from the parent, not local `$state` — the parent should own the reviewed/not-reviewed state and re-derive the progress counter via `$derived`. This avoids the parent and child getting out of sync. - **204 for "no reviewed blocks"**: This is fine for a programmatic API, but the admin UI should handle this gracefully — the button should be disabled with a hint *before* the user clicks, not after they get a 204. The spec already describes this ("If 0 reviewed: disabled button + hint") — just confirming the frontend should check `availableBlocks` from the info endpoint, not rely on the export endpoint returning 204. - **Open question — sync vs async training**: For 10-30 page crops taking seconds to minutes, synchronous is fine *now*. But the `OcrTrainingRun` entity already has `RUNNING`/`DONE`/`FAILED` status, which implies async. I'd suggest: implement synchronous first (KISS), but keep the status model — it's cheap and makes the eventual async migration trivial. The TDD tests should assert on the status transitions regardless.

marcel commented

2026-04-12 23:37:14 +02:00

🏗️ Markus Keller — Application Architect

Questions & Observations

New dependency: PDFBox 3.0.4: This is a significant addition — PDFBox pulls in ~5 transitive JARs (fontbox, commons-logging, etc.). It's the right tool for server-side PDF rendering, but be aware it's memory-hungry when rendering at 300 DPI. For a single admin-triggered export this is fine; if it ever becomes a frequent operation, you'd want to cap concurrent renders. For now: KISS, add it, move on.
Domain boundary question — where does TrainingDataExportService live? It crosses three domains: transcription blocks, documents/files, and OCR. I'd place it in the OCR package since its purpose is OCR model training. It should depend on TranscriptionBlockRepository (or a TranscriptionService method) and FileService — never reach into DocumentRepository directly.
OcrTrainingRun entity — is this a new domain or part of OCR? It tracks training runs, which is OCR-specific. Keep it in the OCR package alongside OcrTrainingService. Don't create a separate training package — that's premature given there's only one entity and one service.
The Python OCR service now has two responsibilities: inference (POST /ocr) and training (POST /train). This is fine for now — it's one process, one model, and training needs access to the loaded model for transfer learning. But document in a comment that training mutates the in-process model state. If the service ever gets replicated, this becomes a consistency problem.
StreamingResponseBody for the ZIP export: Good choice — avoids buffering the entire ZIP in memory. But the controller must not annotate the method with @Transactional, or the DB connection stays open for the entire streaming duration. Make sure the query executes and collects results before entering the streaming phase.

Suggestions

Consider a dedicated JPQL query method on TranscriptionBlockRepository rather than filtering in Java:
```
@Query("SELECT b FROM TranscriptionBlock b JOIN b.document d WHERE b.source = 'OCR' AND b.reviewed = true AND d.scriptType = 'HANDWRITING_KURRENT'")
List<TranscriptionBlock> findReviewedKurrentOcrBlocks();
```
Push the filtering down to the database. This is cleaner and more efficient than loading all OCR blocks and filtering in memory.
The info endpoint (GET /api/ocr/training-info) aggregates multiple queries: availableBlocks, totalOcrBlocks, availableDocuments, ocrServiceAvailable, lastRun, runs. Consider whether these are all needed on every page load or if some can be lazy-loaded. For an admin panel that's rarely visited, a single endpoint is fine — don't over-optimize.
triggered_by UUID in ocr_training_runs: Add a foreign key to app_users with ON DELETE SET NULL. If a user is deleted, the training history should survive but the reference should null out gracefully.
Sync vs async: Start synchronous. The OcrTrainingRun status model already supports async if needed later. The cost of adding @Async + a status polling endpoint later is minimal. The cost of building async infrastructure now for a feature that processes 10-30 images is wasted time.

## 🏗️ Markus Keller — Application Architect ### Questions & Observations - **New dependency: PDFBox 3.0.4**: This is a significant addition — PDFBox pulls in ~5 transitive JARs (fontbox, commons-logging, etc.). It's the right tool for server-side PDF rendering, but be aware it's memory-hungry when rendering at 300 DPI. For a single admin-triggered export this is fine; if it ever becomes a frequent operation, you'd want to cap concurrent renders. For now: KISS, add it, move on. - **Domain boundary question — where does `TrainingDataExportService` live?** It crosses three domains: transcription blocks, documents/files, and OCR. I'd place it in the OCR package since its purpose is OCR model training. It should depend on `TranscriptionBlockRepository` (or a `TranscriptionService` method) and `FileService` — never reach into `DocumentRepository` directly. - **`OcrTrainingRun` entity — is this a new domain or part of OCR?** It tracks training runs, which is OCR-specific. Keep it in the OCR package alongside `OcrTrainingService`. Don't create a separate `training` package — that's premature given there's only one entity and one service. - **The Python OCR service now has two responsibilities**: inference (`POST /ocr`) and training (`POST /train`). This is fine for now — it's one process, one model, and training needs access to the loaded model for transfer learning. But document in a comment that training mutates the in-process model state. If the service ever gets replicated, this becomes a consistency problem. - **`StreamingResponseBody` for the ZIP export**: Good choice — avoids buffering the entire ZIP in memory. But the controller must not annotate the method with `@Transactional`, or the DB connection stays open for the entire streaming duration. Make sure the query executes and collects results *before* entering the streaming phase. ### Suggestions - **Consider a dedicated JPQL query method** on `TranscriptionBlockRepository` rather than filtering in Java: ```java @Query("SELECT b FROM TranscriptionBlock b JOIN b.document d WHERE b.source = 'OCR' AND b.reviewed = true AND d.scriptType = 'HANDWRITING_KURRENT'") List<TranscriptionBlock> findReviewedKurrentOcrBlocks(); ``` Push the filtering down to the database. This is cleaner and more efficient than loading all OCR blocks and filtering in memory. - **The info endpoint (`GET /api/ocr/training-info`) aggregates multiple queries**: `availableBlocks`, `totalOcrBlocks`, `availableDocuments`, `ocrServiceAvailable`, `lastRun`, `runs`. Consider whether these are all needed on every page load or if some can be lazy-loaded. For an admin panel that's rarely visited, a single endpoint is fine — don't over-optimize. - **`triggered_by UUID` in `ocr_training_runs`**: Add a foreign key to `app_users` with `ON DELETE SET NULL`. If a user is deleted, the training history should survive but the reference should null out gracefully. - **Sync vs async**: Start synchronous. The `OcrTrainingRun` status model already supports async if needed later. The cost of adding `@Async` + a status polling endpoint later is minimal. The cost of building async infrastructure now for a feature that processes 10-30 images is wasted time.

marcel commented

2026-04-12 23:37:36 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Questions & Observations

TrainingDataExportService is the testing crux of this issue. It involves PDF rendering, image cropping, and ZIP assembly — all I/O-heavy operations that are hard to mock meaningfully. I need clarity on the test strategy:
- Unit tests: Can we test the query logic, grouping logic, and ZIP structure assembly separately from the actual PDF rendering?
- Integration test: We need at least one test with a real small PDF (a fixture file) that goes through the full pipeline: query → render → crop → ZIP → verify ZIP contents. This is the test that catches the bugs mocking would hide.
The Python POST /train endpoint is a black box from the Java side. How do we test OcrTrainingService.trainModel() in CI without a running Python service? Options:
1. Mock the OcrClient interface at the unit test level (fine for testing orchestration logic)
2. Integration test with a real OCR container (expensive, probably overkill for CI)
3. Contract test: verify the Java side sends the right ZIP format, verify the Python side accepts it — tested independently
I'd go with option 1 for CI + option 3 as a manual/nightly check.
Review toggle endpoint already has 5 tests — good. But I want to verify: do the existing tests cover the case where a non-OCR block (source=MANUAL) is toggled? The spec doesn't say whether manual blocks can be reviewed, but the review endpoint presumably doesn't check source. Should it?
Admin UI testing: The training card has conditional states (0 blocks → disabled, OCR unavailable → disabled, happy path → enabled). Each state needs a test. The history table with status badges (DONE/FAILED) needs visual verification.

Suggestions

Test plan by layer:
- Unit: TrainingDataExportService query logic, OcrTrainingService orchestration (mock OcrClient), ZIP structure validation
- Integration: Full export pipeline with PDF fixture → verify ZIP contains expected .png + .gt.txt pairs; OcrTrainingRun persistence and status transitions; info endpoint aggregation correctness
- Frontend (Vitest): Review toggle component (optimistic update + rollback), training card conditional rendering (disabled states, progress display), history table rendering
- E2E: Not needed for this feature — it's admin-only with a small surface area. The integration layer covers it.
PDF fixture file: Create a minimal 1-page PDF with known dimensions in src/test/resources/fixtures/. This avoids depending on real uploaded documents in tests and makes the test deterministic.
Edge case checklist:
- Export with 0 reviewed blocks → 204 response
- Export where a document's PDF has been deleted from S3 → graceful error, not 500
- Training run triggered while another is already running → what happens? (The issue doesn't specify — race condition risk)
- Block reviewed, then document deleted → does the query still work or does the JOIN fail?
- Very large block annotation that exceeds page boundaries → does cropping handle this?

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Questions & Observations - **TrainingDataExportService is the testing crux of this issue.** It involves PDF rendering, image cropping, and ZIP assembly — all I/O-heavy operations that are hard to mock meaningfully. I need clarity on the test strategy: - **Unit tests**: Can we test the query logic, grouping logic, and ZIP structure assembly separately from the actual PDF rendering? - **Integration test**: We need at least one test with a real small PDF (a fixture file) that goes through the full pipeline: query → render → crop → ZIP → verify ZIP contents. This is the test that catches the bugs mocking would hide. - **The Python `POST /train` endpoint is a black box from the Java side.** How do we test `OcrTrainingService.trainModel()` in CI without a running Python service? Options: 1. Mock the `OcrClient` interface at the unit test level (fine for testing orchestration logic) 2. Integration test with a real OCR container (expensive, probably overkill for CI) 3. Contract test: verify the Java side sends the right ZIP format, verify the Python side accepts it — tested independently I'd go with option 1 for CI + option 3 as a manual/nightly check. - **Review toggle endpoint already has 5 tests** — good. But I want to verify: do the existing tests cover the case where a non-OCR block (source=MANUAL) is toggled? The spec doesn't say whether manual blocks can be reviewed, but the review endpoint presumably doesn't check source. Should it? - **Admin UI testing**: The training card has conditional states (0 blocks → disabled, OCR unavailable → disabled, happy path → enabled). Each state needs a test. The history table with status badges (DONE/FAILED) needs visual verification. ### Suggestions - **Test plan by layer:** - **Unit**: `TrainingDataExportService` query logic, `OcrTrainingService` orchestration (mock OcrClient), ZIP structure validation - **Integration**: Full export pipeline with PDF fixture → verify ZIP contains expected `.png` + `.gt.txt` pairs; `OcrTrainingRun` persistence and status transitions; info endpoint aggregation correctness - **Frontend (Vitest)**: Review toggle component (optimistic update + rollback), training card conditional rendering (disabled states, progress display), history table rendering - **E2E**: Not needed for this feature — it's admin-only with a small surface area. The integration layer covers it. - **PDF fixture file**: Create a minimal 1-page PDF with known dimensions in `src/test/resources/fixtures/`. This avoids depending on real uploaded documents in tests and makes the test deterministic. - **Edge case checklist:** - Export with 0 reviewed blocks → 204 response - Export where a document's PDF has been deleted from S3 → graceful error, not 500 - Training run triggered while another is already running → what happens? (The issue doesn't specify — race condition risk) - Block reviewed, then document deleted → does the query still work or does the JOIN fail? - Very large block annotation that exceeds page boundaries → does cropping handle this?

marcel commented

2026-04-12 23:37:59 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Questions & Observations

GET /api/ocr/training-data/export returns a ZIP containing cropped document images + transcription text. This endpoint is @RequirePermission(Permission.ADMIN) — good. But the ZIP contains actual family document content. Verify that:
1. The permission check cannot be bypassed (no path traversal in block IDs, no IDOR via manipulated query params)
2. The endpoint doesn't expose blocks from documents the admin shouldn't see (though with ADMIN permission this is likely moot — just confirming there's no multi-tenancy concern)
ZIP Slip risk in the Python POST /train endpoint: When the Python service extracts the uploaded ZIP to a temp directory, it must validate that extracted file paths don't escape the temp directory. This is a classic ZIP Slip vulnerability (CWE-22). The Java side constructs the ZIP so it should be safe, but the Python side should still validate defensively:
```
import os
for entry in zip_file.namelist():
    target = os.path.join(temp_dir, entry)
    if not os.path.abspath(target).startswith(os.path.abspath(temp_dir)):
        raise ValueError(f"Zip entry escapes target dir: {entry}")
```
POST /train on the Python service — is it authenticated? The issue doesn't mention authentication between Spring Boot and the Python OCR service. If the OCR service is on the Docker network only (no host port exposure), network isolation provides some protection. But if the port is exposed (even accidentally), anyone could trigger training or upload malicious training data. At minimum, use a shared secret/API key header.
Training data poisoning: A malicious admin (or compromised admin account) could mark adversarial blocks as "reviewed" to poison the training data. This is a low-likelihood but high-impact risk for an ML pipeline. Consider: should there be a minimum threshold of reviewed blocks before training is allowed? The spec mentions this implicitly (button disabled when 0 blocks) but doesn't set a meaningful minimum.
triggered_by UUID stores who triggered training: Good for audit. Make sure this is populated from the authenticated session, not from a request parameter that could be spoofed.

Suggestions

Sanitize block IDs in the ZIP: The ZIP entries use <block-id>.png and <block-id>.gt.txt. Block IDs are UUIDs (safe characters), but still validate that the ID format matches UUID before using it as a filename — defense in depth.
Rate-limit the training endpoint: Training is CPU-intensive. Without rate limiting, a compromised admin session could trigger repeated training runs as a denial-of-service against the OCR service. Even a simple "reject if a run is already in RUNNING status" check (which the OcrTrainingRun entity supports) would suffice.
The export ZIP should not include the full file path or document metadata beyond what's needed for training (block ID + image + text). The current spec looks clean on this — just confirming it stays that way during implementation.
Python temp directory cleanup: After training completes (success or failure), the extracted ZIP contents in the temp directory must be cleaned up. Use a try/finally or Python's tempfile.TemporaryDirectory context manager to ensure cleanup even on exceptions.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Questions & Observations - **`GET /api/ocr/training-data/export` returns a ZIP containing cropped document images + transcription text.** This endpoint is `@RequirePermission(Permission.ADMIN)` — good. But the ZIP contains actual family document content. Verify that: 1. The permission check cannot be bypassed (no path traversal in block IDs, no IDOR via manipulated query params) 2. The endpoint doesn't expose blocks from documents the admin shouldn't see (though with ADMIN permission this is likely moot — just confirming there's no multi-tenancy concern) - **ZIP Slip risk in the Python `POST /train` endpoint**: When the Python service extracts the uploaded ZIP to a temp directory, it must validate that extracted file paths don't escape the temp directory. This is a classic ZIP Slip vulnerability (CWE-22). The Java side constructs the ZIP so it *should* be safe, but the Python side should still validate defensively: ```python import os for entry in zip_file.namelist(): target = os.path.join(temp_dir, entry) if not os.path.abspath(target).startswith(os.path.abspath(temp_dir)): raise ValueError(f"Zip entry escapes target dir: {entry}") ``` - **`POST /train` on the Python service — is it authenticated?** The issue doesn't mention authentication between Spring Boot and the Python OCR service. If the OCR service is on the Docker network only (no host port exposure), network isolation provides some protection. But if the port is exposed (even accidentally), anyone could trigger training or upload malicious training data. At minimum, use a shared secret/API key header. - **Training data poisoning**: A malicious admin (or compromised admin account) could mark adversarial blocks as "reviewed" to poison the training data. This is a low-likelihood but high-impact risk for an ML pipeline. Consider: should there be a minimum threshold of reviewed blocks before training is allowed? The spec mentions this implicitly (button disabled when 0 blocks) but doesn't set a meaningful minimum. - **`triggered_by UUID` stores who triggered training**: Good for audit. Make sure this is populated from the authenticated session, not from a request parameter that could be spoofed. ### Suggestions - **Sanitize block IDs in the ZIP**: The ZIP entries use `<block-id>.png` and `<block-id>.gt.txt`. Block IDs are UUIDs (safe characters), but still validate that the ID format matches UUID before using it as a filename — defense in depth. - **Rate-limit the training endpoint**: Training is CPU-intensive. Without rate limiting, a compromised admin session could trigger repeated training runs as a denial-of-service against the OCR service. Even a simple "reject if a run is already in RUNNING status" check (which the `OcrTrainingRun` entity supports) would suffice. - **The export ZIP should not include the full file path or document metadata** beyond what's needed for training (block ID + image + text). The current spec looks clean on this — just confirming it stays that way during implementation. - **Python temp directory cleanup**: After training completes (success or failure), the extracted ZIP contents in the temp directory must be cleaned up. Use a `try/finally` or Python's `tempfile.TemporaryDirectory` context manager to ensure cleanup even on exceptions.

marcel commented

2026-04-12 23:38:22 +02:00

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

Review toggle in TranscriptionBlock.svelte: The spec says "outline checkmark, muted" → "filled checkmark, brand-mint" on toggle. This is good, but color alone cannot be the only cue (WCAG 1.4.1). The outline vs. filled visual difference helps, but I'd also recommend:
- Adding a tooltip or aria-label that changes: "Als geprüft markieren" ↔ "Prüfung aufheben"
- A subtle background tint change on the block itself (very light brand-mint wash when reviewed) to give a secondary visual signal
- Ensure the checkmark icon is at least 20×20px with a 44×44px touch target around it
Progress indicator "12 von 17 geprüft": Where exactly in the panel header? If the transcription editor has a toolbar, this should be in the toolbar — not in the page header. It's contextual to the current document's blocks, so it belongs visually close to the blocks. Consider:
- A small progress bar (brand-mint fill on brand-sand background) alongside the text
- Showing percentage as well for quick scanning: "12 / 17 geprüft (71%)"
Admin training card: Following the existing card pattern is correct. But the info text "{reviewedCount} geprüfte Blöcke aus {docCount} Dokumenten verfügbar (von {totalOcrCount} OCR-Blöcken gesamt)" is dense for a single line. Consider splitting:
- Line 1: 23 geprüfte Blöcke aus 8 Dokumenten verfügbar
- Line 2 (muted, smaller): von 150 OCR-Blöcken gesamt
This creates visual hierarchy — the actionable number (23 reviewed) stands out.
History table status badges: Green for DONE, red for FAILED — fine, but pair with an icon (checkmark for DONE, X for FAILED) for colorblind users. Also consider the RUNNING state — the spec mentions it in the entity but not in the table. If training becomes async later, you'll need a spinner or animated indicator for RUNNING.

Suggestions

Disabled button states need clear communication: "OCR-Dienst nicht erreichbar" and "Keine geprüften Blöcke vorhanden" — these hints should be visible without hovering. Use a small text line below the disabled button, not a tooltip, because:
1. Tooltips don't work on touch devices
2. Disabled elements often can't receive focus, so the tooltip never triggers
3. Seniors (60+) may not discover hover-only information
The review toggle should have a focus style: When navigating by keyboard through transcription blocks, the checkmark toggle needs a visible focus ring. Use focus-visible:ring-2 focus-visible:ring-brand-navy consistent with other interactive elements.
Consider the flow from admin panel to transcription editor: When the admin sees "0 reviewed blocks" with a hint to mark blocks in the transcription view — is there a direct link to a document with unreviewed OCR blocks? A "Dokumente mit OCR-Blöcken anzeigen" link would reduce friction significantly.

## 🎨 Leonie Voss — UI/UX Design Lead ### Questions & Observations - **Review toggle in `TranscriptionBlock.svelte`**: The spec says "outline checkmark, muted" → "filled checkmark, brand-mint" on toggle. This is good, but **color alone cannot be the only cue** (WCAG 1.4.1). The outline vs. filled visual difference helps, but I'd also recommend: - Adding a tooltip or `aria-label` that changes: "Als geprüft markieren" ↔ "Prüfung aufheben" - A subtle background tint change on the block itself (very light brand-mint wash when reviewed) to give a secondary visual signal - Ensure the checkmark icon is at least 20×20px with a 44×44px touch target around it - **Progress indicator "12 von 17 geprüft"**: Where exactly in the panel header? If the transcription editor has a toolbar, this should be in the toolbar — not in the page header. It's contextual to the current document's blocks, so it belongs visually close to the blocks. Consider: - A small progress bar (brand-mint fill on brand-sand background) alongside the text - Showing percentage as well for quick scanning: "12 / 17 geprüft (71%)" - **Admin training card**: Following the existing card pattern is correct. But the info text "{reviewedCount} geprüfte Blöcke aus {docCount} Dokumenten verfügbar (von {totalOcrCount} OCR-Blöcken gesamt)" is dense for a single line. Consider splitting: - Line 1: **23 geprüfte Blöcke** aus 8 Dokumenten verfügbar - Line 2 (muted, smaller): von 150 OCR-Blöcken gesamt This creates visual hierarchy — the actionable number (23 reviewed) stands out. - **History table status badges**: Green for DONE, red for FAILED — fine, but pair with an icon (checkmark for DONE, X for FAILED) for colorblind users. Also consider the RUNNING state — the spec mentions it in the entity but not in the table. If training becomes async later, you'll need a spinner or animated indicator for RUNNING. ### Suggestions - **Disabled button states need clear communication**: "OCR-Dienst nicht erreichbar" and "Keine geprüften Blöcke vorhanden" — these hints should be visible *without* hovering. Use a small text line below the disabled button, not a tooltip, because: 1. Tooltips don't work on touch devices 2. Disabled elements often can't receive focus, so the tooltip never triggers 3. Seniors (60+) may not discover hover-only information - **The review toggle should have a focus style**: When navigating by keyboard through transcription blocks, the checkmark toggle needs a visible focus ring. Use `focus-visible:ring-2 focus-visible:ring-brand-navy` consistent with other interactive elements. - **Consider the flow from admin panel to transcription editor**: When the admin sees "0 reviewed blocks" with a hint to mark blocks in the transcription view — is there a direct link to a document with unreviewed OCR blocks? A "Dokumente mit OCR-Blöcken anzeigen" link would reduce friction significantly.

marcel commented

2026-04-12 23:38:46 +02:00

⚙️ Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

PDFBox at 300 DPI is memory-hungry: Rendering a single A4 page at 300 DPI produces a ~25 MB uncompressed image in memory (2480 × 3508 × 4 bytes ARGB). If you're processing 10-30 blocks across multiple documents, peak memory could spike significantly. The current Spring Boot container likely has a default heap of 256-512 MB. You may need to:
- Increase the backend container's memory limit in docker-compose.yml
- Process pages sequentially (one document at a time, release each PDFDocument after rendering)
- Monitor with /actuator/metrics/jvm.memory.used during a test run
The Python OCR service now does training on CPU: "seconds to minutes" for 10-30 crops. That's fine, but during training the OCR service is busy — what happens to incoming OCR inference requests? If POST /train is synchronous and blocks the Python process (likely a single-worker FastAPI/Flask), all OCR requests will queue or timeout. Consider:
- Documenting that OCR inference is unavailable during training
- Or: the Python service already uses multiple workers? Check the Dockerfile/startup command
Model file replacement during training: "backs up old model, replaces with fine-tuned model, reloads in-process". This is a file system operation inside the container. If the container restarts between backup and reload, which model loads? The spec should define:
- Where the model files live (a named volume, not the container filesystem)
- The backup naming convention
- Whether the backup is kept indefinitely or rotated
The ocr_training_runs table needs a Flyway migration: The spec includes the SQL. Make sure this is the next sequential migration number after whatever V26 added for the block source/reviewed columns.

Suggestions

Named volume for OCR models: The model file should live on a named Docker volume, not baked into the image or on the container filesystem. This ensures:
- Models survive container rebuilds/restarts (the spec mentions V25 already does this for model cache — extend the same pattern)
- Fine-tuned models persist across deployments
- Backup/restore of models is a simple volume operation
Health check awareness: If the OCR service is training and temporarily unavailable for inference, the ocrServiceAvailable flag in the info endpoint should reflect this. The admin UI already plans to show availability — make sure the health check endpoint on the Python side returns a meaningful status during training (e.g. {"status": "training", "inference_available": false}).
Container resource limits: Add explicit memory limits to the backend container in docker-compose for the PDF rendering workload:
```
backend:
  deploy:
    resources:
      limits:
        memory: 1G
```
And similarly for the OCR service during training.
Log the training run: Both the Java side (OcrTrainingService) and the Python side should log at INFO level: training started, block count, duration, success/failure. This is essential for debugging when a training run fails silently. The OcrTrainingRun entity stores this, but structured logs make it searchable in Loki without querying the database.
Concurrent training guard: Add a check in OcrTrainingService: if a run with status RUNNING already exists, reject the new request with a 409 Conflict. The spec doesn't mention this explicitly, but it's a necessary safeguard — two concurrent training runs would corrupt the model file.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer ### Questions & Observations - **PDFBox at 300 DPI is memory-hungry**: Rendering a single A4 page at 300 DPI produces a ~25 MB uncompressed image in memory (`2480 × 3508 × 4 bytes ARGB`). If you're processing 10-30 blocks across multiple documents, peak memory could spike significantly. The current Spring Boot container likely has a default heap of 256-512 MB. You may need to: - Increase the backend container's memory limit in `docker-compose.yml` - Process pages sequentially (one document at a time, release each `PDFDocument` after rendering) - Monitor with `/actuator/metrics/jvm.memory.used` during a test run - **The Python OCR service now does training on CPU**: "seconds to minutes" for 10-30 crops. That's fine, but during training the OCR service is busy — what happens to incoming OCR inference requests? If `POST /train` is synchronous and blocks the Python process (likely a single-worker FastAPI/Flask), all OCR requests will queue or timeout. Consider: - Documenting that OCR inference is unavailable during training - Or: the Python service already uses multiple workers? Check the Dockerfile/startup command - **Model file replacement during training**: "backs up old model, replaces with fine-tuned model, reloads in-process". This is a file system operation inside the container. If the container restarts between backup and reload, which model loads? The spec should define: - Where the model files live (a named volume, not the container filesystem) - The backup naming convention - Whether the backup is kept indefinitely or rotated - **The `ocr_training_runs` table needs a Flyway migration**: The spec includes the SQL. Make sure this is the next sequential migration number after whatever V26 added for the block source/reviewed columns. ### Suggestions - **Named volume for OCR models**: The model file should live on a named Docker volume, not baked into the image or on the container filesystem. This ensures: - Models survive container rebuilds/restarts (the spec mentions V25 already does this for model cache — extend the same pattern) - Fine-tuned models persist across deployments - Backup/restore of models is a simple volume operation - **Health check awareness**: If the OCR service is training and temporarily unavailable for inference, the `ocrServiceAvailable` flag in the info endpoint should reflect this. The admin UI already plans to show availability — make sure the health check endpoint on the Python side returns a meaningful status during training (e.g. `{"status": "training", "inference_available": false}`). - **Container resource limits**: Add explicit memory limits to the backend container in docker-compose for the PDF rendering workload: ```yaml backend: deploy: resources: limits: memory: 1G ``` And similarly for the OCR service during training. - **Log the training run**: Both the Java side (`OcrTrainingService`) and the Python side should log at INFO level: training started, block count, duration, success/failure. This is essential for debugging when a training run fails silently. The `OcrTrainingRun` entity stores this, but structured logs make it searchable in Loki without querying the database. - **Concurrent training guard**: Add a check in `OcrTrainingService`: if a run with status `RUNNING` already exists, reject the new request with a 409 Conflict. The spec doesn't mention this explicitly, but it's a necessary safeguard — two concurrent training runs would corrupt the model file.

marcel referenced this issue

2026-04-12 23:59:56 +02:00

perf: page-by-page streaming OCR with real-time progress #231

marcel referenced this issue

2026-04-13 09:06:06 +02:00

perf: page-by-page streaming OCR with real-time progress #231

marcel commented

2026-04-13 13:46:35 +02:00

🗂️ Implementation Plan — Felix Brandt

After reading the issue, all comments, and exploring the codebase, here is the full implementation plan.

Key finding: All of Part 1's backend + frontend work is already committed on feat/issue-226-227-ocr-pipeline-polygon (V26 migration, BlockSource enum, reviewBlock() service/controller, TranscriptionBlock.svelte toggle, TranscriptionEditView wiring, document page reviewToggle()). The only remaining Part 1 gap is the "X von Y geprüft" progress counter in the panel.

Part 1 — Remaining progress counter

[frontend] Add reviewedCount and totalCount derived counters to TranscriptionEditView.svelte and render "X von Y geprüft" in the panel's top bar — $derived from blocks prop, no new state

Part 2 — Recognition training data export

[backend] Add org.apache.pdfbox:pdfbox:3.0.4 to backend/pom.xml
[backend] Add findEligibleKurrentBlocks() JPQL query to TranscriptionBlockRepository:
- (source = MANUAL OR (source = OCR AND reviewed = true)) AND document.scriptType = HANDWRITING_KURRENT
- Rationale: MANUAL blocks are implicitly reviewed (human-authored), equal or better quality than corrected OCR blocks
[backend] TrainingDataExportService with decomposed single-responsibility methods:
- queryEligibleBlocks() → repository
- renderPageImage(PDDocument, int pageIdx) → PDFBox 300 DPI, BufferedImage
- cropBlockImage(BufferedImage, DocumentAnnotation) → de-normalize coords × image dims, crop
- writeTrainingPair(ZipOutputStream, UUID, BufferedImage, String) → <id>.png + <id>.gt.txt
- exportToZip(StreamingResponseBody) → outer orchestrator; query results collected before entering lambda (Markus: no open DB txn during streaming); each PDDocument released after processing (Tobias: avoid memory spike); block IDs validated as UUID before use as filename (NullX: defense in depth)
[backend] GET /api/ocr/training-data/export in OcrController — @RequirePermission(ADMIN), StreamingResponseBody response, 204 if no eligible blocks
[test] Integration test with a minimal 1-page PDF fixture in src/test/resources/fixtures/; unit tests for query eligibility logic and 204 path

Part 3 — Python recognition training endpoint

[python] POST /train in ocr-service/main.py:
- Accepts ZIP UploadFile; ZIP Slip validation on all entries (NullX)
- Extracts to tempfile.TemporaryDirectory() (auto-cleanup on success or failure)
- Calls ketos train Python API with --load from KRAKEN_MODEL_PATH (transfer learning)
- Backs up old model (german_kurrent.mlmodel.bak), replaces, reloads engine in-process
- Returns {"loss": ..., "accuracy": ..., "epochs": ...}
- Comment: "Training mutates in-process model state — not safe if service is replicated"
- INFO-level logging: started, block count, duration, result (Tobias)
[backend] Add trainModel(byte[] trainingDataZip) to OcrClient interface
[backend] Implement trainModel() in RestClientOcrClient — POST to /train with multipart ZIP, 10-minute timeout

Part 4 — Training history

[migration] V29__add_ocr_training_runs.sql:

CREATE TABLE ocr_training_runs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    status VARCHAR(20) NOT NULL DEFAULT 'RUNNING',
    block_count INT NOT NULL,
    document_count INT NOT NULL,
    model_name VARCHAR(100) NOT NULL,
    error_message TEXT,
    triggered_by UUID REFERENCES app_users(id) ON DELETE SET NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    completed_at TIMESTAMPTZ
);

[backend] OcrTrainingRun entity in model/ — Lombok @Data @Builder @NoArgsConstructor @AllArgsConstructor
[backend] OcrTrainingRunRepository — findFirstByStatusOrderByCreatedAtDesc() (concurrent-run guard) + findTop5ByOrderByCreatedAtDesc() (info endpoint)
[backend] OcrTrainingService.triggerTraining(UUID triggeredBy):
1. 409 Conflict if a RUNNING run exists (Tobias: concurrent training guard)
2. Create run with status=RUNNING; triggered_by from session, never request body (NullX)
3. Export via TrainingDataExportService → collect ZIP bytes
4. Call ocrClient.trainModel(zipBytes)
5. Mark DONE + set completedAt on success; FAILED + errorMessage on exception
[backend] Add ErrorCode.TRAINING_ALREADY_RUNNING + mirror in errors.ts + Paraglide keys (de/en/es)
[backend] Two new endpoints in OcrController:
- POST /api/ocr/train — @RequirePermission(ADMIN) → triggerTraining(userId)
- GET /api/ocr/training-info — @RequirePermission(ADMIN) → TrainingInfoResponse (availableBlocks, totalOcrBlocks, availableDocuments, ocrServiceAvailable, lastRun, runs)
[test] OcrTrainingService tests: concurrent guard (409), happy path (DONE + completedAt), failure path (FAILED + errorMessage) — mock OcrClient and TrainingDataExportService

Part 5 — Admin panel UI

[frontend] Regenerate API types after backend is built
[frontend] TrainingHistory.svelte — table with Datum | Status | Blöcke | Dokumente | Gestartet von; status badge: green + ✓ = DONE, red + × = FAILED, spinner = RUNNING (icon + color, Leonie: colorblind-safe); keyed {#each runs as run (run.id)}
[frontend] OcrTrainingCard.svelte:
- Two-line info display (Leonie: visual hierarchy) — bold reviewed count + docs on line 1, muted total OCR blocks on line 2
- Disabled states: inline text below button, not tooltip (Leonie: accessibility for touch + seniors)
- "Training starten" button → POST /api/ocr/train → reload info
- Embeds TrainingHistory.svelte; focus-visible:ring-2 focus-visible:ring-brand-navy on all interactive elements
[frontend] Wire OcrTrainingCard into admin/system/+page.svelte — fetch info via $effect, pass to card
[test] Vitest tests for OcrTrainingCard: disabled (0 blocks), disabled (OCR unavailable), enabled (happy path)

Part 6 — Segmentation training

[migration] V30__make_transcription_block_text_nullable.sql — ALTER COLUMN text DROP NOT NULL; SET DEFAULT ''
[backend] Update TranscriptionBlock.text — remove nullable = false; sanitizeText() null → ""
[backend] Add findSegmentationBlocks() JPQL: MANUAL blocks where text IS NULL OR text = '', HANDWRITING_KURRENT documents
[backend] SegmentationTrainingExportService — PAGE XML export:
- querySegmentationBlocks(), renderFullPage(), buildPageXml() (de-normalize polygon to pixel space, baseline = bottom edge of quad), exportToZip()
- ZIP structure: page-{docId}-{pageNum}.png + page-{docId}-{pageNum}.xml
[backend] GET /api/ocr/segmentation-training-data/export — same streaming + 204 pattern
[python] POST /segtrain — same ZIP Slip + TemporaryDirectory pattern; calls ketos segtrain --load blla; backs up + replaces + reloads segmentation model; returns metrics
[backend] Add segtrainModel(byte[] zip) to OcrClient interface + implement in RestClientOcrClient
[backend] Extend OcrTrainingService + info endpoint for segmentation block counts; model_name = "blla" in run record
[frontend] "Nur Segmentierung" badge in TranscriptionBlock.svelte — small muted tag when text === '' && source === 'MANUAL'
[frontend] SegmentationTrainingCard.svelte — second admin card; "Segmentierung trainieren" button → POST /api/ocr/segtrain; shares TrainingHistory.svelte filtered to model_name = 'blla'
[frontend] Wire SegmentationTrainingCard into admin/system/+page.svelte below recognition card

Files touched

Backend: pom.xml, TrainingDataExportService, SegmentationTrainingExportService, OcrTrainingService, OcrClient, RestClientOcrClient, OcrController, OcrTrainingRun, TranscriptionBlockRepository, OcrTrainingRunRepository, ErrorCode, TranscriptionBlock, V29 + V30 migrations

Python: ocr-service/main.py — POST /train, POST /segtrain

Frontend: TranscriptionEditView.svelte, TranscriptionBlock.svelte, TrainingHistory.svelte (new), OcrTrainingCard.svelte (new), SegmentationTrainingCard.svelte (new), admin/system/+page.svelte, errors.ts, Paraglide messages (de/en/es), regenerated api.ts

## 🗂️ Implementation Plan — Felix Brandt After reading the issue, all comments, and exploring the codebase, here is the full implementation plan. **Key finding:** All of Part 1's backend + frontend work is already committed on `feat/issue-226-227-ocr-pipeline-polygon` (V26 migration, BlockSource enum, reviewBlock() service/controller, TranscriptionBlock.svelte toggle, TranscriptionEditView wiring, document page reviewToggle()). The only remaining Part 1 gap is the "X von Y geprüft" progress counter in the panel. --- ### Part 1 — Remaining progress counter 1. `[frontend]` Add `reviewedCount` and `totalCount` derived counters to `TranscriptionEditView.svelte` and render "X von Y geprüft" in the panel's top bar — `$derived` from `blocks` prop, no new state --- ### Part 2 — Recognition training data export 2. `[backend]` Add `org.apache.pdfbox:pdfbox:3.0.4` to `backend/pom.xml` 3. `[backend]` Add `findEligibleKurrentBlocks()` JPQL query to `TranscriptionBlockRepository`: - `(source = MANUAL OR (source = OCR AND reviewed = true))` AND `document.scriptType = HANDWRITING_KURRENT` - Rationale: MANUAL blocks are implicitly reviewed (human-authored), equal or better quality than corrected OCR blocks 4. `[backend]` `TrainingDataExportService` with decomposed single-responsibility methods: - `queryEligibleBlocks()` → repository - `renderPageImage(PDDocument, int pageIdx)` → PDFBox 300 DPI, `BufferedImage` - `cropBlockImage(BufferedImage, DocumentAnnotation)` → de-normalize coords × image dims, crop - `writeTrainingPair(ZipOutputStream, UUID, BufferedImage, String)` → `<id>.png` + `<id>.gt.txt` - `exportToZip(StreamingResponseBody)` → outer orchestrator; query results collected **before** entering lambda (Markus: no open DB txn during streaming); each `PDDocument` released after processing (Tobias: avoid memory spike); block IDs validated as UUID before use as filename (NullX: defense in depth) 5. `[backend]` `GET /api/ocr/training-data/export` in `OcrController` — `@RequirePermission(ADMIN)`, `StreamingResponseBody` response, `204` if no eligible blocks 6. `[test]` Integration test with a minimal 1-page PDF fixture in `src/test/resources/fixtures/`; unit tests for query eligibility logic and 204 path --- ### Part 3 — Python recognition training endpoint 7. `[python]` `POST /train` in `ocr-service/main.py`: - Accepts ZIP `UploadFile`; ZIP Slip validation on all entries (NullX) - Extracts to `tempfile.TemporaryDirectory()` (auto-cleanup on success or failure) - Calls `ketos train` Python API with `--load` from `KRAKEN_MODEL_PATH` (transfer learning) - Backs up old model (`german_kurrent.mlmodel.bak`), replaces, reloads engine in-process - Returns `{"loss": ..., "accuracy": ..., "epochs": ...}` - Comment: "Training mutates in-process model state — not safe if service is replicated" - INFO-level logging: started, block count, duration, result (Tobias) 8. `[backend]` Add `trainModel(byte[] trainingDataZip)` to `OcrClient` interface 9. `[backend]` Implement `trainModel()` in `RestClientOcrClient` — POST to `/train` with multipart ZIP, 10-minute timeout --- ### Part 4 — Training history 10. `[migration]` `V29__add_ocr_training_runs.sql`: ```sql CREATE TABLE ocr_training_runs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), status VARCHAR(20) NOT NULL DEFAULT 'RUNNING', block_count INT NOT NULL, document_count INT NOT NULL, model_name VARCHAR(100) NOT NULL, error_message TEXT, triggered_by UUID REFERENCES app_users(id) ON DELETE SET NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), completed_at TIMESTAMPTZ ); ``` 11. `[backend]` `OcrTrainingRun` entity in `model/` — Lombok `@Data @Builder @NoArgsConstructor @AllArgsConstructor` 12. `[backend]` `OcrTrainingRunRepository` — `findFirstByStatusOrderByCreatedAtDesc()` (concurrent-run guard) + `findTop5ByOrderByCreatedAtDesc()` (info endpoint) 13. `[backend]` `OcrTrainingService.triggerTraining(UUID triggeredBy)`: 1. 409 Conflict if a RUNNING run exists (Tobias: concurrent training guard) 2. Create run with status=RUNNING; `triggered_by` from session, never request body (NullX) 3. Export via `TrainingDataExportService` → collect ZIP bytes 4. Call `ocrClient.trainModel(zipBytes)` 5. Mark DONE + set completedAt on success; FAILED + errorMessage on exception 14. `[backend]` Add `ErrorCode.TRAINING_ALREADY_RUNNING` + mirror in `errors.ts` + Paraglide keys (de/en/es) 15. `[backend]` Two new endpoints in `OcrController`: - `POST /api/ocr/train` — `@RequirePermission(ADMIN)` → `triggerTraining(userId)` - `GET /api/ocr/training-info` — `@RequirePermission(ADMIN)` → `TrainingInfoResponse` (availableBlocks, totalOcrBlocks, availableDocuments, ocrServiceAvailable, lastRun, runs) 16. `[test]` `OcrTrainingService` tests: concurrent guard (409), happy path (DONE + completedAt), failure path (FAILED + errorMessage) — mock `OcrClient` and `TrainingDataExportService` --- ### Part 5 — Admin panel UI 17. `[frontend]` Regenerate API types after backend is built 18. `[frontend]` `TrainingHistory.svelte` — table with Datum | Status | Blöcke | Dokumente | Gestartet von; status badge: green + ✓ = DONE, red + × = FAILED, spinner = RUNNING (icon + color, Leonie: colorblind-safe); keyed `{#each runs as run (run.id)}` 19. `[frontend]` `OcrTrainingCard.svelte`: - Two-line info display (Leonie: visual hierarchy) — bold reviewed count + docs on line 1, muted total OCR blocks on line 2 - Disabled states: inline text below button, not tooltip (Leonie: accessibility for touch + seniors) - "Training starten" button → `POST /api/ocr/train` → reload info - Embeds `TrainingHistory.svelte`; `focus-visible:ring-2 focus-visible:ring-brand-navy` on all interactive elements 20. `[frontend]` Wire `OcrTrainingCard` into `admin/system/+page.svelte` — fetch info via `$effect`, pass to card 21. `[test]` Vitest tests for `OcrTrainingCard`: disabled (0 blocks), disabled (OCR unavailable), enabled (happy path) --- ### Part 6 — Segmentation training 22. `[migration]` `V30__make_transcription_block_text_nullable.sql` — `ALTER COLUMN text DROP NOT NULL; SET DEFAULT ''` 23. `[backend]` Update `TranscriptionBlock.text` — remove `nullable = false`; `sanitizeText()` null → `""` 24. `[backend]` Add `findSegmentationBlocks()` JPQL: MANUAL blocks where `text IS NULL OR text = ''`, HANDWRITING_KURRENT documents 25. `[backend]` `SegmentationTrainingExportService` — PAGE XML export: - `querySegmentationBlocks()`, `renderFullPage()`, `buildPageXml()` (de-normalize polygon to pixel space, baseline = bottom edge of quad), `exportToZip()` - ZIP structure: `page-{docId}-{pageNum}.png` + `page-{docId}-{pageNum}.xml` 26. `[backend]` `GET /api/ocr/segmentation-training-data/export` — same streaming + 204 pattern 27. `[python]` `POST /segtrain` — same ZIP Slip + `TemporaryDirectory` pattern; calls `ketos segtrain --load blla`; backs up + replaces + reloads segmentation model; returns metrics 28. `[backend]` Add `segtrainModel(byte[] zip)` to `OcrClient` interface + implement in `RestClientOcrClient` 29. `[backend]` Extend `OcrTrainingService` + info endpoint for segmentation block counts; `model_name = "blla"` in run record 30. `[frontend]` "Nur Segmentierung" badge in `TranscriptionBlock.svelte` — small muted tag when `text === '' && source === 'MANUAL'` 31. `[frontend]` `SegmentationTrainingCard.svelte` — second admin card; "Segmentierung trainieren" button → `POST /api/ocr/segtrain`; shares `TrainingHistory.svelte` filtered to `model_name = 'blla'` 32. `[frontend]` Wire `SegmentationTrainingCard` into `admin/system/+page.svelte` below recognition card --- ### Files touched **Backend:** `pom.xml`, `TrainingDataExportService`, `SegmentationTrainingExportService`, `OcrTrainingService`, `OcrClient`, `RestClientOcrClient`, `OcrController`, `OcrTrainingRun`, `TranscriptionBlockRepository`, `OcrTrainingRunRepository`, `ErrorCode`, `TranscriptionBlock`, `V29` + `V30` migrations **Python:** `ocr-service/main.py` — `POST /train`, `POST /segtrain` **Frontend:** `TranscriptionEditView.svelte`, `TranscriptionBlock.svelte`, `TrainingHistory.svelte` (new), `OcrTrainingCard.svelte` (new), `SegmentationTrainingCard.svelte` (new), `admin/system/+page.svelte`, `errors.ts`, Paraglide messages (de/en/es), regenerated `api.ts`

marcel commented

2026-04-13 13:47:45 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

JPQL join on annotationId needs care: The plan's findEligibleKurrentBlocks() query does JOIN DocumentAnnotation a ON a.id = b.annotationId. That's a non-FK join (no @ManyToOne between TranscriptionBlock and DocumentAnnotation). JPQL handles this with JOIN … ON, but I'd want to verify the query compiles and doesn't produce a Cartesian product before writing tests against it. Consider wrapping the result in a projection that includes the annotation data in one shot, rather than doing N+1 lookups per block in TrainingDataExportService.
TrainingDataExportService.exportToZip() collects all block+annotation+document data before the lambda — right call, but how? The plan says "query results collected before entering StreamingResponseBody lambda." That means the method returns a StreamingResponseBody that captures the already-queried data as a local variable. Make sure the query returns everything needed for rendering (block, annotation coords, document fileHash/s3Key) in one go — not lazy-loaded associations that will fail once the Hibernate session closes.
sanitizeText() null handling for Part 6: The existing sanitizeText() in TranscriptionService returns "" for null. But TranscriptionBlock.text is currently annotated @Column(nullable = false). When we make it nullable in V30, we need to ensure createOcrBlock() and createBlock() in TranscriptionService still store "" not null for OCR blocks — only segmentation-only manual blocks should store null. Worth making this intent explicit with a named constant or a factory method distinction.
Python ketos train API: The Kraken Python API for training (ketos train) uses a callback-based or iterator-based interface. The plan says "calls ketos train Python API" but the synchronous call pattern needs to be verified — ketos training is not a simple function call, it typically involves iterating an LMDB or ground truth loader. The implementation needs to handle the training loop correctly (even if just calling the CLI via subprocess if the Python API proves complex to integrate directly).
OcrTrainingCard.svelte $effect on mount: The plan says "fetch info via $effect". Since the admin system page has no +page.server.ts, this client-side fetch is correct, but the initial state while loading needs to be handled — the button must not appear enabled while trainingInfo is null/undefined. Use a loading state guard.

Suggestions

One integration test per export pipeline: Write TrainingDataExportServiceIntegrationTest that uses PDFBox itself to generate the fixture PDF (not a static file), asserting on: (1) the ZIP contains exactly N entries where N = 2 × eligibleBlocks, (2) each .gt.txt has the expected text content, (3) each .png is a valid image with non-zero dimensions.
OcrTrainingRun status as an enum, not a String: status VARCHAR(20) on the DB side maps cleanly to a Java enum TrainingStatus { RUNNING, DONE, FAILED }. Then findFirstByStatus(TrainingStatus.RUNNING) is typesafe and refactor-friendly. The migration stores the string; the entity maps it as @Enumerated(EnumType.STRING).
Segmentation "Nur Segmentierung" badge check: text === '' is not the same as text === null. When the frontend receives a block from the API, the JSON serialization of a Java null String is null in JSON but "" if we default to "". Verify which value the API returns for segmentation-only blocks, and guard both: (!block.text || block.text.trim() === '') && block.source === 'MANUAL'.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Questions & Observations - **JPQL join on `annotationId` needs care**: The plan's `findEligibleKurrentBlocks()` query does `JOIN DocumentAnnotation a ON a.id = b.annotationId`. That's a non-FK join (no `@ManyToOne` between `TranscriptionBlock` and `DocumentAnnotation`). JPQL handles this with `JOIN … ON`, but I'd want to verify the query compiles and doesn't produce a Cartesian product before writing tests against it. Consider wrapping the result in a projection that includes the annotation data in one shot, rather than doing N+1 lookups per block in `TrainingDataExportService`. - **`TrainingDataExportService.exportToZip()` collects all block+annotation+document data before the lambda — right call, but how?** The plan says "query results collected before entering StreamingResponseBody lambda." That means the method returns a `StreamingResponseBody` that captures the already-queried data as a local variable. Make sure the query returns everything needed for rendering (block, annotation coords, document fileHash/s3Key) in one go — not lazy-loaded associations that will fail once the Hibernate session closes. - **`sanitizeText()` null handling for Part 6**: The existing `sanitizeText()` in `TranscriptionService` returns `""` for null. But `TranscriptionBlock.text` is currently annotated `@Column(nullable = false)`. When we make it nullable in V30, we need to ensure `createOcrBlock()` and `createBlock()` in `TranscriptionService` still store `""` not `null` for OCR blocks — only segmentation-only manual blocks should store `null`. Worth making this intent explicit with a named constant or a factory method distinction. - **Python `ketos train` API**: The Kraken Python API for training (`ketos train`) uses a callback-based or iterator-based interface. The plan says "calls `ketos train` Python API" but the synchronous call pattern needs to be verified — `ketos` training is not a simple function call, it typically involves iterating an LMDB or ground truth loader. The implementation needs to handle the training loop correctly (even if just calling the CLI via subprocess if the Python API proves complex to integrate directly). - **`OcrTrainingCard.svelte` `$effect` on mount**: The plan says "fetch info via `$effect`". Since the admin system page has no `+page.server.ts`, this client-side fetch is correct, but the initial state while loading needs to be handled — the button must not appear enabled while `trainingInfo` is null/undefined. Use a loading state guard. ### Suggestions - **One integration test per export pipeline**: Write `TrainingDataExportServiceIntegrationTest` that uses PDFBox itself to *generate* the fixture PDF (not a static file), asserting on: (1) the ZIP contains exactly `N` entries where `N = 2 × eligibleBlocks`, (2) each `.gt.txt` has the expected text content, (3) each `.png` is a valid image with non-zero dimensions. - **`OcrTrainingRun` status as an enum, not a String**: `status VARCHAR(20)` on the DB side maps cleanly to a Java enum `TrainingStatus { RUNNING, DONE, FAILED }`. Then `findFirstByStatus(TrainingStatus.RUNNING)` is typesafe and refactor-friendly. The migration stores the string; the entity maps it as `@Enumerated(EnumType.STRING)`. - **Segmentation "Nur Segmentierung" badge check**: `text === ''` is not the same as `text === null`. When the frontend receives a block from the API, the JSON serialization of a Java `null` String is `null` in JSON but `""` if we default to `""`. Verify which value the API returns for segmentation-only blocks, and guard both: `(!block.text || block.text.trim() === '') && block.source === 'MANUAL'`.

marcel commented

2026-04-13 13:48:13 +02:00

🏗️ Markus Keller — Application Architect

Questions & Observations

Concurrent training race condition in OcrTrainingService: The plan rejects a second trigger if a RUNNING run exists, using findFirstByStatusOrderByCreatedAtDesc(). This is a read-then-write pattern — two simultaneous requests can both read "no RUNNING run" and both proceed to create one. The ocr_training_runs table should have a partial unique index to enforce the single-active-run invariant at the database layer:
```
CREATE UNIQUE INDEX idx_ocr_training_runs_one_running
    ON ocr_training_runs (status)
    WHERE status = 'RUNNING';
```
The application check stays as a user-friendly 409. The DB constraint is the safety net.
TrainingDataExportService domain boundary: This service touches three areas — transcription blocks, document/file storage, and annotations. That's fine as long as it goes through FileService for S3 (not the S3 client directly) and through the TranscriptionBlockRepository (not DocumentRepository). The annotation access via AnnotationRepository is the grey area: does TranscriptionService already provide a method to fetch blocks with their annotations? If so, delegate to it rather than having TrainingDataExportService reach directly into AnnotationRepository.
SegmentationTrainingExportService shares 80% of its logic with TrainingDataExportService: Both render PDF pages at 300 DPI, both download files via FileService, both write PNGs to a ZIP. Before creating two separate services, consider whether the shared rendering logic should live in a PdfPageRenderer utility class that both services use. KISS applies — only extract it if the duplication is actually painful to maintain.
GET /api/ocr/training-info is called on every page load of the admin system tab: This aggregates 5+ queries and an HTTP health check to the Python service. The health check alone adds latency and introduces a hard dependency on OCR service availability just to display the page. Consider: (a) make ocrServiceAvailable a separate lazy-fetch or (b) cache the health status with a short TTL (30–60s) in OcrHealthClient to avoid hitting the Python service on every admin page open.
PAGE XML generation for segmentation training: The spec describes producing PAGE XML with polygon coordinates. The DocumentAnnotation.polygon is a JSONB field containing a 4-point quad. Verify the coordinate system: are these normalized (0.0–1.0) or pixel-absolute? The plan de-normalizes them to pixel space, which is correct — but the imageWidth and imageHeight in the PAGE XML must match the rendered image dimensions exactly (2480 × 3508 for A4 at 300 DPI), or Kraken's segtrain will reject them. This detail needs a unit test that verifies coordinate de-normalization is consistent with the rendered image dimensions.

Suggestions

OcrTrainingRun completedAt vs. duration: Storing created_at + completed_at lets you compute duration, which is useful in the admin UI. But consider also storing block_count and document_count on the run at creation time (before training starts), not just on completion — this way you can display "training 23 blocks from 8 documents…" while RUNNING, not just retrospectively.
Error recovery: If the Spring Boot process crashes during a training run (OOM during PDFBox rendering, for example), the run stays in RUNNING forever. Add a startup recovery step in OcrTrainingService: on application start, find any RUNNING runs older than 1 hour and mark them FAILED with error_message = "Abgebrochen: Dienst wurde neugestartet". This prevents the admin from being permanently locked out of triggering new runs.

## 🏗️ Markus Keller — Application Architect ### Questions & Observations - **Concurrent training race condition in `OcrTrainingService`**: The plan rejects a second trigger if a RUNNING run exists, using `findFirstByStatusOrderByCreatedAtDesc()`. This is a read-then-write pattern — two simultaneous requests can both read "no RUNNING run" and both proceed to create one. The `ocr_training_runs` table should have a partial unique index to enforce the single-active-run invariant at the database layer: ```sql CREATE UNIQUE INDEX idx_ocr_training_runs_one_running ON ocr_training_runs (status) WHERE status = 'RUNNING'; ``` The application check stays as a user-friendly 409. The DB constraint is the safety net. - **`TrainingDataExportService` domain boundary**: This service touches three areas — transcription blocks, document/file storage, and annotations. That's fine as long as it goes through `FileService` for S3 (not the S3 client directly) and through the `TranscriptionBlockRepository` (not `DocumentRepository`). The annotation access via `AnnotationRepository` is the grey area: does `TranscriptionService` already provide a method to fetch blocks with their annotations? If so, delegate to it rather than having `TrainingDataExportService` reach directly into `AnnotationRepository`. - **`SegmentationTrainingExportService` shares 80% of its logic with `TrainingDataExportService`**: Both render PDF pages at 300 DPI, both download files via `FileService`, both write PNGs to a ZIP. Before creating two separate services, consider whether the shared rendering logic should live in a `PdfPageRenderer` utility class that both services use. KISS applies — only extract it if the duplication is actually painful to maintain. - **`GET /api/ocr/training-info` is called on every page load of the admin system tab**: This aggregates 5+ queries and an HTTP health check to the Python service. The health check alone adds latency and introduces a hard dependency on OCR service availability just to *display* the page. Consider: (a) make `ocrServiceAvailable` a separate lazy-fetch or (b) cache the health status with a short TTL (30–60s) in `OcrHealthClient` to avoid hitting the Python service on every admin page open. - **PAGE XML generation for segmentation training**: The spec describes producing PAGE XML with polygon coordinates. The `DocumentAnnotation.polygon` is a JSONB field containing a 4-point quad. Verify the coordinate system: are these normalized (0.0–1.0) or pixel-absolute? The plan de-normalizes them to pixel space, which is correct — but the `imageWidth` and `imageHeight` in the PAGE XML must match the rendered image dimensions exactly (2480 × 3508 for A4 at 300 DPI), or Kraken's segtrain will reject them. This detail needs a unit test that verifies coordinate de-normalization is consistent with the rendered image dimensions. ### Suggestions - **`OcrTrainingRun` completedAt vs. duration**: Storing `created_at` + `completed_at` lets you compute duration, which is useful in the admin UI. But consider also storing `block_count` and `document_count` on the run at creation time (before training starts), not just on completion — this way you can display "training 23 blocks from 8 documents…" while RUNNING, not just retrospectively. - **Error recovery**: If the Spring Boot process crashes during a training run (OOM during PDFBox rendering, for example), the run stays in RUNNING forever. Add a startup recovery step in `OcrTrainingService`: on application start, find any RUNNING runs older than 1 hour and mark them FAILED with `error_message = "Abgebrochen: Dienst wurde neugestartet"`. This prevents the admin from being permanently locked out of triggering new runs.

marcel commented

2026-04-13 13:48:43 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Questions & Observations

PDF fixture strategy: The plan mentions a "minimal 1-page PDF fixture" in src/test/resources/fixtures/. A static binary file in git is fine for a small fixture, but consider generating it programmatically with PDFBox in a @BeforeAll method instead — it avoids committing a binary and lets you control page dimensions precisely (exact A4 at 300 DPI = 2480×3508px). Either way, the fixture must have at least one text region so a crop produces a non-blank PNG.
Integration test data setup: TrainingDataExportServiceIntegrationTest needs a full chain in the DB: Document (scriptType=HANDWRITING_KURRENT, with a real s3Key) → DocumentAnnotation (with x/y/width/height) → TranscriptionBlock (source=MANUAL or source=OCR+reviewed=true). That's 3 entities with valid FK relationships. Use @Sql fixture files or @BeforeEach builders — not inline save() chains inside the test body.
The findEligibleKurrentBlocks() JPQL query needs explicit negative tests:
- Block is OCR + reviewed=false → NOT included
- Block is MANUAL → included (regardless of reviewed)
- Block is OCR + reviewed=true → included
- Block's document is scriptType=TYPEWRITER → NOT included
  All four cases should be in a @DataJpaTest against a real Postgres container.
Concurrent training guard test: Testing the 409 requires two near-simultaneous HTTP requests, which is hard to do deterministically in a unit test. A simpler approach: write a unit test that calls triggerTraining() twice on a mock that always returns "no running" on the first call but a running run on the second, and verify the second throws DomainException.conflict(). The DB-level partial unique index (if added) would be tested by the integration test inserting two RUNNING rows and expecting a constraint violation.
Python test coverage: The POST /train and POST /segtrain endpoints in main.py have no test plan in the current spec. At minimum, pytest tests should cover: (1) ZIP Slip rejection (a ZIP entry with ../../etc/passwd path), (2) valid ZIP is extracted to the correct temp dir, (3) cleanup happens even on exception. These don't require a real Kraken model — mock ketos.train at the module level with unittest.mock.patch.
Missing edge case: PDF deleted from S3: If FileService.downloadFileBytes(s3Key) throws because the file no longer exists in MinIO, TrainingDataExportService should skip that document gracefully (log + continue) rather than aborting the entire ZIP export. The test: one document with a valid block but a missing S3 file → ZIP still contains blocks from other documents.
V30 migration rollback path: Making text nullable is easy to roll forward. Rolling back (setting NOT NULL again) would fail if any null rows exist. Verify the V30 migration includes a comment noting this is intentional and irreversible without a data cleanup step.

Suggestions

Test name convention reminder: The test names in the plan are described as behaviors, which is right. Make sure they follow the project pattern: should_notIncludeUnreviewedOcrBlocks_whenQueryingEligibleKurrentBlocks or just should return only eligible Kurrent blocks (sentence style). Avoid testFindEligibleKurrentBlocks.
Vitest test isolation for OcrTrainingCard: When testing the disabled states, mock fetch at the Vitest level using vi.stubGlobal('fetch', ...). The component's $effect fetch means the test must await tick() after render to let the async update propagate before asserting on the button state.

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Questions & Observations - **PDF fixture strategy**: The plan mentions a "minimal 1-page PDF fixture" in `src/test/resources/fixtures/`. A static binary file in git is fine for a small fixture, but consider generating it programmatically with PDFBox in a `@BeforeAll` method instead — it avoids committing a binary and lets you control page dimensions precisely (exact A4 at 300 DPI = 2480×3508px). Either way, the fixture must have at least one text region so a crop produces a non-blank PNG. - **Integration test data setup**: `TrainingDataExportServiceIntegrationTest` needs a full chain in the DB: Document (scriptType=HANDWRITING_KURRENT, with a real s3Key) → DocumentAnnotation (with x/y/width/height) → TranscriptionBlock (source=MANUAL or source=OCR+reviewed=true). That's 3 entities with valid FK relationships. Use `@Sql` fixture files or `@BeforeEach` builders — not inline `save()` chains inside the test body. - **The `findEligibleKurrentBlocks()` JPQL query needs explicit negative tests**: - Block is OCR + `reviewed=false` → NOT included - Block is MANUAL → included (regardless of `reviewed`) - Block is OCR + `reviewed=true` → included - Block's document is `scriptType=TYPEWRITER` → NOT included All four cases should be in a `@DataJpaTest` against a real Postgres container. - **Concurrent training guard test**: Testing the 409 requires two near-simultaneous HTTP requests, which is hard to do deterministically in a unit test. A simpler approach: write a unit test that calls `triggerTraining()` twice on a mock that always returns "no running" on the first call but a running run on the second, and verify the second throws `DomainException.conflict()`. The DB-level partial unique index (if added) would be tested by the integration test inserting two RUNNING rows and expecting a constraint violation. - **Python test coverage**: The `POST /train` and `POST /segtrain` endpoints in `main.py` have no test plan in the current spec. At minimum, pytest tests should cover: (1) ZIP Slip rejection (a ZIP entry with `../../etc/passwd` path), (2) valid ZIP is extracted to the correct temp dir, (3) cleanup happens even on exception. These don't require a real Kraken model — mock `ketos.train` at the module level with `unittest.mock.patch`. - **Missing edge case: PDF deleted from S3**: If `FileService.downloadFileBytes(s3Key)` throws because the file no longer exists in MinIO, `TrainingDataExportService` should skip that document gracefully (log + continue) rather than aborting the entire ZIP export. The test: one document with a valid block but a missing S3 file → ZIP still contains blocks from other documents. - **`V30` migration rollback path**: Making `text` nullable is easy to roll forward. Rolling back (setting NOT NULL again) would fail if any null rows exist. Verify the V30 migration includes a comment noting this is intentional and irreversible without a data cleanup step. ### Suggestions - **Test name convention reminder**: The test names in the plan are described as behaviors, which is right. Make sure they follow the project pattern: `should_notIncludeUnreviewedOcrBlocks_whenQueryingEligibleKurrentBlocks` or just `should return only eligible Kurrent blocks` (sentence style). Avoid `testFindEligibleKurrentBlocks`. - **Vitest test isolation for `OcrTrainingCard`**: When testing the disabled states, mock `fetch` at the Vitest level using `vi.stubGlobal('fetch', ...)`. The component's `$effect` fetch means the test must `await tick()` after render to let the async update propagate before asserting on the button state.

marcel commented

2026-04-13 13:49:09 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Questions & Observations

Python /train and /segtrain — network-level authentication gap: The plan correctly applies @RequirePermission(ADMIN) on the Spring Boot side. But the Python service itself has no authentication — it accepts any POST to /train or /segtrain. The current compose exposes the OCR service only internally (expose: 8000, not ports:), so Docker network isolation is the only guard. This is acceptable for now, but document it explicitly: if ocr-service is ever given a host port binding (e.g. for debugging), anyone with network access can trigger training. A pre-shared secret header (X-Training-Token) would close this with minimal complexity:
```
TRAINING_TOKEN = os.environ.get("TRAINING_TOKEN")
# in the endpoint:
if token != TRAINING_TOKEN:
    raise HTTPException(status_code=403, detail="Unauthorized")
```
Spring Boot sends this header; the Python service checks it. Zero user friction, closes the gap.
ZIP Slip — the fix is correct, but needs one more check: The plan's validation checks that the extracted path doesn't escape temp_dir. Good. But also validate that ZIP entries do not contain absolute paths (entries starting with / or C:\):
```
if os.path.isabs(entry) or entry.startswith('..'):
    raise ValueError(f"Unsafe zip entry: {entry}")
```
Both checks together cover the full CWE-22 surface.
Training data as an indirect attack surface: Cropped document images + their .gt.txt transcriptions are written to a temp dir during training, then (per the plan) cleaned up. But if the model file replacement step fails after the temp dir is created, are the training images cleaned up? The tempfile.TemporaryDirectory() context manager handles this correctly — as long as it wraps the entire operation including the model replacement, not just the extraction.
triggered_by population: The plan correctly specifies this comes from the authenticated session. Just confirming: the implementation should use authentication.getName() → userService.findByUsername() → user.getId(), same as requireUserId() in TranscriptionBlockController. Do not add a triggered_by field to any request body or DTO.
Training data export endpoint and IDOR: GET /api/ocr/training-data/export is admin-gated and queries all eligible blocks across all documents. No IDOR risk since there's no per-resource scoping. The main surface is the admin permission check — verify @RequirePermission(ADMIN) is actually enforced in the test suite with a test that sends the request without admin permission and expects 403.

Suggestions

Add a security test for the export endpoint: A @WebMvcTest with @WithMockUser(roles = "READ_ALL") calling GET /api/ocr/training-data/export should return 403. This is a regression guard — it ensures @RequirePermission(ADMIN) never gets accidentally removed.
Minimum block threshold: The plan currently allows training with as few as 1 reviewed block. Consider a minimum threshold (e.g. 5 blocks) to reduce the practical impact of training data poisoning via a single tampered block. The admin UI can enforce this in the disabled-button logic; the backend should also validate and return a 422 with a clear message if the threshold isn't met.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Questions & Observations - **Python `/train` and `/segtrain` — network-level authentication gap**: The plan correctly applies `@RequirePermission(ADMIN)` on the Spring Boot side. But the Python service itself has no authentication — it accepts any POST to `/train` or `/segtrain`. The current compose exposes the OCR service only internally (`expose: 8000`, not `ports:`), so Docker network isolation is the only guard. This is acceptable for now, but document it explicitly: if `ocr-service` is ever given a host port binding (e.g. for debugging), anyone with network access can trigger training. A pre-shared secret header (`X-Training-Token`) would close this with minimal complexity: ```python TRAINING_TOKEN = os.environ.get("TRAINING_TOKEN") # in the endpoint: if token != TRAINING_TOKEN: raise HTTPException(status_code=403, detail="Unauthorized") ``` Spring Boot sends this header; the Python service checks it. Zero user friction, closes the gap. - **ZIP Slip — the fix is correct, but needs one more check**: The plan's validation checks that the extracted path doesn't escape `temp_dir`. Good. But also validate that ZIP entries do not contain absolute paths (entries starting with `/` or `C:\`): ```python if os.path.isabs(entry) or entry.startswith('..'): raise ValueError(f"Unsafe zip entry: {entry}") ``` Both checks together cover the full CWE-22 surface. - **Training data as an indirect attack surface**: Cropped document images + their `.gt.txt` transcriptions are written to a temp dir during training, then (per the plan) cleaned up. But if the model file replacement step fails after the temp dir is created, are the training images cleaned up? The `tempfile.TemporaryDirectory()` context manager handles this correctly — as long as it wraps the entire operation including the model replacement, not just the extraction. - **`triggered_by` population**: The plan correctly specifies this comes from the authenticated session. Just confirming: the implementation should use `authentication.getName()` → `userService.findByUsername()` → `user.getId()`, same as `requireUserId()` in `TranscriptionBlockController`. Do not add a `triggered_by` field to any request body or DTO. - **Training data export endpoint and IDOR**: `GET /api/ocr/training-data/export` is admin-gated and queries all eligible blocks across all documents. No IDOR risk since there's no per-resource scoping. The main surface is the admin permission check — verify `@RequirePermission(ADMIN)` is actually enforced in the test suite with a test that sends the request without admin permission and expects 403. ### Suggestions - **Add a security test for the export endpoint**: A `@WebMvcTest` with `@WithMockUser(roles = "READ_ALL")` calling `GET /api/ocr/training-data/export` should return 403. This is a regression guard — it ensures `@RequirePermission(ADMIN)` never gets accidentally removed. - **Minimum block threshold**: The plan currently allows training with as few as 1 reviewed block. Consider a minimum threshold (e.g. 5 blocks) to reduce the practical impact of training data poisoning via a single tampered block. The admin UI can enforce this in the disabled-button logic; the backend should also validate and return a 422 with a clear message if the threshold isn't met.

marcel commented

2026-04-13 13:49:35 +02:00

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

Progress counter placement in TranscriptionEditView.svelte: The plan adds "X von Y geprüft" to the "panel's top bar." The current component has no top bar — it's a scrollable flex column. The counter should appear as a sticky header row at the top of the scrollable area, above the block list, so it remains visible as the user scrolls through many blocks. A thin line below it visually separates it from the first block.
OcrTrainingCard info text layout on narrow screens: The two-line hierarchy (bold count + docs on line 1, muted total on line 2) works well on desktop. On mobile (320px), long German strings like "23 geprüfte Blöcke aus 8 Dokumenten verfügbar" may overflow. Test at 320px and use break-words or wrap inside a <p> with text-sm or smaller.
History table on mobile: A 5-column table (Datum | Status | Blöcke | Dokumente | Gestartet von) will not fit at 320px without horizontal scrolling or column hiding. Recommend: at < md breakpoint, show only Datum + Status + Blöcke, hide the rest. Use hidden md:table-cell on the lower-priority columns.
RUNNING state spinner in TrainingHistory: The plan mentions a spinner for RUNNING status. Animated spinners in a table column can be visually jarring if the page auto-polls. Use a subtle pulsing dot (animate-pulse) rather than a spinning circle — it's less distracting in a dense table row.
OcrTrainingCard button disabled feedback timing: When the user clicks "Training starten" and the request is in-flight, the button should show a loading state (disabled + spinner or "…") before the response arrives. Without this, a slow network makes it look like nothing happened, and users click again.
"Nur Segmentierung" badge in TranscriptionBlock: The badge should appear where the label is currently rendered (the header row inside the card), styled as a small muted chip — text-xs font-medium text-ink-3 bg-muted rounded px-1.5 py-0.5. It must not push the block's layout wider; keep it inline with the label. If the block also has a label, show label first, then the badge.

Suggestions

Progress counter with mini progress bar: Consider pairing the "12 / 17 geprüft" text with a thin progress bar underneath (2px height, brand-mint fill on brand-sand background). It gives immediate visual density — users grasp "70% done" faster than parsing the fraction. Total added height: ~10px.
"Training starten" → feedback after completion: After a successful train call, show an inline success message in the card (e.g. a green bordered row: "Training abgeschlossen — Modell aktualisiert") for ~5 seconds, then reload the history. Don't rely on the history table update alone — users may miss a new row appearing at the top of a table they weren't watching.

## 🎨 Leonie Voss — UI/UX Design Lead ### Questions & Observations - **Progress counter placement in `TranscriptionEditView.svelte`**: The plan adds "X von Y geprüft" to the "panel's top bar." The current component has no top bar — it's a scrollable flex column. The counter should appear as a sticky header row at the top of the scrollable area, above the block list, so it remains visible as the user scrolls through many blocks. A thin line below it visually separates it from the first block. - **`OcrTrainingCard` info text layout on narrow screens**: The two-line hierarchy (bold count + docs on line 1, muted total on line 2) works well on desktop. On mobile (320px), long German strings like "23 geprüfte Blöcke aus 8 Dokumenten verfügbar" may overflow. Test at 320px and use `break-words` or wrap inside a `<p>` with `text-sm` or smaller. - **History table on mobile**: A 5-column table (Datum | Status | Blöcke | Dokumente | Gestartet von) will not fit at 320px without horizontal scrolling or column hiding. Recommend: at `< md` breakpoint, show only Datum + Status + Blöcke, hide the rest. Use `hidden md:table-cell` on the lower-priority columns. - **RUNNING state spinner in `TrainingHistory`**: The plan mentions a spinner for RUNNING status. Animated spinners in a table column can be visually jarring if the page auto-polls. Use a subtle pulsing dot (`animate-pulse`) rather than a spinning circle — it's less distracting in a dense table row. - **`OcrTrainingCard` button disabled feedback timing**: When the user clicks "Training starten" and the request is in-flight, the button should show a loading state (disabled + spinner or "…") before the response arrives. Without this, a slow network makes it look like nothing happened, and users click again. - **"Nur Segmentierung" badge in `TranscriptionBlock`**: The badge should appear where the `label` is currently rendered (the header row inside the card), styled as a small muted chip — `text-xs font-medium text-ink-3 bg-muted rounded px-1.5 py-0.5`. It must not push the block's layout wider; keep it inline with the label. If the block also has a label, show label first, then the badge. ### Suggestions - **Progress counter with mini progress bar**: Consider pairing the "12 / 17 geprüft" text with a thin progress bar underneath (2px height, brand-mint fill on brand-sand background). It gives immediate visual density — users grasp "70% done" faster than parsing the fraction. Total added height: ~10px. - **"Training starten" → feedback after completion**: After a successful train call, show an inline success message in the card (e.g. a green bordered row: "Training abgeschlossen — Modell aktualisiert") for ~5 seconds, then reload the history. Don't rely on the history table update alone — users may miss a new row appearing at the top of a table they weren't watching.

marcel commented

2026-04-13 13:50:02 +02:00

⚙️ Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

Backend memory limit needs updating: The current docker-compose.yml doesn't have explicit memory limits on the backend container. PDFBox at 300 DPI produces ~25 MB per page uncompressed in memory. For a training export across 8 documents with 3–4 pages each, that's ~800 MB peak on top of the normal JVM heap. Add an explicit limit:
```
backend:
  deploy:
    resources:
      limits:
        memory: 1500m
```
Without this, a large export could trigger the OOM killer and take down the entire backend container — including all other users' requests.
Stuck RUNNING runs after container restart: Markus already raised the startup recovery logic. From an ops angle: the recovery should run in a @EventListener(ApplicationReadyEvent.class) method, not in the constructor or @PostConstruct — the JPA context needs to be fully ready before querying OcrTrainingRunRepository. Log the recovery at WARN level so it's visible in Loki: "Found orphaned RUNNING training run {} — marking FAILED (service was restarted)".
OCR service health check during training: The Python service currently returns {"status": "ok"} or fails the health check entirely. During a synchronous training run, the /health endpoint should still respond (training blocks the request thread, but FastAPI can serve health on a separate worker). Verify the uvicorn worker count: if it's --workers 1 (the default in the Dockerfile), health checks will time out during training. Either add --workers 2 or make the /health endpoint non-blocking. Docker's health check interval is 10s with 5s timeout — a blocked single worker would fail it.
Model backup naming: The plan backs up german_kurrent.mlmodel to german_kurrent.mlmodel.bak. A single .bak means each training run overwrites the previous backup — you lose the ability to roll back past one run. Consider timestamped backups: german_kurrent.mlmodel.{timestamp}.bak. Keep the last 3. This adds minimal disk usage (models are ~50–200 MB) and gives a recovery path if two consecutive training runs both degrade quality.
ocr_models volume and the blla segmentation model: The compose file mounts ocr_models:/app/models. The segmentation model (blla) is part of Kraken's built-in models — it's not in /app/models, it's in the Kraken package or the model cache at /root/.cache. The ocr_cache volume already covers this. Verify where blla lives after a ketos segtrain run: the fine-tuned model must be written to a path on a named volume, not to the container filesystem.

Suggestions

Structured logging for training runs: Add a training_run_id field to every log line emitted during a training run, both in Java (MDC.put("trainingRunId", run.getId().toString())) and in Python (pass it as a log context header or prefix). This makes it trivial to filter all logs for a specific run in Loki: {container="archive-backend"} | json | trainingRunId="uuid-here".
Add TRAINING_TOKEN to the compose env block now: Even if it's optional in the first iteration, add the env var to docker-compose.yml with a placeholder comment so it's not forgotten:
```
ocr-service:
  environment:
    TRAINING_TOKEN: "${OCR_TRAINING_TOKEN:-}"  # set to a secret in production
```
An empty value means no token check (dev-friendly default); a non-empty value enables the check.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer ### Questions & Observations - **Backend memory limit needs updating**: The current `docker-compose.yml` doesn't have explicit memory limits on the backend container. PDFBox at 300 DPI produces ~25 MB per page uncompressed in memory. For a training export across 8 documents with 3–4 pages each, that's ~800 MB peak on top of the normal JVM heap. Add an explicit limit: ```yaml backend: deploy: resources: limits: memory: 1500m ``` Without this, a large export could trigger the OOM killer and take down the entire backend container — including all other users' requests. - **Stuck RUNNING runs after container restart**: Markus already raised the startup recovery logic. From an ops angle: the recovery should run in a `@EventListener(ApplicationReadyEvent.class)` method, not in the constructor or `@PostConstruct` — the JPA context needs to be fully ready before querying `OcrTrainingRunRepository`. Log the recovery at WARN level so it's visible in Loki: `"Found orphaned RUNNING training run {} — marking FAILED (service was restarted)"`. - **OCR service health check during training**: The Python service currently returns `{"status": "ok"}` or fails the health check entirely. During a synchronous training run, the `/health` endpoint should still respond (training blocks the request thread, but FastAPI can serve health on a separate worker). Verify the uvicorn worker count: if it's `--workers 1` (the default in the Dockerfile), health checks will time out during training. Either add `--workers 2` or make the `/health` endpoint non-blocking. Docker's health check interval is 10s with 5s timeout — a blocked single worker would fail it. - **Model backup naming**: The plan backs up `german_kurrent.mlmodel` to `german_kurrent.mlmodel.bak`. A single `.bak` means each training run overwrites the previous backup — you lose the ability to roll back past one run. Consider timestamped backups: `german_kurrent.mlmodel.{timestamp}.bak`. Keep the last 3. This adds minimal disk usage (models are ~50–200 MB) and gives a recovery path if two consecutive training runs both degrade quality. - **`ocr_models` volume and the `blla` segmentation model**: The compose file mounts `ocr_models:/app/models`. The segmentation model (`blla`) is part of Kraken's built-in models — it's not in `/app/models`, it's in the Kraken package or the model cache at `/root/.cache`. The `ocr_cache` volume already covers this. Verify where `blla` lives after a `ketos segtrain` run: the fine-tuned model must be written to a path on a named volume, not to the container filesystem. ### Suggestions - **Structured logging for training runs**: Add a `training_run_id` field to every log line emitted during a training run, both in Java (`MDC.put("trainingRunId", run.getId().toString())`) and in Python (pass it as a log context header or prefix). This makes it trivial to filter all logs for a specific run in Loki: `{container="archive-backend"} | json | trainingRunId="uuid-here"`. - **Add `TRAINING_TOKEN` to the compose env block now**: Even if it's optional in the first iteration, add the env var to `docker-compose.yml` with a placeholder comment so it's not forgotten: ```yaml ocr-service: environment: TRAINING_TOKEN: "${OCR_TRAINING_TOKEN:-}" # set to a secret in production ``` An empty value means no token check (dev-friendly default); a non-empty value enables the check.

marcel commented

2026-04-13 15:15:54 +02:00

✅ Implementierung abgeschlossen

Alle 6 Teile des Kraken Fine-Tuning Pipelines sind implementiert. Branch: feat/issue-226-227-ocr-pipeline-polygon

Commits (chronologisch)

7322907 feat(transcription) — Sticky Review-Fortschrittsanzeige (X / Y geprüft + Fortschrittsbalken) in TranscriptionEditView (Part 1)
fdf1eb9 feat(training) — Training-Enrollment: document_training_labels-Tabelle, TrainingLabel-Enum, @ElementCollection auf Document, PATCH /api/documents/{id}/training-labels, Auto-Enrollment bei Kurrent-OCR, Chip-Toggle im Transkriptions-Panel (Part 1b)
cfa3c4d feat(training) — Recognition Training Data Export: TrainingDataExportService (PDFBox 300 DPI, ZIP mit PNG+GT-Textpaaren), GET /api/ocr/training-data/export (Part 2)
bc97a2d feat(ocr) — Python /train-Endpoint (ZIP Slip-Schutz, ketos.train, Backup-Rotation, In-Process-Reload), OcrClient.trainModel(), RestClientOcrClient-Implementierung mit 10-min-Timeout + X-Training-Token (Part 3)
88e005e feat(ocr) — Trainings-Verlauf: V30-Migration (ocr_training_runs, partieller Unique-Index), OcrTrainingRun-Entity, OcrTrainingService (409-Guard, Orphan-Recovery), POST /api/ocr/train, GET /api/ocr/training-info (Part 4)
4e08d31 feat(admin) — Admin-UI: TrainingHistory.svelte, OcrTrainingCard.svelte, in admin/system/+page.svelte eingebunden, Vitest-Tests (Part 5)
9b2f91e feat(training) — Segmentierungs-Training: SegmentationTrainingExportService (PAGE XML, Polygon-Denormalisierung), /segtrain-Python-Endpoint, segtrainModel() in OcrClient, V31-Migration (text nullable), Paraglide i18n für alle Training-Strings (Part 6)
86e9c05 feat(training) — Alle UI-Komponenten auf Paraglide umgestellt (TrainingHistory, OcrTrainingCard, SegmentationTrainingCard, "Nur Segmentierung"-Badge), SegmentationTrainingCard in Admin-Seite eingebunden, availableSegBlocks in TrainingInfoResponse ergänzt

Testergebnisse

Backend: 874 Tests ✅
Frontend: 708 Tests ✅
Type-Check: keine neuen Fehler

Nächste Schritte

PR öffnen und Review durchführen
Deployment testen: OCR-Service mit --workers 2, TRAINING_TOKEN in Prod setzen

## ✅ Implementierung abgeschlossen Alle 6 Teile des Kraken Fine-Tuning Pipelines sind implementiert. Branch: `feat/issue-226-227-ocr-pipeline-polygon` --- ### Commits (chronologisch) 1. **`7322907`** `feat(transcription)` — Sticky Review-Fortschrittsanzeige (X / Y geprüft + Fortschrittsbalken) in TranscriptionEditView (Part 1) 2. **`fdf1eb9`** `feat(training)` — Training-Enrollment: `document_training_labels`-Tabelle, `TrainingLabel`-Enum, `@ElementCollection` auf `Document`, `PATCH /api/documents/{id}/training-labels`, Auto-Enrollment bei Kurrent-OCR, Chip-Toggle im Transkriptions-Panel (Part 1b) 3. **`cfa3c4d`** `feat(training)` — Recognition Training Data Export: `TrainingDataExportService` (PDFBox 300 DPI, ZIP mit PNG+GT-Textpaaren), `GET /api/ocr/training-data/export` (Part 2) 4. **`bc97a2d`** `feat(ocr)` — Python `/train`-Endpoint (ZIP Slip-Schutz, `ketos.train`, Backup-Rotation, In-Process-Reload), `OcrClient.trainModel()`, `RestClientOcrClient`-Implementierung mit 10-min-Timeout + X-Training-Token (Part 3) 5. **`88e005e`** `feat(ocr)` — Trainings-Verlauf: V30-Migration (`ocr_training_runs`, partieller Unique-Index), `OcrTrainingRun`-Entity, `OcrTrainingService` (409-Guard, Orphan-Recovery), `POST /api/ocr/train`, `GET /api/ocr/training-info` (Part 4) 6. **`4e08d31`** `feat(admin)` — Admin-UI: `TrainingHistory.svelte`, `OcrTrainingCard.svelte`, in `admin/system/+page.svelte` eingebunden, Vitest-Tests (Part 5) 7. **`9b2f91e`** `feat(training)` — Segmentierungs-Training: `SegmentationTrainingExportService` (PAGE XML, Polygon-Denormalisierung), `/segtrain`-Python-Endpoint, `segtrainModel()` in OcrClient, V31-Migration (text nullable), Paraglide i18n für alle Training-Strings (Part 6) 8. **`86e9c05`** `feat(training)` — Alle UI-Komponenten auf Paraglide umgestellt (TrainingHistory, OcrTrainingCard, SegmentationTrainingCard, "Nur Segmentierung"-Badge), `SegmentationTrainingCard` in Admin-Seite eingebunden, `availableSegBlocks` in TrainingInfoResponse ergänzt --- ### Testergebnisse - Backend: **874 Tests** ✅ - Frontend: **708 Tests** ✅ - Type-Check: keine neuen Fehler --- ### Nächste Schritte - PR öffnen und Review durchführen - Deployment testen: OCR-Service mit `--workers 2`, `TRAINING_TOKEN` in Prod setzen

marcel commented

2026-04-13 15:58:27 +02:00

Guided OCR implemented ✅

Branch: feat/issue-226-227-ocr-pipeline-polygon
Commit: ee58b63

What was built

Python (ocr-service)

Added OcrRegion model (annotationId, pageNumber, x/y/width/height)
Extended OcrRequest with regions: list[OcrRegion] | None; extended OcrBlock with annotationId
Added extract_region_text(image, x, y, w, h) to both Kraken and Surya engines — crops to normalized region, runs recognition on the crop
Added guided mode branch in /ocr/stream: when regions is present, groups by page and recognizes each region without layout detection, annotationId flows back in each block

Java (backend)

OcrBlockResult gains annotationId field (null in normal mode)
OcrClient.OcrRegion record + updated streamBlocks signature
TriggerOcrDTO.useExistingAnnotations flag (Boolean, defaults to false)
TranscriptionBlockRepository.findByAnnotationId
TranscriptionService.upsertGuidedBlock — creates new OCR block, updates existing OCR block, or skips MANUAL block unchanged
OcrAsyncRunner dispatches to upsertGuidedBlock when annotationId is non-null; in guided mode fetches existing annotations and skips clearExistingBlocks()
OcrService.startOcr threads useExistingAnnotations through

Frontend

OcrTrigger: new annotationCount prop; when > 0 shows a checkbox "Nur annotierte Bereiche" with a hint; skips the destructive-replace confirmation in guided mode
TranscriptionEditView: passes annotationCount={blocks.length} to OcrTrigger
Document page triggerOcr passes useExistingAnnotations in the POST body
Paraglide keys ocr_use_existing_annotations + ocr_use_existing_annotations_hint in de/en/es

Tests

TranscriptionServiceGuidedTest (3 cases: creates new, updates OCR, preserves MANUAL)
All existing OCR tests updated for new method signatures
708 frontend + all backend tests green

## Guided OCR implemented ✅ Branch: `feat/issue-226-227-ocr-pipeline-polygon` Commit: `ee58b63` ### What was built **Python (`ocr-service`)** - Added `OcrRegion` model (`annotationId`, `pageNumber`, `x/y/width/height`) - Extended `OcrRequest` with `regions: list[OcrRegion] | None`; extended `OcrBlock` with `annotationId` - Added `extract_region_text(image, x, y, w, h)` to both Kraken and Surya engines — crops to normalized region, runs recognition on the crop - Added guided mode branch in `/ocr/stream`: when `regions` is present, groups by page and recognizes each region without layout detection, annotationId flows back in each block **Java (backend)** - `OcrBlockResult` gains `annotationId` field (null in normal mode) - `OcrClient.OcrRegion` record + updated `streamBlocks` signature - `TriggerOcrDTO.useExistingAnnotations` flag (Boolean, defaults to false) - `TranscriptionBlockRepository.findByAnnotationId` - `TranscriptionService.upsertGuidedBlock` — creates new OCR block, updates existing OCR block, or skips MANUAL block unchanged - `OcrAsyncRunner` dispatches to `upsertGuidedBlock` when `annotationId` is non-null; in guided mode fetches existing annotations and skips `clearExistingBlocks()` - `OcrService.startOcr` threads `useExistingAnnotations` through **Frontend** - `OcrTrigger`: new `annotationCount` prop; when > 0 shows a checkbox "Nur annotierte Bereiche" with a hint; skips the destructive-replace confirmation in guided mode - `TranscriptionEditView`: passes `annotationCount={blocks.length}` to `OcrTrigger` - Document page `triggerOcr` passes `useExistingAnnotations` in the POST body - Paraglide keys `ocr_use_existing_annotations` + `ocr_use_existing_annotations_hint` in de/en/es **Tests** - `TranscriptionServiceGuidedTest` (3 cases: creates new, updates OCR, preserves MANUAL) - All existing OCR tests updated for new method signatures - 708 frontend + all backend tests green

marcel closed this issue

2026-04-14 10:31:45 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#230

feat: Kraken fine-tuning pipeline (block origin tracking + training export + admin UI) #230

Overview

Part 1: Block origin tracking ✅ (implemented)

Part 2: Recognition training data export

Query

Service: TrainingDataExportService

Endpoint

Dependencies

Part 3: Python recognition training endpoint

POST /train on the OCR service (main.py)

Java side

Part 4: Training history

Entity: OcrTrainingRun

Info endpoint

Part 5: Admin panel UI

Part 6: Segmentation training (layout-only, no text required)

Annotation workflow

Export: SegmentationTrainingExportService

Endpoint

Python: POST /segtrain

Admin UI

Frontend: Review toggle in transcription editor

Open questions

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

Suggestions

🏗️ Markus Keller — Application Architect

Questions & Observations

Suggestions

🧪 Sara Holt — QA Engineer & Test Strategist

Questions & Observations

Suggestions

🔒 Nora "NullX" Steiner — Application Security Engineer

Questions & Observations

Suggestions

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

Suggestions

⚙️ Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

Suggestions

🗂️ Implementation Plan — Felix Brandt

Part 1 — Remaining progress counter

Part 2 — Recognition training data export

Part 3 — Python recognition training endpoint

Part 4 — Training history

Part 5 — Admin panel UI

Part 6 — Segmentation training

Files touched

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

Suggestions

🏗️ Markus Keller — Application Architect

Questions & Observations

Suggestions

🧪 Sara Holt — QA Engineer & Test Strategist

Questions & Observations

Suggestions

🔒 Nora "NullX" Steiner — Application Security Engineer

Questions & Observations

Suggestions

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

Suggestions

⚙️ Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

Suggestions

✅ Implementierung abgeschlossen

Commits (chronologisch)

Testergebnisse

Nächste Schritte

Guided OCR implemented ✅

What was built

Service: `TrainingDataExportService`

`POST /train` on the OCR service (`main.py`)

Entity: `OcrTrainingRun`

Export: `SegmentationTrainingExportService`

Python: `POST /segtrain`