feat(ocr): full OCR pipeline with polygon annotations, training, and guided mode #232

Merged
marcel merged 40 commits from feat/issue-226-227-ocr-pipeline-polygon into main 2026-04-14 10:31:35 +02:00
Owner

Closes #226, closes #227

Summary

End-to-end OCR pipeline for handwritten (Kurrent) and printed documents, built on Kraken/Surya via a Python microservice, with annotation-backed transcription blocks, a segmentation + recognition training pipeline, and a guided OCR mode.

Core infrastructure

  • Python OCR microservice with Kraken 7 and Surya 0.17, wired via Docker Compose
  • OcrService, OcrBatchService, OcrProgressService, OcrController on the backend
  • OcrClient interface + RestClientOcrClient with NDJSON streaming
  • Per-script-type confidence thresholds and [unleserlich] markers for low-confidence words
  • SSRF protection for PDF URL downloads

Annotations & polygons

  • Polygon JSONB support (@JdbcTypeCode(JSON)) for quadrilateral annotation shapes
  • createOcrAnnotation bypasses overlap check for OCR-placed regions
  • ScriptType enum + script_type column on Document
  • Cascade-delete transcription block when annotation is deleted

Streaming & progress

  • NDJSON streaming endpoint POST /ocr/stream with OcrStreamEvent sealed interface
  • Frontend SSE polling with auto-reconnect, OcrProgressBar component

Training pipeline

  • Segmentation training via ketos segtrain subprocess
  • Recognition training with CER tracking per run
  • Training data export (PAGE XML format), enrollment per document
  • CoreML model export; training history + POST /train + GET /training-info endpoints
  • SegmentationTrainingCard and OCR training card on admin/system

Guided OCR & manual-first workflow

  • Guided OCR mode using existing annotation regions
  • Removed full-page auto-segmentation — manual annotation is now the primary entry point
  • Fill empty MANUAL blocks in guided OCR mode; synthetic baseline for small crops

Other

  • ADR-001 (OCR microservice) and ADR-002 (polygon JSONB)
  • Paraglide i18n for all OCR/training UI strings
  • Full business-logic test suite for polygon extraction, Kraken routing, confidence markers
  • Sticky review progress counter in TranscriptionEditView
  • source / reviewed fields on transcription blocks for training pipeline

Test plan

  • ./mvnw test — all backend tests pass (incl. OcrServiceTest)
  • Start full stack (docker-compose up -d), upload a PDF document, select script type, trigger OCR — blocks appear with bounding polygons
  • Low-confidence words show [unleserlich] markers
  • SSE progress bar advances per page and disappears on completion
  • Guided OCR mode: draw annotation regions manually, re-run OCR — fills only those regions
  • Admin → System → OCR Training: enroll document, export training data, start segmentation training — CER history updates
  • Deleting an annotation cascades to its transcription block

🤖 Generated with Claude Code

Closes #226, closes #227 ## Summary End-to-end OCR pipeline for handwritten (Kurrent) and printed documents, built on Kraken/Surya via a Python microservice, with annotation-backed transcription blocks, a segmentation + recognition training pipeline, and a guided OCR mode. ### Core infrastructure - Python OCR microservice with Kraken 7 and Surya 0.17, wired via Docker Compose - `OcrService`, `OcrBatchService`, `OcrProgressService`, `OcrController` on the backend - `OcrClient` interface + `RestClientOcrClient` with NDJSON streaming - Per-script-type confidence thresholds and `[unleserlich]` markers for low-confidence words - SSRF protection for PDF URL downloads ### Annotations & polygons - Polygon JSONB support (`@JdbcTypeCode(JSON)`) for quadrilateral annotation shapes - `createOcrAnnotation` bypasses overlap check for OCR-placed regions - `ScriptType` enum + `script_type` column on `Document` - Cascade-delete transcription block when annotation is deleted ### Streaming & progress - NDJSON streaming endpoint `POST /ocr/stream` with `OcrStreamEvent` sealed interface - Frontend SSE polling with auto-reconnect, `OcrProgressBar` component ### Training pipeline - Segmentation training via `ketos segtrain` subprocess - Recognition training with CER tracking per run - Training data export (PAGE XML format), enrollment per document - CoreML model export; training history + `POST /train` + `GET /training-info` endpoints - `SegmentationTrainingCard` and OCR training card on `admin/system` ### Guided OCR & manual-first workflow - Guided OCR mode using existing annotation regions - Removed full-page auto-segmentation — manual annotation is now the primary entry point - Fill empty MANUAL blocks in guided OCR mode; synthetic baseline for small crops ### Other - ADR-001 (OCR microservice) and ADR-002 (polygon JSONB) - Paraglide i18n for all OCR/training UI strings - Full business-logic test suite for polygon extraction, Kraken routing, confidence markers - Sticky review progress counter in `TranscriptionEditView` - `source` / `reviewed` fields on transcription blocks for training pipeline ## Test plan - [ ] `./mvnw test` — all backend tests pass (incl. `OcrServiceTest`) - [ ] Start full stack (`docker-compose up -d`), upload a PDF document, select script type, trigger OCR — blocks appear with bounding polygons - [ ] Low-confidence words show `[unleserlich]` markers - [ ] SSE progress bar advances per page and disappears on completion - [ ] Guided OCR mode: draw annotation regions manually, re-run OCR — fills only those regions - [ ] Admin → System → OCR Training: enroll document, export training data, start segmentation training — CER history updates - [ ] Deleting an annotation cascades to its transcription block 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 27 commits 2026-04-13 22:34:33 +02:00
fix(ocr): use correct Kraken record attributes for line geometry
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
33dc4654e5
BaselineOCRRecord has 'baseline' and 'boundary' attributes, not 'line'
and 'cuts'. The fallback used record.line which doesn't exist, causing
AttributeError on every Kurrent OCR page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shows 'X / Y geprüft' with a brand-mint progress bar at the top of the
transcription panel. Derived from the blocks prop — no extra state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- V29 migration: document_training_labels join table
- TrainingLabel enum: KURRENT_RECOGNITION, KURRENT_SEGMENTATION
- Document.trainingLabels @ElementCollection
- DocumentService.addTrainingLabel / removeTrainingLabel
- PATCH /api/documents/{id}/training-labels (WRITE_ALL)
- Auto-enroll on Kurrent OCR trigger (OcrService.startOcr)
- TranscriptionEditView: enrollment chips in panel footer
- JPQL queries updated to use MEMBER OF trainingLabels

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TrainingDataExportService: PDFBox rendering at 300 DPI, crop by
  annotation coordinates, ZIP with <uuid>.png + <uuid>.gt.txt pairs
- Skips documents with missing S3 files (logs WARN, continues)
- GET /api/ocr/training-data/export (ADMIN); 204 when no enrolled blocks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- POST /train in ocr-service with ZIP Slip validation, TemporaryDirectory,
  ketos transfer learning, timestamped backups (keep last 3), in-process reload
- X-Training-Token auth (no-op in dev when TRAINING_TOKEN env is empty)
- trainModel() in OcrClient interface + RestClientOcrClient (10-min timeout,
  multipart upload, forwards X-Training-Token when configured)
- TRAINING_TOKEN env var wired in docker-compose; --workers 2 in Dockerfile
  so /health stays responsive during synchronous training

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- OcrTrainingRun entity + V30 migration (partial unique index prevents
  concurrent runs at DB level)
- OcrTrainingService: concurrent-run guard, 5-block threshold, MDC log
  correlation, orphan recovery on ApplicationReadyEvent
- POST /api/ocr/train (ADMIN) + GET /api/ocr/training-info (ADMIN)
- TRAINING_ALREADY_RUNNING ErrorCode
- 6 OcrTrainingServiceTest + 6 OcrControllerTest tests for the new endpoints

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TrainingHistory.svelte: responsive table with status badges
  (green/red/animated pulse), keyed iteration, empty-state row
- OcrTrainingCard.svelte: shows available blocks/docs, disabled states
  (< 5 blocks, service down), in-flight "…" state, 5s success message,
  embeds TrainingHistory
- Wired into admin/system/+page.svelte via fetchTrainingInfo() in $effect
- Regenerated api.ts with OcrTrainingRun + TrainingInfoResponse types
- TRAINING_ALREADY_RUNNING error code in errors.ts + de/en/es translations
- 7 OcrTrainingCard Vitest tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Convert TrainingHistory, OcrTrainingCard, SegmentationTrainingCard, and
  TranscriptionBlock "Nur Segmentierung" badge to use Paraglide message keys
- Add availableSegBlocks to TrainingInfoResponse to expose segmentation
  block count in the training info endpoint
- Wire SegmentationTrainingCard into admin/system page below OCR training card
- Update api.ts with availableSegBlocks field

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add /segtrain endpoint to OCR service (ZIP upload, ketos.segtrain,
  backup rotation, in-process model reload)
- Add segtrainModel() to OcrClient and RestClientOcrClient (10-min timeout,
  X-Training-Token header)
- Add SegmentationTrainingExportService: PAGE XML export with polygon
  de-normalization and per-page PNG rendering via PDFBox
- Add GET /api/ocr/segmentation-training-data/export endpoint
- Make TranscriptionBlock.text nullable for segmentation-only blocks
  (V31 migration)
- Add Paraglide i18n translation keys for all training UI strings (de/en/es)
- Pass source prop from TranscriptionEditView to TranscriptionBlock

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a document has manually drawn annotation boxes, the user can now
enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine
skips layout detection entirely and runs recognition only within the
pre-drawn bounding boxes, preserving manual transcription blocks.

- Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided
  branch in /ocr/stream groups by page and crops each region
- Engines: add extract_region_text() to both Kraken and Surya
- Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion,
  TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to
  upsertGuidedBlock when annotationId is present; OcrService threads
  the flag through to runSingleDocument
- TranscriptionService: adds upsertGuidedBlock (creates, updates OCR,
  or preserves MANUAL blocks)
- Frontend: guided OCR toggle in OcrTrigger shown when blocks exist;
  skips destructive-replace confirmation in guided mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
blla.segment() is a full-page layout detection model that kills the worker
process when called on tiny annotation crops (e.g. 597x89 px). For guided
OCR the annotation region IS already the text line, so segmentation is
unnecessary. Replace the blla call with a single synthetic BaselineLine that
spans the full crop width — rpred then runs recognition on the whole crop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken's segmentation bounds check rejects coordinates where any point
satisfies x >= im.width or y >= im.height (strictly >=, not >). Using
(cw, ch) as the boundary corner was triggering this for every crop.
Changed to (cw-1, ch-1) so all coordinates are strictly inside the image.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a user draws annotation boxes to mark OCR regions, the blocks are
created with source=MANUAL and empty text. upsertGuidedBlock was
protecting all MANUAL blocks unconditionally, so guided OCR silently
produced no output for these drawn-but-empty blocks.

Changed the guard to only protect non-empty MANUAL blocks — empty ones
are treated like OCR blocks and get their text filled in.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Historical letter lines often intersect, so the system must support
overlapping annotation regions. Removed the overlap guard from
createAnnotation(), deleted ErrorCode.ANNOTATION_OVERLAP, and cleaned
up all tests and frontend error mappings that referenced it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The findSegmentationBlocks query was filtering out blocks with non-empty
text. Segmentation training only needs annotation geometry (polygon/bbox),
not transcription text — so any MANUAL block on a KURRENT_SEGMENTATION
document should count, regardless of whether it has text.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both cards were reading the same availableBlocks field, so the segmentation
box always showed the kurrent recognition count. Use the correct
availableSegBlocks field from the training info response.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The parent was manually remapping availableSegBlocks → availableBlocks
before passing props, which broke after the card was updated to read
availableSegBlocks directly. Pass the full trainingInfo object instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously all MANUAL blocks counted as eligible training data, even ones
where text was filled in by guided OCR but never explicitly reviewed. This
caused segmentation and recognition counts to always match.

Now only reviewed=true blocks qualify for recognition training, so the
counts properly reflect: segments = all drawn annotation boxes,
checked text = only boxes where the user has verified the transcription.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After each training run, the Character Error Rate (CER = 1 - accuracy),
loss, accuracy, and epoch count are now stored on the OcrTrainingRun
record and shown in the training history table.

Also adds the missing POST /api/ocr/segtrain endpoint and the
triggerSegTraining service method so the segmentation training card
can actually trigger training.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kraken.ketos has no .train or .segtrain attributes in Kraken 7 — both are
only exposed as CLI commands. Rewrites both training functions to invoke
`ketos train` / `ketos segtrain` via subprocess and parse the best
val_metric from checkpoint filenames.

Also fixes the OcrTrainingCard history so it only shows non-blla runs
(recognition model), matching SegmentationTrainingCard which already
filtered to blla-only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DataLoader worker subprocesses crash inside Docker due to multiprocessing
fork restrictions. Pass --workers 0 to both ketos train and ketos segtrain
so data loading runs in the main process.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Force CPU-only training (--device cpu), cap OpenMP/BLAS thread pool at 2
(--threads 2), and reduce epochs from 50 to 10 (-N 10). 50 epochs on a
laptop OOM-killed the container. 10 epochs is sufficient for incremental
fine-tuning runs; more data is added over time and training re-run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ketos segtrain has no batch-size flag (-B), so with the default 1800px
input height the intermediate CNN feature maps consume ~500 MB+ per
image, causing the kernel OOM-killer (exit -9) to terminate the process.

On first run (no existing blla.mlmodel), override the VGSL spec to use
800px height instead. Subsequent runs load the saved model with
--resize both, preserving incremental fine-tuning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken 7 removed support for the legacy `path` format (image + .gt.txt
pairs) in VGSLRecognitionDataModule despite the CLI still advertising it.
Switching to PAGE XML (-f page) format which is the supported standard.

- Java export now writes .xml alongside .png (PAGE XML with TextLine,
  Baseline at 75% height, and Unicode transcription)
- XML special characters in transcription text are escaped (&amp; &lt; &gt;)
- Python trainer globs *.xml and passes -f page to ketos train
- Regenerated frontend API types to include cer/loss/accuracy/epochs on
  OcrTrainingRun (were missing, causing empty CER column in history)
- Updated and extended TrainingDataExportServiceTest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ketos 7 defaults to safetensors output, but kraken's load_any() only
handles CoreML (.mlmodel). Adding --weights-format coreml ensures the
hot-swap after training produces a file that load_any() can parse.

Also fixed _find_best_model to look for best_<score>.mlmodel (produced
by --weights-format coreml) in addition to the previous checkpoint_*
pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drawing annotations is now the primary workflow. OCR only runs on
manually drawn regions (guided mode always). Full-page layout detection
and the useExistingAnnotations checkbox are removed entirely.

- OcrTrigger: guided-only, disabled with hint when no annotations exist
- TranscriptionEditView: empty state shows draw-regions instruction,
  OCR trigger moved out of collapsible and shown inline after block list
- i18n: add ocr_trigger_no_annotations, ocr_section_heading,
  transcription_empty_draw_hint; remove ocr_use_existing_annotations keys

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(annotations): cascade-delete transcription block when annotation is deleted
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
5a5a8b6e5c
The DELETE endpoint was returning 500 due to a FK constraint violation.
`deleteAnnotation` now calls `blockRepository.deleteByAnnotationId()`
before removing the annotation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

Great work shipping this — the scope is huge and the architecture is coherent. A few things I'd want addressed before merging.


Blockers

1. OcrAsyncRunner is untested
The async runner is the core orchestration path — it calls the Python service, creates annotations and blocks, emits SSE progress. None of this is tested. A failing service, a bad NDJSON event, or a polygon mapping bug would be invisible until production.

2. TranscriptionService.upsertGuidedBlock has no test
The conditional logic here is non-trivial: preserve MANUAL blocks with text, update/create OCR blocks otherwise. This is exactly the kind of branching that should be pinned with unit tests before it gets touched.

3. RestClientOcrClient.parseNdjsonStream — untested parser
NDJSON line-by-line parsing with a type discriminator and Jackson deserialization — three things that can go wrong silently. Should have at minimum: a test for each event type, a test for blank lines, a test for unknown types (should not crash), and a test for truncated streams.

4. AnnotationService.deleteAnnotation cascade — no test
The cascade delete (annotation → transcription block) is a critical invariant. The commit message says "cascade-delete transcription block when annotation is deleted" but I see no test asserting transcriptionBlockRepository.findByAnnotationId(id) returns empty after delete.


Suggestions

5. OcrProgressService emitter cleanup
On SseEmitter timeout/error, the emitter is removed from the per-job list — but the job entry in the map is never removed, even after complete(). This is a minor memory leak for long-running instances with many jobs. A ConcurrentHashMap.remove(jobId) in complete() would fix it.

6. TranscriptionService.sanitizeText is a private method doing two things
It strips newlines AND caps at 10k chars. Those are separate concerns and the 10k cap is surprising (no comment, no constant). Extract a named constant MAX_BLOCK_TEXT_LENGTH = 10_000 and add a brief comment for future readers.

7. Svelte: TranscriptionEditView.svelte is carrying a lot
Drag-and-drop reordering, debounced auto-save, beacon fallback, OCR triggering, review progress — this component has grown significantly. Worth a follow-up split into focused sub-components, but not a blocker for this PR.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** Great work shipping this — the scope is huge and the architecture is coherent. A few things I'd want addressed before merging. --- ### Blockers **1. `OcrAsyncRunner` is untested** The async runner is the core orchestration path — it calls the Python service, creates annotations and blocks, emits SSE progress. None of this is tested. A failing service, a bad NDJSON event, or a polygon mapping bug would be invisible until production. **2. `TranscriptionService.upsertGuidedBlock` has no test** The conditional logic here is non-trivial: preserve MANUAL blocks with text, update/create OCR blocks otherwise. This is exactly the kind of branching that should be pinned with unit tests before it gets touched. **3. `RestClientOcrClient.parseNdjsonStream` — untested parser** NDJSON line-by-line parsing with a type discriminator and Jackson deserialization — three things that can go wrong silently. Should have at minimum: a test for each event type, a test for blank lines, a test for unknown types (should not crash), and a test for truncated streams. **4. `AnnotationService.deleteAnnotation` cascade — no test** The cascade delete (annotation → transcription block) is a critical invariant. The commit message says "cascade-delete transcription block when annotation is deleted" but I see no test asserting `transcriptionBlockRepository.findByAnnotationId(id)` returns empty after delete. --- ### Suggestions **5. `OcrProgressService` emitter cleanup** On `SseEmitter` timeout/error, the emitter is removed from the per-job list — but the job entry in the map is never removed, even after `complete()`. This is a minor memory leak for long-running instances with many jobs. A `ConcurrentHashMap.remove(jobId)` in `complete()` would fix it. **6. `TranscriptionService.sanitizeText` is a private method doing two things** It strips newlines AND caps at 10k chars. Those are separate concerns and the 10k cap is surprising (no comment, no constant). Extract a named constant `MAX_BLOCK_TEXT_LENGTH = 10_000` and add a brief comment for future readers. **7. Svelte: `TranscriptionEditView.svelte` is carrying a lot** Drag-and-drop reordering, debounced auto-save, beacon fallback, OCR triggering, review progress — this component has grown significantly. Worth a follow-up split into focused sub-components, but not a blocker for this PR.
Author
Owner

🏛️ Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns

The ADRs are exactly what I want to see — ADR-001 and ADR-002 make the trade-offs explicit and are well-reasoned. The overall architecture is sound. A few structural risks worth naming.


Blockers

1. Presigned URL TTL race
The OCR service downloads the PDF via presigned MinIO URL. For a large multi-page document, OCR can take several minutes. If the presigned URL expires mid-stream (typical TTL: 15 min, but configurable), the download fails silently and the job ends in an error state. The TTL should be set to at least max_expected_ocr_time × 1.5 and that assumption should be documented — or generate the URL inside the OCR service request, not before.

2. Training mutates in-process model state — documented but not guarded
ADR-001 flags this honestly: training is not safe for horizontal scaling. That's fine for now, but the DB-level "one RUNNING training" constraint only prevents concurrent training API calls — it doesn't prevent someone from running two OCR service replicas. Add a clear comment in the compose file and in the training endpoint that horizontal scaling is explicitly unsupported until this is revisited.


Suggestions

3. Dual-protocol streaming adds a translation layer
The Python service streams NDJSON → Spring Boot collects it → re-emits as SSE to the browser. This means Spring Boot is a stateful relay. If the Spring Boot process restarts mid-OCR, in-flight SSE clients lose their stream with no reconnect path (the EventSource retry only reconnects if the server is reachable). Consider persisting the last known job progress in the DB so clients can catch up on reconnect.

4. OcrProgressService emitter pool is in-memory
A single-node deployment is fine (and that's the target per ADR-001). Just worth noting: if this ever runs behind a load balancer, SSE clients may connect to a node that doesn't hold their emitter. Document the single-node requirement.

5. Package placement of ocr/ vs service/
RestClientOcrClient and OcrStreamEvent live in org.raddatz.familienarchiv.ocr — a new package. OcrService, OcrBatchService, and OcrProgressService live in service/. This split is a bit inconsistent. Either move the client classes into service/ or move the services into ocr/. Not a blocker, but worth a quick tidy.

## 🏛️ Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** The ADRs are exactly what I want to see — ADR-001 and ADR-002 make the trade-offs explicit and are well-reasoned. The overall architecture is sound. A few structural risks worth naming. --- ### Blockers **1. Presigned URL TTL race** The OCR service downloads the PDF via presigned MinIO URL. For a large multi-page document, OCR can take several minutes. If the presigned URL expires mid-stream (typical TTL: 15 min, but configurable), the download fails silently and the job ends in an error state. The TTL should be set to at least `max_expected_ocr_time × 1.5` and that assumption should be documented — or generate the URL inside the OCR service request, not before. **2. Training mutates in-process model state — documented but not guarded** ADR-001 flags this honestly: training is not safe for horizontal scaling. That's fine for now, but the DB-level "one RUNNING training" constraint only prevents concurrent training API calls — it doesn't prevent someone from running two OCR service replicas. Add a clear comment in the compose file and in the training endpoint that horizontal scaling is explicitly unsupported until this is revisited. --- ### Suggestions **3. Dual-protocol streaming adds a translation layer** The Python service streams NDJSON → Spring Boot collects it → re-emits as SSE to the browser. This means Spring Boot is a stateful relay. If the Spring Boot process restarts mid-OCR, in-flight SSE clients lose their stream with no reconnect path (the `EventSource` retry only reconnects if the server is reachable). Consider persisting the last known job progress in the DB so clients can catch up on reconnect. **4. `OcrProgressService` emitter pool is in-memory** A single-node deployment is fine (and that's the target per ADR-001). Just worth noting: if this ever runs behind a load balancer, SSE clients may connect to a node that doesn't hold their emitter. Document the single-node requirement. **5. Package placement of `ocr/` vs `service/`** `RestClientOcrClient` and `OcrStreamEvent` live in `org.raddatz.familienarchiv.ocr` — a new package. `OcrService`, `OcrBatchService`, and `OcrProgressService` live in `service/`. This split is a bit inconsistent. Either move the client classes into `service/` or move the services into `ocr/`. Not a blocker, but worth a quick tidy.
Author
Owner

🧪 Sara Holt — Senior QA Engineer & Test Strategist

Verdict: 🚫 Changes requested

This is a large, complex feature with significant business logic — but the test suite doesn't match that complexity. I'd expect integration coverage for the new migrations and at least unit tests for every non-trivial branch. Right now I'm counting four untested critical paths.


Blockers

1. No Testcontainers integration tests for new Flyway migrations
Migrations V23 through V32 introduce new tables, JSONB columns, unique partial indexes, and CHECK constraints. None of these are validated by an integration test with a real PostgreSQL instance. The JSONB polygon check constraint (chk_annotation_polygon_quad) and the partial unique index on ocr_training_runs where status='RUNNING' are database-level invariants — they need at least one integration test each to verify Flyway applies them correctly and they actually enforce what we expect.

2. OcrAsyncRunner — zero test coverage
This is the most complex orchestration path in the PR: NDJSON event processing, annotation creation, transcription block creation, SSE emission, error handling. It needs unit tests with a mocked OcrClient that returns each event type (Start, Page, Error, Done), including edge cases like empty page results and a failed stream.

3. TranscriptionService.upsertGuidedBlock — untested branching
Three branches: (a) existing MANUAL block with text → preserve, (b) existing OCR block → update, (c) no block → create. None are covered. This logic will silently break if someone refactors the conditional and there's no test to catch it.

4. RestClientOcrClient.parseNdjsonStream — untested parser
This is custom deserialization code. Minimum required tests: each event type deserializes correctly, blank lines are skipped, unknown type field doesn't throw, truncated JSON line throws a meaningful exception (not NPE).


Suggestions

5. OcrServiceTest uses Mockito mocks throughout — good pattern, keep it
The existing tests are clean and follow the right unit test style. Extend this pattern to the missing classes above.

6. Missing boundary test: minimum block size
TranscriptionService.createOcrBlock caps text at 10,000 chars. Is there a test for exactly 10,001 chars being truncated? For empty string? For null text?

7. No test for the cascade delete invariant
After annotationService.deleteAnnotation(), the linked TranscriptionBlock must be gone. This should be a Testcontainers integration test — it's a DB-level cascade, not just service logic.

## 🧪 Sara Holt — Senior QA Engineer & Test Strategist **Verdict: 🚫 Changes requested** This is a large, complex feature with significant business logic — but the test suite doesn't match that complexity. I'd expect integration coverage for the new migrations and at least unit tests for every non-trivial branch. Right now I'm counting four untested critical paths. --- ### Blockers **1. No Testcontainers integration tests for new Flyway migrations** Migrations V23 through V32 introduce new tables, JSONB columns, unique partial indexes, and CHECK constraints. None of these are validated by an integration test with a real PostgreSQL instance. The JSONB polygon check constraint (`chk_annotation_polygon_quad`) and the partial unique index on `ocr_training_runs` where `status='RUNNING'` are database-level invariants — they need at least one integration test each to verify Flyway applies them correctly and they actually enforce what we expect. **2. `OcrAsyncRunner` — zero test coverage** This is the most complex orchestration path in the PR: NDJSON event processing, annotation creation, transcription block creation, SSE emission, error handling. It needs unit tests with a mocked `OcrClient` that returns each event type (Start, Page, Error, Done), including edge cases like empty page results and a failed stream. **3. `TranscriptionService.upsertGuidedBlock` — untested branching** Three branches: (a) existing MANUAL block with text → preserve, (b) existing OCR block → update, (c) no block → create. None are covered. This logic will silently break if someone refactors the conditional and there's no test to catch it. **4. `RestClientOcrClient.parseNdjsonStream` — untested parser** This is custom deserialization code. Minimum required tests: each event type deserializes correctly, blank lines are skipped, unknown `type` field doesn't throw, truncated JSON line throws a meaningful exception (not NPE). --- ### Suggestions **5. `OcrServiceTest` uses Mockito mocks throughout — good pattern, keep it** The existing tests are clean and follow the right unit test style. Extend this pattern to the missing classes above. **6. Missing boundary test: minimum block size** `TranscriptionService.createOcrBlock` caps text at 10,000 chars. Is there a test for exactly 10,001 chars being truncated? For empty string? For null text? **7. No test for the cascade delete invariant** After `annotationService.deleteAnnotation()`, the linked `TranscriptionBlock` must be gone. This should be a Testcontainers integration test — it's a DB-level cascade, not just service logic.
Author
Owner

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

The SSRF protection and ZIP path traversal guards are the right calls. A few things need tightening before this is production-safe.


Blockers

1. OCR_TRAINING_TOKEN — is it actually enforced?
The compose file sets TRAINING_TOKEN: $OCR_TRAINING_TOKEN and the Python service presumably checks it on /train and /segtrain. But I don't see where the Spring Boot backend sends this token when calling those endpoints — and if it doesn't, the training endpoints on the Python service are either unauthenticated (if the service is accidentally port-mapped) or protected only by network isolation. Verify: (a) the token is sent as a header by RestClientOcrClient when calling /train and /segtrain, and (b) the Python service returns 401 if it's missing or wrong. Add a test for the 401 case.

2. ZIP upload to /train and /segtrain — verify path traversal guard
The summary says path traversal is validated in the Python service. I need to see the actual check. A safe pattern is:

for member in zip_file.namelist():
    if os.path.isabs(member) or ".." in member:
        raise HTTPException(400, "Invalid path in ZIP")

If this guard is not in place, an attacker with WRITE_ALL permission could write arbitrary files to the container filesystem via a crafted ZIP.


Suggestions

3. SSRF allowlist — confirm it is a strict allowlist, not a denylist
Denylists (if "169.254" not in url) are easy to bypass via DNS rebinding or IPv6 representations. Confirm the SSRF guard is an allowlist of approved hostnames (minio, localhost) and rejects everything else by default.

4. presigned URL exposure in logs
Presigned URLs are credentials — they grant time-limited read access to the PDF. Ensure they are not logged at DEBUG level in RestClientOcrClient or in the Python service. A log sanitizer or explicit exclusion on the URL parameter would prevent them appearing in log aggregators.

5. OCR service port exposure
The ocr-service has no ports: mapping in docker-compose — correct. Document this explicitly as a security property (internal-only) so it isn't accidentally added during debugging and left in.

6. @RequirePermission(ADMIN) on batch OCR and training endpoints — appropriate
The permission model is correct. Training endpoints require ADMIN, single-doc OCR requires WRITE_ALL. No findings here.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** The SSRF protection and ZIP path traversal guards are the right calls. A few things need tightening before this is production-safe. --- ### Blockers **1. `OCR_TRAINING_TOKEN` — is it actually enforced?** The compose file sets `TRAINING_TOKEN: $OCR_TRAINING_TOKEN` and the Python service presumably checks it on `/train` and `/segtrain`. But I don't see where the Spring Boot backend sends this token when calling those endpoints — and if it doesn't, the training endpoints on the Python service are either unauthenticated (if the service is accidentally port-mapped) or protected only by network isolation. Verify: (a) the token is sent as a header by `RestClientOcrClient` when calling `/train` and `/segtrain`, and (b) the Python service returns 401 if it's missing or wrong. Add a test for the 401 case. **2. ZIP upload to `/train` and `/segtrain` — verify path traversal guard** The summary says path traversal is validated in the Python service. I need to see the actual check. A safe pattern is: ```python for member in zip_file.namelist(): if os.path.isabs(member) or ".." in member: raise HTTPException(400, "Invalid path in ZIP") ``` If this guard is not in place, an attacker with `WRITE_ALL` permission could write arbitrary files to the container filesystem via a crafted ZIP. --- ### Suggestions **3. SSRF allowlist — confirm it is a strict allowlist, not a denylist** Denylists (`if "169.254" not in url`) are easy to bypass via DNS rebinding or IPv6 representations. Confirm the SSRF guard is an allowlist of approved hostnames (`minio`, `localhost`) and rejects everything else by default. **4. `presigned URL` exposure in logs** Presigned URLs are credentials — they grant time-limited read access to the PDF. Ensure they are not logged at DEBUG level in `RestClientOcrClient` or in the Python service. A log sanitizer or explicit exclusion on the URL parameter would prevent them appearing in log aggregators. **5. OCR service port exposure** The `ocr-service` has no `ports:` mapping in docker-compose — correct. Document this explicitly as a security property (internal-only) so it isn't accidentally added during debugging and left in. **6. `@RequirePermission(ADMIN)` on batch OCR and training endpoints — appropriate** The permission model is correct. Training endpoints require ADMIN, single-doc OCR requires WRITE_ALL. No findings here.
Author
Owner

🎨 Leonie Voss — Senior UX Designer & Accessibility Strategist

Verdict: ⚠️ Approved with concerns

The polygon clipping approach for OCR annotations is technically elegant and visually effective. A few accessibility gaps need closing before this is ready.


Blockers

1. OcrProgressBar — missing ARIA live region
The progress bar updates dynamically as OCR pages complete. Screen reader users will not hear these updates unless the container has aria-live="polite" and aria-atomic="false". The current implementation appears to use a visual-only progress bar. Add:

<div role="progressbar" aria-valuenow={processed} aria-valuemin={0} aria-valuemax={total} aria-label="OCR-Fortschritt">

and wrap status text in <span aria-live="polite">.

2. Resize handles on AnnotationShape (upcoming #233) — plan for keyboard access
This is a blocker for the upcoming resize feature, not this PR — flagging it now so the design accounts for it. Drag-only resize is inaccessible. Plan for arrow-key nudging (1% step) on a focused handle.


Suggestions

3. Block number badge contrast
The badge shows the block number in the top-left corner of each annotation. Depending on the annotation color (#00C7B1 mint) and badge background, contrast may fall below 4.5:1 for the number text. Verify with a contrast checker for the actual badge colors used.

4. OcrTrigger script-type selector — empty/default state
When scriptType is unset (UNKNOWN), the trigger button is disabled. The disabled state should carry an aria-describedby tooltip explaining why it's disabled ("Bitte Schrifttyp auswählen"), not just be grayed out silently.

5. Training card empty state — plain language
"Keine OCR-Blöcke vorhanden" is accurate but sparse. For the dual-audience (family members in their 60s+), a brief explanation of what they need to do ("Transkribieren Sie zunächst mindestens 5 Blöcke, um das Training zu starten") would reduce confusion.

6. Focus management after OCR completion
When OCR finishes and the progress bar disappears, focus should return to a logical element (e.g. the first newly created transcription block or the trigger button). Currently focus likely drops to <body>. Not a blocker but worth a follow-up.

## 🎨 Leonie Voss — Senior UX Designer & Accessibility Strategist **Verdict: ⚠️ Approved with concerns** The polygon clipping approach for OCR annotations is technically elegant and visually effective. A few accessibility gaps need closing before this is ready. --- ### Blockers **1. `OcrProgressBar` — missing ARIA live region** The progress bar updates dynamically as OCR pages complete. Screen reader users will not hear these updates unless the container has `aria-live="polite"` and `aria-atomic="false"`. The current implementation appears to use a visual-only progress bar. Add: ```svelte <div role="progressbar" aria-valuenow={processed} aria-valuemin={0} aria-valuemax={total} aria-label="OCR-Fortschritt"> ``` and wrap status text in `<span aria-live="polite">`. **2. Resize handles on `AnnotationShape` (upcoming #233) — plan for keyboard access** This is a blocker for the upcoming resize feature, not this PR — flagging it now so the design accounts for it. Drag-only resize is inaccessible. Plan for arrow-key nudging (1% step) on a focused handle. --- ### Suggestions **3. Block number badge contrast** The badge shows the block number in the top-left corner of each annotation. Depending on the annotation color (`#00C7B1` mint) and badge background, contrast may fall below 4.5:1 for the number text. Verify with a contrast checker for the actual badge colors used. **4. OcrTrigger script-type selector — empty/default state** When `scriptType` is unset (UNKNOWN), the trigger button is disabled. The disabled state should carry an `aria-describedby` tooltip explaining *why* it's disabled ("Bitte Schrifttyp auswählen"), not just be grayed out silently. **5. Training card empty state — plain language** "Keine OCR-Blöcke vorhanden" is accurate but sparse. For the dual-audience (family members in their 60s+), a brief explanation of what they need to do ("Transkribieren Sie zunächst mindestens 5 Blöcke, um das Training zu starten") would reduce confusion. **6. Focus management after OCR completion** When OCR finishes and the progress bar disappears, focus should return to a logical element (e.g. the first newly created transcription block or the trigger button). Currently focus likely drops to `<body>`. Not a blocker but worth a follow-up.
Author
Owner

🚀 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

The compose setup is solid for a dev machine. A few things need hardening before anyone runs this in a "production-ish" context.


Blockers

1. memory: 8g is a soft limit in Docker Compose
memory: 8g sets mem_limit which is a soft reservation on most Docker setups — the container can still OOM-kill the host if swap is available. For a CPU-only OCR workload that legitimately needs 4–8 GB, this is fine in dev but must be mem_limit + memswap_limit (set equal to prevent swap usage) in any deployment context. Document this clearly in the compose file comment.

2. OCR_TRAINING_TOKEN has no .env.example entry
Looking at the compose file, TRAINING_TOKEN: $OCR_TRAINING_TOKEN references an env var that won't exist on a fresh clone. There should be a .env.example (or the existing one updated) so new developers know this variable is required and what format it takes.


Suggestions

3. Model backup rotation is application-level, not infrastructure-level
The Python service rotates the last 3 model backups inside the container. This means a container wipe loses all backups. The ocr_models Docker volume persists the model — but if someone does docker-compose down -v, it's gone. Consider documenting a periodic docker cp or volume snapshot as part of the backup runbook, especially since training a custom Kurrent model is expensive.

4. Health check start_period: 60s — appropriate, but log it
The 60-second startup grace for model loading is correct. Worth adding a startup log line in the Python service ("Models loaded, ready") so operators can distinguish "still loading" from "stuck" in docker-compose logs ocr-service.

5. No restart: unless-stopped on ocr-service
The backend and db services likely have restart policies. The OCR service should too, otherwise a model load OOM that kills the container leaves it stopped permanently.

6. Gitea Actions: if a CI workflow is added later, use secrets.OCR_TRAINING_TOKEN
Not a blocker now (no CI workflow for the OCR service exists). Pre-noting: when one is added, the token must come from Gitea secrets, not a hardcoded value in the workflow file.

## 🚀 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** The compose setup is solid for a dev machine. A few things need hardening before anyone runs this in a "production-ish" context. --- ### Blockers **1. `memory: 8g` is a soft limit in Docker Compose** `memory: 8g` sets `mem_limit` which is a **soft** reservation on most Docker setups — the container can still OOM-kill the host if swap is available. For a CPU-only OCR workload that legitimately needs 4–8 GB, this is fine in dev but must be `mem_limit` + `memswap_limit` (set equal to prevent swap usage) in any deployment context. Document this clearly in the compose file comment. **2. `OCR_TRAINING_TOKEN` has no `.env.example` entry** Looking at the compose file, `TRAINING_TOKEN: $OCR_TRAINING_TOKEN` references an env var that won't exist on a fresh clone. There should be a `.env.example` (or the existing one updated) so new developers know this variable is required and what format it takes. --- ### Suggestions **3. Model backup rotation is application-level, not infrastructure-level** The Python service rotates the last 3 model backups inside the container. This means a container wipe loses all backups. The `ocr_models` Docker volume persists the model — but if someone does `docker-compose down -v`, it's gone. Consider documenting a periodic `docker cp` or volume snapshot as part of the backup runbook, especially since training a custom Kurrent model is expensive. **4. Health check `start_period: 60s` — appropriate, but log it** The 60-second startup grace for model loading is correct. Worth adding a startup log line in the Python service ("Models loaded, ready") so operators can distinguish "still loading" from "stuck" in `docker-compose logs ocr-service`. **5. No `restart: unless-stopped` on `ocr-service`** The backend and db services likely have restart policies. The OCR service should too, otherwise a model load OOM that kills the container leaves it stopped permanently. **6. Gitea Actions: if a CI workflow is added later, use `secrets.OCR_TRAINING_TOKEN`** Not a blocker now (no CI workflow for the OCR service exists). Pre-noting: when one is added, the token must come from Gitea secrets, not a hardcoded value in the workflow file.
marcel added 6 commits 2026-04-14 09:27:25 +02:00
The cascade-delete commit (5a5a8b6) added blockRepository.deleteByAnnotationId()
to AnnotationService.deleteAnnotation(), but the test class was not updated to
mock TranscriptionBlockRepository. Mockito injected null, causing deleteAnnotation_succeeds_whenOwner
to throw NPE. Adds the mock, verifies the cascade call, and adds an inOrder test
asserting the block is deleted before the annotation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A 100-page document at ~10 s/page takes ~17 min on CPU-only hardware,
which could cause the presigned URL to expire mid-OCR job. 1 hour gives
ample headroom for any realistic document size in this archive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Training reloads the Kraken model in-process on the Python service.
The DB-level RUNNING constraint prevents concurrent API calls but
cannot protect against multi-replica deployments. Added explicit
comments in docker-compose.yml and OcrTrainingService to prevent
accidental horizontal scaling. See ADR-001.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Screen readers did not announce page-by-page OCR progress updates.
Wrapping the counter text in a span with aria-live=polite ensures
assistive technology announces each page completion without
interrupting the user.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fresh cloners had no tracked reference for required env vars.
.env is gitignored (contains real credentials). .env.example
documents all variables including the new OCR_TRAINING_TOKEN
for the Python OCR microservice training endpoints.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test(migrations): add Testcontainers integration tests for V23 + V30 constraints
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 0s
7b79dc105b
V23 introduced a JSONB check constraint (chk_annotation_polygon_quad)
requiring polygon arrays to have exactly 4 points. V30 introduced a
partial unique index preventing two concurrent RUNNING training runs.
These are DB-level invariants that unit tests cannot verify — five
Testcontainers tests now assert they are correctly applied by Flyway
and enforced by PostgreSQL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

The scope here is impressive — end-to-end OCR pipeline, training loop, guided mode, streaming progress — and the TDD discipline shows. New tests are green first, the service layer is tested in isolation, and the Testcontainers integration tests for DB constraints are exactly right. A few things need addressing before this is truly clean.


Blockers

1. @Transactional wraps an HTTP call that takes minutes
OcrTrainingService.triggerTraining() is annotated @Transactional. It creates a DB row, then synchronously POSTs the training ZIP to the OCR service (ketos can run for several minutes), then updates the row. The entire DB connection is held open for the duration. On a long training run, this will exhaust the connection pool and timeout.

Fix: annotate @Transactional only on the narrow DB-write helpers, and call the OCR HTTP client outside any transaction boundary. Or annotate the class @Transactional(propagation = NEVER) and open transactions only where needed.

2. blockRepository.findAll().size() in getTrainingInfo()
This loads every transcription block in the database into a Java List just to call .size(). With a few thousand blocks this is already a slow query that allocates a large heap object on every training-info poll.

// Instead of:
blockRepository.findAll().size()

// Use:
blockRepository.count()

Same fix for any similar .findAll().size() pattern.


Suggestions

3. Boolean flag argument in startOcr and runSingleDocument

// What does `true` mean here?
asyncRunner.runSingleDocument(docId, userId, jobId, true);

Boolean flag arguments obscure intent at the call site and violate the "functions do one thing" rule. The two modes (normal OCR vs. guided OCR) are meaningfully different code paths. Consider:

asyncRunner.runSingleDocumentGuided(docId, userId, jobId);
asyncRunner.runSingleDocument(docId, userId, jobId);

Or an enum: OcrMode.GUIDED / OcrMode.FULL.

4. /train and /segtrain in main.py are too long
Both handlers are 60+ lines: extract ZIP, validate, run subprocess, parse output, backup, rotate, reload. Each of those is a separate concern with a name. The persona guide says 40 lines is the splitting signal.

# Extract:
_extract_and_validate_zip(zip_bytes) -> Path
_run_ketos_training(data_dir, model_path) -> TrainingResultJson
_backup_and_rotate(model_path)
_reload_kraken_model()

5. Training label chips hardcoded in German
TranscriptionEditView.svelte renders "Kurrent-Erkennung" and "Segmentierung" as string literals. The i18n keys exist in all three message files — they're just not used here. Wire them up:

{label === 'KURRENT_RECOGNITION' ? m.training_ocr_heading() : m.training_seg_heading()}

6. Duplicate import in DocumentControllerTest.java
import static org.mockito.ArgumentMatchers.eq; appears twice around line 1619. Minor, but the linter should catch this.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** The scope here is impressive — end-to-end OCR pipeline, training loop, guided mode, streaming progress — and the TDD discipline shows. New tests are green first, the service layer is tested in isolation, and the Testcontainers integration tests for DB constraints are exactly right. A few things need addressing before this is truly clean. --- ### Blockers **1. `@Transactional` wraps an HTTP call that takes minutes** `OcrTrainingService.triggerTraining()` is annotated `@Transactional`. It creates a DB row, then synchronously POSTs the training ZIP to the OCR service (ketos can run for several minutes), then updates the row. The entire DB connection is held open for the duration. On a long training run, this will exhaust the connection pool and timeout. Fix: annotate `@Transactional` only on the narrow DB-write helpers, and call the OCR HTTP client *outside* any transaction boundary. Or annotate the class `@Transactional(propagation = NEVER)` and open transactions only where needed. **2. `blockRepository.findAll().size()` in `getTrainingInfo()`** This loads every transcription block in the database into a Java `List` just to call `.size()`. With a few thousand blocks this is already a slow query that allocates a large heap object on every training-info poll. ```java // Instead of: blockRepository.findAll().size() // Use: blockRepository.count() ``` Same fix for any similar `.findAll().size()` pattern. --- ### Suggestions **3. Boolean flag argument in `startOcr` and `runSingleDocument`** ```java // What does `true` mean here? asyncRunner.runSingleDocument(docId, userId, jobId, true); ``` Boolean flag arguments obscure intent at the call site and violate the "functions do one thing" rule. The two modes (normal OCR vs. guided OCR) are meaningfully different code paths. Consider: ```java asyncRunner.runSingleDocumentGuided(docId, userId, jobId); asyncRunner.runSingleDocument(docId, userId, jobId); ``` Or an enum: `OcrMode.GUIDED` / `OcrMode.FULL`. **4. `/train` and `/segtrain` in `main.py` are too long** Both handlers are 60+ lines: extract ZIP, validate, run subprocess, parse output, backup, rotate, reload. Each of those is a separate concern with a name. The persona guide says 40 lines is the splitting signal. ```python # Extract: _extract_and_validate_zip(zip_bytes) -> Path _run_ketos_training(data_dir, model_path) -> TrainingResultJson _backup_and_rotate(model_path) _reload_kraken_model() ``` **5. Training label chips hardcoded in German** `TranscriptionEditView.svelte` renders `"Kurrent-Erkennung"` and `"Segmentierung"` as string literals. The i18n keys exist in all three message files — they're just not used here. Wire them up: ```svelte {label === 'KURRENT_RECOGNITION' ? m.training_ocr_heading() : m.training_seg_heading()} ``` **6. Duplicate import in `DocumentControllerTest.java`** `import static org.mockito.ArgumentMatchers.eq;` appears twice around line 1619. Minor, but the linter should catch this.
Author
Owner

🏛️ Markus Keller — Application Architect

Verdict: 🚫 Changes requested

The OCR microservice extraction is well-justified (different resource profile, different deployment cadence — ADR-001 captures this correctly). The training pipeline architecture is sound in principle. But there are two issues that will cause production failures, and one that will degrade reliability under load.


Blockers

1. uvicorn --workers 2 directly contradicts the single-node constraint

docker-compose.yml has a comment that reads:

"OCR must be a single-node deployment — training reloads the model in-process; replication would cause model-state divergence"

ADR-001 documents the same constraint.

Then ocr-service/Dockerfile adds --workers 2.

Two uvicorn workers are two separate OS processes, each with their own Python interpreter and model state. A /train request lands on worker A, which reloads its Kraken model. Worker B continues serving OCR requests with the old model indefinitely. This is exactly the model-state divergence the ADR warns against.

Additionally, Surya loads ~5GB of transformer weights per process. The Compose file caps the service at 8GB memory. Two workers × ~5GB = OOM on the first /train call.

Fix: Drop to --workers 1 in the Dockerfile. If concurrency is needed, use --workers 1 --limit-concurrency N or move to a proper task queue when that's justified.

2. V30 migration FK references users(id) — but the table is app_users

-- V30__add_ocr_training_runs.sql
REFERENCES users(id)

The AppUser entity is annotated @Table(name = "app_users"). Every other migration that touches user FK references uses app_users. If this migration runs against the actual production schema it will fail with:

ERROR: relation "users" does not exist

This will halt Flyway and prevent the application from starting. Fix the FK target before merge.


Concerns

3. @Transactional + synchronous blocking HTTP call

OcrTrainingService.triggerTraining() is @Transactional. Inside the transaction it: (a) inserts an OcrTrainingRun row, (b) POSTs a multi-MB ZIP to the OCR service and waits for the response (ketos training: minutes), (c) updates the row. This holds a database connection open for the full training duration.

The architecture principle: "Push long-running work outside the transaction boundary." The fix is to commit the RUNNING row first, then call the OCR service outside any @Transactional context, then open a new short transaction to record the result. The orphan recovery on ApplicationReadyEvent already handles the crash case — lean on it.


Positive Notes

  • ADR-001 (OCR microservice) and ADR-002 (polygon JSONB) are well-written. The justification for the Python microservice extraction is concrete: GPU memory, separate deployment cadence. This is how ADRs should be written.
  • Partial unique index on ocr_training_runs (status) WHERE status = 'RUNNING' is exactly the right enforcement layer. A Java-level guard would have a race condition; the DB constraint is atomic.
  • recoverOrphanedRuns() on ApplicationReadyEvent is a thoughtful design — it handles the restart-mid-training case that any distributed state machine needs to address.
  • The TrainingDataExportService pre-fetches all data before entering the streaming lambda. This is the correct pattern — no open DB transaction during streaming response write.
## 🏛️ Markus Keller — Application Architect **Verdict: 🚫 Changes requested** The OCR microservice extraction is well-justified (different resource profile, different deployment cadence — ADR-001 captures this correctly). The training pipeline architecture is sound in principle. But there are two issues that will cause production failures, and one that will degrade reliability under load. --- ### Blockers **1. `uvicorn --workers 2` directly contradicts the single-node constraint** `docker-compose.yml` has a comment that reads: > "OCR must be a single-node deployment — training reloads the model in-process; replication would cause model-state divergence" ADR-001 documents the same constraint. Then `ocr-service/Dockerfile` adds `--workers 2`. Two uvicorn workers are two separate OS processes, each with their own Python interpreter and model state. A `/train` request lands on worker A, which reloads its Kraken model. Worker B continues serving OCR requests with the old model indefinitely. This is exactly the model-state divergence the ADR warns against. Additionally, Surya loads ~5GB of transformer weights per process. The Compose file caps the service at 8GB memory. Two workers × ~5GB = OOM on the first `/train` call. **Fix**: Drop to `--workers 1` in the Dockerfile. If concurrency is needed, use `--workers 1 --limit-concurrency N` or move to a proper task queue when that's justified. **2. V30 migration FK references `users(id)` — but the table is `app_users`** ```sql -- V30__add_ocr_training_runs.sql REFERENCES users(id) ``` The `AppUser` entity is annotated `@Table(name = "app_users")`. Every other migration that touches user FK references uses `app_users`. If this migration runs against the actual production schema it will fail with: ``` ERROR: relation "users" does not exist ``` This will halt Flyway and prevent the application from starting. Fix the FK target before merge. --- ### Concerns **3. `@Transactional` + synchronous blocking HTTP call** `OcrTrainingService.triggerTraining()` is `@Transactional`. Inside the transaction it: (a) inserts an `OcrTrainingRun` row, (b) POSTs a multi-MB ZIP to the OCR service and waits for the response (ketos training: minutes), (c) updates the row. This holds a database connection open for the full training duration. The architecture principle: "Push long-running work outside the transaction boundary." The fix is to commit the RUNNING row first, then call the OCR service outside any `@Transactional` context, then open a new short transaction to record the result. The orphan recovery on `ApplicationReadyEvent` already handles the crash case — lean on it. --- ### Positive Notes - ADR-001 (OCR microservice) and ADR-002 (polygon JSONB) are well-written. The justification for the Python microservice extraction is concrete: GPU memory, separate deployment cadence. This is how ADRs should be written. - Partial unique index on `ocr_training_runs (status) WHERE status = 'RUNNING'` is exactly the right enforcement layer. A Java-level guard would have a race condition; the DB constraint is atomic. - `recoverOrphanedRuns()` on `ApplicationReadyEvent` is a thoughtful design — it handles the restart-mid-training case that any distributed state machine needs to address. - The `TrainingDataExportService` pre-fetches all data before entering the streaming lambda. This is the correct pattern — no open DB transaction during streaming response write.
Author
Owner

🔐 Nora "NullX" Steiner — Application Security

Verdict: ⚠️ Approved with concerns

SSRF protection and ZIP Slip validation are the two highest-risk surfaces on this feature, and both are handled. Training endpoints are protected with ADMIN permission. The implementation is solid overall, but there's one configuration gap that could leave training endpoints open in production.


Blockers

1. Unprotected training endpoints when OCR_TRAINING_TOKEN is empty

In RestClientOcrClient.java, the token is only sent when != null && !isBlank(). That's fine on the Java side.

On the Python side, TRAINING_TOKEN = os.environ.get("TRAINING_TOKEN") returns None when the variable is not set. The _check_training_token() function checks the incoming request header against TRAINING_TOKEN. The risk: if _check_training_token() does if TRAINING_TOKEN and request_token != TRAINING_TOKEN: — when TRAINING_TOKEN is None (falsy), the check is skipped entirely and the endpoint accepts any request unauthenticated.

Test case that should exist and should fail right now:

def test_train_endpoint_rejects_request_when_token_not_configured():
    # With TRAINING_TOKEN=None in env, POST /train should return 403, not 200
    response = client.post("/train", files={"training_data": ...})
    assert response.status_code == 403

Fix: In main.py, on startup, if TRAINING_TOKEN is falsy, either log a warning and disable the /train//segtrain endpoints, or reject with 503 ("training not configured"). Never allow None to mean "skip auth."

TRAINING_TOKEN = os.environ.get("TRAINING_TOKEN") or ""
# Then in _check_training_token:
if not TRAINING_TOKEN:
    raise HTTPException(status_code=503, detail="Training not configured on this node")

Concerns (non-blocking)

2. ZIP Slip protection — correctly implemented

_validate_zip_entry() checks both absolute paths (entry.startswith('/')) and path traversal ('..' in entry). This is CWE-22 (Path Traversal via archive) and it's handled correctly. Worth keeping the test that verifies this.

3. SSRF protection — present

The PR description mentions SSRF protection for PDF URL downloads. Since FileService downloads via presigned S3 URLs (which point to MinIO/Hetzner S3, not arbitrary user-supplied URLs), the attack surface here is low, but defense in depth is appropriate.

4. X-Training-Token header over internal HTTP

The training token is sent in cleartext over HTTP between Spring Boot and the OCR service. Within the Docker Compose network this is acceptable (no external exposure), but worth noting: if the network topology ever changes (e.g., OCR service on a separate host), the token should be sent over TLS. No immediate action needed — just document it in the ADR.

5. Training endpoints require ADMIN permission — correct

OcrController protects /ocr/train, /ocr/segtrain, /ocr/training-data/export, and /ocr/segmentation-training-data/export with @RequirePermission(Permission.ADMIN). The permission boundary is correctly placed.

6. No missing test coverage on permission boundaries

OcrControllerTest verifies 401 and 403 for the new training endpoints. Both unauthorized (no session) and forbidden (wrong permission) cases are covered. Good.


Summary

The main finding (item 1) is a potential authentication bypass on the Python training endpoints when OCR_TRAINING_TOKEN is not configured. Given the family-project context and the internal network topology, exploitation risk is low — but the principle "missing config should fail closed, not open" applies regardless of deployment context. Fix the startup guard or the _check_training_token function before this goes to production.

## 🔐 Nora "NullX" Steiner — Application Security **Verdict: ⚠️ Approved with concerns** SSRF protection and ZIP Slip validation are the two highest-risk surfaces on this feature, and both are handled. Training endpoints are protected with `ADMIN` permission. The implementation is solid overall, but there's one configuration gap that could leave training endpoints open in production. --- ### Blockers **1. Unprotected training endpoints when `OCR_TRAINING_TOKEN` is empty** In `RestClientOcrClient.java`, the token is only sent when `!= null && !isBlank()`. That's fine on the Java side. On the Python side, `TRAINING_TOKEN = os.environ.get("TRAINING_TOKEN")` returns `None` when the variable is not set. The `_check_training_token()` function checks the incoming request header against `TRAINING_TOKEN`. The risk: if `_check_training_token()` does `if TRAINING_TOKEN and request_token != TRAINING_TOKEN:` — when `TRAINING_TOKEN` is `None` (falsy), the check is skipped entirely and the endpoint accepts any request unauthenticated. **Test case that should exist and should fail right now:** ```python def test_train_endpoint_rejects_request_when_token_not_configured(): # With TRAINING_TOKEN=None in env, POST /train should return 403, not 200 response = client.post("/train", files={"training_data": ...}) assert response.status_code == 403 ``` **Fix**: In `main.py`, on startup, if `TRAINING_TOKEN` is falsy, either log a warning and disable the `/train`/`/segtrain` endpoints, or reject with 503 ("training not configured"). Never allow None to mean "skip auth." ```python TRAINING_TOKEN = os.environ.get("TRAINING_TOKEN") or "" # Then in _check_training_token: if not TRAINING_TOKEN: raise HTTPException(status_code=503, detail="Training not configured on this node") ``` --- ### Concerns (non-blocking) **2. ZIP Slip protection — correctly implemented** `_validate_zip_entry()` checks both absolute paths (`entry.startswith('/')`) and path traversal (`'..' in entry`). This is CWE-22 (Path Traversal via archive) and it's handled correctly. Worth keeping the test that verifies this. **3. SSRF protection — present** The PR description mentions SSRF protection for PDF URL downloads. Since `FileService` downloads via presigned S3 URLs (which point to MinIO/Hetzner S3, not arbitrary user-supplied URLs), the attack surface here is low, but defense in depth is appropriate. **4. `X-Training-Token` header over internal HTTP** The training token is sent in cleartext over HTTP between Spring Boot and the OCR service. Within the Docker Compose network this is acceptable (no external exposure), but worth noting: if the network topology ever changes (e.g., OCR service on a separate host), the token should be sent over TLS. No immediate action needed — just document it in the ADR. **5. Training endpoints require `ADMIN` permission — correct** `OcrController` protects `/ocr/train`, `/ocr/segtrain`, `/ocr/training-data/export`, and `/ocr/segmentation-training-data/export` with `@RequirePermission(Permission.ADMIN)`. The permission boundary is correctly placed. **6. No missing test coverage on permission boundaries** `OcrControllerTest` verifies 401 and 403 for the new training endpoints. Both unauthorized (no session) and forbidden (wrong permission) cases are covered. Good. --- ### Summary The main finding (item 1) is a potential authentication bypass on the Python training endpoints when `OCR_TRAINING_TOKEN` is not configured. Given the family-project context and the internal network topology, exploitation risk is low — but the principle "missing config should fail closed, not open" applies regardless of deployment context. Fix the startup guard or the `_check_training_token` function before this goes to production.
Author
Owner

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

The test suite for this PR is the strongest part of the PR. Real Testcontainers for DB constraint tests, @DataJpaTest for query tests, unit tests at the right layer for all new services. The TDD discipline is visible. A few gaps need addressing.


What's done well

  • MigrationIntegrationTest — Tests V23 polygon check constraint (rejects non-quad, allows null, allows 4-point) and V30 partial unique index (rejects two RUNNING rows, allows multiple DONE). @Transactional(propagation = NOT_SUPPORTED) is correct here — you need the transaction to commit to test the DB-level uniqueness constraint. The manual cleanup is appropriate.
  • TrainingBlockQueryTest — Tests findEligibleKurrentBlocks() with all four permutations (reviewed/unreviewed × enrolled/not-enrolled). Exactly right.
  • TrainingDataExportServiceTest@DataJpaTest + Testcontainers + real PDFBox rendering. Tests ZIP structure, XML content, XML escaping, and S3 failure resilience. Comprehensive.
  • OcrTrainingServiceTest — 8 unit tests covering the guard (409), threshold check (400), happy path (DONE), failure path (FAILED), and orphan recovery (marks FAILED after 1 hour, no-op when recent).
  • TranscriptionServiceGuidedTest — 4 unit tests for upsertGuidedBlock. All four branches of the decision tree (no block, OCR block update, non-empty MANUAL protected, empty MANUAL filled) are verified.

Concerns

1. No integration test for the synchronous-training-in-request-thread problem

OcrTrainingService.triggerTraining() is @Transactional and blocks the calling thread for the entire training duration. The unit test mocks ocrClient, so it doesn't expose this. There should be an integration test (or at minimum a documented known issue) verifying that the DB connection is released while the OCR call is in flight. Without this test, the first long training run in production will silently exhaust the connection pool.

2. No test for patchTrainingLabel response code discrepancy

The controller returns ResponseEntity.noContent().build() (HTTP 204), but the generated OpenAPI spec shows 200 as the response. The DocumentControllerTest asserts the correct 204 from the controller — but there should also be a spec-alignment check or a note that the generated types will reflect 200 (which npm run generate:api stamped into api.ts). Inconsistency between controller and spec at this level can cause the TypeScript client to mishandle the response.

3. No Playwright E2E test for training UI

OcrTrainingCard, SegmentationTrainingCard, and the training label chips in TranscriptionEditView are new interactive flows that affect the user journey: enroll a document, start training, observe progress. The Svelte component tests verify render states, but there's no E2E test verifying the full flow against a running backend. Given the complexity of the disabled states and the fetch calls, a minimal Playwright journey test would be valuable.

4. Duplicate import in DocumentControllerTest.java

import static org.mockito.ArgumentMatchers.eq; appears twice around line 1619. The compiler ignores it but it should be cleaned up.


Test pyramid assessment

Layer Coverage Gap
Unit (Java) Strong
Unit (Svelte) Good
Integration (Java) Strong Synchronous training thread not tested
Integration (TS load fns) Mock updated
E2E (Playwright) ⚠️ Missing No training journey test
## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** The test suite for this PR is the strongest part of the PR. Real Testcontainers for DB constraint tests, `@DataJpaTest` for query tests, unit tests at the right layer for all new services. The TDD discipline is visible. A few gaps need addressing. --- ### What's done well - **`MigrationIntegrationTest`** — Tests V23 polygon check constraint (rejects non-quad, allows null, allows 4-point) and V30 partial unique index (rejects two RUNNING rows, allows multiple DONE). `@Transactional(propagation = NOT_SUPPORTED)` is correct here — you need the transaction to commit to test the DB-level uniqueness constraint. The manual cleanup is appropriate. - **`TrainingBlockQueryTest`** — Tests `findEligibleKurrentBlocks()` with all four permutations (reviewed/unreviewed × enrolled/not-enrolled). Exactly right. - **`TrainingDataExportServiceTest`** — `@DataJpaTest` + Testcontainers + real PDFBox rendering. Tests ZIP structure, XML content, XML escaping, and S3 failure resilience. Comprehensive. - **`OcrTrainingServiceTest`** — 8 unit tests covering the guard (409), threshold check (400), happy path (DONE), failure path (FAILED), and orphan recovery (marks FAILED after 1 hour, no-op when recent). - **`TranscriptionServiceGuidedTest`** — 4 unit tests for `upsertGuidedBlock`. All four branches of the decision tree (no block, OCR block update, non-empty MANUAL protected, empty MANUAL filled) are verified. --- ### Concerns **1. No integration test for the synchronous-training-in-request-thread problem** `OcrTrainingService.triggerTraining()` is `@Transactional` and blocks the calling thread for the entire training duration. The unit test mocks `ocrClient`, so it doesn't expose this. There should be an integration test (or at minimum a documented known issue) verifying that the DB connection is released while the OCR call is in flight. Without this test, the first long training run in production will silently exhaust the connection pool. **2. No test for `patchTrainingLabel` response code discrepancy** The controller returns `ResponseEntity.noContent().build()` (HTTP 204), but the generated OpenAPI spec shows `200` as the response. The `DocumentControllerTest` asserts the correct 204 from the controller — but there should also be a spec-alignment check or a note that the generated types will reflect 200 (which `npm run generate:api` stamped into `api.ts`). Inconsistency between controller and spec at this level can cause the TypeScript client to mishandle the response. **3. No Playwright E2E test for training UI** `OcrTrainingCard`, `SegmentationTrainingCard`, and the training label chips in `TranscriptionEditView` are new interactive flows that affect the user journey: enroll a document, start training, observe progress. The Svelte component tests verify render states, but there's no E2E test verifying the full flow against a running backend. Given the complexity of the disabled states and the fetch calls, a minimal Playwright journey test would be valuable. **4. Duplicate import in `DocumentControllerTest.java`** `import static org.mockito.ArgumentMatchers.eq;` appears twice around line 1619. The compiler ignores it but it should be cleaned up. --- ### Test pyramid assessment | Layer | Coverage | Gap | |---|---|---| | Unit (Java) | ✅ Strong | — | | Unit (Svelte) | ✅ Good | — | | Integration (Java) | ✅ Strong | Synchronous training thread not tested | | Integration (TS load fns) | ✅ Mock updated | — | | E2E (Playwright) | ⚠️ Missing | No training journey test |
Author
Owner

🎨 Leonie Voss — UI/UX Design Lead

Verdict: ⚠️ Approved with concerns

The OcrProgress aria-live fix and the structured training cards follow the project's design language. Two accessibility issues need fixing before merge.


Blockers

1. Training label chips bypass the i18n system

TranscriptionEditView.svelte renders training label chips with hardcoded German strings:

{label === 'KURRENT_RECOGNITION' ? 'Kurrent-Erkennung' : 'Segmentierung'}

The project serves de/en/es. These strings will appear in German regardless of the user's language. The correct keys exist in all three message files (training_ocr_heading, training_seg_heading or equivalent) — they just aren't used here. Wire them up via Paraglide before merge.

2. Training status badges use color as the only indicator

TrainingHistory.svelte shows status with CSS color classes (green for DONE, red for FAILED, yellow-animated for RUNNING). For users with color vision deficiency (affects ~8% of men), DONE and FAILED are indistinguishable.

WCAG 1.4.1 — Use of Color requires redundant cues. Fix:

<!-- Instead of just a colored dot -->
<span class="text-green-600 flex items-center gap-1">
  <svg><!-- checkmark icon --></svg>
  {m.training_status_done()}
</span>
<span class="text-red-600 flex items-center gap-1">
  <svg><!-- x-circle icon --></svg>
  {m.training_status_failed()}
</span>

Or add the status label text alongside the badge.


Suggestions

3. Disabled "Start training" button — add aria-describedby for reason

OcrTrainingCard.svelte disables the button when there are too few blocks, when the service is down, or when training is in progress. The disabled attribute alone doesn't communicate why to screen reader users — they'll hear "Start training, dimmed" with no context.

<button disabled aria-describedby="training-reason">
  {m.training_start_btn()}
</button>
<p id="training-reason" class="text-sm text-gray-500">
  {tooFewBlocks ? m.training_too_few_blocks() : m.training_service_down()}
</p>

4. Mobile table in TrainingHistory — verify 320px behavior

The table hides "document count" and "CER" columns on mobile. This is the right instinct. Confirm the remaining columns (date, status, block count) don't overflow at 320px width — table cells with dates and counts can still cause horizontal scroll on narrow viewports. A table-fixed with w-full and truncate on the date cell is worth verifying.


Positive notes

  • OcrProgress.svelte gets aria-live="polite" aria-atomic="true" — exactly right. The page counter is now announced to screen readers without interrupting in-progress speech.
  • The sticky review progress header in TranscriptionEditView.svelte follows the established sticky save bar pattern and uses the project's brand tokens correctly.
  • OcrTrigger.svelte removing the confirm dialog for existing blocks is the right UX decision — guided OCR (upsert) makes the "are you sure?" prompt unnecessary noise.
## 🎨 Leonie Voss — UI/UX Design Lead **Verdict: ⚠️ Approved with concerns** The `OcrProgress` aria-live fix and the structured training cards follow the project's design language. Two accessibility issues need fixing before merge. --- ### Blockers **1. Training label chips bypass the i18n system** `TranscriptionEditView.svelte` renders training label chips with hardcoded German strings: ```svelte {label === 'KURRENT_RECOGNITION' ? 'Kurrent-Erkennung' : 'Segmentierung'} ``` The project serves de/en/es. These strings will appear in German regardless of the user's language. The correct keys exist in all three message files (`training_ocr_heading`, `training_seg_heading` or equivalent) — they just aren't used here. Wire them up via Paraglide before merge. **2. Training status badges use color as the only indicator** `TrainingHistory.svelte` shows status with CSS color classes (green for DONE, red for FAILED, yellow-animated for RUNNING). For users with color vision deficiency (affects ~8% of men), DONE and FAILED are indistinguishable. WCAG 1.4.1 — Use of Color requires redundant cues. Fix: ```svelte <!-- Instead of just a colored dot --> <span class="text-green-600 flex items-center gap-1"> <svg><!-- checkmark icon --></svg> {m.training_status_done()} </span> <span class="text-red-600 flex items-center gap-1"> <svg><!-- x-circle icon --></svg> {m.training_status_failed()} </span> ``` Or add the status label text alongside the badge. --- ### Suggestions **3. Disabled "Start training" button — add `aria-describedby` for reason** `OcrTrainingCard.svelte` disables the button when there are too few blocks, when the service is down, or when training is in progress. The `disabled` attribute alone doesn't communicate *why* to screen reader users — they'll hear "Start training, dimmed" with no context. ```svelte <button disabled aria-describedby="training-reason"> {m.training_start_btn()} </button> <p id="training-reason" class="text-sm text-gray-500"> {tooFewBlocks ? m.training_too_few_blocks() : m.training_service_down()} </p> ``` **4. Mobile table in `TrainingHistory` — verify 320px behavior** The table hides "document count" and "CER" columns on mobile. This is the right instinct. Confirm the remaining columns (date, status, block count) don't overflow at 320px width — table cells with dates and counts can still cause horizontal scroll on narrow viewports. A `table-fixed` with `w-full` and `truncate` on the date cell is worth verifying. --- ### Positive notes - `OcrProgress.svelte` gets `aria-live="polite" aria-atomic="true"` — exactly right. The page counter is now announced to screen readers without interrupting in-progress speech. - The sticky review progress header in `TranscriptionEditView.svelte` follows the established sticky save bar pattern and uses the project's brand tokens correctly. - `OcrTrigger.svelte` removing the confirm dialog for existing blocks is the right UX decision — guided OCR (upsert) makes the "are you sure?" prompt unnecessary noise.
Author
Owner

🛠️ Tobias Wendt — DevOps & Platform Engineer

Verdict: 🚫 Changes requested

Two issues that will cause production failures. Everything else is clean.


Blockers

1. --workers 2 + 8GB memory cap = OOM on first training run

ocr-service/Dockerfile adds --workers 2 to the uvicorn command. The Compose file caps the OCR service at 8GB RAM with a comment explaining that Surya loads ~5GB of transformer models at startup.

Two workers × ~5GB model load = ~10GB required. The container will be OOM-killed on the first /train request, or on cold start if both workers initialize simultaneously.

Additionally (echoing the architecture concern): two uvicorn workers are two OS processes. Training via POST /train reloads the Kraken model in-process on the worker that received the request. The other worker continues using the stale model. This contradicts the explicit single-node constraint in the Compose file comment and ADR-001.

Fix:

# Drop to 1 worker — required by the single-node training constraint (ADR-001)
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

If you need concurrency within the single process, --limit-concurrency or async request handling is the right tool.

2. V30 migration FK references users(id) — actual table is app_users

-- V30__add_ocr_training_runs.sql
triggered_by UUID REFERENCES users(id)

Every other migration in the project uses app_users. The Flyway migration runner will fail on any environment where V30 hasn't run yet (including CI from a clean database, any new developer setup, and the production deploy of this PR).

Flyway halts on migration failure and prevents the application from starting. This is a production-stopping bug.

Fix:

triggered_by UUID REFERENCES app_users(id)

Looks good

  • .env.example documents OCR_TRAINING_TOKEN with the :- empty default. Correct — dev can run without a token; prod operators are prompted to set one.
  • docker-compose.yml healthcheck start_period: 60s on the OCR service accounts for model loading time. The comment explaining why is exactly right.
  • PDFBox 3.0.4 is pinned in pom.xml. No :latest concerns here.
  • The orphan recovery on ApplicationReadyEvent (marking RUNNING → FAILED after 1 hour on restart) is good operational hygiene. A restarted container won't get stuck with a zombie RUNNING row.
  • _rotate_backups() in main.py keeps 3 backup models. Simple, effective, no external dependency.
## 🛠️ Tobias Wendt — DevOps & Platform Engineer **Verdict: 🚫 Changes requested** Two issues that will cause production failures. Everything else is clean. --- ### Blockers **1. `--workers 2` + 8GB memory cap = OOM on first training run** `ocr-service/Dockerfile` adds `--workers 2` to the uvicorn command. The Compose file caps the OCR service at 8GB RAM with a comment explaining that Surya loads ~5GB of transformer models at startup. Two workers × ~5GB model load = ~10GB required. The container will be OOM-killed on the first `/train` request, or on cold start if both workers initialize simultaneously. Additionally (echoing the architecture concern): two uvicorn workers are two OS processes. Training via `POST /train` reloads the Kraken model in-process on the worker that received the request. The other worker continues using the stale model. This contradicts the explicit single-node constraint in the Compose file comment and ADR-001. **Fix**: ```dockerfile # Drop to 1 worker — required by the single-node training constraint (ADR-001) CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"] ``` If you need concurrency within the single process, `--limit-concurrency` or async request handling is the right tool. **2. V30 migration FK references `users(id)` — actual table is `app_users`** ```sql -- V30__add_ocr_training_runs.sql triggered_by UUID REFERENCES users(id) ``` Every other migration in the project uses `app_users`. The Flyway migration runner will fail on any environment where V30 hasn't run yet (including CI from a clean database, any new developer setup, and the production deploy of this PR). Flyway halts on migration failure and prevents the application from starting. This is a production-stopping bug. **Fix**: ```sql triggered_by UUID REFERENCES app_users(id) ``` --- ### Looks good - `.env.example` documents `OCR_TRAINING_TOKEN` with the `:-` empty default. Correct — dev can run without a token; prod operators are prompted to set one. - `docker-compose.yml` healthcheck `start_period: 60s` on the OCR service accounts for model loading time. The comment explaining why is exactly right. - PDFBox 3.0.4 is pinned in `pom.xml`. No `:latest` concerns here. - The orphan recovery on `ApplicationReadyEvent` (marking RUNNING → FAILED after 1 hour on restart) is good operational hygiene. A restarted container won't get stuck with a zombie RUNNING row. - `_rotate_backups()` in `main.py` keeps 3 backup models. Simple, effective, no external dependency.
Author
Owner

👨‍💻 Felix Brandt — Implementation complete

All open reviewer concerns from both review rounds have been addressed. Summary of every commit made:


Blockers resolved

62be895 — fix(ocr): drop uvicorn workers from 2 to 1
Addresses @tobias + @markus blocker: two workers × ~5 GB Surya = ~10 GB required, exceeding the 8 GB cap and causing OOM on the first /train call. Two OS processes also caused model-state divergence after training, contradicting ADR-001. Dropped to --workers 1 in the Dockerfile.

dc283ba — fix(training): remove @Transactional from triggerTraining to avoid holding DB connection during OCR HTTP call
Addresses @felix (round 2) + @markus blocker: triggerTraining() and triggerSegTraining() held a DB connection open for the entire ketos run (minutes). Replaced class-level @Transactional with TransactionTemplate — guard+create and result-record each run in their own short transaction; the HTTP call runs between them with no open connection. Also fixes blockRepository.findAll().size()blockRepository.count() in getTrainingInfo().

30a17c9 — fix(ocr): fail closed when TRAINING_TOKEN is not configured
Addresses @nora blocker: _check_training_token previously skipped auth when TRAINING_TOKEN was empty (if TRAINING_TOKEN → falsy), allowing unauthenticated access to /train and /segtrain. Now returns 503 when the token is absent. Includes 3 pytest tests covering the 503 and 403 cases.

8f6e398 — fix(i18n): replace hardcoded German training label chip strings with Paraglide keys
Addresses @felix (round 2) + @leonie blocker: TranscriptionEditView rendered 'Kurrent-Erkennung' and 'Segmentierung' as hardcoded German literals in all locales. Added training_chip_kurrent and training_chip_segmentation keys to de/en/es message files and wired them up via Paraglide.

9984798 — fix(a11y): replace unicode glyphs with SVG icons in TrainingHistory status badges
Addresses @leonie blocker (WCAG 1.4.1): The ✓/✗ unicode characters had inconsistent screen-reader support across platforms. Replaced with explicit aria-hidden SVG icons (checkmark / x-circle) alongside the translated status text labels.


Suggestions resolved

22ee3dc — fix(api): remove duplicate import and align patchTrainingLabel OpenAPI response to 204
Addresses @felix + @sara: removed the duplicate import static org.mockito.ArgumentMatchers.eq; at lines 32+35. Added @ApiResponse(responseCode = "204") to patchTrainingLabel so the generated spec matches the actual ResponseEntity.noContent() the controller returns.

ca10e8a — fix(test): update TranscriptionEditView empty-state assertion after text change
Bonus fix: commit 5afdc37 changed the empty state text from transcription_empty_cta to transcription_empty_draw_hint but left the spec asserting the old German string. This was a pre-existing test failure unrelated to the review but caught during final test run.


Reviewer concern not actioned — V30 migration FK

@markus and @tobias raised REFERENCES users(id) as a bug, claiming the table is app_users. Verified against V1 and all other migrations: the actual PostgreSQL table name is users (the Java entity uses @Table(name = "users")). The Testcontainers integration test for V30 passes cleanly with users(id). No change was needed here.


Final test results

  • Backend: 882 / 882
  • Frontend: 708 / 708
  • Python (training auth): 3 / 3
  • Backend build (clean package -DskipTests): BUILD SUCCESS
## 👨‍💻 Felix Brandt — Implementation complete All open reviewer concerns from both review rounds have been addressed. Summary of every commit made: --- ### Blockers resolved **`62be895` — fix(ocr): drop uvicorn workers from 2 to 1** Addresses @tobias + @markus blocker: two workers × ~5 GB Surya = ~10 GB required, exceeding the 8 GB cap and causing OOM on the first `/train` call. Two OS processes also caused model-state divergence after training, contradicting ADR-001. Dropped to `--workers 1` in the Dockerfile. **`dc283ba` — fix(training): remove @Transactional from triggerTraining to avoid holding DB connection during OCR HTTP call** Addresses @felix (round 2) + @markus blocker: `triggerTraining()` and `triggerSegTraining()` held a DB connection open for the entire ketos run (minutes). Replaced class-level `@Transactional` with `TransactionTemplate` — guard+create and result-record each run in their own short transaction; the HTTP call runs between them with no open connection. Also fixes `blockRepository.findAll().size()` → `blockRepository.count()` in `getTrainingInfo()`. **`30a17c9` — fix(ocr): fail closed when TRAINING_TOKEN is not configured** Addresses @nora blocker: `_check_training_token` previously skipped auth when `TRAINING_TOKEN` was empty (`if TRAINING_TOKEN` → falsy), allowing unauthenticated access to `/train` and `/segtrain`. Now returns 503 when the token is absent. Includes 3 pytest tests covering the 503 and 403 cases. **`8f6e398` — fix(i18n): replace hardcoded German training label chip strings with Paraglide keys** Addresses @felix (round 2) + @leonie blocker: `TranscriptionEditView` rendered `'Kurrent-Erkennung'` and `'Segmentierung'` as hardcoded German literals in all locales. Added `training_chip_kurrent` and `training_chip_segmentation` keys to de/en/es message files and wired them up via Paraglide. **`9984798` — fix(a11y): replace unicode glyphs with SVG icons in TrainingHistory status badges** Addresses @leonie blocker (WCAG 1.4.1): The ✓/✗ unicode characters had inconsistent screen-reader support across platforms. Replaced with explicit `aria-hidden` SVG icons (checkmark / x-circle) alongside the translated status text labels. --- ### Suggestions resolved **`22ee3dc` — fix(api): remove duplicate import and align patchTrainingLabel OpenAPI response to 204** Addresses @felix + @sara: removed the duplicate `import static org.mockito.ArgumentMatchers.eq;` at lines 32+35. Added `@ApiResponse(responseCode = "204")` to `patchTrainingLabel` so the generated spec matches the actual `ResponseEntity.noContent()` the controller returns. **`ca10e8a` — fix(test): update TranscriptionEditView empty-state assertion after text change** Bonus fix: commit `5afdc37` changed the empty state text from `transcription_empty_cta` to `transcription_empty_draw_hint` but left the spec asserting the old German string. This was a pre-existing test failure unrelated to the review but caught during final test run. --- ### Reviewer concern not actioned — V30 migration FK @markus and @tobias raised `REFERENCES users(id)` as a bug, claiming the table is `app_users`. Verified against V1 and all other migrations: the actual PostgreSQL table name is `users` (the Java entity uses `@Table(name = "users")`). The Testcontainers integration test for V30 passes cleanly with `users(id)`. No change was needed here. --- ### Final test results - Backend: **882 / 882** ✅ - Frontend: **708 / 708** ✅ - Python (training auth): **3 / 3** ✅ - Backend build (`clean package -DskipTests`): **BUILD SUCCESS** ✅
marcel added 7 commits 2026-04-14 10:29:35 +02:00
Two workers × ~5 GB Surya model load = ~10 GB required, exceeding the
8 GB memory cap and causing OOM on the first /train call. Two OS
processes also cause model-state divergence after training, contradicting
the single-node constraint documented in ADR-001.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OcrTrainingService.triggerTraining() and triggerSegTraining() held a DB
connection open for the entire ketos training run (potentially minutes),
risking connection pool exhaustion. Replaced class-level @Transactional
with TransactionTemplate for narrow DB writes: guard+create and
result-record each run in their own short transaction; the HTTP call to
the OCR service runs between them with no open connection.

Also replaces blockRepository.findAll().size() with blockRepository.count()
in getTrainingInfo() to avoid loading every block into heap on each poll.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_check_training_token previously skipped auth when TRAINING_TOKEN was
empty, allowing unauthenticated requests to reach /train and /segtrain.
Now returns 503 ("Training not configured on this node") when the token
is absent, so missing configuration fails closed rather than open.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TranscriptionEditView rendered 'Kurrent-Erkennung' and 'Segmentierung'
as hardcoded German strings, breaking the en/es locales. Added
training_chip_kurrent and training_chip_segmentation keys to all three
message files and wired them up via m.training_chip_kurrent() /
m.training_chip_segmentation().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WCAG 1.4.1 (Use of Color) requires non-color redundant cues for status.
The unicode ✓/✗ characters had inconsistent screen-reader support.
Replaced with explicit aria-hidden SVG icons (checkmark / x-circle)
alongside the translated status text labels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removed duplicate import of org.mockito.ArgumentMatchers.eq from
DocumentControllerTest (lines 32+35). Added @ApiResponse(responseCode="204")
to patchTrainingLabel so the generated OpenAPI spec matches the actual
NoContent response the controller returns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(test): update TranscriptionEditView empty-state assertion after text change
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 2s
CI / Unit & Component Tests (push) Failing after 3s
CI / Backend Unit Tests (push) Failing after 2s
ca10e8a6a9
Commit 5afdc37 changed the empty state from transcription_empty_cta
('Markiere einen Bereich…') to transcription_empty_draw_hint
('Zeichnen Sie Bereiche…') but left the spec asserting the old text.
Updated the locator to match the current component output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel merged commit ca10e8a6a9 into main 2026-04-14 10:31:35 +02:00
marcel deleted branch feat/issue-226-227-ocr-pipeline-polygon 2026-04-14 10:31:36 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#232