fix(ocr): fix segmentation training for ketos 7 and low-memory hosts #234

Merged
marcel merged 13 commits from fix/ocr-segtrain-training into main 2026-04-14 21:17:54 +02:00

13 Commits

Author SHA1 Message Date
Marcel
6d7469e9b8 fix(deploy): increase OCR healthcheck start_period, comment ocr_cache volume, add token hint
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
- start_period 60s → 120s: Zenodo download on cold start can exceed 60s on slow connections
- ocr_cache volume comment: documents what the cache stores for future operators
- .env.example: add token generation command to prevent weak placeholder in production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 15:29:28 +02:00
Marcel
06e3ae141c test(frontend): add Vitest component tests for TrainingHistory expand/collapse
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 15:28:15 +02:00
Marcel
83900de787 fix(frontend): accessibility fixes for TrainingHistory expand/collapse and FAILED badge
- Add aria-expanded + aria-controls to expand button (WCAG 4.1.2)
- Add id="training-history-rows" to tbody for aria-controls target
- Replace title= tooltip on FAILED badge with details/summary for keyboard
  and touch accessibility; add training_error_detail_label i18n key
- Use motion-safe:animate-pulse on RUNNING badge for prefers-reduced-motion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 15:26:55 +02:00
Marcel
29b44e3f48 fix(ocr): pin Dockerfile base image to python:3.11.9-slim for reproducible builds
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 15:24:44 +02:00
Marcel
fdae60a528 fix(ocr): narrow exception handling and add unit tests for ensure_blla_model
- _model_is_loadable: narrow bare except to (RuntimeError, OSError, ValueError)
  with DEBUG-level fallback for unexpected exceptions — prevents silent masking
  of missing kraken install or AttributeError on vgsl
- _run_segtrain: replace bare except:pass with log.warning so height-check
  fallback is visible in container logs
- New test_ensure_blla_model.py: covers model-OK early return, incompatible
  model rename+replace, and missing model download paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 15:24:04 +02:00
Marcel
9b6b6f4f7e refactor(ocr): rename findTop5 to findTop10 for headroom as frontend shows 3 by default
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 15:20:11 +02:00
Marcel
1eaae2ca09 test(ocr): add unit tests for triggerSegTraining() — conflict, threshold, happy path, failure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 15:19:04 +02:00
Marcel
a1694090ff refactor(ocr): extract assertNoRunningTraining() to eliminate duplicate guard
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 15:17:57 +02:00
Marcel
9ee39efb8b feat(frontend): limit training history to 3 runs with expand toggle
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
Both training panels (OCR and segmentation) share TrainingHistory.
Show only the 3 most recent runs by default; render a Mehr/Weniger
anzeigen button when there are more.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 13:08:08 +02:00
Marcel
ff565353c0 fix(backend): store error rate for segmentation training runs
setCer() was called for recognition training but not for segmentation.
The OCR service now returns cer = 1 - accuracy for segtrain; persist it
so the admin panel can display Fehlerrate for both training types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 13:07:39 +02:00
Marcel
4108cda520 fix(deploy): wire OCR training token to backend and raise container memory limit
- Pass OCR_TRAINING_TOKEN through to the backend container as
  APP_OCR_TRAINING_TOKEN so RestClientOcrClient sends the X-Training-Token
  header when calling /train and /segtrain.
- Raise mem_limit/memswap_limit from 8g to 12g to give segtrain headroom
  on hosts with more available RAM.
- Uncomment OCR_TRAINING_TOKEN in .env.example — it is now required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 13:07:11 +02:00
Marcel
9ca3e92387 fix(ocr-service): fix ketos 7 segtrain compatibility and prevent OOM
Three issues fixed:

1. --resize both was removed in ketos 7; replaced with --resize union
   which extends the model's class mapping to include training data classes.

2. ketos ignores -s when -i is present, so the 1800px blla model caused
   7+ GB peak RAM and OOM-killed the host (no swap, 5 GB free).
   Now checks the loaded model's input height: only uses the base model
   when it was already fine-tuned at 800px; otherwise trains from scratch
   at 800px (~200 MB peak). After the first run the trained 800px model
   becomes the base for all subsequent fine-tuning runs.

3. segtrain now computes and returns cer = 1 - accuracy, matching the
   recognition training path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 13:06:44 +02:00
Marcel
3f55af46e6 fix(ocr-service): add entrypoint that validates blla model format on startup
Adds ensure_blla_model.py which loads the blla segmentation model with
ketos on every container start. If the model is missing or in the legacy
PyTorch ZIP format (incompatible with ketos 7), it re-downloads the
correct CoreML protobuf model from Zenodo (DOI 10.5281/zenodo.14602569).
The Dockerfile now uses entrypoint.sh which runs this check before
starting uvicorn.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 13:06:12 +02:00