marcel/familienarchiv

fix(ocr): fix segmentation training for ketos 7 and low-memory hosts #234

Merged

marcel merged 13 commits from fix/ocr-segtrain-training into main

2026-04-14 21:17:54 +02:00

Author	SHA1	Message	Date
Marcel	6d7469e9b8	fix(deploy): increase OCR healthcheck start_period, comment ocr_cache volume, add token hint Some checks failed CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 2s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details - start_period 60s → 120s: Zenodo download on cold start can exceed 60s on slow connections - ocr_cache volume comment: documents what the cache stores for future operators - .env.example: add token generation command to prevent weak placeholder in production Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 15:29:28 +02:00
Marcel	06e3ae141c	test(frontend): add Vitest component tests for TrainingHistory expand/collapse Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 15:28:15 +02:00
Marcel	83900de787	fix(frontend): accessibility fixes for TrainingHistory expand/collapse and FAILED badge - Add aria-expanded + aria-controls to expand button (WCAG 4.1.2) - Add id="training-history-rows" to tbody for aria-controls target - Replace title= tooltip on FAILED badge with details/summary for keyboard and touch accessibility; add training_error_detail_label i18n key - Use motion-safe:animate-pulse on RUNNING badge for prefers-reduced-motion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 15:26:55 +02:00
Marcel	29b44e3f48	fix(ocr): pin Dockerfile base image to python:3.11.9-slim for reproducible builds Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 15:24:44 +02:00
Marcel	fdae60a528	fix(ocr): narrow exception handling and add unit tests for ensure_blla_model - _model_is_loadable: narrow bare except to (RuntimeError, OSError, ValueError) with DEBUG-level fallback for unexpected exceptions — prevents silent masking of missing kraken install or AttributeError on vgsl - _run_segtrain: replace bare except:pass with log.warning so height-check fallback is visible in container logs - New test_ensure_blla_model.py: covers model-OK early return, incompatible model rename+replace, and missing model download paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 15:24:04 +02:00
Marcel	9b6b6f4f7e	refactor(ocr): rename findTop5 to findTop10 for headroom as frontend shows 3 by default Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 15:20:11 +02:00
Marcel	1eaae2ca09	test(ocr): add unit tests for triggerSegTraining() — conflict, threshold, happy path, failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 15:19:04 +02:00
Marcel	a1694090ff	refactor(ocr): extract assertNoRunningTraining() to eliminate duplicate guard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 15:17:57 +02:00
Marcel	9ee39efb8b	feat(frontend): limit training history to 3 runs with expand toggle Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details Both training panels (OCR and segmentation) share TrainingHistory. Show only the 3 most recent runs by default; render a Mehr/Weniger anzeigen button when there are more. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 13:08:08 +02:00
Marcel	ff565353c0	fix(backend): store error rate for segmentation training runs setCer() was called for recognition training but not for segmentation. The OCR service now returns cer = 1 - accuracy for segtrain; persist it so the admin panel can display Fehlerrate for both training types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 13:07:39 +02:00
Marcel	4108cda520	fix(deploy): wire OCR training token to backend and raise container memory limit - Pass OCR_TRAINING_TOKEN through to the backend container as APP_OCR_TRAINING_TOKEN so RestClientOcrClient sends the X-Training-Token header when calling /train and /segtrain. - Raise mem_limit/memswap_limit from 8g to 12g to give segtrain headroom on hosts with more available RAM. - Uncomment OCR_TRAINING_TOKEN in .env.example — it is now required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 13:07:11 +02:00
Marcel	9ca3e92387	fix(ocr-service): fix ketos 7 segtrain compatibility and prevent OOM Three issues fixed: 1. --resize both was removed in ketos 7; replaced with --resize union which extends the model's class mapping to include training data classes. 2. ketos ignores -s when -i is present, so the 1800px blla model caused 7+ GB peak RAM and OOM-killed the host (no swap, 5 GB free). Now checks the loaded model's input height: only uses the base model when it was already fine-tuned at 800px; otherwise trains from scratch at 800px (~200 MB peak). After the first run the trained 800px model becomes the base for all subsequent fine-tuning runs. 3. segtrain now computes and returns cer = 1 - accuracy, matching the recognition training path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 13:06:44 +02:00
Marcel	3f55af46e6	fix(ocr-service): add entrypoint that validates blla model format on startup Adds ensure_blla_model.py which loads the blla segmentation model with ketos on every container start. If the model is missing or in the legacy PyTorch ZIP format (incompatible with ketos 7), it re-downloads the correct CoreML protobuf model from Zenodo (DOI 10.5281/zenodo.14602569). The Dockerfile now uses entrypoint.sh which runs this check before starting uvicorn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 13:06:12 +02:00