fix(ocr): fix segmentation training for ketos 7 and low-memory hosts #234
Reference in New Issue
Block a user
Delete Branch "fix/ocr-segtrain-training"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
OCR_TRAINING_TOKENwas set on the OCR container but not passed to the backend, so every/segtraincall returned 503. Fixed by addingAPP_OCR_TRAINING_TOKENto the backend environment indocker-compose.yml.ensure_blla_model.pywhich validates the model on every container start and re-downloads from Zenodo if it's missing or incompatible.-swhen-iis present, so height can't be overridden when loading a base model. Fixed by checking the base model's height: if it's not already 800px, skip-iand train from scratch at 800px (~200 MB peak). After the first run the trained 800px model is reused as the base for all subsequent runs.--resize bothremoved in ketos 7: Replaced with--resize union.cer: nullfor segmentation runs and the backend wasn't storing it. Both now compute and persistcer = 1 - accuracy.Test plan
ensure_blla_model.pydownloads the correct model on first start🤖 Generated with Claude Code
Felix Brandt (@felixbrandt) — Developer Review
Verdict: APPROVE with suggestions
The fixes are targeted and correct. Each change addresses a real, reproducible bug and the code is clean. A few things to tighten up.
What's done well
ensure_blla_model.pyhas a clear module docstring and a cleanmain()/_download_blla()/_model_is_loadable()split. Each function does one thing.entrypoint.shusesset -euo pipefail— correct fail-fast behaviour._run_segtrain()logic inmain.pycleanly explains the height-check strategy with inline comments that explain why, not what._validate_zip_entry(),_rotate_backups(),_parse_best_checkpoint(),_find_best_model()are all well-named single-purpose helpers.OcrTrainingService.triggerSegTraining()mirrors the existingtriggerTraining()pattern exactly — consistent.Blockers
None.
Suggestions
1.
_model_is_loadablecatches bareException— too broadThe broad catch hides import errors,
AttributeErrorfrom a missingvgslsymbol, etc. Ifkrakenitself isn't installed, this silently returnsFalseand triggers a re-download on every startup. Narrow it to(RuntimeError, OSError, ValueError)or at least log the full traceback atDEBUGlevel so startup surprises are diagnosable.2. Bare
except Exception: passin_run_segtrainThis is the height-check block in
segtrain. Ifvgsl.TorchVGSLModel.load_modelraises an unexpected error (disk full, corrupted file), the code silently falls through to training-from-scratch. Alog.warning(...)here would at least make the fallback visible in logs.3.
TrainingHistory.svelte— missingaria-labelon the expand/collapse buttonThe button text already comes from Paraglide, so this is fine for sighted users. But screen reader users navigating by button list need an
aria-expandedattribute to understand the toggle state:4.
OcrTrainingService—findTop5vs. 3-item displaygetTrainingInfo()callsfindTop5ByOrderByCreatedAtDesc()but the frontend now shows only 3 with a toggle. The backend still sends 5. Consider renaming tofindTop10...or passing a configurable limit so the backend and frontend stay in sync as the display count evolves. Not a bug, just a maintenance friction point.5. No tests added for the new segtrain path
The
triggerSegTraining()method inOcrTrainingServiceis a write path with a DB guard, an HTTP call, and two transaction template executions — identical structure totriggerTraining(). Following the project's TDD expectation, aOcrTrainingServiceTestunit test covering the happy path and the "already running" conflict case would be appropriate here.Markus Keller (@mkeller) — Architect Review
Verdict: APPROVE with one structural note
The fixes are well-scoped and the architectural boundary between Java backend and Python OCR service is respected. The OOM mitigation strategy is pragmatic and the model lifecycle logic is appropriately placed in the OCR service, not leaked into the backend.
What's done well
triggerTraining()andtriggerSegTraining()correctly release the DB connection before the long-running HTTP call. This is the right pattern. TheTransactionTemplateapproach is explicit and readable.ensure_blla_model.pyexits non-zero on failure — this causes Docker to mark the container unhealthy rather than start silently with a broken model. Correct failure mode._rotate_backups()keeps only 3 model backups — prevents unbounded disk growth on a volume that's shared with large ML models.Structural concern: duplicated training run guard logic
triggerTraining()andtriggerSegTraining()share identical patterns:This guard exists in both methods. If the conflict check ever needs to change (e.g., per-model-type guards), it will be changed in one place and silently left stale in the other. Extract to a private
assertNoRunningTraining()helper. One line each, zero duplication:This isn't a blocker but it's the kind of thing that grows into a real bug on the third method.
Observation:
getTrainingInfo()loads all eligible blocks twice on the segtrain pathgetTrainingInfo()calls bothqueryEligibleBlocks()(for recognition training) andquerySegmentationBlocks()(for segmentation training) on every request. If the admin page polls this regularly, that's two potentially expensive queries per poll. Not a problem at current data volumes, but worth noting as a future optimization point if admin page latency becomes noticeable.Minor: Dockerfile base image not pinned to a patch version
Per project infrastructure standards, this should be pinned to a digest or at minimum a patch version (
python:3.11.9-slim) so builds are reproducible.:slimwithout a patch version will silently change ondocker pull. Renovate can automate the bump PRs.Sara Holt (@saraholt) — QA / Tester Review
Verdict: REQUEST CHANGES — missing test coverage on new paths
The fixes are correct and the logic is sound. However, two new code paths with meaningful branching logic have been added with no accompanying tests. For a codebase that practices TDD, this is a gap.
Blockers
1.
triggerSegTraining()inOcrTrainingService— no unit testsThis is a write path that contains:
DomainException.conflict)DomainException.badRequest)TransactionTemplateexecutions (success path and failure path)recoverOrphanedRuns()The equivalent
triggerTraining()method presumably has tests (it's the existing path).triggerSegTraining()needs the same coverage. Minimum test cases:2.
ensure_blla_model.py— no unit testsThis script has two distinct branches:
These are testable with
unittest.mock.patchon_model_is_loadableand_download_blla. Without tests, a regression in the rename/replace logic could silently wipe the model on every startup.Suggestions
3.
TrainingHistory.svelte— no component test for the expand/collapse toggleThe new
expandedstate andvisibleRunsderived value have visible user-facing behaviour (3 rows → all rows toggle). A Vitest component test would catch regressions:4.
_parse_best_checkpoint()— edge cases not coveredThe function parses filenames with a regex. Edge cases worth testing:
(None, 0)✓ (handled)checkpoint_03-0.9500.ckpt) → verify regex handles this5. Acceptance test for the model height detection path
The OOM fix depends on correctly reading
_m.input[2]from the loaded model. If ketos changes the input tensor shape format in a future version, this silently breaks and training goes back to OOM-on-start. A contract test or at least a comment pointing to the ketos version this was tested against would help future maintainers.Nora "NullX" Steiner (@nullx) — Security Review
Verdict: APPROVE — no blockers, one medium finding, one smell
The existing security controls are maintained and the new code adds no new attack surface. The ZIP Slip protection and SSRF allowlist carry over correctly to the new
/segtrainendpoint. Two items worth addressing before the next release.What's done well
_validate_zip_entry()is called in both/trainand/segtrain— ZIP Slip protection is correctly applied to the new endpoint. Theos.path.realpath()+startswith()check is the correct approach._check_training_token()fails closed — whenTRAINING_TOKENis not set, it returns 503 rather than allowing unauthenticated access. This is correct fail-closed behaviour.ALLOWED_PDF_HOSTSSSRF protection — carried forward unchanged, correctly rejects unexpected hostnames.OCR_TRAINING_TOKENis now wired to the backend — the fix correctly uses${OCR_TRAINING_TOKEN:-}as the default (empty string disables training). This is acceptable for dev; production must set a non-empty value.Medium:
.env.exampleuses a weak placeholder token that operators may deploy as-isThe comment says "Must not be empty in production" but provides a memorable, guessable placeholder. An operator who forgets to change it deploys with a known token. Consider:
This doesn't prevent the mistake but makes the intention unambiguous and removes the excuse of not knowing how to generate a suitable value.
Smell: bare
except Exception: passswallows security-relevant failuresIn
_run_segtrain():This is inside the model-height detection block. If this block raises an
OSError(e.g. model file disappeared between theos.path.exists()check andload_model()), the code falls through to training-from-scratch. That's a safe fallback behaviour, not a security issue per se. However, a silent exception here could mask an attacker-controlled file replacement between the check and the load (TOCTOU on the model path). The risk is very low in practice (Docker named volume, no external write access), but logging the exception would at minimum make the anomaly visible.Recommended fix:
Observation: model path is read from environment in
_run_segtrainbut not validatedThe path is used in
shutil.copy2()andos.rename()operations. In the current deployment (Docker, controlled environment variable), this is not exploitable. Document thatBLLA_MODEL_PATHmust be an absolute path within the models volume, and that arbitrary values from untrusted sources must not be allowed to set it.Leonie Voss (@leonievoss) — UI/UX Review
Verdict: APPROVE with minor accessibility fixes
The
TrainingHistory.sveltechanges are clean and well-structured. The Svelte 5 runes usage is correct. Two accessibility issues need addressing — one is a WCAG blocker for keyboard/AT users, one is minor.What's done well
$state,$derived, and$props()are all idiomatic.$derivedforvisibleRunsandhasMorekeeps the template clean.{#each visibleRuns as run (run.id)}— keyed iteration. Correct.aria-hidden="true"on decorative SVGs — correct, prevents screen readers from announcing raw path data.title={run.errorMessage}on the FAILED badge — provides the error detail on hover. Good for sighted users. (See suggestion below for AT users.)Blocker: missing
aria-expandedon the expand/collapse buttonWCAG 4.1.2 (Name, Role, Value) requires that the expanded/collapsed state of a toggle control be programmatically determinable. Without
aria-expanded, a screen reader user hears "Mehr anzeigen" but has no way to know they can also collapse — or whether the button already expanded content.Fix:
And add
id="training-history-rows"to the<tbody>element so the relationship is declared.Medium: FAILED badge error message is tooltip-only
titleattributes are:For a family archival tool used by a 60+ audience on tablets, this is a meaningful gap. The error detail should be reachable without hover. Options:
<dialog>with the error message<details>/<summary>element within the table row whenrun.status === 'FAILED'role="tooltip"and make the trigger focusableThe simplest fix that covers keyboard and touch:
Minor: RUNNING badge uses an animated element without reduced-motion check
animate-pulsewill still animate for users who haveprefers-reduced-motion: reduceset. Given the senior audience, this should be guarded:Tailwind's
motion-safe:variant handles this in one class change.Brand compliance
The table styling (
text-ink-2,text-ink-3,border-line) uses the project's semantic tokens correctly. The status badge colours (green-100/700, red-100/700, yellow-100/700) are standard semantic status colours, not brand palette — this is appropriate for status indicators.Tobias Wendt (@tobiwendt) — DevOps Review
Verdict: APPROVE with one blocker and several findings
The
docker-compose.ymlchanges are mostly correct. The newentrypoint.sh+ensure_blla_model.pypattern is clean and the healthcheck configuration is sensible. One real problem to fix before merge.What's done well
set -euo pipefailinentrypoint.sh— correct fail-fast shell scripting. Ifensure_blla_model.pyexits non-zero, the container exits and Docker marks it unhealthy. Exactly right.ocr_cachenamed volume — new volume for~/.cacheprevents model re-downloads on container recreate. Good operational decision.mem_limit: 12gwithmemswap_limit: 12g— setting both to the same value effectively disables swap for this container, forcing a clean OOM kill rather than thrashing. Correct for an ML workload on a constrained host.BLLA_MODEL_PATHnot exposed in compose — the default in code (/app/models/blla.mlmodel) is used, which is correct for the named volume mount at/app/models.APP_OCR_TRAINING_TOKENcorrectly added to backend environment — this was the root cause of the 503s. Fixed.Blocker: Dockerfile base image not pinned to a specific version
:slimwithout a patch version is equivalent to:latestfor reproducibility purposes — it moves on everydocker pull. A build that passes today may fail tomorrow after a silent Python patch update changes a system library.Fix:
Pin to the exact patch version. Add a Renovate rule for the
ocr-service/Dockerfileto automate future bumps:Finding:
minioandmailpitstill use:latestPre-existing issue, not introduced by this PR, but worth noting:
These will be addressed separately but flagging for visibility.
Finding: PostgreSQL data still on bind mount
Also pre-existing. Named volume (
postgres_data:) would be safer in production. Not blocking this PR.Suggestion:
start_periodfor OCR service healthcheck may be too short on first cold startOn a cold start with no model cached,
ensure_blla_model.pytriggers akraken getdownload from Zenodo, which can take several minutes on a slow connection. Thestart_period: 60splusretries: 12gives ~180s before Docker marks the container unhealthy. A Zenodo download on a home NAS connection can easily exceed this. Considerstart_period: 120sor document in.env.examplethat the first cold start requires a working internet connection and may take several minutes.Suggestion:
ocr_cachevolume not documentedAdd a comment explaining what
ocr_cachestores:Per project infrastructure standards, non-obvious configuration should be commented.
Review feedback addressed — Felix Brandt (@felixbrandt)
All reviewer concerns from the five reviews have been resolved. Here's what was committed in 8 commits on top of the original PR work:
Blockers resolved
B1 — @saraholt: No unit tests for
triggerSegTraining()→1eaae2cAdded four tests to
OcrTrainingServiceTest: 409 conflict guard, 422 threshold guard, happy path (verifiesmodelName="blla"andcer != null), and failure path (verifiesFAILEDstatus + error message).B2 — @saraholt: No unit tests for
ensure_blla_model.py→fdae60aNew
test_ensure_blla_model.pycovers all three branches: model OK (early return, no download), model incompatible (rename to.incompatible+ fresh download), and model missing (download without rename). Verified passing in the Docker container.B3 — @tobiwendt: Dockerfile base image unpinned →
29b44e3FROM python:3.11-slim→FROM python:3.11.9-slimB4 — @leonievoss: Missing
aria-expandedon expand/collapse button (WCAG 4.1.2) →83900deAdded
aria-expanded={expanded},aria-controls="training-history-rows"to the button andid="training-history-rows"to the<tbody>.Suggestions resolved
S1 — @felixbrandt:
_model_is_loadablecatches bareException→fdae60aNarrowed to
(RuntimeError, OSError, ValueError)with an explicitDEBUG-level fallback for unexpected exceptions.S2 — @felixbrandt + @nullx: Bare
except Exception: passin_run_segtrain→fdae60aReplaced with
except Exception as exc: log.warning(...)so the fallback is visible in container logs.S3 — @felixbrandt:
findTop5vs. 3-item display →9b6b6f4Renamed
findTop5ByOrderByCreatedAtDesc→findTop10ByOrderByCreatedAtDescin repository, service, and test stub.S4 — @mkeller: Duplicate
assertNoRunningTrainingguard →a169409Extracted to a private
assertNoRunningTraining()helper; bothtriggerTraining()andtriggerSegTraining()call it.S5 — @nullx:
.env.exampletoken generation command →6d7469eAdded
# Generate with: python3 -c "import secrets; print(secrets.token_hex(32))"above the placeholder.S6 — @leonievoss:
animate-pulsewithout reduced-motion guard →83900deanimate-pulse→motion-safe:animate-pulseon the RUNNING badge.S7 — @leonievoss: FAILED badge error message tooltip-only →
83900deReplaced
title=with an inline<details>/<summary>block, keyboard- and touch-reachable. Addedtraining_error_detail_labeli18n key in de/en/es.S8 — @saraholt: No component test for expand/collapse toggle →
06e3ae1New
TrainingHistory.svelte.spec.tswith three tests: shows 3 initially, expands to all on click, hides toggle when ≤3 runs.S9 — @tobiwendt:
start_periodtoo short on cold start →6d7469estart_period: 60s→start_period: 120sS10 — @tobiwendt:
ocr_cachevolume not commented →6d7469eAdded inline comment explaining what the cache stores.
Verification
./mvnw test -Dtest=OcrTrainingServiceTest→ 10/10 ✅./mvnw clean package -DskipTests→ BUILD SUCCESS ✅TrainingHistory.svelte.spec.ts→ 3/3 ✅test_ensure_blla_model.py→ 3/3 ✅ (run in container)