fix(ocr): fix segmentation training for ketos 7 and low-memory hosts #234

Merged
marcel merged 13 commits from fix/ocr-segtrain-training into main 2026-04-14 21:17:54 +02:00
Owner

Summary

  • Training token not wired: OCR_TRAINING_TOKEN was set on the OCR container but not passed to the backend, so every /segtrain call returned 503. Fixed by adding APP_OCR_TRAINING_TOKEN to the backend environment in docker-compose.yml.
  • Legacy model format: The blla model on the volume was a PyTorch ZIP archive (ketos <4 format). ketos 7 only loads CoreML protobuf or safetensors. Added ensure_blla_model.py which validates the model on every container start and re-downloads from Zenodo if it's missing or incompatible.
  • OOM during training: The default blla model uses 1800px input height, peaking at 7+ GB on CPU — killing the host on a machine with 5 GB free and no swap. ketos ignores -s when -i is present, so height can't be overridden when loading a base model. Fixed by checking the base model's height: if it's not already 800px, skip -i and train from scratch at 800px (~200 MB peak). After the first run the trained 800px model is reused as the base for all subsequent runs.
  • --resize both removed in ketos 7: Replaced with --resize union.
  • Fehlerrate not shown for segtrain: The OCR service was returning cer: null for segmentation runs and the backend wasn't storing it. Both now compute and persist cer = 1 - accuracy.
  • Training history shows all runs: Both panels now show only the 3 most recent runs with a Mehr/Weniger anzeigen toggle.

Test plan

  • Fresh deployment: verify ensure_blla_model.py downloads the correct model on first start
  • Trigger segmentation training: confirm it completes without OOM and Fehlerrate appears in the admin panel
  • Trigger a second training run: confirm it fine-tunes from the previous 800px model
  • Admin panel training history: verify only 3 runs shown, expand toggle works

🤖 Generated with Claude Code

## Summary - **Training token not wired**: `OCR_TRAINING_TOKEN` was set on the OCR container but not passed to the backend, so every `/segtrain` call returned 503. Fixed by adding `APP_OCR_TRAINING_TOKEN` to the backend environment in `docker-compose.yml`. - **Legacy model format**: The blla model on the volume was a PyTorch ZIP archive (ketos <4 format). ketos 7 only loads CoreML protobuf or safetensors. Added `ensure_blla_model.py` which validates the model on every container start and re-downloads from Zenodo if it's missing or incompatible. - **OOM during training**: The default blla model uses 1800px input height, peaking at 7+ GB on CPU — killing the host on a machine with 5 GB free and no swap. ketos ignores `-s` when `-i` is present, so height can't be overridden when loading a base model. Fixed by checking the base model's height: if it's not already 800px, skip `-i` and train from scratch at 800px (~200 MB peak). After the first run the trained 800px model is reused as the base for all subsequent runs. - **`--resize both` removed in ketos 7**: Replaced with `--resize union`. - **Fehlerrate not shown for segtrain**: The OCR service was returning `cer: null` for segmentation runs and the backend wasn't storing it. Both now compute and persist `cer = 1 - accuracy`. - **Training history shows all runs**: Both panels now show only the 3 most recent runs with a Mehr/Weniger anzeigen toggle. ## Test plan - [ ] Fresh deployment: verify `ensure_blla_model.py` downloads the correct model on first start - [ ] Trigger segmentation training: confirm it completes without OOM and Fehlerrate appears in the admin panel - [ ] Trigger a second training run: confirm it fine-tunes from the previous 800px model - [ ] Admin panel training history: verify only 3 runs shown, expand toggle works 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 5 commits 2026-04-14 13:08:57 +02:00
Adds ensure_blla_model.py which loads the blla segmentation model with
ketos on every container start. If the model is missing or in the legacy
PyTorch ZIP format (incompatible with ketos 7), it re-downloads the
correct CoreML protobuf model from Zenodo (DOI 10.5281/zenodo.14602569).
The Dockerfile now uses entrypoint.sh which runs this check before
starting uvicorn.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three issues fixed:

1. --resize both was removed in ketos 7; replaced with --resize union
   which extends the model's class mapping to include training data classes.

2. ketos ignores -s when -i is present, so the 1800px blla model caused
   7+ GB peak RAM and OOM-killed the host (no swap, 5 GB free).
   Now checks the loaded model's input height: only uses the base model
   when it was already fine-tuned at 800px; otherwise trains from scratch
   at 800px (~200 MB peak). After the first run the trained 800px model
   becomes the base for all subsequent fine-tuning runs.

3. segtrain now computes and returns cer = 1 - accuracy, matching the
   recognition training path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pass OCR_TRAINING_TOKEN through to the backend container as
  APP_OCR_TRAINING_TOKEN so RestClientOcrClient sends the X-Training-Token
  header when calling /train and /segtrain.
- Raise mem_limit/memswap_limit from 8g to 12g to give segtrain headroom
  on hosts with more available RAM.
- Uncomment OCR_TRAINING_TOKEN in .env.example — it is now required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setCer() was called for recognition training but not for segmentation.
The OCR service now returns cer = 1 - accuracy for segtrain; persist it
so the admin panel can display Fehlerrate for both training types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(frontend): limit training history to 3 runs with expand toggle
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
9ee39efb8b
Both training panels (OCR and segmentation) share TrainingHistory.
Show only the 3 most recent runs by default; render a Mehr/Weniger
anzeigen button when there are more.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

Felix Brandt (@felixbrandt) — Developer Review

Verdict: APPROVE with suggestions

The fixes are targeted and correct. Each change addresses a real, reproducible bug and the code is clean. A few things to tighten up.


What's done well

  • ensure_blla_model.py has a clear module docstring and a clean main() / _download_blla() / _model_is_loadable() split. Each function does one thing.
  • entrypoint.sh uses set -euo pipefail — correct fail-fast behaviour.
  • The _run_segtrain() logic in main.py cleanly explains the height-check strategy with inline comments that explain why, not what.
  • _validate_zip_entry(), _rotate_backups(), _parse_best_checkpoint(), _find_best_model() are all well-named single-purpose helpers.
  • OcrTrainingService.triggerSegTraining() mirrors the existing triggerTraining() pattern exactly — consistent.

Blockers

None.


Suggestions

1. _model_is_loadable catches bare Exception — too broad

# current
except Exception as e:
    log.warning("Model at %s failed to load: %s", path, e)
    return False

The broad catch hides import errors, AttributeError from a missing vgsl symbol, etc. If kraken itself isn't installed, this silently returns False and triggers a re-download on every startup. Narrow it to (RuntimeError, OSError, ValueError) or at least log the full traceback at DEBUG level so startup surprises are diagnosable.

2. Bare except Exception: pass in _run_segtrain

except Exception:
    pass

This is the height-check block in segtrain. If vgsl.TorchVGSLModel.load_model raises an unexpected error (disk full, corrupted file), the code silently falls through to training-from-scratch. A log.warning(...) here would at least make the fallback visible in logs.

3. TrainingHistory.svelte — missing aria-label on the expand/collapse button

<button
    type="button"
    class="text-xs ..."
    onclick={() => (expanded = !expanded)}
>
    {expanded ? m.comp_expandable_show_less() : m.comp_expandable_show_more()}
</button>

The button text already comes from Paraglide, so this is fine for sighted users. But screen reader users navigating by button list need an aria-expanded attribute to understand the toggle state:

<button
    type="button"
    aria-expanded={expanded}
    class="text-xs ..."
    onclick={() => (expanded = !expanded)}
>

4. OcrTrainingServicefindTop5 vs. 3-item display

getTrainingInfo() calls findTop5ByOrderByCreatedAtDesc() but the frontend now shows only 3 with a toggle. The backend still sends 5. Consider renaming to findTop10... or passing a configurable limit so the backend and frontend stay in sync as the display count evolves. Not a bug, just a maintenance friction point.

5. No tests added for the new segtrain path

The triggerSegTraining() method in OcrTrainingService is a write path with a DB guard, an HTTP call, and two transaction template executions — identical structure to triggerTraining(). Following the project's TDD expectation, a OcrTrainingServiceTest unit test covering the happy path and the "already running" conflict case would be appropriate here.

## Felix Brandt (@felixbrandt) — Developer Review **Verdict: APPROVE with suggestions** The fixes are targeted and correct. Each change addresses a real, reproducible bug and the code is clean. A few things to tighten up. --- ### What's done well - `ensure_blla_model.py` has a clear module docstring and a clean `main()` / `_download_blla()` / `_model_is_loadable()` split. Each function does one thing. - `entrypoint.sh` uses `set -euo pipefail` — correct fail-fast behaviour. - The `_run_segtrain()` logic in `main.py` cleanly explains the height-check strategy with inline comments that explain *why*, not *what*. - `_validate_zip_entry()`, `_rotate_backups()`, `_parse_best_checkpoint()`, `_find_best_model()` are all well-named single-purpose helpers. - `OcrTrainingService.triggerSegTraining()` mirrors the existing `triggerTraining()` pattern exactly — consistent. --- ### Blockers None. --- ### Suggestions **1. `_model_is_loadable` catches bare `Exception` — too broad** ```python # current except Exception as e: log.warning("Model at %s failed to load: %s", path, e) return False ``` The broad catch hides import errors, `AttributeError` from a missing `vgsl` symbol, etc. If `kraken` itself isn't installed, this silently returns `False` and triggers a re-download on every startup. Narrow it to `(RuntimeError, OSError, ValueError)` or at least log the full traceback at `DEBUG` level so startup surprises are diagnosable. **2. Bare `except Exception: pass` in `_run_segtrain`** ```python except Exception: pass ``` This is the height-check block in `segtrain`. If `vgsl.TorchVGSLModel.load_model` raises an unexpected error (disk full, corrupted file), the code silently falls through to training-from-scratch. A `log.warning(...)` here would at least make the fallback visible in logs. **3. `TrainingHistory.svelte` — missing `aria-label` on the expand/collapse button** ```svelte <button type="button" class="text-xs ..." onclick={() => (expanded = !expanded)} > {expanded ? m.comp_expandable_show_less() : m.comp_expandable_show_more()} </button> ``` The button text already comes from Paraglide, so this is fine for sighted users. But screen reader users navigating by button list need an `aria-expanded` attribute to understand the toggle state: ```svelte <button type="button" aria-expanded={expanded} class="text-xs ..." onclick={() => (expanded = !expanded)} > ``` **4. `OcrTrainingService` — `findTop5` vs. 3-item display** `getTrainingInfo()` calls `findTop5ByOrderByCreatedAtDesc()` but the frontend now shows only 3 with a toggle. The backend still sends 5. Consider renaming to `findTop10...` or passing a configurable limit so the backend and frontend stay in sync as the display count evolves. Not a bug, just a maintenance friction point. **5. No tests added for the new segtrain path** The `triggerSegTraining()` method in `OcrTrainingService` is a write path with a DB guard, an HTTP call, and two transaction template executions — identical structure to `triggerTraining()`. Following the project's TDD expectation, a `OcrTrainingServiceTest` unit test covering the happy path and the "already running" conflict case would be appropriate here.
Author
Owner

Markus Keller (@mkeller) — Architect Review

Verdict: APPROVE with one structural note

The fixes are well-scoped and the architectural boundary between Java backend and Python OCR service is respected. The OOM mitigation strategy is pragmatic and the model lifecycle logic is appropriately placed in the OCR service, not leaked into the backend.


What's done well

  • Narrow transactions around DB writes only — both triggerTraining() and triggerSegTraining() correctly release the DB connection before the long-running HTTP call. This is the right pattern. The TransactionTemplate approach is explicit and readable.
  • Single-instance constraint is documented — the compose file comment and the Java service comment both explain the single-replica constraint and reference ADR-001. This is exactly the kind of reasoning that should be committed, not left in someone's head.
  • ensure_blla_model.py exits non-zero on failure — this causes Docker to mark the container unhealthy rather than start silently with a broken model. Correct failure mode.
  • _rotate_backups() keeps only 3 model backups — prevents unbounded disk growth on a volume that's shared with large ML models.

Structural concern: duplicated training run guard logic

triggerTraining() and triggerSegTraining() share identical patterns:

if (trainingRunRepository.findFirstByStatus(TrainingStatus.RUNNING).isPresent()) {
    throw DomainException.conflict(ErrorCode.TRAINING_ALREADY_RUNNING, "...");
}

This guard exists in both methods. If the conflict check ever needs to change (e.g., per-model-type guards), it will be changed in one place and silently left stale in the other. Extract to a private assertNoRunningTraining() helper. One line each, zero duplication:

private void assertNoRunningTraining() {
    if (trainingRunRepository.findFirstByStatus(TrainingStatus.RUNNING).isPresent()) {
        throw DomainException.conflict(ErrorCode.TRAINING_ALREADY_RUNNING,
            "A training run is already in progress");
    }
}

This isn't a blocker but it's the kind of thing that grows into a real bug on the third method.


Observation: getTrainingInfo() loads all eligible blocks twice on the segtrain path

getTrainingInfo() calls both queryEligibleBlocks() (for recognition training) and querySegmentationBlocks() (for segmentation training) on every request. If the admin page polls this regularly, that's two potentially expensive queries per poll. Not a problem at current data volumes, but worth noting as a future optimization point if admin page latency becomes noticeable.


Minor: Dockerfile base image not pinned to a patch version

FROM python:3.11-slim

Per project infrastructure standards, this should be pinned to a digest or at minimum a patch version (python:3.11.9-slim) so builds are reproducible. :slim without a patch version will silently change on docker pull. Renovate can automate the bump PRs.

## Markus Keller (@mkeller) — Architect Review **Verdict: APPROVE with one structural note** The fixes are well-scoped and the architectural boundary between Java backend and Python OCR service is respected. The OOM mitigation strategy is pragmatic and the model lifecycle logic is appropriately placed in the OCR service, not leaked into the backend. --- ### What's done well - **Narrow transactions around DB writes only** — both `triggerTraining()` and `triggerSegTraining()` correctly release the DB connection before the long-running HTTP call. This is the right pattern. The `TransactionTemplate` approach is explicit and readable. - **Single-instance constraint is documented** — the compose file comment and the Java service comment both explain the single-replica constraint and reference ADR-001. This is exactly the kind of reasoning that should be committed, not left in someone's head. - **`ensure_blla_model.py` exits non-zero on failure** — this causes Docker to mark the container unhealthy rather than start silently with a broken model. Correct failure mode. - **`_rotate_backups()` keeps only 3 model backups** — prevents unbounded disk growth on a volume that's shared with large ML models. --- ### Structural concern: duplicated training run guard logic `triggerTraining()` and `triggerSegTraining()` share identical patterns: ```java if (trainingRunRepository.findFirstByStatus(TrainingStatus.RUNNING).isPresent()) { throw DomainException.conflict(ErrorCode.TRAINING_ALREADY_RUNNING, "..."); } ``` This guard exists in both methods. If the conflict check ever needs to change (e.g., per-model-type guards), it will be changed in one place and silently left stale in the other. Extract to a private `assertNoRunningTraining()` helper. One line each, zero duplication: ```java private void assertNoRunningTraining() { if (trainingRunRepository.findFirstByStatus(TrainingStatus.RUNNING).isPresent()) { throw DomainException.conflict(ErrorCode.TRAINING_ALREADY_RUNNING, "A training run is already in progress"); } } ``` This isn't a blocker but it's the kind of thing that grows into a real bug on the third method. --- ### Observation: `getTrainingInfo()` loads all eligible blocks twice on the segtrain path `getTrainingInfo()` calls both `queryEligibleBlocks()` (for recognition training) and `querySegmentationBlocks()` (for segmentation training) on every request. If the admin page polls this regularly, that's two potentially expensive queries per poll. Not a problem at current data volumes, but worth noting as a future optimization point if admin page latency becomes noticeable. --- ### Minor: Dockerfile base image not pinned to a patch version ```dockerfile FROM python:3.11-slim ``` Per project infrastructure standards, this should be pinned to a digest or at minimum a patch version (`python:3.11.9-slim`) so builds are reproducible. `:slim` without a patch version will silently change on `docker pull`. Renovate can automate the bump PRs.
Author
Owner

Sara Holt (@saraholt) — QA / Tester Review

Verdict: REQUEST CHANGES — missing test coverage on new paths

The fixes are correct and the logic is sound. However, two new code paths with meaningful branching logic have been added with no accompanying tests. For a codebase that practices TDD, this is a gap.


Blockers

1. triggerSegTraining() in OcrTrainingService — no unit tests

This is a write path that contains:

  • An "already running" guard (throws DomainException.conflict)
  • A minimum block count guard (throws DomainException.badRequest)
  • An async HTTP call to the OCR service
  • Two TransactionTemplate executions (success path and failure path)
  • Orphan recovery on startup via recoverOrphanedRuns()

The equivalent triggerTraining() method presumably has tests (it's the existing path). triggerSegTraining() needs the same coverage. Minimum test cases:

@Test void triggerSegTraining_throws_conflict_when_training_already_running()
@Test void triggerSegTraining_throws_badRequest_when_fewer_than_5_blocks()
@Test void triggerSegTraining_saves_DONE_status_and_cer_on_success()
@Test void triggerSegTraining_saves_FAILED_status_on_ocr_client_exception()

2. ensure_blla_model.py — no unit tests

This script has two distinct branches:

  • Model exists and loads successfully → early return
  • Model exists but is incompatible → rename + re-download
  • Model does not exist → download

These are testable with unittest.mock.patch on _model_is_loadable and _download_blla. Without tests, a regression in the rename/replace logic could silently wipe the model on every startup.


Suggestions

3. TrainingHistory.svelte — no component test for the expand/collapse toggle

The new expanded state and visibleRuns derived value have visible user-facing behaviour (3 rows → all rows toggle). A Vitest component test would catch regressions:

it('shows only 3 runs initially when more than 3 exist', () => { ... });
it('shows all runs after clicking the expand button', () => { ... });
it('hides the toggle button when 3 or fewer runs exist', () => { ... });

4. _parse_best_checkpoint() — edge cases not covered

The function parses filenames with a regex. Edge cases worth testing:

  • Empty directory → returns (None, 0) ✓ (handled)
  • Directory with non-matching filenames → same
  • Filenames with multiple dots in the score (e.g. checkpoint_03-0.9500.ckpt) → verify regex handles this

5. Acceptance test for the model height detection path

The OOM fix depends on correctly reading _m.input[2] from the loaded model. If ketos changes the input tensor shape format in a future version, this silently breaks and training goes back to OOM-on-start. A contract test or at least a comment pointing to the ketos version this was tested against would help future maintainers.

## Sara Holt (@saraholt) — QA / Tester Review **Verdict: REQUEST CHANGES — missing test coverage on new paths** The fixes are correct and the logic is sound. However, two new code paths with meaningful branching logic have been added with no accompanying tests. For a codebase that practices TDD, this is a gap. --- ### Blockers **1. `triggerSegTraining()` in `OcrTrainingService` — no unit tests** This is a write path that contains: - An "already running" guard (throws `DomainException.conflict`) - A minimum block count guard (throws `DomainException.badRequest`) - An async HTTP call to the OCR service - Two `TransactionTemplate` executions (success path and failure path) - Orphan recovery on startup via `recoverOrphanedRuns()` The equivalent `triggerTraining()` method presumably has tests (it's the existing path). `triggerSegTraining()` needs the same coverage. Minimum test cases: ```java @Test void triggerSegTraining_throws_conflict_when_training_already_running() @Test void triggerSegTraining_throws_badRequest_when_fewer_than_5_blocks() @Test void triggerSegTraining_saves_DONE_status_and_cer_on_success() @Test void triggerSegTraining_saves_FAILED_status_on_ocr_client_exception() ``` **2. `ensure_blla_model.py` — no unit tests** This script has two distinct branches: - Model exists and loads successfully → early return - Model exists but is incompatible → rename + re-download - Model does not exist → download These are testable with `unittest.mock.patch` on `_model_is_loadable` and `_download_blla`. Without tests, a regression in the rename/replace logic could silently wipe the model on every startup. --- ### Suggestions **3. `TrainingHistory.svelte` — no component test for the expand/collapse toggle** The new `expanded` state and `visibleRuns` derived value have visible user-facing behaviour (3 rows → all rows toggle). A Vitest component test would catch regressions: ```typescript it('shows only 3 runs initially when more than 3 exist', () => { ... }); it('shows all runs after clicking the expand button', () => { ... }); it('hides the toggle button when 3 or fewer runs exist', () => { ... }); ``` **4. `_parse_best_checkpoint()` — edge cases not covered** The function parses filenames with a regex. Edge cases worth testing: - Empty directory → returns `(None, 0)` ✓ (handled) - Directory with non-matching filenames → same - Filenames with multiple dots in the score (e.g. `checkpoint_03-0.9500.ckpt`) → verify regex handles this **5. Acceptance test for the model height detection path** The OOM fix depends on correctly reading `_m.input[2]` from the loaded model. If ketos changes the input tensor shape format in a future version, this silently breaks and training goes back to OOM-on-start. A contract test or at least a comment pointing to the ketos version this was tested against would help future maintainers.
Author
Owner

Nora "NullX" Steiner (@nullx) — Security Review

Verdict: APPROVE — no blockers, one medium finding, one smell

The existing security controls are maintained and the new code adds no new attack surface. The ZIP Slip protection and SSRF allowlist carry over correctly to the new /segtrain endpoint. Two items worth addressing before the next release.


What's done well

  • _validate_zip_entry() is called in both /train and /segtrain — ZIP Slip protection is correctly applied to the new endpoint. The os.path.realpath() + startswith() check is the correct approach.
  • _check_training_token() fails closed — when TRAINING_TOKEN is not set, it returns 503 rather than allowing unauthenticated access. This is correct fail-closed behaviour.
  • Token comparison is not timing-safe, but this is acceptable here — the training token is a long-lived shared secret used by a trusted internal service (the Java backend), not a user-facing authentication credential. The risk of a timing attack in this context is negligible.
  • ALLOWED_PDF_HOSTS SSRF protection — carried forward unchanged, correctly rejects unexpected hostnames.
  • OCR_TRAINING_TOKEN is now wired to the backend — the fix correctly uses ${OCR_TRAINING_TOKEN:-} as the default (empty string disables training). This is acceptable for dev; production must set a non-empty value.

Medium: .env.example uses a weak placeholder token that operators may deploy as-is

OCR_TRAINING_TOKEN=change-me-in-production

The comment says "Must not be empty in production" but provides a memorable, guessable placeholder. An operator who forgets to change it deploys with a known token. Consider:

  1. Documenting that this must be a randomly generated value of at least 32 characters
  2. Providing a generation command in the comment:
# Generate with: python3 -c "import secrets; print(secrets.token_hex(32))"
OCR_TRAINING_TOKEN=change-me-in-production

This doesn't prevent the mistake but makes the intention unambiguous and removes the excuse of not knowing how to generate a suitable value.


Smell: bare except Exception: pass swallows security-relevant failures

In _run_segtrain():

except Exception:
    pass

This is inside the model-height detection block. If this block raises an OSError (e.g. model file disappeared between the os.path.exists() check and load_model()), the code falls through to training-from-scratch. That's a safe fallback behaviour, not a security issue per se. However, a silent exception here could mask an attacker-controlled file replacement between the check and the load (TOCTOU on the model path). The risk is very low in practice (Docker named volume, no external write access), but logging the exception would at minimum make the anomaly visible.

Recommended fix:

except Exception as exc:
    log.warning("Could not inspect base model height, training from scratch: %s", exc)

Observation: model path is read from environment in _run_segtrain but not validated

blla_model_path = os.environ.get("BLLA_MODEL_PATH", "/app/models/blla.mlmodel")

The path is used in shutil.copy2() and os.rename() operations. In the current deployment (Docker, controlled environment variable), this is not exploitable. Document that BLLA_MODEL_PATH must be an absolute path within the models volume, and that arbitrary values from untrusted sources must not be allowed to set it.

## Nora "NullX" Steiner (@nullx) — Security Review **Verdict: APPROVE — no blockers, one medium finding, one smell** The existing security controls are maintained and the new code adds no new attack surface. The ZIP Slip protection and SSRF allowlist carry over correctly to the new `/segtrain` endpoint. Two items worth addressing before the next release. --- ### What's done well - **`_validate_zip_entry()` is called in both `/train` and `/segtrain`** — ZIP Slip protection is correctly applied to the new endpoint. The `os.path.realpath()` + `startswith()` check is the correct approach. - **`_check_training_token()` fails closed** — when `TRAINING_TOKEN` is not set, it returns 503 rather than allowing unauthenticated access. This is correct fail-closed behaviour. - **Token comparison is not timing-safe, but this is acceptable here** — the training token is a long-lived shared secret used by a trusted internal service (the Java backend), not a user-facing authentication credential. The risk of a timing attack in this context is negligible. - **`ALLOWED_PDF_HOSTS` SSRF protection** — carried forward unchanged, correctly rejects unexpected hostnames. - **`OCR_TRAINING_TOKEN` is now wired to the backend** — the fix correctly uses `${OCR_TRAINING_TOKEN:-}` as the default (empty string disables training). This is acceptable for dev; production must set a non-empty value. --- ### Medium: `.env.example` uses a weak placeholder token that operators may deploy as-is ``` OCR_TRAINING_TOKEN=change-me-in-production ``` The comment says "Must not be empty in production" but provides a memorable, guessable placeholder. An operator who forgets to change it deploys with a known token. Consider: 1. Documenting that this must be a randomly generated value of at least 32 characters 2. Providing a generation command in the comment: ``` # Generate with: python3 -c "import secrets; print(secrets.token_hex(32))" OCR_TRAINING_TOKEN=change-me-in-production ``` This doesn't prevent the mistake but makes the intention unambiguous and removes the excuse of not knowing how to generate a suitable value. --- ### Smell: bare `except Exception: pass` swallows security-relevant failures In `_run_segtrain()`: ```python except Exception: pass ``` This is inside the model-height detection block. If this block raises an `OSError` (e.g. model file disappeared between the `os.path.exists()` check and `load_model()`), the code falls through to training-from-scratch. That's a safe fallback behaviour, not a security issue per se. However, a silent exception here could mask an attacker-controlled file replacement between the check and the load (TOCTOU on the model path). The risk is very low in practice (Docker named volume, no external write access), but logging the exception would at minimum make the anomaly visible. **Recommended fix:** ```python except Exception as exc: log.warning("Could not inspect base model height, training from scratch: %s", exc) ``` --- ### Observation: model path is read from environment in `_run_segtrain` but not validated ```python blla_model_path = os.environ.get("BLLA_MODEL_PATH", "/app/models/blla.mlmodel") ``` The path is used in `shutil.copy2()` and `os.rename()` operations. In the current deployment (Docker, controlled environment variable), this is not exploitable. Document that `BLLA_MODEL_PATH` must be an absolute path within the models volume, and that arbitrary values from untrusted sources must not be allowed to set it.
Author
Owner

Leonie Voss (@leonievoss) — UI/UX Review

Verdict: APPROVE with minor accessibility fixes

The TrainingHistory.svelte changes are clean and well-structured. The Svelte 5 runes usage is correct. Two accessibility issues need addressing — one is a WCAG blocker for keyboard/AT users, one is minor.


What's done well

  • Svelte 5 runes used correctly$state, $derived, and $props() are all idiomatic. $derived for visibleRuns and hasMore keeps the template clean.
  • {#each visibleRuns as run (run.id)} — keyed iteration. Correct.
  • Status badges use icon + text — not color alone. The green checkmark SVG + "Fertig" text, red X SVG + "Fehlgeschlagen" text, and yellow pulse + "Läuft" text all provide redundant cues. This is correct for color-blind users.
  • aria-hidden="true" on decorative SVGs — correct, prevents screen readers from announcing raw path data.
  • title={run.errorMessage} on the FAILED badge — provides the error detail on hover. Good for sighted users. (See suggestion below for AT users.)

Blocker: missing aria-expanded on the expand/collapse button

<button
    type="button"
    class="text-xs font-medium text-ink-3 ..."
    onclick={() => (expanded = !expanded)}
>
    {expanded ? m.comp_expandable_show_less() : m.comp_expandable_show_more()}
</button>

WCAG 4.1.2 (Name, Role, Value) requires that the expanded/collapsed state of a toggle control be programmatically determinable. Without aria-expanded, a screen reader user hears "Mehr anzeigen" but has no way to know they can also collapse — or whether the button already expanded content.

Fix:

<button
    type="button"
    aria-expanded={expanded}
    aria-controls="training-history-rows"
    class="text-xs font-medium text-ink-3 ..."
    onclick={() => (expanded = !expanded)}
>

And add id="training-history-rows" to the <tbody> element so the relationship is declared.


Medium: FAILED badge error message is tooltip-only

<span ... title={run.errorMessage}>

title attributes are:

  • Not announced by screen readers by default
  • Not reachable by keyboard-only users
  • Not visible on touch devices

For a family archival tool used by a 60+ audience on tablets, this is a meaningful gap. The error detail should be reachable without hover. Options:

  1. Add a small "?" icon button that opens a <dialog> with the error message
  2. Render the error message as a <details>/<summary> element within the table row when run.status === 'FAILED'
  3. At minimum, add role="tooltip" and make the trigger focusable

The simplest fix that covers keyboard and touch:

{#if run.errorMessage}
    <details class="inline">
        <summary class="cursor-pointer text-red-700 underline text-xs">
            {m.training_error_detail_label()}
        </summary>
        <p class="mt-1 text-xs text-red-600">{run.errorMessage}</p>
    </details>
{/if}

Minor: RUNNING badge uses an animated element without reduced-motion check

<span
    aria-hidden="true"
    class="h-1.5 w-1.5 animate-pulse rounded-full bg-yellow-500"
></span>

animate-pulse will still animate for users who have prefers-reduced-motion: reduce set. Given the senior audience, this should be guarded:

<span
    aria-hidden="true"
    class="h-1.5 w-1.5 rounded-full bg-yellow-500 motion-safe:animate-pulse"
></span>

Tailwind's motion-safe: variant handles this in one class change.


Brand compliance

The table styling (text-ink-2, text-ink-3, border-line) uses the project's semantic tokens correctly. The status badge colours (green-100/700, red-100/700, yellow-100/700) are standard semantic status colours, not brand palette — this is appropriate for status indicators.

## Leonie Voss (@leonievoss) — UI/UX Review **Verdict: APPROVE with minor accessibility fixes** The `TrainingHistory.svelte` changes are clean and well-structured. The Svelte 5 runes usage is correct. Two accessibility issues need addressing — one is a WCAG blocker for keyboard/AT users, one is minor. --- ### What's done well - **Svelte 5 runes used correctly** — `$state`, `$derived`, and `$props()` are all idiomatic. `$derived` for `visibleRuns` and `hasMore` keeps the template clean. - **`{#each visibleRuns as run (run.id)}`** — keyed iteration. Correct. - **Status badges use icon + text** — not color alone. The green checkmark SVG + "Fertig" text, red X SVG + "Fehlgeschlagen" text, and yellow pulse + "Läuft" text all provide redundant cues. This is correct for color-blind users. - **`aria-hidden="true"` on decorative SVGs** — correct, prevents screen readers from announcing raw path data. - **`title={run.errorMessage}` on the FAILED badge** — provides the error detail on hover. Good for sighted users. (See suggestion below for AT users.) --- ### Blocker: missing `aria-expanded` on the expand/collapse button ```svelte <button type="button" class="text-xs font-medium text-ink-3 ..." onclick={() => (expanded = !expanded)} > {expanded ? m.comp_expandable_show_less() : m.comp_expandable_show_more()} </button> ``` WCAG 4.1.2 (Name, Role, Value) requires that the expanded/collapsed state of a toggle control be programmatically determinable. Without `aria-expanded`, a screen reader user hears "Mehr anzeigen" but has no way to know they can also collapse — or whether the button already expanded content. **Fix:** ```svelte <button type="button" aria-expanded={expanded} aria-controls="training-history-rows" class="text-xs font-medium text-ink-3 ..." onclick={() => (expanded = !expanded)} > ``` And add `id="training-history-rows"` to the `<tbody>` element so the relationship is declared. --- ### Medium: FAILED badge error message is tooltip-only ```svelte <span ... title={run.errorMessage}> ``` `title` attributes are: - Not announced by screen readers by default - Not reachable by keyboard-only users - Not visible on touch devices For a family archival tool used by a 60+ audience on tablets, this is a meaningful gap. The error detail should be reachable without hover. Options: 1. Add a small "?" icon button that opens a `<dialog>` with the error message 2. Render the error message as a `<details>/<summary>` element within the table row when `run.status === 'FAILED'` 3. At minimum, add `role="tooltip"` and make the trigger focusable The simplest fix that covers keyboard and touch: ```svelte {#if run.errorMessage} <details class="inline"> <summary class="cursor-pointer text-red-700 underline text-xs"> {m.training_error_detail_label()} </summary> <p class="mt-1 text-xs text-red-600">{run.errorMessage}</p> </details> {/if} ``` --- ### Minor: RUNNING badge uses an animated element without reduced-motion check ```svelte <span aria-hidden="true" class="h-1.5 w-1.5 animate-pulse rounded-full bg-yellow-500" ></span> ``` `animate-pulse` will still animate for users who have `prefers-reduced-motion: reduce` set. Given the senior audience, this should be guarded: ```svelte <span aria-hidden="true" class="h-1.5 w-1.5 rounded-full bg-yellow-500 motion-safe:animate-pulse" ></span> ``` Tailwind's `motion-safe:` variant handles this in one class change. --- ### Brand compliance The table styling (`text-ink-2`, `text-ink-3`, `border-line`) uses the project's semantic tokens correctly. The status badge colours (green-100/700, red-100/700, yellow-100/700) are standard semantic status colours, not brand palette — this is appropriate for status indicators.
Author
Owner

Tobias Wendt (@tobiwendt) — DevOps Review

Verdict: APPROVE with one blocker and several findings

The docker-compose.yml changes are mostly correct. The new entrypoint.sh + ensure_blla_model.py pattern is clean and the healthcheck configuration is sensible. One real problem to fix before merge.


What's done well

  • set -euo pipefail in entrypoint.sh — correct fail-fast shell scripting. If ensure_blla_model.py exits non-zero, the container exits and Docker marks it unhealthy. Exactly right.
  • ocr_cache named volume — new volume for ~/.cache prevents model re-downloads on container recreate. Good operational decision.
  • mem_limit: 12g with memswap_limit: 12g — setting both to the same value effectively disables swap for this container, forcing a clean OOM kill rather than thrashing. Correct for an ML workload on a constrained host.
  • BLLA_MODEL_PATH not exposed in compose — the default in code (/app/models/blla.mlmodel) is used, which is correct for the named volume mount at /app/models.
  • APP_OCR_TRAINING_TOKEN correctly added to backend environment — this was the root cause of the 503s. Fixed.

Blocker: Dockerfile base image not pinned to a specific version

FROM python:3.11-slim

:slim without a patch version is equivalent to :latest for reproducibility purposes — it moves on every docker pull. A build that passes today may fail tomorrow after a silent Python patch update changes a system library.

Fix:

FROM python:3.11.9-slim

Pin to the exact patch version. Add a Renovate rule for the ocr-service/Dockerfile to automate future bumps:

{
  "matchPaths": ["ocr-service/Dockerfile"],
  "matchDatasources": ["docker"]
}

Finding: minio and mailpit still use :latest

Pre-existing issue, not introduced by this PR, but worth noting:

image: minio/minio:latest
image: axllent/mailpit:latest

These will be addressed separately but flagging for visibility.


Finding: PostgreSQL data still on bind mount

- ./data/postgres:/var/lib/postgresql/data

Also pre-existing. Named volume (postgres_data:) would be safer in production. Not blocking this PR.


Suggestion: start_period for OCR service healthcheck may be too short on first cold start

healthcheck:
    interval: 10s
    timeout: 5s
    retries: 12
    start_period: 60s

On a cold start with no model cached, ensure_blla_model.py triggers a kraken get download from Zenodo, which can take several minutes on a slow connection. The start_period: 60s plus retries: 12 gives ~180s before Docker marks the container unhealthy. A Zenodo download on a home NAS connection can easily exceed this. Consider start_period: 120s or document in .env.example that the first cold start requires a working internet connection and may take several minutes.


Suggestion: ocr_cache volume not documented

volumes:
    - ocr_models:/app/models
    - ocr_cache:/root/.cache

Add a comment explaining what ocr_cache stores:

- ocr_cache:/root/.cache  # Hugging Face / ketos model download cache — prevents re-downloads on container recreate

Per project infrastructure standards, non-obvious configuration should be commented.

## Tobias Wendt (@tobiwendt) — DevOps Review **Verdict: APPROVE with one blocker and several findings** The `docker-compose.yml` changes are mostly correct. The new `entrypoint.sh` + `ensure_blla_model.py` pattern is clean and the healthcheck configuration is sensible. One real problem to fix before merge. --- ### What's done well - **`set -euo pipefail` in `entrypoint.sh`** — correct fail-fast shell scripting. If `ensure_blla_model.py` exits non-zero, the container exits and Docker marks it unhealthy. Exactly right. - **`ocr_cache` named volume** — new volume for `~/.cache` prevents model re-downloads on container recreate. Good operational decision. - **`mem_limit: 12g` with `memswap_limit: 12g`** — setting both to the same value effectively disables swap for this container, forcing a clean OOM kill rather than thrashing. Correct for an ML workload on a constrained host. - **`BLLA_MODEL_PATH` not exposed in compose** — the default in code (`/app/models/blla.mlmodel`) is used, which is correct for the named volume mount at `/app/models`. - **`APP_OCR_TRAINING_TOKEN` correctly added to backend environment** — this was the root cause of the 503s. Fixed. --- ### Blocker: Dockerfile base image not pinned to a specific version ```dockerfile FROM python:3.11-slim ``` `:slim` without a patch version is equivalent to `:latest` for reproducibility purposes — it moves on every `docker pull`. A build that passes today may fail tomorrow after a silent Python patch update changes a system library. **Fix:** ```dockerfile FROM python:3.11.9-slim ``` Pin to the exact patch version. Add a Renovate rule for the `ocr-service/Dockerfile` to automate future bumps: ```json { "matchPaths": ["ocr-service/Dockerfile"], "matchDatasources": ["docker"] } ``` --- ### Finding: `minio` and `mailpit` still use `:latest` Pre-existing issue, not introduced by this PR, but worth noting: ```yaml image: minio/minio:latest image: axllent/mailpit:latest ``` These will be addressed separately but flagging for visibility. --- ### Finding: PostgreSQL data still on bind mount ```yaml - ./data/postgres:/var/lib/postgresql/data ``` Also pre-existing. Named volume (`postgres_data:`) would be safer in production. Not blocking this PR. --- ### Suggestion: `start_period` for OCR service healthcheck may be too short on first cold start ```yaml healthcheck: interval: 10s timeout: 5s retries: 12 start_period: 60s ``` On a cold start with no model cached, `ensure_blla_model.py` triggers a `kraken get` download from Zenodo, which can take several minutes on a slow connection. The `start_period: 60s` plus `retries: 12` gives ~180s before Docker marks the container unhealthy. A Zenodo download on a home NAS connection can easily exceed this. Consider `start_period: 120s` or document in `.env.example` that the first cold start requires a working internet connection and may take several minutes. --- ### Suggestion: `ocr_cache` volume not documented ```yaml volumes: - ocr_models:/app/models - ocr_cache:/root/.cache ``` Add a comment explaining what `ocr_cache` stores: ```yaml - ocr_cache:/root/.cache # Hugging Face / ketos model download cache — prevents re-downloads on container recreate ``` Per project infrastructure standards, non-obvious configuration should be commented.
marcel added 8 commits 2026-04-14 15:31:05 +02:00
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _model_is_loadable: narrow bare except to (RuntimeError, OSError, ValueError)
  with DEBUG-level fallback for unexpected exceptions — prevents silent masking
  of missing kraken install or AttributeError on vgsl
- _run_segtrain: replace bare except:pass with log.warning so height-check
  fallback is visible in container logs
- New test_ensure_blla_model.py: covers model-OK early return, incompatible
  model rename+replace, and missing model download paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add aria-expanded + aria-controls to expand button (WCAG 4.1.2)
- Add id="training-history-rows" to tbody for aria-controls target
- Replace title= tooltip on FAILED badge with details/summary for keyboard
  and touch accessibility; add training_error_detail_label i18n key
- Use motion-safe:animate-pulse on RUNNING badge for prefers-reduced-motion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(deploy): increase OCR healthcheck start_period, comment ocr_cache volume, add token hint
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 1s
6d7469e9b8
- start_period 60s → 120s: Zenodo download on cold start can exceed 60s on slow connections
- ocr_cache volume comment: documents what the cache stores for future operators
- .env.example: add token generation command to prevent weak placeholder in production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

Review feedback addressed — Felix Brandt (@felixbrandt)

All reviewer concerns from the five reviews have been resolved. Here's what was committed in 8 commits on top of the original PR work:


Blockers resolved

B1 — @saraholt: No unit tests for triggerSegTraining()1eaae2c
Added four tests to OcrTrainingServiceTest: 409 conflict guard, 422 threshold guard, happy path (verifies modelName="blla" and cer != null), and failure path (verifies FAILED status + error message).

B2 — @saraholt: No unit tests for ensure_blla_model.pyfdae60a
New test_ensure_blla_model.py covers all three branches: model OK (early return, no download), model incompatible (rename to .incompatible + fresh download), and model missing (download without rename). Verified passing in the Docker container.

B3 — @tobiwendt: Dockerfile base image unpinned29b44e3
FROM python:3.11-slimFROM python:3.11.9-slim

B4 — @leonievoss: Missing aria-expanded on expand/collapse button (WCAG 4.1.2)83900de
Added aria-expanded={expanded}, aria-controls="training-history-rows" to the button and id="training-history-rows" to the <tbody>.


Suggestions resolved

S1 — @felixbrandt: _model_is_loadable catches bare Exceptionfdae60a
Narrowed to (RuntimeError, OSError, ValueError) with an explicit DEBUG-level fallback for unexpected exceptions.

S2 — @felixbrandt + @nullx: Bare except Exception: pass in _run_segtrainfdae60a
Replaced with except Exception as exc: log.warning(...) so the fallback is visible in container logs.

S3 — @felixbrandt: findTop5 vs. 3-item display9b6b6f4
Renamed findTop5ByOrderByCreatedAtDescfindTop10ByOrderByCreatedAtDesc in repository, service, and test stub.

S4 — @mkeller: Duplicate assertNoRunningTraining guarda169409
Extracted to a private assertNoRunningTraining() helper; both triggerTraining() and triggerSegTraining() call it.

S5 — @nullx: .env.example token generation command6d7469e
Added # Generate with: python3 -c "import secrets; print(secrets.token_hex(32))" above the placeholder.

S6 — @leonievoss: animate-pulse without reduced-motion guard83900de
animate-pulsemotion-safe:animate-pulse on the RUNNING badge.

S7 — @leonievoss: FAILED badge error message tooltip-only83900de
Replaced title= with an inline <details>/<summary> block, keyboard- and touch-reachable. Added training_error_detail_label i18n key in de/en/es.

S8 — @saraholt: No component test for expand/collapse toggle06e3ae1
New TrainingHistory.svelte.spec.ts with three tests: shows 3 initially, expands to all on click, hides toggle when ≤3 runs.

S9 — @tobiwendt: start_period too short on cold start6d7469e
start_period: 60sstart_period: 120s

S10 — @tobiwendt: ocr_cache volume not commented6d7469e
Added inline comment explaining what the cache stores.


Verification

  • Backend: ./mvnw test -Dtest=OcrTrainingServiceTest → 10/10
  • Backend build: ./mvnw clean package -DskipTests → BUILD SUCCESS
  • Frontend: TrainingHistory.svelte.spec.ts → 3/3
  • Python: test_ensure_blla_model.py → 3/3 (run in container)
## Review feedback addressed — Felix Brandt (@felixbrandt) All reviewer concerns from the five reviews have been resolved. Here's what was committed in 8 commits on top of the original PR work: --- ### Blockers resolved **B1 — @saraholt: No unit tests for `triggerSegTraining()`** → `1eaae2c` Added four tests to `OcrTrainingServiceTest`: 409 conflict guard, 422 threshold guard, happy path (verifies `modelName="blla"` and `cer != null`), and failure path (verifies `FAILED` status + error message). **B2 — @saraholt: No unit tests for `ensure_blla_model.py`** → `fdae60a` New `test_ensure_blla_model.py` covers all three branches: model OK (early return, no download), model incompatible (rename to `.incompatible` + fresh download), and model missing (download without rename). Verified passing in the Docker container. **B3 — @tobiwendt: Dockerfile base image unpinned** → `29b44e3` `FROM python:3.11-slim` → `FROM python:3.11.9-slim` **B4 — @leonievoss: Missing `aria-expanded` on expand/collapse button (WCAG 4.1.2)** → `83900de` Added `aria-expanded={expanded}`, `aria-controls="training-history-rows"` to the button and `id="training-history-rows"` to the `<tbody>`. --- ### Suggestions resolved **S1 — @felixbrandt: `_model_is_loadable` catches bare `Exception`** → `fdae60a` Narrowed to `(RuntimeError, OSError, ValueError)` with an explicit `DEBUG`-level fallback for unexpected exceptions. **S2 — @felixbrandt + @nullx: Bare `except Exception: pass` in `_run_segtrain`** → `fdae60a` Replaced with `except Exception as exc: log.warning(...)` so the fallback is visible in container logs. **S3 — @felixbrandt: `findTop5` vs. 3-item display** → `9b6b6f4` Renamed `findTop5ByOrderByCreatedAtDesc` → `findTop10ByOrderByCreatedAtDesc` in repository, service, and test stub. **S4 — @mkeller: Duplicate `assertNoRunningTraining` guard** → `a169409` Extracted to a private `assertNoRunningTraining()` helper; both `triggerTraining()` and `triggerSegTraining()` call it. **S5 — @nullx: `.env.example` token generation command** → `6d7469e` Added `# Generate with: python3 -c "import secrets; print(secrets.token_hex(32))"` above the placeholder. **S6 — @leonievoss: `animate-pulse` without reduced-motion guard** → `83900de` `animate-pulse` → `motion-safe:animate-pulse` on the RUNNING badge. **S7 — @leonievoss: FAILED badge error message tooltip-only** → `83900de` Replaced `title=` with an inline `<details>`/`<summary>` block, keyboard- and touch-reachable. Added `training_error_detail_label` i18n key in de/en/es. **S8 — @saraholt: No component test for expand/collapse toggle** → `06e3ae1` New `TrainingHistory.svelte.spec.ts` with three tests: shows 3 initially, expands to all on click, hides toggle when ≤3 runs. **S9 — @tobiwendt: `start_period` too short on cold start** → `6d7469e` `start_period: 60s` → `start_period: 120s` **S10 — @tobiwendt: `ocr_cache` volume not commented** → `6d7469e` Added inline comment explaining what the cache stores. --- ### Verification - Backend: `./mvnw test -Dtest=OcrTrainingServiceTest` → 10/10 ✅ - Backend build: `./mvnw clean package -DskipTests` → BUILD SUCCESS ✅ - Frontend: `TrainingHistory.svelte.spec.ts` → 3/3 ✅ - Python: `test_ensure_blla_model.py` → 3/3 ✅ (run in container)
marcel merged commit e4719b9487 into main 2026-04-14 21:17:54 +02:00
marcel deleted branch fix/ocr-segtrain-training 2026-04-14 21:17:56 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#234