Commit Graph

78 Commits

Author SHA1 Message Date
Marcel
6839cf2a33 docs(ocr): clarify entrypoint comment and add manual run hint for skipped test
- entrypoint.sh: replace "cross-job ground-truth leakage" with plain
  "Remove stale partial downloads left by a previous docker-kill"
- test_tmpdir_is_inside_persistent_cache_volume: add docker exec command
  so future developers know how to run this deployment-contract test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:20:45 +02:00
Marcel
775b5c062e test(ocr): add orphan cleanup behavior tests for entrypoint.sh find -mtime
test_entrypoint_removes_day_old_orphans and test_entrypoint_preserves_fresh_files
verify the find -mtime +1 -delete logic using os.utime() to fabricate old mtimes
without mocking system time. Also extracts _run_entrypoint helper to remove
subprocess setup duplication.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:19:33 +02:00
Marcel
e31dac5c9c test(ocr): assert entrypoint.sh exit code in test_entrypoint_creates_tmpdir
A silent non-zero exit would previously cause the test to pass incorrectly
because only directory creation was checked. Exit code is now the first
assertion, catching regressions before the filesystem check runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:18:14 +02:00
Marcel
c2bd1b34f0 refactor(ocr): extract _validate_zip_entry to utils.py so ZIP Slip test runs in CI
_validate_zip_entry has no ML-stack dependency; importing it via main.py
pulled in surya/torch and caused the test to be skipped in CI. Moving it
to utils.py (fastapi only) and adding fastapi to the CI lightweight install
lets test_zipslip_still_anchors_under_custom_tmpdir run on every push.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:17:15 +02:00
Marcel
cfd49ff69e docs(ocr): document TMPDIR convention and add ADR-021
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m7s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m7s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 18s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
- ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows
  to the environment variables table
- ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume
- docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision,
  trade-offs, and rejected alternatives (Approach B / C) for issue #614
- ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 10:58:10 +02:00
Marcel
240b373f68 fix(ocr): create TMPDIR on startup and clear day-old orphans
On a fresh ocr_cache volume /app/cache/.tmp does not exist yet. The mkdir
ensures the first Surya model download can proceed without ENOSPC on the
512 MB /tmp tmpfs. The find cleanup removes fragments left by docker-kill
mid-download, preventing cross-job ground-truth leakage.

Fixes #614. See ADR-021.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 10:54:17 +02:00
Marcel
09a043431e build(ocr): set ENV TMPDIR=/app/cache/.tmp so docker run uses SSD staging
Without this, running the image outside compose loses the TMPDIR redirect
and Surya model downloads fall back to the 512 MB /tmp tmpfs (ENOSPC).
See issue #614, ADR-021.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 10:53:15 +02:00
Marcel
bead6f1811 fix(ocr): handle empty-string HTRMOPO_DIR env var with or-fallback
os.environ.get(key, default) returns "" when the key exists but is blank —
the default is only used when the key is absent. The or-fallback treats both
absence and blank values as "use the default".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 18:53:26 +02:00
Marcel
fc8b4b164b security(ocr): redirect XDG cache and Torch home away from read-only HOME
Prevents PyTorch/Matplotlib/Ketos from writing to /home/ocr which is
on the read-only container filesystem — fixes Nora's blocker. Also
restores the explanatory comment on the ocr_cache volume mount.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 17:30:39 +02:00
Marcel
eb63df2000 test(ocr): add startup root canary tests for main.py lifespan
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 17:29:47 +02:00
Marcel
53bd574660 test(ocr): replace vacuous startswith assertion with equality check
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 17:26:58 +02:00
Marcel
581ba01d8d security(ocr): log warning on startup when running as root
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m3s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 3m10s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
Adds a canary log line if os.getuid() == 0. Produces an observable
signal in container logs if the USER directive is ever removed from
the Dockerfile, without requiring an external audit tool.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 16:51:00 +02:00
Marcel
9db42d6cc1 fix(ocr): resolve HTRMOPO_DIR from env var, not ~ expansion
With --no-create-home, os.path.expanduser("~") resolves to "/" causing
kraken get to write to /.local/share/htrmopo. Replace with
os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") so the path is
explicit and override-friendly without a home directory.

Adds two tests verifying env-var resolution and ~-free default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 16:49:21 +02:00
Marcel
1aca4c4a41 security(ocr): add non-root user and set HOME/HF_HOME in Dockerfile
CIS Docker §4.1: run uvicorn as UID 1000 (ocr) instead of root.
Creates /home/ocr and /app/cache with correct ownership so named
volumes inherit ocr:ocr on first Docker mount. Sets HOME and HF_HOME
so ~ expansion and Hugging Face caching resolve under /app, not /root.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 16:46:25 +02:00
Marcel
50b18f0849 docs(legibility): fix three review blockers in DOC-7
Some checks failed
CI / Unit & Component Tests (push) Failing after 3m29s
CI / OCR Service Tests (push) Successful in 32s
CI / Backend Unit Tests (push) Failing after 3m29s
- docs/README.md: remove duplicate infrastructure/ entry at end of folder tree
- ocr-service/CLAUDE.md: add **LLM reminder:** prefix to ALLOWED_PDF_HOSTS
  SSRF warning (consistent with all other machine-readable instructions)
- backend/CLAUDE.md: restore ResponseStatusException note for simple controller
  validation — avoids LLMs reaching for DomainException for trivial checks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 07:41:02 +02:00
Marcel
86c13a230c docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7
Processes all 7 CLAUDE.md files according to the 3-bucket classification.
Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md,
domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last.

### scripts/CLAUDE.md → scripts/README.md
New `scripts/README.md` with full script documentation (preserving the
⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md`
reduced to a pointer + "document new scripts in README.md" reminder.

### .devcontainer/CLAUDE.md → .devcontainer/README.md
New `.devcontainer/README.md` with all configuration, usage, and limitations.
`devcontainer/CLAUDE.md` reduced to a single pointer line.

### docs/CLAUDE.md → docs/README.md
New `docs/README.md` covering the folder structure, ADR guide, infrastructure
docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder.

### ocr-service/CLAUDE.md
Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6).
Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk.

### backend/CLAUDE.md
- Layering Rules → pointer to docs/ARCHITECTURE.md
- Error Handling → pointer to CONTRIBUTING.md + reminder
- Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder
- Package Structure → tagged TODO post-REFACTOR-1
- Fixed errors.ts path to frontend/src/lib/shared/errors.ts
- Added ANNOTATE_ALL + BLOG_WRITE to permission list
- Key Entities, Entity Code Style, Services → kept (Bucket-2)

### root CLAUDE.md
- Stack, Infrastructure, Dev Container → pointers
- Layering Rules, Error Handling, Security, OpenAPI, API Client,
  Date Handling, UI Components, Frontend Error Handling → pointers + reminders
- Package Structure → tagged TODO post-REFACTOR-1
- Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2)

### frontend/CLAUDE.md
- API Client Pattern, Date Handling → pointers + reminders
- Key UI Components → pointer to domain READMEs
- Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 07:41:02 +02:00
Marcel
a1b89670c0 docs(legibility): add 18 per-domain README.md files (DOC-6)
Backend (9): document, person, tag, user, geschichte, notification,
ocr, audit, dashboard.
Frontend (8): document, person, tag, user, geschichte, notification,
ocr, shared.
OCR service (1): ocr-service/README.md.

Each README covers: what the domain owns, explicit non-ownership,
public surface (verified by grep against the codebase), internal
layout, and cross-domain dependencies.

Closes #400
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 07:36:38 +02:00
Marcel
e85057bed2 refactor(document): move document domain core to document/ package
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:39:20 +02:00
Marcel
23cf88856e fix(ocr): guard Kraken block extraction against missing boundary/baseline
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m37s
CI / OCR Service Tests (push) Successful in 32s
CI / Backend Unit Tests (push) Failing after 2m51s
extract_page_blocks() walked `record.boundary` and `record.baseline`
unconditionally, so a record that arrived without either (malformed
kraken output, or a MagicMock in tests that iterates to nothing)
crashed with "min() arg is an empty sequence".

Coerce both attributes through list(), require at least 3 points for
the polygon path, fall back to the baseline path when the polygon is
missing, and skip the record entirely when neither is usable —
emitting no block is safer than emitting one with garbage coordinates.

The test helper now sets `boundary` and `baseline` explicitly to
mirror real Kraken 7.0 records (and so the happy-path test exercises
the polygon branch). A new regression test covers the skip path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-23 09:33:03 +02:00
Marcel
1f7b712dd0 fix(ocr): accept sender_model_path in Surya engine so non-Kurrent OCR works
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m36s
CI / OCR Service Tests (push) Successful in 33s
CI / Backend Unit Tests (push) Has started running
main.py unifies the call to both engines and always passes
`sender_model_path` (None for non-Kurrent scripts). Surya's
extract_region_text / extract_page_blocks accepted one fewer positional
arg than Kraken's, so every guided-OCR run on a TYPEWRITER or
HANDWRITING_LATIN document raised "takes 5 positional arguments but 6
were given" and the stream returned 0 blocks / 1 skipped page.

Add an ignored `sender_model_path` kwarg to both Surya functions so the
signatures match Kraken's, and guard the regression with two signature
tests in test_engines.py that compare both engines' parameter lists.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-23 09:28:25 +02:00
Marcel
64a854aad6 refactor(ocr): mark _SenderModelRegistry.contains as private (_contains)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 21:26:46 +02:00
Marcel
84c09e41ef test(ocr): add /train-sender auth tests and run sender registry tests in CI
Add 503/403 auth tests for the /train-sender endpoint, matching the pattern
already used for /train and /segtrain. Also surface test_sender_registry.py
in CI (it needs no ML stack) and add pytest-asyncio to the install step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 21:14:27 +02:00
Marcel
000079fd50 refactor(ocr): rename _contains to contains in SenderModelRegistry
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 20:53:16 +02:00
Marcel
07035b9fa9 style(ocr): add Image type hints to extract_page_blocks and extract_region_text
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 20:22:34 +02:00
Marcel
eab37b9ac9 test(ocr): verify load failure does not cache broken entry in SenderModelRegistry
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 20:19:40 +02:00
Marcel
64d27d6d61 feat(ocr): per-sender model registry and /train-sender endpoint
engines/kraken.py:
- Add _SenderModelRegistry with LRU eviction (max configurable via
  OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(),
  and path whitelist (/app/models/ only)
- Add _load_sender_model() helper for testability
- extract_page_blocks() and extract_region_text() accept optional
  sender_model_path; route to sender registry when provided

models.py:
- OcrRequest gains senderModelPath: str | None = None field

main.py:
- /ocr and /ocr/stream pass request.senderModelPath to Kraken engine
- New /train-sender endpoint: validates output_model_path, runs ketos
  train with base model as starting point, invalidates sender cache

docker-compose.yml:
- Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment

test_sender_registry.py:
- 4 tests: cache hit, LRU eviction, invalidate, path traversal guard

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 18:05:39 +02:00
Marcel
c5e6ed922b test(ocr): decouple correction tests from exact library dictionary state
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 3m35s
CI / OCR Service Tests (pull_request) Successful in 36s
CI / Backend Unit Tests (pull_request) Failing after 2m47s
CI / Unit & Component Tests (push) Failing after 2m33s
CI / OCR Service Tests (push) Successful in 34s
CI / Backend Unit Tests (push) Failing after 2m41s
Replace exact-string assertions in test_correctable_ocr_error_gets_corrected
and test_sentence_with_multiple_corrections with structural assertions that
verify behavior (correction attempted, marker present, expected stem) without
coupling to a specific pyspellchecker version's frequency weights.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:23:09 +02:00
Marcel
ec85f228c1 refactor(ocr): document > 50 frequency threshold rationale
Strict greater-than avoids non-determinism: if multiple candidates share
the minimum frequency value, pyspellchecker's ranking is undefined.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:21:37 +02:00
Marcel
fea24aee25 refactor(ocr): make collapse_adjacent_markers a public function
Drop underscore prefix — the helper is part of confidence.py's effective
public API since spell_check.py imports and calls it directly.

Fixes reviewer concern: importing a _-prefixed name across module boundaries
contradicts Python's private-by-convention signal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:20:31 +02:00
Marcel
77100ab1e6 feat(ocr): integrate spell-check post-processing for handwriting script types
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:54:17 +02:00
Marcel
092131930c feat(ocr): add spell_check module with German spellchecker and historical wordlist
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:52:50 +02:00
Marcel
47f9a0bf73 test(ocr): add failing tests for spell_check module
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:51:38 +02:00
Marcel
30a6cbeb7f feat(ocr): add DTA-derived historical German wordlist and generation script
153K words from dtak+dtae 1800-1899 corpora (min_freq=20),
covering pre-reform spellings common in Kurrent/Süterlin documents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:48:26 +02:00
Marcel
6faaa3b7d6 feat(ocr): add pyspellchecker dependency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:41:24 +02:00
Marcel
77747aa556 refactor(ocr): extract _collapse_adjacent_markers helper and add CORRECTION_MARKER
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:40:39 +02:00
Marcel
4cb7c975f5 test(ocr): add resilience tests for tiny image and unexpected exception propagation
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 2m27s
CI / Backend Unit Tests (pull_request) Failing after 2m37s
CI / Unit & Component Tests (push) Failing after 3m14s
CI / Backend Unit Tests (push) Has been cancelled
Add test for 1×1 image (sub-tile-size) resilience and narrow preprocess_page
fallback from except Exception to (cv2.error, ValueError, MemoryError) so
programming errors propagate instead of being silently swallowed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 15:16:17 +02:00
Marcel
b310caaeeb feat(ocr): integrate preprocessing into stream and batch endpoints
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:16:47 +02:00
Marcel
615d404ba9 chore(ocr): add opencv-python-headless, libglib2.0-0, and CLAHE env vars
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:14:47 +02:00
Marcel
7183fc4428 feat(ocr): add image preprocessing module with CLAHE + grayscale + blur
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:13:42 +02:00
Marcel
5c7efef307 fix(ocr): pin Dockerfile base image to python:3.11.9-slim for reproducible builds
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
74c9046745 fix(ocr): narrow exception handling and add unit tests for ensure_blla_model
- _model_is_loadable: narrow bare except to (RuntimeError, OSError, ValueError)
  with DEBUG-level fallback for unexpected exceptions — prevents silent masking
  of missing kraken install or AttributeError on vgsl
- _run_segtrain: replace bare except:pass with log.warning so height-check
  fallback is visible in container logs
- New test_ensure_blla_model.py: covers model-OK early return, incompatible
  model rename+replace, and missing model download paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
a5979c4069 fix(ocr-service): fix ketos 7 segtrain compatibility and prevent OOM
Three issues fixed:

1. --resize both was removed in ketos 7; replaced with --resize union
   which extends the model's class mapping to include training data classes.

2. ketos ignores -s when -i is present, so the 1800px blla model caused
   7+ GB peak RAM and OOM-killed the host (no swap, 5 GB free).
   Now checks the loaded model's input height: only uses the base model
   when it was already fine-tuned at 800px; otherwise trains from scratch
   at 800px (~200 MB peak). After the first run the trained 800px model
   becomes the base for all subsequent fine-tuning runs.

3. segtrain now computes and returns cer = 1 - accuracy, matching the
   recognition training path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
e8375d6c72 fix(ocr-service): add entrypoint that validates blla model format on startup
Adds ensure_blla_model.py which loads the blla segmentation model with
ketos on every container start. If the model is missing or in the legacy
PyTorch ZIP format (incompatible with ketos 7), it re-downloads the
correct CoreML protobuf model from Zenodo (DOI 10.5281/zenodo.14602569).
The Dockerfile now uses entrypoint.sh which runs this check before
starting uvicorn.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
30a17c97e8 fix(ocr): fail closed when TRAINING_TOKEN is not configured
_check_training_token previously skipped auth when TRAINING_TOKEN was
empty, allowing unauthenticated requests to reach /train and /segtrain.
Now returns 503 ("Training not configured on this node") when the token
is absent, so missing configuration fails closed rather than open.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:02:13 +02:00
Marcel
62be895b9e fix(ocr): drop uvicorn workers from 2 to 1
Two workers × ~5 GB Surya model load = ~10 GB required, exceeding the
8 GB memory cap and causing OOM on the first /train call. Two OS
processes also cause model-state divergence after training, contradicting
the single-node constraint documented in ADR-001.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 09:55:55 +02:00
Marcel
669f2f8b98 fix(training): output CoreML format and fix best-model finder
ketos 7 defaults to safetensors output, but kraken's load_any() only
handles CoreML (.mlmodel). Adding --weights-format coreml ensures the
hot-swap after training produces a file that load_any() can parse.

Also fixed _find_best_model to look for best_<score>.mlmodel (produced
by --weights-format coreml) in addition to the previous checkpoint_*
pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:57:42 +02:00
Marcel
49c9022285 fix(training): switch to PAGE XML format for kurrent recognition training
Kraken 7 removed support for the legacy `path` format (image + .gt.txt
pairs) in VGSLRecognitionDataModule despite the CLI still advertising it.
Switching to PAGE XML (-f page) format which is the supported standard.

- Java export now writes .xml alongside .png (PAGE XML with TextLine,
  Baseline at 75% height, and Unicode transcription)
- XML special characters in transcription text are escaped (&amp; &lt; &gt;)
- Python trainer globs *.xml and passes -f page to ketos train
- Regenerated frontend API types to include cer/loss/accuracy/epochs on
  OcrTrainingRun (were missing, causing empty CER column in history)
- Updated and extended TrainingDataExportServiceTest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:45:08 +02:00
Marcel
94b9c56527 fix(segtrain): reduce input height to 800px on first run to avoid OOM
ketos segtrain has no batch-size flag (-B), so with the default 1800px
input height the intermediate CNN feature maps consume ~500 MB+ per
image, causing the kernel OOM-killer (exit -9) to terminate the process.

On first run (no existing blla.mlmodel), override the VGSL spec to use
800px height instead. Subsequent runs load the saved model with
--resize both, preserving incremental fine-tuning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:37:24 +02:00
Marcel
89a18c430e fix(training): limit CPU threads and epochs to prevent RAM exhaustion
Force CPU-only training (--device cpu), cap OpenMP/BLAS thread pool at 2
(--threads 2), and reduce epochs from 50 to 10 (-N 10). 50 epochs on a
laptop OOM-killed the container. 10 epochs is sufficient for incremental
fine-tuning runs; more data is added over time and training re-run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:09:13 +02:00
Marcel
8dec5b5976 fix(training): disable DataLoader workers in subprocess training
DataLoader worker subprocesses crash inside Docker due to multiprocessing
fork restrictions. Pass --workers 0 to both ketos train and ketos segtrain
so data loading runs in the main process.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 20:58:32 +02:00