familienarchiv

Author	SHA1	Message	Date
Marcel	6839cf2a33	docs(ocr): clarify entrypoint comment and add manual run hint for skipped test - entrypoint.sh: replace "cross-job ground-truth leakage" with plain "Remove stale partial downloads left by a previous docker-kill" - test_tmpdir_is_inside_persistent_cache_volume: add docker exec command so future developers know how to run this deployment-contract test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 11:20:45 +02:00
Marcel	775b5c062e	test(ocr): add orphan cleanup behavior tests for entrypoint.sh find -mtime test_entrypoint_removes_day_old_orphans and test_entrypoint_preserves_fresh_files verify the find -mtime +1 -delete logic using os.utime() to fabricate old mtimes without mocking system time. Also extracts _run_entrypoint helper to remove subprocess setup duplication. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 11:19:33 +02:00
Marcel	e31dac5c9c	test(ocr): assert entrypoint.sh exit code in test_entrypoint_creates_tmpdir A silent non-zero exit would previously cause the test to pass incorrectly because only directory creation was checked. Exit code is now the first assertion, catching regressions before the filesystem check runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 11:18:14 +02:00
Marcel	c2bd1b34f0	refactor(ocr): extract _validate_zip_entry to utils.py so ZIP Slip test runs in CI _validate_zip_entry has no ML-stack dependency; importing it via main.py pulled in surya/torch and caused the test to be skipped in CI. Moving it to utils.py (fastapi only) and adding fastapi to the CI lightweight install lets test_zipslip_still_anchors_under_custom_tmpdir run on every push. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 11:17:15 +02:00
Marcel	cfd49ff69e	docs(ocr): document TMPDIR convention and add ADR-021 All checks were successful CI / Unit & Component Tests (pull_request) Successful in 3m7s Details CI / OCR Service Tests (pull_request) Successful in 19s Details CI / Backend Unit Tests (pull_request) Successful in 3m7s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 18s Details CI / Compose Bucket Idempotency (pull_request) Successful in 59s Details - ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows to the environment variables table - ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume - docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision, trade-offs, and rejected alternatives (Approach B / C) for issue #614 - ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 10:58:10 +02:00
Marcel	240b373f68	fix(ocr): create TMPDIR on startup and clear day-old orphans On a fresh ocr_cache volume /app/cache/.tmp does not exist yet. The mkdir ensures the first Surya model download can proceed without ENOSPC on the 512 MB /tmp tmpfs. The find cleanup removes fragments left by docker-kill mid-download, preventing cross-job ground-truth leakage. Fixes #614. See ADR-021. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 10:54:17 +02:00
Marcel	09a043431e	build(ocr): set ENV TMPDIR=/app/cache/.tmp so docker run uses SSD staging Without this, running the image outside compose loses the TMPDIR redirect and Surya model downloads fall back to the 512 MB /tmp tmpfs (ENOSPC). See issue #614, ADR-021. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 10:53:15 +02:00
Marcel	bead6f1811	fix(ocr): handle empty-string HTRMOPO_DIR env var with or-fallback os.environ.get(key, default) returns "" when the key exists but is blank — the default is only used when the key is absent. The or-fallback treats both absence and blank values as "use the default". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 18:53:26 +02:00
Marcel	fc8b4b164b	security(ocr): redirect XDG cache and Torch home away from read-only HOME Prevents PyTorch/Matplotlib/Ketos from writing to /home/ocr which is on the read-only container filesystem — fixes Nora's blocker. Also restores the explanatory comment on the ocr_cache volume mount. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:30:39 +02:00
Marcel	eb63df2000	test(ocr): add startup root canary tests for main.py lifespan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:29:47 +02:00
Marcel	53bd574660	test(ocr): replace vacuous startswith assertion with equality check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:26:58 +02:00
Marcel	581ba01d8d	security(ocr): log warning on startup when running as root All checks were successful CI / Unit & Component Tests (pull_request) Successful in 3m3s Details CI / OCR Service Tests (pull_request) Successful in 18s Details CI / Backend Unit Tests (pull_request) Successful in 3m10s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 19s Details CI / Compose Bucket Idempotency (pull_request) Successful in 59s Details Adds a canary log line if os.getuid() == 0. Produces an observable signal in container logs if the USER directive is ever removed from the Dockerfile, without requiring an external audit tool. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 16:51:00 +02:00
Marcel	9db42d6cc1	fix(ocr): resolve HTRMOPO_DIR from env var, not ~ expansion With --no-create-home, os.path.expanduser("~") resolves to "/" causing kraken get to write to /.local/share/htrmopo. Replace with os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") so the path is explicit and override-friendly without a home directory. Adds two tests verifying env-var resolution and ~-free default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 16:49:21 +02:00
Marcel	1aca4c4a41	security(ocr): add non-root user and set HOME/HF_HOME in Dockerfile CIS Docker §4.1: run uvicorn as UID 1000 (ocr) instead of root. Creates /home/ocr and /app/cache with correct ownership so named volumes inherit ocr:ocr on first Docker mount. Sets HOME and HF_HOME so ~ expansion and Hugging Face caching resolve under /app, not /root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 16:46:25 +02:00
Marcel	50b18f0849	docs(legibility): fix three review blockers in DOC-7 Some checks failed CI / Unit & Component Tests (push) Failing after 3m29s Details CI / OCR Service Tests (push) Successful in 32s Details CI / Backend Unit Tests (push) Failing after 3m29s Details - docs/README.md: remove duplicate infrastructure/ entry at end of folder tree - ocr-service/CLAUDE.md: add LLM reminder: prefix to ALLOWED_PDF_HOSTS SSRF warning (consistent with all other machine-readable instructions) - backend/CLAUDE.md: restore ResponseStatusException note for simple controller validation — avoids LLMs reaching for DomainException for trivial checks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:41:02 +02:00
Marcel	86c13a230c	docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7 Processes all 7 CLAUDE.md files according to the 3-bucket classification. Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md, domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last. ### scripts/CLAUDE.md → scripts/README.md New `scripts/README.md` with full script documentation (preserving the ⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md` reduced to a pointer + "document new scripts in README.md" reminder. ### .devcontainer/CLAUDE.md → .devcontainer/README.md New `.devcontainer/README.md` with all configuration, usage, and limitations. `devcontainer/CLAUDE.md` reduced to a single pointer line. ### docs/CLAUDE.md → docs/README.md New `docs/README.md` covering the folder structure, ADR guide, infrastructure docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder. ### ocr-service/CLAUDE.md Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6). Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk. ### backend/CLAUDE.md - Layering Rules → pointer to docs/ARCHITECTURE.md - Error Handling → pointer to CONTRIBUTING.md + reminder - Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder - Package Structure → tagged TODO post-REFACTOR-1 - Fixed errors.ts path to frontend/src/lib/shared/errors.ts - Added ANNOTATE_ALL + BLOG_WRITE to permission list - Key Entities, Entity Code Style, Services → kept (Bucket-2) ### root CLAUDE.md - Stack, Infrastructure, Dev Container → pointers - Layering Rules, Error Handling, Security, OpenAPI, API Client, Date Handling, UI Components, Frontend Error Handling → pointers + reminders - Package Structure → tagged TODO post-REFACTOR-1 - Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2) ### frontend/CLAUDE.md - API Client Pattern, Date Handling → pointers + reminders - Key UI Components → pointer to domain READMEs - Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:41:02 +02:00
Marcel	a1b89670c0	docs(legibility): add 18 per-domain README.md files (DOC-6) Backend (9): document, person, tag, user, geschichte, notification, ocr, audit, dashboard. Frontend (8): document, person, tag, user, geschichte, notification, ocr, shared. OCR service (1): ocr-service/README.md. Each README covers: what the domain owns, explicit non-ownership, public surface (verified by grep against the codebase), internal layout, and cross-domain dependencies. Closes #400 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:36:38 +02:00
Marcel	e85057bed2	refactor(document): move document domain core to document/ package Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 12:39:20 +02:00
Marcel	23cf88856e	fix(ocr): guard Kraken block extraction against missing boundary/baseline Some checks failed CI / Unit & Component Tests (push) Failing after 2m37s Details CI / OCR Service Tests (push) Successful in 32s Details CI / Backend Unit Tests (push) Failing after 2m51s Details extract_page_blocks() walked `record.boundary` and `record.baseline` unconditionally, so a record that arrived without either (malformed kraken output, or a MagicMock in tests that iterates to nothing) crashed with "min() arg is an empty sequence". Coerce both attributes through list(), require at least 3 points for the polygon path, fall back to the baseline path when the polygon is missing, and skip the record entirely when neither is usable — emitting no block is safer than emitting one with garbage coordinates. The test helper now sets `boundary` and `baseline` explicitly to mirror real Kraken 7.0 records (and so the happy-path test exercises the polygon branch). A new regression test covers the skip path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-23 09:33:03 +02:00
Marcel	1f7b712dd0	fix(ocr): accept sender_model_path in Surya engine so non-Kurrent OCR works Some checks failed CI / Unit & Component Tests (push) Failing after 2m36s Details CI / OCR Service Tests (push) Successful in 33s Details CI / Backend Unit Tests (push) Has started running Details main.py unifies the call to both engines and always passes `sender_model_path` (None for non-Kurrent scripts). Surya's extract_region_text / extract_page_blocks accepted one fewer positional arg than Kraken's, so every guided-OCR run on a TYPEWRITER or HANDWRITING_LATIN document raised "takes 5 positional arguments but 6 were given" and the stream returned 0 blocks / 1 skipped page. Add an ignored `sender_model_path` kwarg to both Surya functions so the signatures match Kraken's, and guard the regression with two signature tests in test_engines.py that compare both engines' parameter lists. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-23 09:28:25 +02:00
Marcel	64a854aad6	refactor(ocr): mark _SenderModelRegistry.contains as private (_contains) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:26:46 +02:00
Marcel	84c09e41ef	test(ocr): add /train-sender auth tests and run sender registry tests in CI Add 503/403 auth tests for the /train-sender endpoint, matching the pattern already used for /train and /segtrain. Also surface test_sender_registry.py in CI (it needs no ML stack) and add pytest-asyncio to the install step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:14:27 +02:00
Marcel	000079fd50	refactor(ocr): rename _contains to contains in SenderModelRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:53:16 +02:00
Marcel	07035b9fa9	style(ocr): add Image type hints to extract_page_blocks and extract_region_text Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:22:34 +02:00
Marcel	eab37b9ac9	test(ocr): verify load failure does not cache broken entry in SenderModelRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:19:40 +02:00
Marcel	64d27d6d61	feat(ocr): per-sender model registry and /train-sender endpoint engines/kraken.py: - Add _SenderModelRegistry with LRU eviction (max configurable via OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(), and path whitelist (/app/models/ only) - Add _load_sender_model() helper for testability - extract_page_blocks() and extract_region_text() accept optional sender_model_path; route to sender registry when provided models.py: - OcrRequest gains senderModelPath: str \| None = None field main.py: - /ocr and /ocr/stream pass request.senderModelPath to Kraken engine - New /train-sender endpoint: validates output_model_path, runs ketos train with base model as starting point, invalidates sender cache docker-compose.yml: - Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment test_sender_registry.py: - 4 tests: cache hit, LRU eviction, invalidate, path traversal guard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 18:05:39 +02:00
Marcel	c5e6ed922b	test(ocr): decouple correction tests from exact library dictionary state Some checks failed CI / Unit & Component Tests (pull_request) Successful in 3m35s Details CI / OCR Service Tests (pull_request) Successful in 36s Details CI / Backend Unit Tests (pull_request) Failing after 2m47s Details CI / Unit & Component Tests (push) Failing after 2m33s Details CI / OCR Service Tests (push) Successful in 34s Details CI / Backend Unit Tests (push) Failing after 2m41s Details Replace exact-string assertions in test_correctable_ocr_error_gets_corrected and test_sentence_with_multiple_corrections with structural assertions that verify behavior (correction attempted, marker present, expected stem) without coupling to a specific pyspellchecker version's frequency weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:23:09 +02:00
Marcel	ec85f228c1	refactor(ocr): document > 50 frequency threshold rationale Strict greater-than avoids non-determinism: if multiple candidates share the minimum frequency value, pyspellchecker's ranking is undefined. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:21:37 +02:00
Marcel	fea24aee25	refactor(ocr): make collapse_adjacent_markers a public function Drop underscore prefix — the helper is part of confidence.py's effective public API since spell_check.py imports and calls it directly. Fixes reviewer concern: importing a _-prefixed name across module boundaries contradicts Python's private-by-convention signal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:20:31 +02:00
Marcel	77100ab1e6	feat(ocr): integrate spell-check post-processing for handwriting script types Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:54:17 +02:00
Marcel	092131930c	feat(ocr): add spell_check module with German spellchecker and historical wordlist Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:52:50 +02:00
Marcel	47f9a0bf73	test(ocr): add failing tests for spell_check module Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:51:38 +02:00
Marcel	30a6cbeb7f	feat(ocr): add DTA-derived historical German wordlist and generation script 153K words from dtak+dtae 1800-1899 corpora (min_freq=20), covering pre-reform spellings common in Kurrent/Süterlin documents. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:48:26 +02:00
Marcel	6faaa3b7d6	feat(ocr): add pyspellchecker dependency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:41:24 +02:00
Marcel	77747aa556	refactor(ocr): extract _collapse_adjacent_markers helper and add CORRECTION_MARKER Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:40:39 +02:00
Marcel	4cb7c975f5	test(ocr): add resilience tests for tiny image and unexpected exception propagation Some checks failed CI / Unit & Component Tests (pull_request) Failing after 2m27s Details CI / Backend Unit Tests (pull_request) Failing after 2m37s Details CI / Unit & Component Tests (push) Failing after 3m14s Details CI / Backend Unit Tests (push) Has been cancelled Details Add test for 1×1 image (sub-tile-size) resilience and narrow preprocess_page fallback from except Exception to (cv2.error, ValueError, MemoryError) so programming errors propagate instead of being silently swallowed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 15:16:17 +02:00
Marcel	b310caaeeb	feat(ocr): integrate preprocessing into stream and batch endpoints Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:16:47 +02:00
Marcel	615d404ba9	chore(ocr): add opencv-python-headless, libglib2.0-0, and CLAHE env vars Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:14:47 +02:00
Marcel	7183fc4428	feat(ocr): add image preprocessing module with CLAHE + grayscale + blur Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:13:42 +02:00
Marcel	5c7efef307	fix(ocr): pin Dockerfile base image to python:3.11.9-slim for reproducible builds Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	74c9046745	fix(ocr): narrow exception handling and add unit tests for ensure_blla_model - _model_is_loadable: narrow bare except to (RuntimeError, OSError, ValueError) with DEBUG-level fallback for unexpected exceptions — prevents silent masking of missing kraken install or AttributeError on vgsl - _run_segtrain: replace bare except:pass with log.warning so height-check fallback is visible in container logs - New test_ensure_blla_model.py: covers model-OK early return, incompatible model rename+replace, and missing model download paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	a5979c4069	fix(ocr-service): fix ketos 7 segtrain compatibility and prevent OOM Three issues fixed: 1. --resize both was removed in ketos 7; replaced with --resize union which extends the model's class mapping to include training data classes. 2. ketos ignores -s when -i is present, so the 1800px blla model caused 7+ GB peak RAM and OOM-killed the host (no swap, 5 GB free). Now checks the loaded model's input height: only uses the base model when it was already fine-tuned at 800px; otherwise trains from scratch at 800px (~200 MB peak). After the first run the trained 800px model becomes the base for all subsequent fine-tuning runs. 3. segtrain now computes and returns cer = 1 - accuracy, matching the recognition training path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	e8375d6c72	fix(ocr-service): add entrypoint that validates blla model format on startup Adds ensure_blla_model.py which loads the blla segmentation model with ketos on every container start. If the model is missing or in the legacy PyTorch ZIP format (incompatible with ketos 7), it re-downloads the correct CoreML protobuf model from Zenodo (DOI 10.5281/zenodo.14602569). The Dockerfile now uses entrypoint.sh which runs this check before starting uvicorn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	30a17c97e8	fix(ocr): fail closed when TRAINING_TOKEN is not configured _check_training_token previously skipped auth when TRAINING_TOKEN was empty, allowing unauthenticated requests to reach /train and /segtrain. Now returns 503 ("Training not configured on this node") when the token is absent, so missing configuration fails closed rather than open. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 10:02:13 +02:00
Marcel	62be895b9e	fix(ocr): drop uvicorn workers from 2 to 1 Two workers × ~5 GB Surya model load = ~10 GB required, exceeding the 8 GB memory cap and causing OOM on the first /train call. Two OS processes also cause model-state divergence after training, contradicting the single-node constraint documented in ADR-001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 09:55:55 +02:00
Marcel	669f2f8b98	fix(training): output CoreML format and fix best-model finder ketos 7 defaults to safetensors output, but kraken's load_any() only handles CoreML (.mlmodel). Adding --weights-format coreml ensures the hot-swap after training produces a file that load_any() can parse. Also fixed _find_best_model to look for best_<score>.mlmodel (produced by --weights-format coreml) in addition to the previous checkpoint_* pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:57:42 +02:00
Marcel	49c9022285	fix(training): switch to PAGE XML format for kurrent recognition training Kraken 7 removed support for the legacy `path` format (image + .gt.txt pairs) in VGSLRecognitionDataModule despite the CLI still advertising it. Switching to PAGE XML (-f page) format which is the supported standard. - Java export now writes .xml alongside .png (PAGE XML with TextLine, Baseline at 75% height, and Unicode transcription) - XML special characters in transcription text are escaped (& < >) - Python trainer globs *.xml and passes -f page to ketos train - Regenerated frontend API types to include cer/loss/accuracy/epochs on OcrTrainingRun (were missing, causing empty CER column in history) - Updated and extended TrainingDataExportServiceTest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:45:08 +02:00
Marcel	94b9c56527	fix(segtrain): reduce input height to 800px on first run to avoid OOM ketos segtrain has no batch-size flag (-B), so with the default 1800px input height the intermediate CNN feature maps consume ~500 MB+ per image, causing the kernel OOM-killer (exit -9) to terminate the process. On first run (no existing blla.mlmodel), override the VGSL spec to use 800px height instead. Subsequent runs load the saved model with --resize both, preserving incremental fine-tuning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:37:24 +02:00
Marcel	89a18c430e	fix(training): limit CPU threads and epochs to prevent RAM exhaustion Force CPU-only training (--device cpu), cap OpenMP/BLAS thread pool at 2 (--threads 2), and reduce epochs from 50 to 10 (-N 10). 50 epochs on a laptop OOM-killed the container. 10 epochs is sufficient for incremental fine-tuning runs; more data is added over time and training re-run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:09:13 +02:00
Marcel	8dec5b5976	fix(training): disable DataLoader workers in subprocess training DataLoader worker subprocesses crash inside Docker due to multiprocessing fork restrictions. Pass --workers 0 to both ketos train and ketos segtrain so data loading runs in the main process. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 20:58:32 +02:00

1 2

78 Commits