familienarchiv

Author	SHA1	Message	Date
Marcel	c2bd1b34f0	refactor(ocr): extract _validate_zip_entry to utils.py so ZIP Slip test runs in CI _validate_zip_entry has no ML-stack dependency; importing it via main.py pulled in surya/torch and caused the test to be skipped in CI. Moving it to utils.py (fastapi only) and adding fastapi to the CI lightweight install lets test_zipslip_still_anchors_under_custom_tmpdir run on every push. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 11:17:15 +02:00
Marcel	cfd49ff69e	docs(ocr): document TMPDIR convention and add ADR-021 All checks were successful CI / Unit & Component Tests (pull_request) Successful in 3m7s Details CI / OCR Service Tests (pull_request) Successful in 19s Details CI / Backend Unit Tests (pull_request) Successful in 3m7s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 18s Details CI / Compose Bucket Idempotency (pull_request) Successful in 59s Details - ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows to the environment variables table - ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume - docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision, trade-offs, and rejected alternatives (Approach B / C) for issue #614 - ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 10:58:10 +02:00
Marcel	240b373f68	fix(ocr): create TMPDIR on startup and clear day-old orphans On a fresh ocr_cache volume /app/cache/.tmp does not exist yet. The mkdir ensures the first Surya model download can proceed without ENOSPC on the 512 MB /tmp tmpfs. The find cleanup removes fragments left by docker-kill mid-download, preventing cross-job ground-truth leakage. Fixes #614. See ADR-021. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 10:54:17 +02:00
Marcel	09a043431e	build(ocr): set ENV TMPDIR=/app/cache/.tmp so docker run uses SSD staging Without this, running the image outside compose loses the TMPDIR redirect and Surya model downloads fall back to the 512 MB /tmp tmpfs (ENOSPC). See issue #614, ADR-021. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 10:53:15 +02:00
Marcel	bead6f1811	fix(ocr): handle empty-string HTRMOPO_DIR env var with or-fallback os.environ.get(key, default) returns "" when the key exists but is blank — the default is only used when the key is absent. The or-fallback treats both absence and blank values as "use the default". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 18:53:26 +02:00
Marcel	fc8b4b164b	security(ocr): redirect XDG cache and Torch home away from read-only HOME Prevents PyTorch/Matplotlib/Ketos from writing to /home/ocr which is on the read-only container filesystem — fixes Nora's blocker. Also restores the explanatory comment on the ocr_cache volume mount. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:30:39 +02:00
Marcel	eb63df2000	test(ocr): add startup root canary tests for main.py lifespan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:29:47 +02:00
Marcel	53bd574660	test(ocr): replace vacuous startswith assertion with equality check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:26:58 +02:00
Marcel	581ba01d8d	security(ocr): log warning on startup when running as root All checks were successful CI / Unit & Component Tests (pull_request) Successful in 3m3s Details CI / OCR Service Tests (pull_request) Successful in 18s Details CI / Backend Unit Tests (pull_request) Successful in 3m10s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 19s Details CI / Compose Bucket Idempotency (pull_request) Successful in 59s Details Adds a canary log line if os.getuid() == 0. Produces an observable signal in container logs if the USER directive is ever removed from the Dockerfile, without requiring an external audit tool. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 16:51:00 +02:00
Marcel	9db42d6cc1	fix(ocr): resolve HTRMOPO_DIR from env var, not ~ expansion With --no-create-home, os.path.expanduser("~") resolves to "/" causing kraken get to write to /.local/share/htrmopo. Replace with os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") so the path is explicit and override-friendly without a home directory. Adds two tests verifying env-var resolution and ~-free default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 16:49:21 +02:00
Marcel	1aca4c4a41	security(ocr): add non-root user and set HOME/HF_HOME in Dockerfile CIS Docker §4.1: run uvicorn as UID 1000 (ocr) instead of root. Creates /home/ocr and /app/cache with correct ownership so named volumes inherit ocr:ocr on first Docker mount. Sets HOME and HF_HOME so ~ expansion and Hugging Face caching resolve under /app, not /root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 16:46:25 +02:00
Marcel	50b18f0849	docs(legibility): fix three review blockers in DOC-7 Some checks failed CI / Unit & Component Tests (push) Failing after 3m29s Details CI / OCR Service Tests (push) Successful in 32s Details CI / Backend Unit Tests (push) Failing after 3m29s Details - docs/README.md: remove duplicate infrastructure/ entry at end of folder tree - ocr-service/CLAUDE.md: add LLM reminder: prefix to ALLOWED_PDF_HOSTS SSRF warning (consistent with all other machine-readable instructions) - backend/CLAUDE.md: restore ResponseStatusException note for simple controller validation — avoids LLMs reaching for DomainException for trivial checks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:41:02 +02:00
Marcel	86c13a230c	docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7 Processes all 7 CLAUDE.md files according to the 3-bucket classification. Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md, domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last. ### scripts/CLAUDE.md → scripts/README.md New `scripts/README.md` with full script documentation (preserving the ⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md` reduced to a pointer + "document new scripts in README.md" reminder. ### .devcontainer/CLAUDE.md → .devcontainer/README.md New `.devcontainer/README.md` with all configuration, usage, and limitations. `devcontainer/CLAUDE.md` reduced to a single pointer line. ### docs/CLAUDE.md → docs/README.md New `docs/README.md` covering the folder structure, ADR guide, infrastructure docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder. ### ocr-service/CLAUDE.md Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6). Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk. ### backend/CLAUDE.md - Layering Rules → pointer to docs/ARCHITECTURE.md - Error Handling → pointer to CONTRIBUTING.md + reminder - Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder - Package Structure → tagged TODO post-REFACTOR-1 - Fixed errors.ts path to frontend/src/lib/shared/errors.ts - Added ANNOTATE_ALL + BLOG_WRITE to permission list - Key Entities, Entity Code Style, Services → kept (Bucket-2) ### root CLAUDE.md - Stack, Infrastructure, Dev Container → pointers - Layering Rules, Error Handling, Security, OpenAPI, API Client, Date Handling, UI Components, Frontend Error Handling → pointers + reminders - Package Structure → tagged TODO post-REFACTOR-1 - Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2) ### frontend/CLAUDE.md - API Client Pattern, Date Handling → pointers + reminders - Key UI Components → pointer to domain READMEs - Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:41:02 +02:00
Marcel	a1b89670c0	docs(legibility): add 18 per-domain README.md files (DOC-6) Backend (9): document, person, tag, user, geschichte, notification, ocr, audit, dashboard. Frontend (8): document, person, tag, user, geschichte, notification, ocr, shared. OCR service (1): ocr-service/README.md. Each README covers: what the domain owns, explicit non-ownership, public surface (verified by grep against the codebase), internal layout, and cross-domain dependencies. Closes #400 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:36:38 +02:00
Marcel	e85057bed2	refactor(document): move document domain core to document/ package Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 12:39:20 +02:00
Marcel	23cf88856e	fix(ocr): guard Kraken block extraction against missing boundary/baseline Some checks failed CI / Unit & Component Tests (push) Failing after 2m37s Details CI / OCR Service Tests (push) Successful in 32s Details CI / Backend Unit Tests (push) Failing after 2m51s Details extract_page_blocks() walked `record.boundary` and `record.baseline` unconditionally, so a record that arrived without either (malformed kraken output, or a MagicMock in tests that iterates to nothing) crashed with "min() arg is an empty sequence". Coerce both attributes through list(), require at least 3 points for the polygon path, fall back to the baseline path when the polygon is missing, and skip the record entirely when neither is usable — emitting no block is safer than emitting one with garbage coordinates. The test helper now sets `boundary` and `baseline` explicitly to mirror real Kraken 7.0 records (and so the happy-path test exercises the polygon branch). A new regression test covers the skip path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-23 09:33:03 +02:00
Marcel	1f7b712dd0	fix(ocr): accept sender_model_path in Surya engine so non-Kurrent OCR works Some checks failed CI / Unit & Component Tests (push) Failing after 2m36s Details CI / OCR Service Tests (push) Successful in 33s Details CI / Backend Unit Tests (push) Has started running Details main.py unifies the call to both engines and always passes `sender_model_path` (None for non-Kurrent scripts). Surya's extract_region_text / extract_page_blocks accepted one fewer positional arg than Kraken's, so every guided-OCR run on a TYPEWRITER or HANDWRITING_LATIN document raised "takes 5 positional arguments but 6 were given" and the stream returned 0 blocks / 1 skipped page. Add an ignored `sender_model_path` kwarg to both Surya functions so the signatures match Kraken's, and guard the regression with two signature tests in test_engines.py that compare both engines' parameter lists. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-23 09:28:25 +02:00
Marcel	64a854aad6	refactor(ocr): mark _SenderModelRegistry.contains as private (_contains) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:26:46 +02:00
Marcel	84c09e41ef	test(ocr): add /train-sender auth tests and run sender registry tests in CI Add 503/403 auth tests for the /train-sender endpoint, matching the pattern already used for /train and /segtrain. Also surface test_sender_registry.py in CI (it needs no ML stack) and add pytest-asyncio to the install step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:14:27 +02:00
Marcel	000079fd50	refactor(ocr): rename _contains to contains in SenderModelRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:53:16 +02:00
Marcel	07035b9fa9	style(ocr): add Image type hints to extract_page_blocks and extract_region_text Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:22:34 +02:00
Marcel	eab37b9ac9	test(ocr): verify load failure does not cache broken entry in SenderModelRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:19:40 +02:00
Marcel	64d27d6d61	feat(ocr): per-sender model registry and /train-sender endpoint engines/kraken.py: - Add _SenderModelRegistry with LRU eviction (max configurable via OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(), and path whitelist (/app/models/ only) - Add _load_sender_model() helper for testability - extract_page_blocks() and extract_region_text() accept optional sender_model_path; route to sender registry when provided models.py: - OcrRequest gains senderModelPath: str \| None = None field main.py: - /ocr and /ocr/stream pass request.senderModelPath to Kraken engine - New /train-sender endpoint: validates output_model_path, runs ketos train with base model as starting point, invalidates sender cache docker-compose.yml: - Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment test_sender_registry.py: - 4 tests: cache hit, LRU eviction, invalidate, path traversal guard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 18:05:39 +02:00
Marcel	c5e6ed922b	test(ocr): decouple correction tests from exact library dictionary state Some checks failed CI / Unit & Component Tests (pull_request) Successful in 3m35s Details CI / OCR Service Tests (pull_request) Successful in 36s Details CI / Backend Unit Tests (pull_request) Failing after 2m47s Details CI / Unit & Component Tests (push) Failing after 2m33s Details CI / OCR Service Tests (push) Successful in 34s Details CI / Backend Unit Tests (push) Failing after 2m41s Details Replace exact-string assertions in test_correctable_ocr_error_gets_corrected and test_sentence_with_multiple_corrections with structural assertions that verify behavior (correction attempted, marker present, expected stem) without coupling to a specific pyspellchecker version's frequency weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:23:09 +02:00
Marcel	ec85f228c1	refactor(ocr): document > 50 frequency threshold rationale Strict greater-than avoids non-determinism: if multiple candidates share the minimum frequency value, pyspellchecker's ranking is undefined. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:21:37 +02:00
Marcel	fea24aee25	refactor(ocr): make collapse_adjacent_markers a public function Drop underscore prefix — the helper is part of confidence.py's effective public API since spell_check.py imports and calls it directly. Fixes reviewer concern: importing a _-prefixed name across module boundaries contradicts Python's private-by-convention signal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:20:31 +02:00
Marcel	77100ab1e6	feat(ocr): integrate spell-check post-processing for handwriting script types Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:54:17 +02:00
Marcel	092131930c	feat(ocr): add spell_check module with German spellchecker and historical wordlist Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:52:50 +02:00
Marcel	47f9a0bf73	test(ocr): add failing tests for spell_check module Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:51:38 +02:00
Marcel	30a6cbeb7f	feat(ocr): add DTA-derived historical German wordlist and generation script 153K words from dtak+dtae 1800-1899 corpora (min_freq=20), covering pre-reform spellings common in Kurrent/Süterlin documents. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:48:26 +02:00
Marcel	6faaa3b7d6	feat(ocr): add pyspellchecker dependency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:41:24 +02:00
Marcel	77747aa556	refactor(ocr): extract _collapse_adjacent_markers helper and add CORRECTION_MARKER Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:40:39 +02:00
Marcel	4cb7c975f5	test(ocr): add resilience tests for tiny image and unexpected exception propagation Some checks failed CI / Unit & Component Tests (pull_request) Failing after 2m27s Details CI / Backend Unit Tests (pull_request) Failing after 2m37s Details CI / Unit & Component Tests (push) Failing after 3m14s Details CI / Backend Unit Tests (push) Has been cancelled Details Add test for 1×1 image (sub-tile-size) resilience and narrow preprocess_page fallback from except Exception to (cv2.error, ValueError, MemoryError) so programming errors propagate instead of being silently swallowed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 15:16:17 +02:00
Marcel	b310caaeeb	feat(ocr): integrate preprocessing into stream and batch endpoints Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:16:47 +02:00
Marcel	615d404ba9	chore(ocr): add opencv-python-headless, libglib2.0-0, and CLAHE env vars Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:14:47 +02:00
Marcel	7183fc4428	feat(ocr): add image preprocessing module with CLAHE + grayscale + blur Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:13:42 +02:00
Marcel	5c7efef307	fix(ocr): pin Dockerfile base image to python:3.11.9-slim for reproducible builds Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	74c9046745	fix(ocr): narrow exception handling and add unit tests for ensure_blla_model - _model_is_loadable: narrow bare except to (RuntimeError, OSError, ValueError) with DEBUG-level fallback for unexpected exceptions — prevents silent masking of missing kraken install or AttributeError on vgsl - _run_segtrain: replace bare except:pass with log.warning so height-check fallback is visible in container logs - New test_ensure_blla_model.py: covers model-OK early return, incompatible model rename+replace, and missing model download paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	a5979c4069	fix(ocr-service): fix ketos 7 segtrain compatibility and prevent OOM Three issues fixed: 1. --resize both was removed in ketos 7; replaced with --resize union which extends the model's class mapping to include training data classes. 2. ketos ignores -s when -i is present, so the 1800px blla model caused 7+ GB peak RAM and OOM-killed the host (no swap, 5 GB free). Now checks the loaded model's input height: only uses the base model when it was already fine-tuned at 800px; otherwise trains from scratch at 800px (~200 MB peak). After the first run the trained 800px model becomes the base for all subsequent fine-tuning runs. 3. segtrain now computes and returns cer = 1 - accuracy, matching the recognition training path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	e8375d6c72	fix(ocr-service): add entrypoint that validates blla model format on startup Adds ensure_blla_model.py which loads the blla segmentation model with ketos on every container start. If the model is missing or in the legacy PyTorch ZIP format (incompatible with ketos 7), it re-downloads the correct CoreML protobuf model from Zenodo (DOI 10.5281/zenodo.14602569). The Dockerfile now uses entrypoint.sh which runs this check before starting uvicorn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	30a17c97e8	fix(ocr): fail closed when TRAINING_TOKEN is not configured _check_training_token previously skipped auth when TRAINING_TOKEN was empty, allowing unauthenticated requests to reach /train and /segtrain. Now returns 503 ("Training not configured on this node") when the token is absent, so missing configuration fails closed rather than open. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 10:02:13 +02:00
Marcel	62be895b9e	fix(ocr): drop uvicorn workers from 2 to 1 Two workers × ~5 GB Surya model load = ~10 GB required, exceeding the 8 GB memory cap and causing OOM on the first /train call. Two OS processes also cause model-state divergence after training, contradicting the single-node constraint documented in ADR-001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 09:55:55 +02:00
Marcel	669f2f8b98	fix(training): output CoreML format and fix best-model finder ketos 7 defaults to safetensors output, but kraken's load_any() only handles CoreML (.mlmodel). Adding --weights-format coreml ensures the hot-swap after training produces a file that load_any() can parse. Also fixed _find_best_model to look for best_<score>.mlmodel (produced by --weights-format coreml) in addition to the previous checkpoint_* pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:57:42 +02:00
Marcel	49c9022285	fix(training): switch to PAGE XML format for kurrent recognition training Kraken 7 removed support for the legacy `path` format (image + .gt.txt pairs) in VGSLRecognitionDataModule despite the CLI still advertising it. Switching to PAGE XML (-f page) format which is the supported standard. - Java export now writes .xml alongside .png (PAGE XML with TextLine, Baseline at 75% height, and Unicode transcription) - XML special characters in transcription text are escaped (& < >) - Python trainer globs *.xml and passes -f page to ketos train - Regenerated frontend API types to include cer/loss/accuracy/epochs on OcrTrainingRun (were missing, causing empty CER column in history) - Updated and extended TrainingDataExportServiceTest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:45:08 +02:00
Marcel	94b9c56527	fix(segtrain): reduce input height to 800px on first run to avoid OOM ketos segtrain has no batch-size flag (-B), so with the default 1800px input height the intermediate CNN feature maps consume ~500 MB+ per image, causing the kernel OOM-killer (exit -9) to terminate the process. On first run (no existing blla.mlmodel), override the VGSL spec to use 800px height instead. Subsequent runs load the saved model with --resize both, preserving incremental fine-tuning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:37:24 +02:00
Marcel	89a18c430e	fix(training): limit CPU threads and epochs to prevent RAM exhaustion Force CPU-only training (--device cpu), cap OpenMP/BLAS thread pool at 2 (--threads 2), and reduce epochs from 50 to 10 (-N 10). 50 epochs on a laptop OOM-killed the container. 10 epochs is sufficient for incremental fine-tuning runs; more data is added over time and training re-run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:09:13 +02:00
Marcel	8dec5b5976	fix(training): disable DataLoader workers in subprocess training DataLoader worker subprocesses crash inside Docker due to multiprocessing fork restrictions. Pass --workers 0 to both ketos train and ketos segtrain so data loading runs in the main process. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 20:58:32 +02:00
Marcel	e33164c4aa	fix(training): use ketos CLI subprocess instead of missing Python API kraken.ketos has no .train or .segtrain attributes in Kraken 7 — both are only exposed as CLI commands. Rewrites both training functions to invoke `ketos train` / `ketos segtrain` via subprocess and parse the best val_metric from checkpoint filenames. Also fixes the OcrTrainingCard history so it only shows non-blla runs (recognition model), matching SegmentationTrainingCard which already filtered to blla-only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 20:50:21 +02:00
Marcel	22954f348a	feat(training): track and display CER per training run After each training run, the Character Error Rate (CER = 1 - accuracy), loss, accuracy, and epoch count are now stored on the OcrTrainingRun record and shown in the training history table. Also adds the missing POST /api/ocr/segtrain endpoint and the triggerSegTraining service method so the segmentation training card can actually trigger training. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 19:01:10 +02:00
Marcel	3e34366702	fix(ocr): use cw-1/ch-1 for synthetic baseline bounds to pass Kraken's >= check Kraken's segmentation bounds check rejects coordinates where any point satisfies x >= im.width or y >= im.height (strictly >=, not >). Using (cw, ch) as the boundary corner was triggering this for every crop. Changed to (cw-1, ch-1) so all coordinates are strictly inside the image. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:21:00 +02:00

1 2

75 Commits