familienarchiv

Author	SHA1	Message	Date
Marcel	67368b4413	docs(ocr): annotate metrics binding + /metrics exposure + pin client Three small drops that pay back later: - Note that main.metrics is import-time bound and tests must monkeypatch `main.metrics`, not the registry. - Flag the /metrics endpoint as unauthenticated and cross-link the Caddy-block snippet in docs/OBSERVABILITY.md. - Pin prometheus-client to the exact 0.25.0 patch version already resolved by prometheus-fastapi-instrumentator 7.0.0, so an upstream bump cannot silently slip in. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:04:28 +02:00
Marcel	ddf6cf4cbc	test(ocr): collapse shared client setup into ocr_client helper Each metrics test was repeating the same five-line block — patch kraken_engine.load_models, patch load_spell_checker, instantiate the AsyncClient, force _models_ready True, restore it. Lift the lot into a single async context manager so each test body shrinks to its real arrange / act / assert intent. Tests that drive the lifespan directly (models_ready gauge) or stub asyncio.to_thread for /train (which already patches _models_ready) stay unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:03:29 +02:00
Marcel	df952861c4	refactor(ocr): extract _record_training for shared metric bookkeeping The /train, /train-sender, and /segtrain endpoints each duplicated the same eight-line try/except + counter + gauge block around the asyncio.to_thread call. Lift it into _record_training(runner, kind), which accepts a sync- or async-returning callable for flexibility. Each endpoint now ends with a single return line. Behaviour preserved — status codes, error propagation, and metric labels stay identical. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:58:40 +02:00
Marcel	22a5ee816a	refactor(ocr): extract _observe_block_words for word counter sites The two block-iteration loops (/ocr and /ocr/stream's standard generator) both ran the same word-total and illegible-word increments. Lift them into a single helper so each call site becomes one line and the counter intent reads cleanly. Pure refactor — no behavior change, tests stay green. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:57:18 +02:00
Marcel	0179e93a4b	test(ocr): narrow training error test to subprocess.run seam The asyncio.to_thread patch stubbed out the entire _run_training call, hiding the real error path. Replacing it with a failing CompletedProcess from subprocess.run exercises the actual ketos-failed branch and keeps the test's intent — error counter bumps, 500 surfaces — intact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:55:14 +02:00
Marcel	0fc0cbcffd	test(ocr): lock in MetricsPathFilter fail-open behavior If uvicorn's access log format ever changes (args=None, or shorter than 3 elements), the filter must keep forwarding records rather than silently dropping them. Two extra LogRecords cover both edge cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:54:24 +02:00
Marcel	549cb15845	test(ocr): cover /train-sender counter and accuracy=None gauge default Two regression tests: - /train-sender hitting the success path bumps the recognition counter (previously only /train and /segtrain were covered). - A successful run whose result.accuracy is None must not call set() on ocr_model_accuracy — the gauge stays at its default 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:53:48 +02:00
Marcel	74ddf16b01	feat(ocr): time only engine work in guided stream histogram Previously the guided generator's page_started timer wrapped the entire region loop including the synchronous correct_text() call, inflating ocr_processing_seconds with spell-check latency. Sum the per-region engine.extract_region_text durations instead so the histogram matches the unguided stream's "engine only" semantic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:53:04 +02:00
Marcel	ebaedb1af0	test(ocr): assert ocr_jobs_total stays zero when stream download fails Locks in the post-download placement of the counter increment so a regression that moves it back above _download_and_convert_pdf would fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:51:23 +02:00
Marcel	525f091b3a	feat(ocr): suppress uvicorn access logs for /metrics and /health Adds a logging.Filter on uvicorn.access that drops records whose request path is /metrics or /health. Each is hit on a tight schedule (Prometheus scrape interval and Docker healthcheck), so unfiltered they dominate the access log without carrying any information about real traffic. Refs #652 (Nora's recommendation) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:16:14 +02:00
Marcel	d6abf990c7	feat(ocr): flip ocr_models_ready to 1 once the lifespan startup finishes Mirrors the existing _models_ready bool so Prometheus has a time-series liveness/readiness signal for future alerting rules (e.g. ocr_models_ready < 1 for 2m). Refs #652 (AC7) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:15:11 +02:00
Marcel	77d59c5d83	test(ocr): assert ocr_model_accuracy gauge is set per kind on success Hits /train then /segtrain through the same test, each with a distinct mocked accuracy, and asserts the labelled gauges reflect the two values. Locks down the kind-label separation between recognition and segmentation accuracy (decision #2). Refs #652 (AC6) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:13:05 +02:00
Marcel	6c2b9af10b	feat(ocr): record training runs in ocr_training_runs_total per kind and outcome Wraps the await asyncio.to_thread(_run_*) calls in /train, /train-sender, and /segtrain with try/except. Recognition training (/train, /train-sender) shares kind="recognition"; /segtrain uses kind="segmentation". The ocr_model_accuracy gauge is set per kind on success. Refs #652 (AC6, decision #2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:12:26 +02:00
Marcel	2e3744d9ef	feat(ocr): observe ocr_processing_seconds around engine.to_thread calls Wraps every asyncio.to_thread(engine.extract_*) call with time.monotonic() deltas in /ocr (per document) and in both /ocr/stream generators (per page). Streaming buckets are the useful operational signal; the non-streaming observation is a bonus. Refs #652 (AC5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:09:25 +02:00
Marcel	131ed336bc	feat(ocr): count words and illegible words at the OCR call sites Walks block["words"] before apply_confidence_markers strips the list, then increments ocr_words_total by len(words) and ocr_illegible_words_total by the count below threshold. Same pattern in both /ocr and /ocr/stream so the ratio illegible/words is a faithful quality signal across endpoints. Refs #652 (AC4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:07:59 +02:00
Marcel	3fa3460dbf	feat(ocr): increment ocr_skipped_pages_total on per-page engine failure Bumps the counter in both /ocr/stream except blocks (standard and guided generators) so the existing skipped_pages local variable now also flows into Prometheus. Refs #652 (AC3b) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:06:50 +02:00
Marcel	79edb94558	feat(ocr): increment ocr_pages_total per successful page in stream Bumps the counter inside both the standard and guided /ocr/stream generators after a page yields its blocks, before the per-page json line is emitted. Also moves the ocr_jobs_total increment for /ocr/stream right after engine selection so the counter still fires when a page later errors out. Refs #652 (AC3a) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:05:36 +02:00
Marcel	52d8dc2b20	test(ocr): assert ocr_jobs_total label is engine=surya for typewriter Locks down AC2 for the non-Kurrent path. The same code branch in /ocr that sets engine_name from script_type now has explicit coverage for both HANDWRITING_KURRENT → kraken and TYPEWRITER → surya. Refs #652 (AC2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:04:20 +02:00
Marcel	696b71da5a	feat(ocr): increment ocr_jobs_total with engine and script_type labels Pick engine="kraken" for HANDWRITING_KURRENT, engine="surya" otherwise, then increment after the blocks have been extracted. Refs #652 (AC2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:03:37 +02:00
Marcel	f3e3545d06	feat(ocr): add metrics.py factory with test-scoped CollectorRegistry support Encapsulates every custom OCR metric in an OcrMetrics frozen dataclass and exposes a `build_metrics(registry)` factory. Production main.py binds against the default REGISTRY; tests construct a fresh CollectorRegistry per case and monkeypatch main.metrics, so counter values stay isolated between tests (decision #3 on issue #652, Option A). Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:02:20 +02:00
Marcel	4bb6685edb	test(ocr): assert http_* metrics appear after an /ocr request Locks down AC1: prometheus-fastapi-instrumentator must keep auto-exposing http_requests_total and http_request_duration_seconds for application traffic, not just register the /metrics endpoint. Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:00:33 +02:00
Marcel	18c93d4eaa	feat(ocr): expose /metrics endpoint via prometheus-fastapi-instrumentator Mount the instrumentator immediately after FastAPI app creation, excluding /health and /metrics from request metrics to keep http_requests_total focused on real application traffic. Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 15:59:37 +02:00
Marcel	6839cf2a33	docs(ocr): clarify entrypoint comment and add manual run hint for skipped test - entrypoint.sh: replace "cross-job ground-truth leakage" with plain "Remove stale partial downloads left by a previous docker-kill" - test_tmpdir_is_inside_persistent_cache_volume: add docker exec command so future developers know how to run this deployment-contract test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 11:20:45 +02:00
Marcel	775b5c062e	test(ocr): add orphan cleanup behavior tests for entrypoint.sh find -mtime test_entrypoint_removes_day_old_orphans and test_entrypoint_preserves_fresh_files verify the find -mtime +1 -delete logic using os.utime() to fabricate old mtimes without mocking system time. Also extracts _run_entrypoint helper to remove subprocess setup duplication. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 11:19:33 +02:00
Marcel	e31dac5c9c	test(ocr): assert entrypoint.sh exit code in test_entrypoint_creates_tmpdir A silent non-zero exit would previously cause the test to pass incorrectly because only directory creation was checked. Exit code is now the first assertion, catching regressions before the filesystem check runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 11:18:14 +02:00
Marcel	c2bd1b34f0	refactor(ocr): extract _validate_zip_entry to utils.py so ZIP Slip test runs in CI _validate_zip_entry has no ML-stack dependency; importing it via main.py pulled in surya/torch and caused the test to be skipped in CI. Moving it to utils.py (fastapi only) and adding fastapi to the CI lightweight install lets test_zipslip_still_anchors_under_custom_tmpdir run on every push. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 11:17:15 +02:00
Marcel	cfd49ff69e	docs(ocr): document TMPDIR convention and add ADR-021 All checks were successful CI / Unit & Component Tests (pull_request) Successful in 3m7s Details CI / OCR Service Tests (pull_request) Successful in 19s Details CI / Backend Unit Tests (pull_request) Successful in 3m7s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 18s Details CI / Compose Bucket Idempotency (pull_request) Successful in 59s Details - ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows to the environment variables table - ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume - docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision, trade-offs, and rejected alternatives (Approach B / C) for issue #614 - ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 10:58:10 +02:00
Marcel	240b373f68	fix(ocr): create TMPDIR on startup and clear day-old orphans On a fresh ocr_cache volume /app/cache/.tmp does not exist yet. The mkdir ensures the first Surya model download can proceed without ENOSPC on the 512 MB /tmp tmpfs. The find cleanup removes fragments left by docker-kill mid-download, preventing cross-job ground-truth leakage. Fixes #614. See ADR-021. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 10:54:17 +02:00
Marcel	09a043431e	build(ocr): set ENV TMPDIR=/app/cache/.tmp so docker run uses SSD staging Without this, running the image outside compose loses the TMPDIR redirect and Surya model downloads fall back to the 512 MB /tmp tmpfs (ENOSPC). See issue #614, ADR-021. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-18 10:53:15 +02:00
Marcel	bead6f1811	fix(ocr): handle empty-string HTRMOPO_DIR env var with or-fallback os.environ.get(key, default) returns "" when the key exists but is blank — the default is only used when the key is absent. The or-fallback treats both absence and blank values as "use the default". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 18:53:26 +02:00
Marcel	fc8b4b164b	security(ocr): redirect XDG cache and Torch home away from read-only HOME Prevents PyTorch/Matplotlib/Ketos from writing to /home/ocr which is on the read-only container filesystem — fixes Nora's blocker. Also restores the explanatory comment on the ocr_cache volume mount. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:30:39 +02:00
Marcel	eb63df2000	test(ocr): add startup root canary tests for main.py lifespan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:29:47 +02:00
Marcel	53bd574660	test(ocr): replace vacuous startswith assertion with equality check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:26:58 +02:00
Marcel	581ba01d8d	security(ocr): log warning on startup when running as root All checks were successful CI / Unit & Component Tests (pull_request) Successful in 3m3s Details CI / OCR Service Tests (pull_request) Successful in 18s Details CI / Backend Unit Tests (pull_request) Successful in 3m10s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 19s Details CI / Compose Bucket Idempotency (pull_request) Successful in 59s Details Adds a canary log line if os.getuid() == 0. Produces an observable signal in container logs if the USER directive is ever removed from the Dockerfile, without requiring an external audit tool. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 16:51:00 +02:00
Marcel	9db42d6cc1	fix(ocr): resolve HTRMOPO_DIR from env var, not ~ expansion With --no-create-home, os.path.expanduser("~") resolves to "/" causing kraken get to write to /.local/share/htrmopo. Replace with os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") so the path is explicit and override-friendly without a home directory. Adds two tests verifying env-var resolution and ~-free default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 16:49:21 +02:00
Marcel	1aca4c4a41	security(ocr): add non-root user and set HOME/HF_HOME in Dockerfile CIS Docker §4.1: run uvicorn as UID 1000 (ocr) instead of root. Creates /home/ocr and /app/cache with correct ownership so named volumes inherit ocr:ocr on first Docker mount. Sets HOME and HF_HOME so ~ expansion and Hugging Face caching resolve under /app, not /root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 16:46:25 +02:00
Marcel	50b18f0849	docs(legibility): fix three review blockers in DOC-7 Some checks failed CI / Unit & Component Tests (push) Failing after 3m29s Details CI / OCR Service Tests (push) Successful in 32s Details CI / Backend Unit Tests (push) Failing after 3m29s Details - docs/README.md: remove duplicate infrastructure/ entry at end of folder tree - ocr-service/CLAUDE.md: add LLM reminder: prefix to ALLOWED_PDF_HOSTS SSRF warning (consistent with all other machine-readable instructions) - backend/CLAUDE.md: restore ResponseStatusException note for simple controller validation — avoids LLMs reaching for DomainException for trivial checks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:41:02 +02:00
Marcel	86c13a230c	docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7 Processes all 7 CLAUDE.md files according to the 3-bucket classification. Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md, domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last. ### scripts/CLAUDE.md → scripts/README.md New `scripts/README.md` with full script documentation (preserving the ⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md` reduced to a pointer + "document new scripts in README.md" reminder. ### .devcontainer/CLAUDE.md → .devcontainer/README.md New `.devcontainer/README.md` with all configuration, usage, and limitations. `devcontainer/CLAUDE.md` reduced to a single pointer line. ### docs/CLAUDE.md → docs/README.md New `docs/README.md` covering the folder structure, ADR guide, infrastructure docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder. ### ocr-service/CLAUDE.md Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6). Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk. ### backend/CLAUDE.md - Layering Rules → pointer to docs/ARCHITECTURE.md - Error Handling → pointer to CONTRIBUTING.md + reminder - Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder - Package Structure → tagged TODO post-REFACTOR-1 - Fixed errors.ts path to frontend/src/lib/shared/errors.ts - Added ANNOTATE_ALL + BLOG_WRITE to permission list - Key Entities, Entity Code Style, Services → kept (Bucket-2) ### root CLAUDE.md - Stack, Infrastructure, Dev Container → pointers - Layering Rules, Error Handling, Security, OpenAPI, API Client, Date Handling, UI Components, Frontend Error Handling → pointers + reminders - Package Structure → tagged TODO post-REFACTOR-1 - Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2) ### frontend/CLAUDE.md - API Client Pattern, Date Handling → pointers + reminders - Key UI Components → pointer to domain READMEs - Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:41:02 +02:00
Marcel	a1b89670c0	docs(legibility): add 18 per-domain README.md files (DOC-6) Backend (9): document, person, tag, user, geschichte, notification, ocr, audit, dashboard. Frontend (8): document, person, tag, user, geschichte, notification, ocr, shared. OCR service (1): ocr-service/README.md. Each README covers: what the domain owns, explicit non-ownership, public surface (verified by grep against the codebase), internal layout, and cross-domain dependencies. Closes #400 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:36:38 +02:00
Marcel	e85057bed2	refactor(document): move document domain core to document/ package Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 12:39:20 +02:00
Marcel	23cf88856e	fix(ocr): guard Kraken block extraction against missing boundary/baseline Some checks failed CI / Unit & Component Tests (push) Failing after 2m37s Details CI / OCR Service Tests (push) Successful in 32s Details CI / Backend Unit Tests (push) Failing after 2m51s Details extract_page_blocks() walked `record.boundary` and `record.baseline` unconditionally, so a record that arrived without either (malformed kraken output, or a MagicMock in tests that iterates to nothing) crashed with "min() arg is an empty sequence". Coerce both attributes through list(), require at least 3 points for the polygon path, fall back to the baseline path when the polygon is missing, and skip the record entirely when neither is usable — emitting no block is safer than emitting one with garbage coordinates. The test helper now sets `boundary` and `baseline` explicitly to mirror real Kraken 7.0 records (and so the happy-path test exercises the polygon branch). A new regression test covers the skip path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-23 09:33:03 +02:00
Marcel	1f7b712dd0	fix(ocr): accept sender_model_path in Surya engine so non-Kurrent OCR works Some checks failed CI / Unit & Component Tests (push) Failing after 2m36s Details CI / OCR Service Tests (push) Successful in 33s Details CI / Backend Unit Tests (push) Has started running Details main.py unifies the call to both engines and always passes `sender_model_path` (None for non-Kurrent scripts). Surya's extract_region_text / extract_page_blocks accepted one fewer positional arg than Kraken's, so every guided-OCR run on a TYPEWRITER or HANDWRITING_LATIN document raised "takes 5 positional arguments but 6 were given" and the stream returned 0 blocks / 1 skipped page. Add an ignored `sender_model_path` kwarg to both Surya functions so the signatures match Kraken's, and guard the regression with two signature tests in test_engines.py that compare both engines' parameter lists. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-23 09:28:25 +02:00
Marcel	64a854aad6	refactor(ocr): mark _SenderModelRegistry.contains as private (_contains) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:26:46 +02:00
Marcel	84c09e41ef	test(ocr): add /train-sender auth tests and run sender registry tests in CI Add 503/403 auth tests for the /train-sender endpoint, matching the pattern already used for /train and /segtrain. Also surface test_sender_registry.py in CI (it needs no ML stack) and add pytest-asyncio to the install step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:14:27 +02:00
Marcel	000079fd50	refactor(ocr): rename _contains to contains in SenderModelRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:53:16 +02:00
Marcel	07035b9fa9	style(ocr): add Image type hints to extract_page_blocks and extract_region_text Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:22:34 +02:00
Marcel	eab37b9ac9	test(ocr): verify load failure does not cache broken entry in SenderModelRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:19:40 +02:00
Marcel	64d27d6d61	feat(ocr): per-sender model registry and /train-sender endpoint engines/kraken.py: - Add _SenderModelRegistry with LRU eviction (max configurable via OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(), and path whitelist (/app/models/ only) - Add _load_sender_model() helper for testability - extract_page_blocks() and extract_region_text() accept optional sender_model_path; route to sender registry when provided models.py: - OcrRequest gains senderModelPath: str \| None = None field main.py: - /ocr and /ocr/stream pass request.senderModelPath to Kraken engine - New /train-sender endpoint: validates output_model_path, runs ketos train with base model as starting point, invalidates sender cache docker-compose.yml: - Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment test_sender_registry.py: - 4 tests: cache hit, LRU eviction, invalidate, path traversal guard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 18:05:39 +02:00
Marcel	c5e6ed922b	test(ocr): decouple correction tests from exact library dictionary state Some checks failed CI / Unit & Component Tests (pull_request) Successful in 3m35s Details CI / OCR Service Tests (pull_request) Successful in 36s Details CI / Backend Unit Tests (pull_request) Failing after 2m47s Details CI / Unit & Component Tests (push) Failing after 2m33s Details CI / OCR Service Tests (push) Successful in 34s Details CI / Backend Unit Tests (push) Failing after 2m41s Details Replace exact-string assertions in test_correctable_ocr_error_gets_corrected and test_sentence_with_multiple_corrections with structural assertions that verify behavior (correction attempted, marker present, expected stem) without coupling to a specific pyspellchecker version's frequency weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:23:09 +02:00
Marcel	ec85f228c1	refactor(ocr): document > 50 frequency threshold rationale Strict greater-than avoids non-determinism: if multiple candidates share the minimum frequency value, pyspellchecker's ranking is undefined. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:21:37 +02:00

1 2

100 Commits