feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

Merged

marcel merged 27 commits from feat/issue-652-ocr-metrics into main

2026-05-21 18:16:48 +02:00

Author	SHA1	Message	Date
Marcel	0801da8df0	docs(ocr): explain why two metrics tests skip fresh_metrics fixture Some checks failed CI / Backend Unit Tests (push) Successful in 3m42s Details CI / fail2ban Regex (push) Successful in 43s Details CI / Semgrep Security Scan (push) Successful in 19s Details CI / Compose Bucket Idempotency (push) Successful in 1m0s Details nightly / deploy-staging (push) Successful in 5m43s Details CI / Unit & Component Tests (pull_request) Successful in 3m24s Details CI / OCR Service Tests (pull_request) Successful in 20s Details CI / Backend Unit Tests (pull_request) Successful in 3m28s Details CI / fail2ban Regex (pull_request) Successful in 43s Details CI / Semgrep Security Scan (pull_request) Successful in 19s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s Details CI / Unit & Component Tests (push) Failing after 2m44s Details CI / OCR Service Tests (push) Successful in 20s Details Sara's cycle-2 S2: clarify the latent (but not actual) cross-test state risk on the two metrics tests that hit the global REGISTRY instead of the per-test fresh_metrics fixture. Migrating them would actually break them — the /metrics endpoint is served by prometheus-fastapi-instrumentator which binds to the default REGISTRY at app-construction time, and the http_requests_total assertion only finds counters on that global registry. Both tests already assert response shape only (status code, content-type substring, body substrings), not numeric values, so the shared-registry caveat is documented for future readers rather than treated as a bug to fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 17:23:32 +02:00
Marcel	e0e1578bdd	test(ocr): widen spell-check exclusion bound to 0.09s with rationale Sara's cycle-2 S1: the wall-clock assertion at < 0.05s could trip on a slow CI runner under load even when the timer correctly excludes spell-check. Sara's preferred structural fix (patch main.time.monotonic with a deterministic sequence) proved awkward — the patched attribute is the global time.monotonic which httpx and asyncio consume, exhausting the sequence before the request reaches the engine loop. Take the documented fallback: widen the bound to 0.09s and explain why. The failure mode the test guards against (spell-check inside the timer) would add 0.1s (2 × 0.05s sleep), so 0.09s catches the bug while leaving ~90ms of headroom for slow CI runners. Verified red→green by temporarily moving correct_text inside the timer block: bound trips at 0.101s; the fixed code reads ~0.001s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 17:22:49 +02:00
Marcel	2df71beb7e	docs: add ADR-023 and glossary entries for OCR metrics All checks were successful CI / Unit & Component Tests (pull_request) Successful in 3m33s Details CI / OCR Service Tests (pull_request) Successful in 22s Details CI / Backend Unit Tests (pull_request) Successful in 3m29s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 20s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s Details ADR-023 captures why prometheus-fastapi-instrumentator was chosen, the build_metrics(registry) factory pattern, and the test rebinding seam. The glossary gains four ops-aligned terms — illegible word, models-ready gauge, recognition vs segmentation accuracy — so the metrics documentation in OBSERVABILITY.md has a vocabulary to lean on. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:06:44 +02:00
Marcel	2dbb3c37b4	docs(observability): document ocr metrics, scrape edge, and access-log filter - L2 container diagram now shows the Prometheus -> ocr:8000 scrape edge (plus the previously-undrawn Prometheus -> backend edge for symmetry). - OBSERVABILITY.md gains a full ocr_* metrics table with labels, units, and the canonical example queries from issue #652. - New "Internal-only endpoints" subsection captures the unauthenticated /metrics caveat and provides the Caddy block snippet for the case where the service ever gets a host port. - Explicit note that MetricsPathFilter only quiets uvicorn stdout, and the OCR metrics must never carry PII or document content. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:05:27 +02:00
Marcel	67368b4413	docs(ocr): annotate metrics binding + /metrics exposure + pin client Three small drops that pay back later: - Note that main.metrics is import-time bound and tests must monkeypatch `main.metrics`, not the registry. - Flag the /metrics endpoint as unauthenticated and cross-link the Caddy-block snippet in docs/OBSERVABILITY.md. - Pin prometheus-client to the exact 0.25.0 patch version already resolved by prometheus-fastapi-instrumentator 7.0.0, so an upstream bump cannot silently slip in. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:04:28 +02:00
Marcel	ddf6cf4cbc	test(ocr): collapse shared client setup into ocr_client helper Each metrics test was repeating the same five-line block — patch kraken_engine.load_models, patch load_spell_checker, instantiate the AsyncClient, force _models_ready True, restore it. Lift the lot into a single async context manager so each test body shrinks to its real arrange / act / assert intent. Tests that drive the lifespan directly (models_ready gauge) or stub asyncio.to_thread for /train (which already patches _models_ready) stay unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:03:29 +02:00
Marcel	df952861c4	refactor(ocr): extract _record_training for shared metric bookkeeping The /train, /train-sender, and /segtrain endpoints each duplicated the same eight-line try/except + counter + gauge block around the asyncio.to_thread call. Lift it into _record_training(runner, kind), which accepts a sync- or async-returning callable for flexibility. Each endpoint now ends with a single return line. Behaviour preserved — status codes, error propagation, and metric labels stay identical. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:58:40 +02:00
Marcel	22a5ee816a	refactor(ocr): extract _observe_block_words for word counter sites The two block-iteration loops (/ocr and /ocr/stream's standard generator) both ran the same word-total and illegible-word increments. Lift them into a single helper so each call site becomes one line and the counter intent reads cleanly. Pure refactor — no behavior change, tests stay green. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:57:18 +02:00
Marcel	0179e93a4b	test(ocr): narrow training error test to subprocess.run seam The asyncio.to_thread patch stubbed out the entire _run_training call, hiding the real error path. Replacing it with a failing CompletedProcess from subprocess.run exercises the actual ketos-failed branch and keeps the test's intent — error counter bumps, 500 surfaces — intact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:55:14 +02:00
Marcel	0fc0cbcffd	test(ocr): lock in MetricsPathFilter fail-open behavior If uvicorn's access log format ever changes (args=None, or shorter than 3 elements), the filter must keep forwarding records rather than silently dropping them. Two extra LogRecords cover both edge cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:54:24 +02:00
Marcel	549cb15845	test(ocr): cover /train-sender counter and accuracy=None gauge default Two regression tests: - /train-sender hitting the success path bumps the recognition counter (previously only /train and /segtrain were covered). - A successful run whose result.accuracy is None must not call set() on ocr_model_accuracy — the gauge stays at its default 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:53:48 +02:00
Marcel	74ddf16b01	feat(ocr): time only engine work in guided stream histogram Previously the guided generator's page_started timer wrapped the entire region loop including the synchronous correct_text() call, inflating ocr_processing_seconds with spell-check latency. Sum the per-region engine.extract_region_text durations instead so the histogram matches the unguided stream's "engine only" semantic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:53:04 +02:00
Marcel	ebaedb1af0	test(ocr): assert ocr_jobs_total stays zero when stream download fails Locks in the post-download placement of the counter increment so a regression that moves it back above _download_and_convert_pdf would fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:51:23 +02:00
Marcel	e75ac8ec45	ops(observability): drop TODO from ocr-service scrape job in prometheus.yml All checks were successful CI / Backend Unit Tests (pull_request) Successful in 3m27s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 18s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s Details CI / Unit & Component Tests (pull_request) Successful in 3m24s Details CI / OCR Service Tests (pull_request) Successful in 20s Details The TODO was a placeholder for this work — the OCR service now exposes /metrics so the target will flip from DOWN to UP on next image rebuild. Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:16:51 +02:00
Marcel	525f091b3a	feat(ocr): suppress uvicorn access logs for /metrics and /health Adds a logging.Filter on uvicorn.access that drops records whose request path is /metrics or /health. Each is hit on a tight schedule (Prometheus scrape interval and Docker healthcheck), so unfiltered they dominate the access log without carrying any information about real traffic. Refs #652 (Nora's recommendation) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:16:14 +02:00
Marcel	d6abf990c7	feat(ocr): flip ocr_models_ready to 1 once the lifespan startup finishes Mirrors the existing _models_ready bool so Prometheus has a time-series liveness/readiness signal for future alerting rules (e.g. ocr_models_ready < 1 for 2m). Refs #652 (AC7) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:15:11 +02:00
Marcel	77d59c5d83	test(ocr): assert ocr_model_accuracy gauge is set per kind on success Hits /train then /segtrain through the same test, each with a distinct mocked accuracy, and asserts the labelled gauges reflect the two values. Locks down the kind-label separation between recognition and segmentation accuracy (decision #2). Refs #652 (AC6) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:13:05 +02:00
Marcel	6c2b9af10b	feat(ocr): record training runs in ocr_training_runs_total per kind and outcome Wraps the await asyncio.to_thread(_run_*) calls in /train, /train-sender, and /segtrain with try/except. Recognition training (/train, /train-sender) shares kind="recognition"; /segtrain uses kind="segmentation". The ocr_model_accuracy gauge is set per kind on success. Refs #652 (AC6, decision #2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:12:26 +02:00
Marcel	2e3744d9ef	feat(ocr): observe ocr_processing_seconds around engine.to_thread calls Wraps every asyncio.to_thread(engine.extract_*) call with time.monotonic() deltas in /ocr (per document) and in both /ocr/stream generators (per page). Streaming buckets are the useful operational signal; the non-streaming observation is a bonus. Refs #652 (AC5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:09:25 +02:00
Marcel	131ed336bc	feat(ocr): count words and illegible words at the OCR call sites Walks block["words"] before apply_confidence_markers strips the list, then increments ocr_words_total by len(words) and ocr_illegible_words_total by the count below threshold. Same pattern in both /ocr and /ocr/stream so the ratio illegible/words is a faithful quality signal across endpoints. Refs #652 (AC4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:07:59 +02:00
Marcel	3fa3460dbf	feat(ocr): increment ocr_skipped_pages_total on per-page engine failure Bumps the counter in both /ocr/stream except blocks (standard and guided generators) so the existing skipped_pages local variable now also flows into Prometheus. Refs #652 (AC3b) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:06:50 +02:00
Marcel	79edb94558	feat(ocr): increment ocr_pages_total per successful page in stream Bumps the counter inside both the standard and guided /ocr/stream generators after a page yields its blocks, before the per-page json line is emitted. Also moves the ocr_jobs_total increment for /ocr/stream right after engine selection so the counter still fires when a page later errors out. Refs #652 (AC3a) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:05:36 +02:00
Marcel	52d8dc2b20	test(ocr): assert ocr_jobs_total label is engine=surya for typewriter Locks down AC2 for the non-Kurrent path. The same code branch in /ocr that sets engine_name from script_type now has explicit coverage for both HANDWRITING_KURRENT → kraken and TYPEWRITER → surya. Refs #652 (AC2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:04:20 +02:00
Marcel	696b71da5a	feat(ocr): increment ocr_jobs_total with engine and script_type labels Pick engine="kraken" for HANDWRITING_KURRENT, engine="surya" otherwise, then increment after the blocks have been extracted. Refs #652 (AC2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:03:37 +02:00
Marcel	f3e3545d06	feat(ocr): add metrics.py factory with test-scoped CollectorRegistry support Encapsulates every custom OCR metric in an OcrMetrics frozen dataclass and exposes a `build_metrics(registry)` factory. Production main.py binds against the default REGISTRY; tests construct a fresh CollectorRegistry per case and monkeypatch main.metrics, so counter values stay isolated between tests (decision #3 on issue #652, Option A). Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:02:20 +02:00
Marcel	4bb6685edb	test(ocr): assert http_* metrics appear after an /ocr request Locks down AC1: prometheus-fastapi-instrumentator must keep auto-exposing http_requests_total and http_request_duration_seconds for application traffic, not just register the /metrics endpoint. Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:00:33 +02:00
Marcel	18c93d4eaa	feat(ocr): expose /metrics endpoint via prometheus-fastapi-instrumentator Mount the instrumentator immediately after FastAPI app creation, excluding /health and /metrics from request metrics to keep http_requests_total focused on real application traffic. Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 15:59:37 +02:00

feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

27 Commits