diff --git a/docs/GLOSSARY.md b/docs/GLOSSARY.md index 21addb4c..99da1775 100644 --- a/docs/GLOSSARY.md +++ b/docs/GLOSSARY.md @@ -80,6 +80,14 @@ _See also [DocumentStatus lifecycle](#documentstatus-lifecycle)._ **Sütterlin** — A specific standardized style of Kurrent taught in German schools from 1915 to 1941. +**Illegible word** — a word whose recognition confidence falls below the configured threshold; replaced with the literal token `[unleserlich]` in the rendered block text and counted in the `ocr_illegible_words_total` Prometheus counter. + +**Models-ready gauge** — the `ocr_models_ready` Prometheus gauge, flipped from `0` to `1` once the FastAPI lifespan startup has finished loading the Kraken model and the spell-checker. Used both for the `/health` endpoint and as the supervised signal for the `ocr_models_ready < 1 for 2m` alert. + +**Recognition model accuracy** — the accuracy reported by `ketos train` for the recognition (text-line) model, exposed as `ocr_model_accuracy{kind="recognition"}`. Sourced from `_parse_best_checkpoint` on the highest-scoring checkpoint after training. + +**Segmentation model accuracy** — the accuracy reported by `ketos segtrain` for the baseline layout analysis (`blla`) model, exposed as `ocr_model_accuracy{kind="segmentation"}`. Distinct from recognition accuracy because the two models are trained and improved independently. + --- ## Other Domain Terms diff --git a/docs/OBSERVABILITY.md b/docs/OBSERVABILITY.md index b895e849..2b3855d6 100644 --- a/docs/OBSERVABILITY.md +++ b/docs/OBSERVABILITY.md @@ -118,11 +118,14 @@ To find a trace for a specific request in staging/production, either increase th ## Metrics (Prometheus → Grafana) -Prometheus scrapes the backend management endpoint every 15 s: +Prometheus scrapes two targets every 15 s: ``` Target: backend:8081/actuator/prometheus Labels: job="spring-boot", application="Familienarchiv" + +Target: ocr:8000/metrics +Labels: job="ocr-service" ``` All Spring Boot metrics carry the `application="Familienarchiv"` tag, which is how the Grafana Spring Boot Observability dashboard (ID 17175) filters to this service. @@ -146,6 +149,70 @@ jvm_memory_used_bytes{area="heap", application="Familienarchiv"} hikaricp_connections_active ``` +### OCR-service custom metrics + +Exposed at `ocr:8000/metrics` by `prometheus-fastapi-instrumentator`. The +`http_*` metrics describe the FastAPI request layer; the `ocr_*` series are +domain-specific. **Never label these with PII or document content** — labels +have unbounded cardinality risk and are visible to anyone with Grafana access. + +| Metric | Type | Labels | Unit | What it tracks | +|---|---|---|---|---| +| `ocr_jobs_total` | Counter | `engine` (`surya`/`kraken`), `script_type` | jobs | OCR jobs that started after a successful PDF download | +| `ocr_pages_total` | Counter | `engine` | pages | Successfully OCR'd pages in the streaming generator | +| `ocr_skipped_pages_total` | Counter | — | pages | Pages skipped because the engine raised on them | +| `ocr_words_total` | Counter | — | words | Recognized words summed across every block | +| `ocr_illegible_words_total` | Counter | — | words | Words below the confidence threshold (rendered as `[unleserlich]`) | +| `ocr_processing_seconds` | Histogram | `engine` | seconds | Per-page (stream) or per-document (`/ocr`) engine time, excluding preprocessing | +| `ocr_training_runs_total` | Counter | `kind` (`recognition`/`segmentation`), `outcome` (`success`/`error`) | runs | Completed training runs | +| `ocr_model_accuracy` | Gauge | `kind` | ratio (0–1) | Latest accuracy reported by a successful training run | +| `ocr_models_ready` | Gauge | — | 0\|1 | 1 once the lifespan startup has finished loading models | + +Canonical example queries (the same ones referenced in issue #652): + +```promql +# OCR throughput by engine +sum by (engine) (rate(ocr_pages_total[5m])) + +# Share of words rendered as [unleserlich] +sum(rate(ocr_illegible_words_total[5m])) + / sum(rate(ocr_words_total[5m])) + +# p95 page processing time per engine +histogram_quantile(0.95, sum by (engine, le) ( + rate(ocr_processing_seconds_bucket[5m]) +)) + +# Training error rate +sum(rate(ocr_training_runs_total{outcome="error"}[1h])) + / sum(rate(ocr_training_runs_total[1h])) + +# Latest recognition vs segmentation accuracy +ocr_model_accuracy +``` + +### Internal-only endpoints + +`/metrics` is exposed by the OCR service over plain HTTP without +authentication. The container is reachable only on the internal Docker +network — Caddy never proxies to it directly. If the service is ever +exposed (e.g. a `ports:` mapping is added), block the endpoint at the +reverse proxy: + +```caddy +ocr.example.com { + @internal_only path /metrics /health + respond @internal_only 404 + reverse_proxy ocr:8000 +} +``` + +The `MetricsPathFilter` in `ocr-service/main.py` suppresses uvicorn's +**stdout** access log lines for `/metrics` and `/health` so the container +console stays focused on real OCR traffic. Promtail/Loki still receive +access lines from any other source. Treat the filter as console +noise-control, not an audit-suppression mechanism. + ## Errors (GlitchTip) GlitchTip receives errors from both the backend (via Sentry Java SDK) and the frontend (via Sentry JavaScript SDK). It groups events by fingerprint, tracks first/last seen times, and links to the release that introduced the error. diff --git a/docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md b/docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md new file mode 100644 index 00000000..5e8a1020 --- /dev/null +++ b/docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md @@ -0,0 +1,94 @@ +# ADR-023: Prometheus Instrumentator and Metrics Registry Injection + +## Status + +Accepted + +## Context + +Until issue #652 the OCR service exposed no `/metrics` endpoint. The +observability stack already scrapes the Spring Boot backend's actuator +endpoint, but it had nothing to scrape on the Python side. Without HTTP- +and domain-level metrics from `ocr-service` we cannot answer questions +like "what is the share of words rendered as `[unleserlich]`" or +"is the training error rate above its budget" from Grafana. + +Two implementation requirements influenced the design: + +1. **Counter / gauge isolation in tests.** `prometheus_client` collectors + are module-level singletons keyed by name on the global `REGISTRY`. + Re-importing or naively re-instantiating them raises a duplicated- + collector error and cross-test state leaks (a `.inc()` in test A is + still readable by test B). A test harness needs a way to swap the + active container for a fresh per-test instance. + +2. **Minimal blast radius on the request path.** We did not want to + hand-instrument every endpoint with FastAPI middleware. The + `prometheus-fastapi-instrumentator` library already provides + `http_requests_total`, `http_request_duration_seconds`, and the + `/metrics` exposition route, all idiomatic Prometheus names. + +## Decision + +- Add `prometheus-fastapi-instrumentator==7.0.0` and pin its transitive + dependency `prometheus-client==0.25.0` explicitly in + `ocr-service/requirements.txt`. +- Mount the instrumentator once at module load: + `Instrumentator(excluded_handlers=["/health", "/metrics"]).instrument(app).expose(app)`. + This adds `/metrics` and an HTTP-level dashboard surface without + changing any endpoint code. +- Define every domain metric (`ocr_jobs_total`, `ocr_pages_total`, + `ocr_processing_seconds`, …) inside a `build_metrics(registry)` + factory in `ocr-service/metrics.py` that returns a frozen `OcrMetrics` + dataclass. Production code binds the container to the default + `REGISTRY` once: `metrics: OcrMetrics = build_metrics(REGISTRY)`. +- Tests use a `fresh_metrics` fixture that builds a new + `CollectorRegistry()` per test and monkeypatches `main.metrics` with + a container bound to it. The endpoint code keeps reading + `metrics.` without knowing whether it is talking to the global + registry or a per-test one. + +## Consequences + +**Positive** + +- One reusable factory captures the metric definitions; future metrics + go in one place. +- Tests run with full counter isolation. Cross-test state leakage is + impossible because each test sees its own dataclass instance. +- The instrumentator gives us `http_*` metrics for free, including a + Grafana-ready histogram that pairs with the Spring Boot one. + +**Negative** + +- One extra level of indirection: any test that asserts on metric + values must remember to monkeypatch `main.metrics`, not the registry + directly. Rebinding through the registry is harmless but useless — + the dataclass holds references to the original collectors. +- `prometheus-client` is now pinned. Upgrading it requires an explicit + bump and re-checking the instrumentator's compatibility range. +- `/metrics` is exposed unauthenticated and relies on the Docker + internal network for confidentiality. See + [docs/OBSERVABILITY.md §Internal-only endpoints](../OBSERVABILITY.md) + for the Caddy snippet that must be added if the service ever gets a + host-side port mapping. + +## Alternatives considered + +- **Hand-roll the `/metrics` endpoint.** Rejected: would have meant + duplicating what `prometheus-fastapi-instrumentator` ships, plus + middleware for the HTTP histograms. +- **Skip the factory; pass `registry` as a function argument + everywhere.** Rejected: clutters every endpoint signature and breaks + the symmetry with the Spring Boot side, which also relies on a + process-global Micrometer registry. +- **Use a `pytest` autouse fixture that resets `REGISTRY` between + tests.** Rejected: `prometheus_client` does not expose a clean + "unregister all" hook, and we would be relying on private APIs. + +## References + +- Issue: [#652](https://git.raddatz.cloud/marcel/familienarchiv/issues/652) +- Library: +- Code: `ocr-service/metrics.py`, `ocr-service/main.py`, + `ocr-service/test_metrics.py` diff --git a/docs/architecture/c4/l2-containers.puml b/docs/architecture/c4/l2-containers.puml index 346efe75..8d66a614 100644 --- a/docs/architecture/c4/l2-containers.puml +++ b/docs/architecture/c4/l2-containers.puml @@ -43,6 +43,8 @@ Rel(ocr, storage, "Fetches PDF via presigned URL", "HTTP / S3 presigned") Rel(mc, storage, "Bootstraps bucket + service account on startup", "MinIO Client CLI") Rel(promtail, loki, "Pushes log streams", "HTTP/Loki push API") Rel(backend, tempo, "Sends distributed traces via OTLP", "HTTP / OTLP / port 4318 (archiv-net)") +Rel(prometheus, backend, "Scrapes JVM + HTTP metrics", "HTTP 8081 /actuator/prometheus") +Rel(prometheus, ocr, "Scrapes OCR + http_* metrics", "HTTP 8000 /metrics") Rel(grafana, prometheus, "Queries metrics", "HTTP 9090") Rel(grafana, loki, "Queries logs", "HTTP 3100") Rel(grafana, tempo, "Queries traces", "HTTP 3200") diff --git a/infra/observability/prometheus/prometheus.yml b/infra/observability/prometheus/prometheus.yml index a29cc75b..4838bc1c 100644 --- a/infra/observability/prometheus/prometheus.yml +++ b/infra/observability/prometheus/prometheus.yml @@ -20,7 +20,4 @@ scrape_configs: - job_name: ocr-service metrics_path: /metrics static_configs: - # TODO: remove or add prometheus-client to ocr-service. - # The Python OCR service does not currently expose Prometheus metrics. - # This target will show as DOWN until prometheus-client is added to ocr-service. - targets: ['ocr:8000'] diff --git a/ocr-service/main.py b/ocr-service/main.py index 409cc78f..bde3fb2f 100644 --- a/ocr-service/main.py +++ b/ocr-service/main.py @@ -2,6 +2,7 @@ import asyncio import glob +import inspect import io import json import logging @@ -10,9 +11,11 @@ import re import shutil import subprocess import tempfile +import time import zipfile from contextlib import asynccontextmanager from datetime import datetime, timezone +from typing import Awaitable, Callable from urllib.parse import urlparse import httpx @@ -20,8 +23,11 @@ import pypdfium2 as pdfium from fastapi import FastAPI, Form, Header, HTTPException, UploadFile from fastapi.responses import StreamingResponse from PIL import Image +from prometheus_client import REGISTRY +from prometheus_fastapi_instrumentator import Instrumentator from confidence import apply_confidence_markers, get_threshold +from metrics import OcrMetrics, build_metrics from spell_check import correct_text, load_spell_checker from engines import kraken as kraken_engine from engines import surya as surya_engine @@ -37,6 +43,12 @@ logger = logging.getLogger(__name__) _models_ready = False +# One-shot import-time binding to the default REGISTRY. Tests that need a +# clean counter state must monkeypatch `main.metrics` with a container built +# from a fresh CollectorRegistry — rebinding through the registry directly +# will not retarget the references stored in the OcrMetrics dataclass. +metrics: OcrMetrics = build_metrics(REGISTRY) + ALLOWED_PDF_HOSTS = set( h.strip() for h in os.getenv("ALLOWED_PDF_HOSTS", "minio,localhost,127.0.0.1").split(",") ) @@ -44,6 +56,42 @@ ALLOWED_PDF_HOSTS = set( _SPELL_CHECK_SCRIPT_TYPES = {"HANDWRITING_KURRENT", "HANDWRITING_LATIN"} +async def _record_training( + runner: Callable[[], Awaitable[dict] | dict], + kind: str, +) -> dict: + """Run a training callable and record outcome + accuracy metrics. + + Wraps the per-endpoint try/except + outcome counter + accuracy gauge + block that used to be repeated at /train, /train-sender, and /segtrain. + The runner returns a dict with at least an `accuracy` key; if its value + is None, the gauge is left at its default. + """ + try: + result = runner() + if inspect.isawaitable(result): + result = await result + except Exception: + metrics.ocr_training_runs_total.labels(kind=kind, outcome="error").inc() + raise + metrics.ocr_training_runs_total.labels(kind=kind, outcome="success").inc() + if result.get("accuracy") is not None: + metrics.ocr_model_accuracy.labels(kind=kind).set(result["accuracy"]) + return result + + +def _observe_block_words(words: list[dict], threshold: float) -> None: + """Record per-block word counts and below-threshold word counts. + + Pre: `words` is non-empty. Caller checks for that — keeping the helper + branch-free makes the call sites read as a single line. + """ + metrics.ocr_words_total.inc(len(words)) + metrics.ocr_illegible_words_total.inc( + sum(1 for w in words if w["confidence"] < threshold) + ) + + def _validate_url(url: str) -> None: """Validate that the PDF URL points to an allowed host (SSRF protection).""" parsed = urlparse(url) @@ -63,6 +111,7 @@ async def lifespan(app: FastAPI): kraken_engine.load_models() load_spell_checker() _models_ready = True + metrics.ocr_models_ready.set(1) logger.info("Startup complete — ready to accept requests") yield @@ -72,6 +121,28 @@ async def lifespan(app: FastAPI): app = FastAPI(title="Familienarchiv OCR Service", lifespan=lifespan) +# /metrics is unauthenticated — relies on Docker-internal-network exposure +# only (CWE-200 risk if `ports:` ever maps 8000 to host). See +# docs/OBSERVABILITY.md §Internal-only endpoints for the Caddy block snippet. +Instrumentator(excluded_handlers=["/health", "/metrics"]).instrument(app).expose(app) + + +class MetricsPathFilter(logging.Filter): + """Drop uvicorn.access entries for /metrics and /health to keep logs focused.""" + + _SUPPRESSED_PATHS = {"/metrics", "/health"} + + def filter(self, record: logging.LogRecord) -> bool: + # uvicorn.access formats as: '%s - "%s %s HTTP/%s" %d' + if record.args and len(record.args) >= 3: + path = record.args[2] + if isinstance(path, str) and path in self._SUPPRESSED_PATHS: + return False + return True + + +logging.getLogger("uvicorn.access").addFilter(MetricsPathFilter()) + @app.get("/health") def health(): @@ -99,7 +170,9 @@ async def run_ocr(request: OcrRequest): del img script_type = request.scriptType.upper() + engine_name = "kraken" if script_type == "HANDWRITING_KURRENT" else "surya" + extract_started = time.monotonic() if script_type == "HANDWRITING_KURRENT": if not kraken_engine.is_available(): raise HTTPException( @@ -111,11 +184,18 @@ async def run_ocr(request: OcrRequest): else: # TYPEWRITER, HANDWRITING_LATIN, UNKNOWN — all use Surya blocks = await asyncio.to_thread(surya_engine.extract_blocks, images, request.language) + metrics.ocr_processing_seconds.labels(engine=engine_name).observe( + time.monotonic() - extract_started + ) + + metrics.ocr_jobs_total.labels(engine=engine_name, script_type=script_type).inc() threshold = get_threshold(script_type) for block in blocks: - if block.get("words"): - block["text"] = apply_confidence_markers(block["words"], threshold) + words = block.get("words") or [] + if words: + _observe_block_words(words, threshold) + block["text"] = apply_confidence_markers(words, threshold) block.pop("words", None) if script_type in _SPELL_CHECK_SCRIPT_TYPES: block["text"] = correct_text(block["text"]) @@ -146,6 +226,9 @@ async def run_ocr_stream(request: OcrRequest): ) engine = kraken_engine if use_kraken else surya_engine + engine_name = "kraken" if use_kraken else "surya" + + metrics.ocr_jobs_total.labels(engine=engine_name, script_type=script_type).inc() if request.regions: # Guided mode: recognize only the user-drawn annotation regions @@ -176,12 +259,15 @@ async def run_ocr_stream(request: OcrRequest): image = await asyncio.to_thread(preprocess_page, image) blocks = [] sender_path = request.senderModelPath if use_kraken else None + engine_seconds = 0.0 for region in page_regions: + region_started = time.monotonic() text = await asyncio.to_thread( engine.extract_region_text, image, region.x, region.y, region.width, region.height, sender_path, ) + engine_seconds += time.monotonic() - region_started if script_type in _SPELL_CHECK_SCRIPT_TYPES: text = correct_text(text) blocks.append({ @@ -195,7 +281,11 @@ async def run_ocr_stream(request: OcrRequest): "annotationId": region.annotationId, }) + metrics.ocr_processing_seconds.labels(engine=engine_name).observe( + engine_seconds + ) total_blocks += len(blocks) + metrics.ocr_pages_total.labels(engine=engine_name).inc() yield json.dumps({ "type": "page", "pageNumber": page_idx, @@ -205,6 +295,7 @@ async def run_ocr_stream(request: OcrRequest): except Exception: logger.exception("Guided OCR failed on page %d", page_idx) skipped_pages += 1 + metrics.ocr_skipped_pages_total.inc() yield json.dumps({ "type": "error", "pageNumber": page_idx, @@ -238,18 +329,25 @@ async def run_ocr_stream(request: OcrRequest): yield json.dumps({"type": "preprocessing", "pageNumber": page_idx}) + "\n" image = await asyncio.to_thread(preprocess_page, image) sender_path = request.senderModelPath if use_kraken else None + page_started = time.monotonic() blocks = await asyncio.to_thread( engine.extract_page_blocks, image, page_idx, request.language, sender_path ) + metrics.ocr_processing_seconds.labels(engine=engine_name).observe( + time.monotonic() - page_started + ) for block in blocks: - if block.get("words"): - block["text"] = apply_confidence_markers(block["words"], threshold) + words = block.get("words") or [] + if words: + _observe_block_words(words, threshold) + block["text"] = apply_confidence_markers(words, threshold) block.pop("words", None) if script_type in _SPELL_CHECK_SCRIPT_TYPES: block["text"] = correct_text(block["text"]) total_blocks += len(blocks) + metrics.ocr_pages_total.labels(engine=engine_name).inc() yield json.dumps({ "type": "page", "pageNumber": page_idx, @@ -259,6 +357,7 @@ async def run_ocr_stream(request: OcrRequest): except Exception: logger.exception("OCR failed on page %d", page_idx) skipped_pages += 1 + metrics.ocr_skipped_pages_total.inc() yield json.dumps({ "type": "error", "pageNumber": page_idx, @@ -438,8 +537,7 @@ async def train_model( return {"loss": None, "accuracy": accuracy, "cer": cer, "epochs": epochs} - result = await asyncio.to_thread(_run_training) - return result + return await _record_training(lambda: asyncio.to_thread(_run_training), kind="recognition") @app.post("/train-sender") @@ -518,8 +616,9 @@ async def train_sender_model( return {"loss": None, "accuracy": accuracy, "cer": cer, "epochs": epochs} - result = await asyncio.to_thread(_run_sender_training) - return result + return await _record_training( + lambda: asyncio.to_thread(_run_sender_training), kind="recognition" + ) @app.post("/segtrain") @@ -628,8 +727,7 @@ async def segtrain_model( return {"loss": None, "accuracy": accuracy, "cer": cer, "epochs": epochs} - result = await asyncio.to_thread(_run_segtrain) - return result + return await _record_training(lambda: asyncio.to_thread(_run_segtrain), kind="segmentation") async def _download_and_convert_pdf(url: str) -> list[Image.Image]: diff --git a/ocr-service/metrics.py b/ocr-service/metrics.py new file mode 100644 index 00000000..66efb9f2 --- /dev/null +++ b/ocr-service/metrics.py @@ -0,0 +1,92 @@ +"""Prometheus metric definitions for the OCR service. + +`build_metrics(registry)` returns a fresh `OcrMetrics` instance bound to the +given `CollectorRegistry`. Production code calls it once at module load with +the default `REGISTRY`; tests pass a per-test `CollectorRegistry()` to keep +counter values isolated between cases (decision #3 on issue #652). +""" + +from __future__ import annotations + +from dataclasses import dataclass + +from prometheus_client import CollectorRegistry, Counter, Gauge, Histogram + + +@dataclass(frozen=True) +class OcrMetrics: + """Container for every custom OCR metric. + + Counters and gauges are immutable references to `prometheus_client` + instances. Mutating them (`.inc()`, `.observe()`, `.set()`) is safe; + rebinding the field on the dataclass is not — use `build_metrics` to get + a new container. + """ + + ocr_jobs_total: Counter + ocr_pages_total: Counter + ocr_skipped_pages_total: Counter + ocr_words_total: Counter + ocr_illegible_words_total: Counter + ocr_processing_seconds: Histogram + ocr_training_runs_total: Counter + ocr_model_accuracy: Gauge + ocr_models_ready: Gauge + + +def build_metrics(registry: CollectorRegistry) -> OcrMetrics: + """Create one OcrMetrics instance bound to `registry`.""" + return OcrMetrics( + ocr_jobs_total=Counter( + "ocr_jobs_total", + "Number of OCR jobs processed, labelled by engine and script type.", + ["engine", "script_type"], + registry=registry, + ), + ocr_pages_total=Counter( + "ocr_pages_total", + "Number of pages successfully OCR'd, labelled by engine.", + ["engine"], + registry=registry, + ), + ocr_skipped_pages_total=Counter( + "ocr_skipped_pages_total", + "Number of pages skipped because the OCR engine raised.", + registry=registry, + ), + ocr_words_total=Counter( + "ocr_words_total", + "Number of words recognized across all OCR blocks.", + registry=registry, + ), + ocr_illegible_words_total=Counter( + "ocr_illegible_words_total", + "Number of words below the confidence threshold " + "(replaced with [unleserlich]).", + registry=registry, + ), + ocr_processing_seconds=Histogram( + "ocr_processing_seconds", + "OCR processing time per page (streaming) or per document (non-streaming).", + ["engine"], + registry=registry, + ), + ocr_training_runs_total=Counter( + "ocr_training_runs_total", + "Number of training runs, labelled by kind (recognition|segmentation) " + "and outcome (success|error).", + ["kind", "outcome"], + registry=registry, + ), + ocr_model_accuracy=Gauge( + "ocr_model_accuracy", + "Latest model accuracy reported by a successful training run.", + ["kind"], + registry=registry, + ), + ocr_models_ready=Gauge( + "ocr_models_ready", + "1 once the lifespan startup has finished loading models, 0 before.", + registry=registry, + ), + ) diff --git a/ocr-service/requirements.txt b/ocr-service/requirements.txt index befaafcb..39c492cc 100644 --- a/ocr-service/requirements.txt +++ b/ocr-service/requirements.txt @@ -10,3 +10,5 @@ pyvips>=2.2.0 httpx==0.28.1 pyspellchecker==0.9.0 opencv-python-headless==4.11.0.86 +prometheus-fastapi-instrumentator==7.0.0 +prometheus-client==0.25.0 diff --git a/ocr-service/test_metrics.py b/ocr-service/test_metrics.py new file mode 100644 index 00000000..42afd275 --- /dev/null +++ b/ocr-service/test_metrics.py @@ -0,0 +1,638 @@ +"""Tests for Prometheus metrics exposed by the OCR service. + +Each test that asserts on a counter/gauge value uses a fresh CollectorRegistry +(see decision #3 on issue #652) to keep the metrics isolated between tests. +""" + +import contextlib +import io +import zipfile +from unittest.mock import AsyncMock, patch + +import pytest +from httpx import ASGITransport, AsyncClient +from PIL import Image +from prometheus_client import CollectorRegistry + +from main import app +from metrics import build_metrics + + +@contextlib.asynccontextmanager +async def ocr_client(*, raise_app_exceptions: bool = True): + """Yield an AsyncClient with model-loaders patched and _models_ready forced on. + + The shared setup for almost every metrics test: stub the heavy lifecycle + hooks (kraken_engine.load_models, load_spell_checker), flip the readiness + flag so request handlers do not 503, and restore it afterwards. + """ + with patch("main.kraken_engine.load_models"), \ + patch("main.load_spell_checker"): + transport = ASGITransport(app=app, raise_app_exceptions=raise_app_exceptions) + async with AsyncClient(transport=transport, base_url="http://test") as client: + import main as main_module + main_module._models_ready = True + try: + yield client + finally: + main_module._models_ready = False + + +def _minimal_zip() -> bytes: + """Return a ZIP containing one fake .xml so endpoint validation passes.""" + buf = io.BytesIO() + with zipfile.ZipFile(buf, "w") as zf: + zf.writestr("page_01.xml", "") + return buf.getvalue() + + +def _fake_training_result(accuracy: float = 0.91) -> dict: + return {"loss": None, "accuracy": accuracy, "cer": round(1 - accuracy, 4), "epochs": 5} + + +@pytest.fixture +def fresh_metrics(monkeypatch): + """Replace the module-level `main.metrics` with one bound to a fresh registry.""" + registry = CollectorRegistry() + test_metrics = build_metrics(registry) + monkeypatch.setattr("main.metrics", test_metrics) + return test_metrics + + +@pytest.mark.asyncio +async def test_metrics_endpoint_returns_200(): + """`GET /metrics` returns 200 with Prometheus exposition content. + + Uses the global REGISTRY by design — does NOT take the `fresh_metrics` fixture. + The `/metrics` endpoint is wired by `prometheus-fastapi-instrumentator`, which + binds to the default REGISTRY at app-construction time; swapping `main.metrics` + via the fixture would not redirect what `/metrics` exposes. This test only + asserts response shape (status code + content-type substring), not numeric + counter values, so cross-test state leakage cannot affect it. + """ + with patch("main.kraken_engine.load_models"), \ + patch("main.load_spell_checker"): + async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client: + response = await client.get("/metrics") + + assert response.status_code == 200 + assert "text/plain" in response.headers.get("content-type", "") + + +@pytest.mark.asyncio +async def test_metrics_includes_http_request_metrics_after_ocr_call(): + """After a request to /ocr, `/metrics` exposes auto-instrumented http_* metrics. + + Uses the global REGISTRY by design — does NOT take the `fresh_metrics` fixture. + The `http_requests_total` / `http_request_duration_seconds` metrics live on + the instrumentator's default REGISTRY (not on `main.metrics`), so a fresh + CollectorRegistry would never see them. This test only asserts response shape + (substring presence in the exposition body), not numeric counter values, so + cross-test state leakage cannot affect it. + """ + mock_images = [Image.new("RGB", (100, 100))] + mock_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0, + "polygon": None, "text": "hi", "words": []}] + + with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \ + patch("main.preprocess_page", side_effect=lambda img: img), \ + patch("main.surya_engine.extract_blocks", return_value=mock_blocks): + async with ocr_client() as client: + ocr_response = await client.post("/ocr", json={ + "pdfUrl": "http://minio/doc.pdf", + "scriptType": "TYPEWRITER", + "language": "de", + }) + assert ocr_response.status_code == 200, ocr_response.text + + metrics_response = await client.get("/metrics") + + body = metrics_response.text + assert "http_requests_total" in body + assert "http_request_duration_seconds" in body + + +def test_build_metrics_registers_all_custom_metrics_on_given_registry(): + """`build_metrics` returns an OcrMetrics bound to the supplied registry.""" + registry = CollectorRegistry() + metrics = build_metrics(registry) + + metric_names = {m.name for m in registry.collect()} + expected = { + "ocr_jobs", + "ocr_pages", + "ocr_skipped_pages", + "ocr_words", + "ocr_illegible_words", + "ocr_processing_seconds", + "ocr_training_runs", + "ocr_model_accuracy", + "ocr_models_ready", + } + assert expected <= metric_names, f"missing: {expected - metric_names}" + + # A second registry yields a separate container — no shared state. + other_metrics = build_metrics(CollectorRegistry()) + assert metrics is not other_metrics + + +async def _drive_ocr(client: AsyncClient, *, script_type: str) -> None: + """Helper — fires /ocr with a single mocked page and asserts a 200.""" + response = await client.post("/ocr", json={ + "pdfUrl": "http://minio/doc.pdf", + "scriptType": script_type, + "language": "de", + }) + assert response.status_code == 200, response.text + + +@pytest.mark.asyncio +async def test_ocr_jobs_total_incremented_with_kraken_engine_label_for_kurrent(fresh_metrics): + """A /ocr call with HANDWRITING_KURRENT increments engine=kraken.""" + mock_images = [Image.new("RGB", (100, 100))] + mock_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0, + "polygon": None, "text": "hi", "words": []}] + + with patch("main.correct_text", side_effect=lambda t: t), \ + patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \ + patch("main.preprocess_page", side_effect=lambda img: img), \ + patch("main.kraken_engine.is_available", return_value=True), \ + patch("main.kraken_engine.extract_blocks", return_value=mock_blocks): + async with ocr_client() as client: + await _drive_ocr(client, script_type="HANDWRITING_KURRENT") + + value = fresh_metrics.ocr_jobs_total.labels( + engine="kraken", script_type="HANDWRITING_KURRENT" + )._value.get() + assert value == 1.0 + + +@pytest.mark.asyncio +async def test_ocr_jobs_total_incremented_with_surya_engine_label_for_typewriter(fresh_metrics): + """A /ocr call with TYPEWRITER increments engine=surya.""" + mock_images = [Image.new("RGB", (100, 100))] + mock_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0, + "polygon": None, "text": "hi", "words": []}] + + with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \ + patch("main.preprocess_page", side_effect=lambda img: img), \ + patch("main.surya_engine.extract_blocks", return_value=mock_blocks): + async with ocr_client() as client: + await _drive_ocr(client, script_type="TYPEWRITER") + + value = fresh_metrics.ocr_jobs_total.labels( + engine="surya", script_type="TYPEWRITER" + )._value.get() + assert value == 1.0 + + +@pytest.mark.asyncio +async def test_ocr_pages_total_incremented_once_per_page_in_stream(fresh_metrics): + """The /ocr/stream generator increments ocr_pages_total per successful page.""" + mock_images = [Image.new("RGB", (100, 100)) for _ in range(3)] + mock_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0, + "polygon": None, "text": "hi", "words": []}] + + with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \ + patch("main.preprocess_page", side_effect=lambda img: img), \ + patch("main.surya_engine.extract_page_blocks", return_value=mock_blocks): + async with ocr_client() as client: + async with client.stream("POST", "/ocr/stream", json={ + "pdfUrl": "http://minio/doc.pdf", + "scriptType": "TYPEWRITER", + "language": "de", + }) as response: + assert response.status_code == 200 + # Drain the stream so all per-page increments fire. + async for _ in response.aiter_lines(): + pass + + value = fresh_metrics.ocr_pages_total.labels(engine="surya")._value.get() + assert value == 3.0 + + +@pytest.mark.asyncio +async def test_ocr_skipped_pages_total_incremented_when_engine_raises_for_a_page(fresh_metrics): + """When the engine raises on a page, ocr_skipped_pages_total bumps and the stream finishes.""" + mock_images = [Image.new("RGB", (100, 100)) for _ in range(2)] + good_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0, + "polygon": None, "text": "ok", "words": []}] + + call_count = {"n": 0} + + def extract_side_effect(*args, **kwargs): + call_count["n"] += 1 + if call_count["n"] == 1: + raise RuntimeError("synthetic engine failure") + return good_blocks + + with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \ + patch("main.preprocess_page", side_effect=lambda img: img), \ + patch("main.surya_engine.extract_page_blocks", side_effect=extract_side_effect): + async with ocr_client() as client: + async with client.stream("POST", "/ocr/stream", json={ + "pdfUrl": "http://minio/doc.pdf", + "scriptType": "TYPEWRITER", + "language": "de", + }) as response: + assert response.status_code == 200 + saw_error = False + async for line in response.aiter_lines(): + if line and '"type": "error"' in line: + saw_error = True + assert saw_error + + assert fresh_metrics.ocr_skipped_pages_total._value.get() == 1.0 + # The second page still succeeds. + assert fresh_metrics.ocr_pages_total.labels(engine="surya")._value.get() == 1.0 + + +@pytest.mark.asyncio +async def test_ocr_words_and_illegible_words_total_sum_across_blocks(fresh_metrics): + """Counters reflect totals summed over every block in the request. + + Threshold defaults to THRESHOLD_DEFAULT (0.3) for non-Kurrent scripts. Two + blocks: 3 words above + 2 words below threshold across blocks. + """ + mock_images = [Image.new("RGB", (100, 100))] + mock_blocks = [ + {"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0, + "polygon": None, "text": "ignored", + "words": [{"text": "Lieber", "confidence": 0.9}, + {"text": "Freund", "confidence": 0.1}]}, + {"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0, + "polygon": None, "text": "ignored", + "words": [{"text": "Gruss", "confidence": 0.8}, + {"text": "verschmiert", "confidence": 0.05}, + {"text": "Karl", "confidence": 0.95}]}, + ] + + with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \ + patch("main.preprocess_page", side_effect=lambda img: img), \ + patch("main.surya_engine.extract_blocks", return_value=mock_blocks): + async with ocr_client() as client: + await _drive_ocr(client, script_type="TYPEWRITER") + + assert fresh_metrics.ocr_words_total._value.get() == 5.0 + assert fresh_metrics.ocr_illegible_words_total._value.get() == 2.0 + + +def _histogram_count_sum(histogram, **labels) -> tuple[float, float]: + """Read the per-label-set _count and _sum from a prometheus_client Histogram.""" + child = histogram.labels(**labels) + return child._sum.get(), sum(b.get() for b in child._buckets) + + +@pytest.mark.asyncio +async def test_ocr_processing_seconds_histogram_observed_per_page_in_stream(fresh_metrics): + """The streaming generator observes ocr_processing_seconds once per page.""" + mock_images = [Image.new("RGB", (100, 100)) for _ in range(2)] + mock_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0, + "polygon": None, "text": "ok", "words": []}] + + with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \ + patch("main.preprocess_page", side_effect=lambda img: img), \ + patch("main.surya_engine.extract_page_blocks", return_value=mock_blocks): + async with ocr_client() as client: + async with client.stream("POST", "/ocr/stream", json={ + "pdfUrl": "http://minio/doc.pdf", + "scriptType": "TYPEWRITER", + "language": "de", + }) as response: + assert response.status_code == 200 + async for _ in response.aiter_lines(): + pass + + sum_seconds, count = _histogram_count_sum( + fresh_metrics.ocr_processing_seconds, engine="surya" + ) + assert count == 2.0 + assert sum_seconds >= 0.0 + + +@pytest.mark.asyncio +async def test_ocr_training_runs_total_incremented_with_recognition_success_label(fresh_metrics): + """/train success increments ocr_training_runs_total{kind=recognition, outcome=success}.""" + async def fake_to_thread(func, *args, **kwargs): + return _fake_training_result() + + with patch("main.TRAINING_TOKEN", "secret-token"), \ + patch("main._models_ready", True), \ + patch("main.asyncio.to_thread", side_effect=fake_to_thread): + async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client: + response = await client.post( + "/train", + files={"file": ("training.zip", _minimal_zip(), "application/zip")}, + headers={"X-Training-Token": "secret-token"}, + ) + + assert response.status_code == 200 + assert fresh_metrics.ocr_training_runs_total.labels( + kind="recognition", outcome="success" + )._value.get() == 1.0 + + +@pytest.mark.asyncio +async def test_ocr_training_runs_total_incremented_with_recognition_error_label(fresh_metrics): + """When ketos exits non-zero, the error counter bumps and the exception propagates. + + Uses the narrowest available seam — `subprocess.run` returning a failing + CompletedProcess — instead of stubbing the asyncio.to_thread boundary, + so the test exercises the real _run_training error path. + """ + from subprocess import CompletedProcess + + failing_proc = CompletedProcess( + args=["ketos"], returncode=1, stdout="", stderr="synthetic ketos failure" + ) + + with patch("main.TRAINING_TOKEN", "secret-token"), \ + patch("main._models_ready", True), \ + patch("main.subprocess.run", return_value=failing_proc): + transport = ASGITransport(app=app, raise_app_exceptions=False) + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.post( + "/train", + files={"file": ("training.zip", _minimal_zip(), "application/zip")}, + headers={"X-Training-Token": "secret-token"}, + ) + + assert response.status_code == 500 + assert fresh_metrics.ocr_training_runs_total.labels( + kind="recognition", outcome="error" + )._value.get() == 1.0 + + +@pytest.mark.asyncio +async def test_ocr_training_runs_total_incremented_with_segmentation_success_label(fresh_metrics): + """/segtrain success increments ocr_training_runs_total{kind=segmentation, outcome=success}.""" + async def fake_to_thread(func, *args, **kwargs): + return _fake_training_result(accuracy=0.83) + + with patch("main.TRAINING_TOKEN", "secret-token"), \ + patch("main._models_ready", True), \ + patch("main.asyncio.to_thread", side_effect=fake_to_thread): + async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client: + response = await client.post( + "/segtrain", + files={"file": ("training.zip", _minimal_zip(), "application/zip")}, + headers={"X-Training-Token": "secret-token"}, + ) + + assert response.status_code == 200 + assert fresh_metrics.ocr_training_runs_total.labels( + kind="segmentation", outcome="success" + )._value.get() == 1.0 + + +@pytest.mark.asyncio +async def test_ocr_training_runs_total_incremented_with_recognition_success_label_for_train_sender(fresh_metrics): + """/train-sender success increments ocr_training_runs_total{kind=recognition, outcome=success}.""" + async def fake_to_thread(func, *args, **kwargs): + return _fake_training_result() + + with patch("main.TRAINING_TOKEN", "secret-token"), \ + patch("main._models_ready", True), \ + patch("main.asyncio.to_thread", side_effect=fake_to_thread): + async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client: + response = await client.post( + "/train-sender", + files={"file": ("training.zip", _minimal_zip(), "application/zip")}, + data={"output_model_path": "/app/models/sender_test.mlmodel"}, + headers={"X-Training-Token": "secret-token"}, + ) + + assert response.status_code == 200, response.text + assert fresh_metrics.ocr_training_runs_total.labels( + kind="recognition", outcome="success" + )._value.get() == 1.0 + + +@pytest.mark.asyncio +async def test_ocr_model_accuracy_gauge_stays_default_when_training_returns_no_accuracy(fresh_metrics): + """When the runner returns accuracy=None, ocr_model_accuracy must remain at its default 0.""" + async def fake_to_thread(func, *args, **kwargs): + return {"loss": None, "accuracy": None, "cer": None, "epochs": 5} + + with patch("main.TRAINING_TOKEN", "secret-token"), \ + patch("main._models_ready", True), \ + patch("main.asyncio.to_thread", side_effect=fake_to_thread): + async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client: + response = await client.post( + "/train", + files={"file": ("training.zip", _minimal_zip(), "application/zip")}, + headers={"X-Training-Token": "secret-token"}, + ) + + assert response.status_code == 200 + # Gauge was never .set() — accessing the label child still creates it with default 0.0. + assert fresh_metrics.ocr_model_accuracy.labels( + kind="recognition" + )._value.get() == 0.0 + + +@pytest.mark.asyncio +async def test_ocr_model_accuracy_gauge_set_per_kind_after_successful_training(fresh_metrics): + """After /train and /segtrain succeed, ocr_model_accuracy{kind=...} reflects the result.""" + recognition_accuracy = 0.917 + segmentation_accuracy = 0.834 + + async def fake_recognition_to_thread(func, *args, **kwargs): + return _fake_training_result(accuracy=recognition_accuracy) + + async def fake_segmentation_to_thread(func, *args, **kwargs): + return _fake_training_result(accuracy=segmentation_accuracy) + + with patch("main.TRAINING_TOKEN", "secret-token"), \ + patch("main._models_ready", True): + async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client: + with patch("main.asyncio.to_thread", side_effect=fake_recognition_to_thread): + rec_resp = await client.post( + "/train", + files={"file": ("training.zip", _minimal_zip(), "application/zip")}, + headers={"X-Training-Token": "secret-token"}, + ) + assert rec_resp.status_code == 200 + with patch("main.asyncio.to_thread", side_effect=fake_segmentation_to_thread): + seg_resp = await client.post( + "/segtrain", + files={"file": ("training.zip", _minimal_zip(), "application/zip")}, + headers={"X-Training-Token": "secret-token"}, + ) + assert seg_resp.status_code == 200 + + assert fresh_metrics.ocr_model_accuracy.labels(kind="recognition")._value.get() == pytest.approx(recognition_accuracy) + assert fresh_metrics.ocr_model_accuracy.labels(kind="segmentation")._value.get() == pytest.approx(segmentation_accuracy) + + +def test_ocr_models_ready_gauge_defaults_to_zero(): + """A freshly-built OcrMetrics has ocr_models_ready=0 before lifespan runs.""" + metrics = build_metrics(CollectorRegistry()) + assert metrics.ocr_models_ready._value.get() == 0.0 + + +@pytest.mark.asyncio +async def test_ocr_models_ready_gauge_is_one_after_lifespan_startup(fresh_metrics): + """The lifespan flips ocr_models_ready to 1 once load_models / load_spell_checker return. + + ASGITransport does not run lifespan by default, so the lifespan context + manager is driven directly to exercise the startup code path. + """ + assert fresh_metrics.ocr_models_ready._value.get() == 0.0 + with patch("main.kraken_engine.load_models"), \ + patch("main.load_spell_checker"): + async with app.router.lifespan_context(app): + assert fresh_metrics.ocr_models_ready._value.get() == 1.0 + + +@pytest.mark.asyncio +async def test_ocr_processing_seconds_histogram_observed_per_page_in_guided_stream(fresh_metrics): + """The guided streaming generator observes ocr_processing_seconds once per page.""" + mock_images = [Image.new("RGB", (100, 100)) for _ in range(2)] + regions = [ + {"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 0.5, "height": 0.5, "annotationId": "a1"}, + {"pageNumber": 2, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0, "annotationId": "a2"}, + ] + + with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \ + patch("main.preprocess_page", side_effect=lambda img: img), \ + patch("main.surya_engine.extract_region_text", return_value="text"): + async with ocr_client() as client: + async with client.stream("POST", "/ocr/stream", json={ + "pdfUrl": "http://minio/doc.pdf", + "scriptType": "TYPEWRITER", + "language": "de", + "regions": regions, + }) as response: + assert response.status_code == 200 + async for _ in response.aiter_lines(): + pass + + sum_seconds, count = _histogram_count_sum( + fresh_metrics.ocr_processing_seconds, engine="surya" + ) + assert count == 2.0 + assert sum_seconds >= 0.0 + + +@pytest.mark.asyncio +async def test_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_stream(fresh_metrics): + """The guided observation must time engine work only, not the spell-check pass. + + Wall-clock bound rather than a structural `patch("main.time.monotonic")`: + the patched attribute is the *global* `time.monotonic`, which httpx and + asyncio also consume — they exhaust the deterministic sequence before the + request reaches the engine loop. Bound is sized against the failure mode, + not the noise floor: spell-check sleeps 0.05s × 2 regions = 0.1s, so a + timer that accidentally wrapped `correct_text` would observe >= 0.1s. The + 0.09s ceiling catches that bug while leaving ~90ms of slack for slow CI + runners (engine work is instantaneous under the mock). + """ + mock_images = [Image.new("RGB", (100, 100))] + regions = [ + {"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 0.5, "height": 0.5, "annotationId": "a1"}, + {"pageNumber": 1, "x": 0.5, "y": 0.0, "width": 0.5, "height": 0.5, "annotationId": "a2"}, + ] + + def slow_correct(text): + import time as _time + _time.sleep(0.05) + return text + + with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \ + patch("main.preprocess_page", side_effect=lambda img: img), \ + patch("main.kraken_engine.is_available", return_value=True), \ + patch("main.kraken_engine.extract_region_text", return_value="text"), \ + patch("main.correct_text", side_effect=slow_correct): + async with ocr_client() as client: + async with client.stream("POST", "/ocr/stream", json={ + "pdfUrl": "http://minio/doc.pdf", + "scriptType": "HANDWRITING_KURRENT", + "language": "de", + "regions": regions, + }) as response: + assert response.status_code == 200 + async for _ in response.aiter_lines(): + pass + + sum_seconds, _ = _histogram_count_sum( + fresh_metrics.ocr_processing_seconds, engine="kraken" + ) + assert sum_seconds < 0.09, f"timing must exclude spell-check; got sum={sum_seconds}" + + +@pytest.mark.asyncio +async def test_ocr_jobs_total_not_incremented_when_pdf_download_fails_in_stream(fresh_metrics): + """If `_download_and_convert_pdf` raises, ocr_jobs_total is NOT incremented. + + Mirrors the /ocr endpoint's semantics: the counter only records jobs that + actually started OCR work, not failed downloads. + """ + async def fail_download(url): + raise RuntimeError("synthetic download failure") + + with patch("main._download_and_convert_pdf", new=fail_download): + async with ocr_client(raise_app_exceptions=False) as client: + response = await client.post("/ocr/stream", json={ + "pdfUrl": "http://minio/doc.pdf", + "scriptType": "TYPEWRITER", + "language": "de", + }) + + assert response.status_code == 500 + assert fresh_metrics.ocr_jobs_total.labels( + engine="surya", script_type="TYPEWRITER" + )._value.get() == 0.0 + + +def test_uvicorn_access_log_filter_fails_open_on_short_or_missing_args(): + """The filter must default-allow records when args is None or shorter than expected. + + Locks in fail-open behavior: if uvicorn ever changes its format we keep + forwarding records to the handler rather than silently dropping logs. + """ + import logging as _logging + from main import MetricsPathFilter + + filt = MetricsPathFilter() + none_record = _logging.LogRecord( + name="uvicorn.access", level=_logging.INFO, pathname="", lineno=0, + msg="some message", args=None, exc_info=None, + ) + short_record = _logging.LogRecord( + name="uvicorn.access", level=_logging.INFO, pathname="", lineno=0, + msg="%s %s", args=("a", "b"), exc_info=None, + ) + + assert filt.filter(none_record) is True + assert filt.filter(short_record) is True + + +def test_uvicorn_access_log_filter_skips_metrics_path(): + """The MetricsPathFilter drops uvicorn.access log records that target /metrics.""" + import logging as _logging + from main import MetricsPathFilter + + filt = MetricsPathFilter() + metrics_record = _logging.LogRecord( + name="uvicorn.access", level=_logging.INFO, pathname="", lineno=0, + msg='%s - "%s %s HTTP/%s" %d', + args=("127.0.0.1:1234", "GET", "/metrics", "1.1", 200), + exc_info=None, + ) + health_record = _logging.LogRecord( + name="uvicorn.access", level=_logging.INFO, pathname="", lineno=0, + msg='%s - "%s %s HTTP/%s" %d', + args=("127.0.0.1:1234", "GET", "/health", "1.1", 200), + exc_info=None, + ) + ocr_record = _logging.LogRecord( + name="uvicorn.access", level=_logging.INFO, pathname="", lineno=0, + msg='%s - "%s %s HTTP/%s" %d', + args=("127.0.0.1:1234", "POST", "/ocr", "1.1", 200), + exc_info=None, + ) + + assert filt.filter(metrics_record) is False + assert filt.filter(health_record) is False + assert filt.filter(ocr_record) is True