Files
familienarchiv/docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md
Marcel 2df71beb7e
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m33s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m29s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
docs: add ADR-023 and glossary entries for OCR metrics
ADR-023 captures why prometheus-fastapi-instrumentator was chosen,
the build_metrics(registry) factory pattern, and the test rebinding
seam. The glossary gains four ops-aligned terms — illegible word,
models-ready gauge, recognition vs segmentation accuracy — so the
metrics documentation in OBSERVABILITY.md has a vocabulary to lean on.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 17:06:44 +02:00

4.2 KiB

ADR-023: Prometheus Instrumentator and Metrics Registry Injection

Status

Accepted

Context

Until issue #652 the OCR service exposed no /metrics endpoint. The observability stack already scrapes the Spring Boot backend's actuator endpoint, but it had nothing to scrape on the Python side. Without HTTP- and domain-level metrics from ocr-service we cannot answer questions like "what is the share of words rendered as [unleserlich]" or "is the training error rate above its budget" from Grafana.

Two implementation requirements influenced the design:

  1. Counter / gauge isolation in tests. prometheus_client collectors are module-level singletons keyed by name on the global REGISTRY. Re-importing or naively re-instantiating them raises a duplicated- collector error and cross-test state leaks (a .inc() in test A is still readable by test B). A test harness needs a way to swap the active container for a fresh per-test instance.

  2. Minimal blast radius on the request path. We did not want to hand-instrument every endpoint with FastAPI middleware. The prometheus-fastapi-instrumentator library already provides http_requests_total, http_request_duration_seconds, and the /metrics exposition route, all idiomatic Prometheus names.

Decision

  • Add prometheus-fastapi-instrumentator==7.0.0 and pin its transitive dependency prometheus-client==0.25.0 explicitly in ocr-service/requirements.txt.
  • Mount the instrumentator once at module load: Instrumentator(excluded_handlers=["/health", "/metrics"]).instrument(app).expose(app). This adds /metrics and an HTTP-level dashboard surface without changing any endpoint code.
  • Define every domain metric (ocr_jobs_total, ocr_pages_total, ocr_processing_seconds, …) inside a build_metrics(registry) factory in ocr-service/metrics.py that returns a frozen OcrMetrics dataclass. Production code binds the container to the default REGISTRY once: metrics: OcrMetrics = build_metrics(REGISTRY).
  • Tests use a fresh_metrics fixture that builds a new CollectorRegistry() per test and monkeypatches main.metrics with a container bound to it. The endpoint code keeps reading metrics.<name> without knowing whether it is talking to the global registry or a per-test one.

Consequences

Positive

  • One reusable factory captures the metric definitions; future metrics go in one place.
  • Tests run with full counter isolation. Cross-test state leakage is impossible because each test sees its own dataclass instance.
  • The instrumentator gives us http_* metrics for free, including a Grafana-ready histogram that pairs with the Spring Boot one.

Negative

  • One extra level of indirection: any test that asserts on metric values must remember to monkeypatch main.metrics, not the registry directly. Rebinding through the registry is harmless but useless — the dataclass holds references to the original collectors.
  • prometheus-client is now pinned. Upgrading it requires an explicit bump and re-checking the instrumentator's compatibility range.
  • /metrics is exposed unauthenticated and relies on the Docker internal network for confidentiality. See docs/OBSERVABILITY.md §Internal-only endpoints for the Caddy snippet that must be added if the service ever gets a host-side port mapping.

Alternatives considered

  • Hand-roll the /metrics endpoint. Rejected: would have meant duplicating what prometheus-fastapi-instrumentator ships, plus middleware for the HTTP histograms.
  • Skip the factory; pass registry as a function argument everywhere. Rejected: clutters every endpoint signature and breaks the symmetry with the Spring Boot side, which also relies on a process-global Micrometer registry.
  • Use a pytest autouse fixture that resets REGISTRY between tests. Rejected: prometheus_client does not expose a clean "unregister all" hook, and we would be relying on private APIs.

References