ADR-023 captures why prometheus-fastapi-instrumentator was chosen, the build_metrics(registry) factory pattern, and the test rebinding seam. The glossary gains four ops-aligned terms — illegible word, models-ready gauge, recognition vs segmentation accuracy — so the metrics documentation in OBSERVABILITY.md has a vocabulary to lean on. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.2 KiB
ADR-023: Prometheus Instrumentator and Metrics Registry Injection
Status
Accepted
Context
Until issue #652 the OCR service exposed no /metrics endpoint. The
observability stack already scrapes the Spring Boot backend's actuator
endpoint, but it had nothing to scrape on the Python side. Without HTTP-
and domain-level metrics from ocr-service we cannot answer questions
like "what is the share of words rendered as [unleserlich]" or
"is the training error rate above its budget" from Grafana.
Two implementation requirements influenced the design:
-
Counter / gauge isolation in tests.
prometheus_clientcollectors are module-level singletons keyed by name on the globalREGISTRY. Re-importing or naively re-instantiating them raises a duplicated- collector error and cross-test state leaks (a.inc()in test A is still readable by test B). A test harness needs a way to swap the active container for a fresh per-test instance. -
Minimal blast radius on the request path. We did not want to hand-instrument every endpoint with FastAPI middleware. The
prometheus-fastapi-instrumentatorlibrary already provideshttp_requests_total,http_request_duration_seconds, and the/metricsexposition route, all idiomatic Prometheus names.
Decision
- Add
prometheus-fastapi-instrumentator==7.0.0and pin its transitive dependencyprometheus-client==0.25.0explicitly inocr-service/requirements.txt. - Mount the instrumentator once at module load:
Instrumentator(excluded_handlers=["/health", "/metrics"]).instrument(app).expose(app). This adds/metricsand an HTTP-level dashboard surface without changing any endpoint code. - Define every domain metric (
ocr_jobs_total,ocr_pages_total,ocr_processing_seconds, …) inside abuild_metrics(registry)factory inocr-service/metrics.pythat returns a frozenOcrMetricsdataclass. Production code binds the container to the defaultREGISTRYonce:metrics: OcrMetrics = build_metrics(REGISTRY). - Tests use a
fresh_metricsfixture that builds a newCollectorRegistry()per test and monkeypatchesmain.metricswith a container bound to it. The endpoint code keeps readingmetrics.<name>without knowing whether it is talking to the global registry or a per-test one.
Consequences
Positive
- One reusable factory captures the metric definitions; future metrics go in one place.
- Tests run with full counter isolation. Cross-test state leakage is impossible because each test sees its own dataclass instance.
- The instrumentator gives us
http_*metrics for free, including a Grafana-ready histogram that pairs with the Spring Boot one.
Negative
- One extra level of indirection: any test that asserts on metric
values must remember to monkeypatch
main.metrics, not the registry directly. Rebinding through the registry is harmless but useless — the dataclass holds references to the original collectors. prometheus-clientis now pinned. Upgrading it requires an explicit bump and re-checking the instrumentator's compatibility range./metricsis exposed unauthenticated and relies on the Docker internal network for confidentiality. See docs/OBSERVABILITY.md §Internal-only endpoints for the Caddy snippet that must be added if the service ever gets a host-side port mapping.
Alternatives considered
- Hand-roll the
/metricsendpoint. Rejected: would have meant duplicating whatprometheus-fastapi-instrumentatorships, plus middleware for the HTTP histograms. - Skip the factory; pass
registryas a function argument everywhere. Rejected: clutters every endpoint signature and breaks the symmetry with the Spring Boot side, which also relies on a process-global Micrometer registry. - Use a
pytestautouse fixture that resetsREGISTRYbetween tests. Rejected:prometheus_clientdoes not expose a clean "unregister all" hook, and we would be relying on private APIs.
References
- Issue: #652
- Library: https://github.com/trallnag/prometheus-fastapi-instrumentator
- Code:
ocr-service/metrics.py,ocr-service/main.py,ocr-service/test_metrics.py