docs: add ADR-023 and glossary entries for OCR metrics
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m33s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m29s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m33s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m29s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
ADR-023 captures why prometheus-fastapi-instrumentator was chosen, the build_metrics(registry) factory pattern, and the test rebinding seam. The glossary gains four ops-aligned terms — illegible word, models-ready gauge, recognition vs segmentation accuracy — so the metrics documentation in OBSERVABILITY.md has a vocabulary to lean on. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -80,6 +80,14 @@ _See also [DocumentStatus lifecycle](#documentstatus-lifecycle)._
|
||||
|
||||
**Sütterlin** — A specific standardized style of Kurrent taught in German schools from 1915 to 1941.
|
||||
|
||||
**Illegible word** — a word whose recognition confidence falls below the configured threshold; replaced with the literal token `[unleserlich]` in the rendered block text and counted in the `ocr_illegible_words_total` Prometheus counter.
|
||||
|
||||
**Models-ready gauge** — the `ocr_models_ready` Prometheus gauge, flipped from `0` to `1` once the FastAPI lifespan startup has finished loading the Kraken model and the spell-checker. Used both for the `/health` endpoint and as the supervised signal for the `ocr_models_ready < 1 for 2m` alert.
|
||||
|
||||
**Recognition model accuracy** — the accuracy reported by `ketos train` for the recognition (text-line) model, exposed as `ocr_model_accuracy{kind="recognition"}`. Sourced from `_parse_best_checkpoint` on the highest-scoring checkpoint after training.
|
||||
|
||||
**Segmentation model accuracy** — the accuracy reported by `ketos segtrain` for the baseline layout analysis (`blla`) model, exposed as `ocr_model_accuracy{kind="segmentation"}`. Distinct from recognition accuracy because the two models are trained and improved independently.
|
||||
|
||||
---
|
||||
|
||||
## Other Domain Terms
|
||||
|
||||
@@ -0,0 +1,94 @@
|
||||
# ADR-023: Prometheus Instrumentator and Metrics Registry Injection
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Until issue #652 the OCR service exposed no `/metrics` endpoint. The
|
||||
observability stack already scrapes the Spring Boot backend's actuator
|
||||
endpoint, but it had nothing to scrape on the Python side. Without HTTP-
|
||||
and domain-level metrics from `ocr-service` we cannot answer questions
|
||||
like "what is the share of words rendered as `[unleserlich]`" or
|
||||
"is the training error rate above its budget" from Grafana.
|
||||
|
||||
Two implementation requirements influenced the design:
|
||||
|
||||
1. **Counter / gauge isolation in tests.** `prometheus_client` collectors
|
||||
are module-level singletons keyed by name on the global `REGISTRY`.
|
||||
Re-importing or naively re-instantiating them raises a duplicated-
|
||||
collector error and cross-test state leaks (a `.inc()` in test A is
|
||||
still readable by test B). A test harness needs a way to swap the
|
||||
active container for a fresh per-test instance.
|
||||
|
||||
2. **Minimal blast radius on the request path.** We did not want to
|
||||
hand-instrument every endpoint with FastAPI middleware. The
|
||||
`prometheus-fastapi-instrumentator` library already provides
|
||||
`http_requests_total`, `http_request_duration_seconds`, and the
|
||||
`/metrics` exposition route, all idiomatic Prometheus names.
|
||||
|
||||
## Decision
|
||||
|
||||
- Add `prometheus-fastapi-instrumentator==7.0.0` and pin its transitive
|
||||
dependency `prometheus-client==0.25.0` explicitly in
|
||||
`ocr-service/requirements.txt`.
|
||||
- Mount the instrumentator once at module load:
|
||||
`Instrumentator(excluded_handlers=["/health", "/metrics"]).instrument(app).expose(app)`.
|
||||
This adds `/metrics` and an HTTP-level dashboard surface without
|
||||
changing any endpoint code.
|
||||
- Define every domain metric (`ocr_jobs_total`, `ocr_pages_total`,
|
||||
`ocr_processing_seconds`, …) inside a `build_metrics(registry)`
|
||||
factory in `ocr-service/metrics.py` that returns a frozen `OcrMetrics`
|
||||
dataclass. Production code binds the container to the default
|
||||
`REGISTRY` once: `metrics: OcrMetrics = build_metrics(REGISTRY)`.
|
||||
- Tests use a `fresh_metrics` fixture that builds a new
|
||||
`CollectorRegistry()` per test and monkeypatches `main.metrics` with
|
||||
a container bound to it. The endpoint code keeps reading
|
||||
`metrics.<name>` without knowing whether it is talking to the global
|
||||
registry or a per-test one.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**
|
||||
|
||||
- One reusable factory captures the metric definitions; future metrics
|
||||
go in one place.
|
||||
- Tests run with full counter isolation. Cross-test state leakage is
|
||||
impossible because each test sees its own dataclass instance.
|
||||
- The instrumentator gives us `http_*` metrics for free, including a
|
||||
Grafana-ready histogram that pairs with the Spring Boot one.
|
||||
|
||||
**Negative**
|
||||
|
||||
- One extra level of indirection: any test that asserts on metric
|
||||
values must remember to monkeypatch `main.metrics`, not the registry
|
||||
directly. Rebinding through the registry is harmless but useless —
|
||||
the dataclass holds references to the original collectors.
|
||||
- `prometheus-client` is now pinned. Upgrading it requires an explicit
|
||||
bump and re-checking the instrumentator's compatibility range.
|
||||
- `/metrics` is exposed unauthenticated and relies on the Docker
|
||||
internal network for confidentiality. See
|
||||
[docs/OBSERVABILITY.md §Internal-only endpoints](../OBSERVABILITY.md)
|
||||
for the Caddy snippet that must be added if the service ever gets a
|
||||
host-side port mapping.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Hand-roll the `/metrics` endpoint.** Rejected: would have meant
|
||||
duplicating what `prometheus-fastapi-instrumentator` ships, plus
|
||||
middleware for the HTTP histograms.
|
||||
- **Skip the factory; pass `registry` as a function argument
|
||||
everywhere.** Rejected: clutters every endpoint signature and breaks
|
||||
the symmetry with the Spring Boot side, which also relies on a
|
||||
process-global Micrometer registry.
|
||||
- **Use a `pytest` autouse fixture that resets `REGISTRY` between
|
||||
tests.** Rejected: `prometheus_client` does not expose a clean
|
||||
"unregister all" hook, and we would be relying on private APIs.
|
||||
|
||||
## References
|
||||
|
||||
- Issue: [#652](https://git.raddatz.cloud/marcel/familienarchiv/issues/652)
|
||||
- Library: <https://github.com/trallnag/prometheus-fastapi-instrumentator>
|
||||
- Code: `ocr-service/metrics.py`, `ocr-service/main.py`,
|
||||
`ocr-service/test_metrics.py`
|
||||
Reference in New Issue
Block a user