Add instrumentation-overhead NFR to OCR service #656

Open
opened 2026-05-21 17:11:25 +02:00 by marcel · 0 comments
Owner

Context

PR #653 added /metrics to the OCR service via
prometheus-fastapi-instrumentator plus a handful of domain counters and
a per-page histogram observation. The instrumentator wraps every request
in a middleware and records bucketed durations; the histogram observations
on /ocr/stream fire from inside the streaming generator.

We have no agreed performance budget for this overhead. Without one, any
future regression (an extra label, a switch to summary-from-histogram, an
on-by-default tracing hook) cannot be evaluated against a target.

Proposal

Adopt a non-functional requirement, document it in
docs/architecture/NFRs.md (or wherever the existing NFRs live — needs a
small scout pass first) and add a CI check or a periodic load test that
fails when the budget is broken.

Suggested budget

Instrumentation overhead — the p95 latency of POST /ocr and
POST /ocr/stream must not increase by more than 5 ms when
Instrumentator(...).expose(app) is mounted, measured against a
baseline run with the instrumentator removed, at a steady 10 req/s on
the dev-host hardware (16 vCPU, 32 GB).

The 5 ms figure is a starting point; calibrate from one baseline +
one instrumented run before locking it in.

Acceptance criteria

  • NFR text committed to docs/architecture/NFRs.md (or wherever the
    existing NFRs live).
  • A reproducible load-test script (e.g. scripts/perf/ocr-bench.sh
    using wrk or k6) that produces a JSON report.
  • A documented "how to verify" runbook: which command to run, which
    number to look at, what passes and what fails.
  • (Stretch) A nightly CI job that runs the benchmark on a fixed
    runner and posts a delta to the PR or to Grafana.

Out of scope

  • Wiring the budget into Alertmanager (covered by #654's training-error
    alert pattern — the same approach could be reused later).
  • Optimising the existing histogram cardinality.
  • PR #653 (introduced the instrumentation being measured)
  • ADR-023 (docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md)
## Context PR #653 added `/metrics` to the OCR service via `prometheus-fastapi-instrumentator` plus a handful of domain counters and a per-page histogram observation. The instrumentator wraps every request in a middleware and records bucketed durations; the histogram observations on `/ocr/stream` fire from inside the streaming generator. We have no agreed performance budget for this overhead. Without one, any future regression (an extra label, a switch to summary-from-histogram, an on-by-default tracing hook) cannot be evaluated against a target. ## Proposal Adopt a non-functional requirement, document it in `docs/architecture/NFRs.md` (or wherever the existing NFRs live — needs a small scout pass first) and add a CI check or a periodic load test that fails when the budget is broken. ### Suggested budget > Instrumentation overhead — the p95 latency of `POST /ocr` and > `POST /ocr/stream` must not increase by more than **5 ms** when > `Instrumentator(...).expose(app)` is mounted, measured against a > baseline run with the instrumentator removed, at a steady 10 req/s on > the dev-host hardware (16 vCPU, 32 GB). The 5 ms figure is a starting point; calibrate from one baseline + one instrumented run before locking it in. ## Acceptance criteria - [ ] NFR text committed to `docs/architecture/NFRs.md` (or wherever the existing NFRs live). - [ ] A reproducible load-test script (e.g. `scripts/perf/ocr-bench.sh` using `wrk` or `k6`) that produces a JSON report. - [ ] A documented "how to verify" runbook: which command to run, which number to look at, what passes and what fails. - [ ] (Stretch) A nightly CI job that runs the benchmark on a fixed runner and posts a delta to the PR or to Grafana. ## Out of scope - Wiring the budget into Alertmanager (covered by #654's training-error alert pattern — the same approach could be reused later). - Optimising the existing histogram cardinality. ## Related - PR #653 (introduced the instrumentation being measured) - ADR-023 (`docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md`)
marcel added the devopsfeature labels 2026-05-21 17:11:34 +02:00
Sign in to join this conversation.
No Label devops feature
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#656