Add Prometheus alerting rules for OCR service #654

New Issue

marcel · 2026-05-21T17:10:42+02:00

marcel commented

2026-05-21 17:10:42 +02:00

Context

PR #653 wires the ocr_* Prometheus metrics on ocr:8000/metrics. The
scrape target is configured but no alerting rules consume the new signals
yet. Without alerts, a stuck or degraded OCR service is invisible until a
user reports it.

Scope

Add a Prometheus rule file (most likely
infra/observability/prometheus/rules/ocr.yml) wired up via the
rule_files: list in prometheus.yml, containing the three alerts below.
Owner of dispatch (Alertmanager → email / webhook) is out of scope for
this issue — assume the existing receiver chain.

Alert 1 — OCR service models not ready

- alert: OcrModelsNotReady
  expr: ocr_models_ready < 1
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "OCR service lifespan startup has not flipped models_ready to 1"
    description: |
      `ocr_models_ready` has been < 1 for 2 minutes on {{ $labels.instance }}.
      The Kraken model or the spell-checker is failing to load; /ocr returns 503.

Rationale: ocr_models_ready is set to 1 exactly once at the end of the
FastAPI lifespan. If it stays 0, the container is up but cannot serve OCR.

Alert 2 — High skipped-page rate

- alert: OcrSkippedPageRateHigh
  expr: |
    sum(rate(ocr_skipped_pages_total[10m]))
      / sum(rate(ocr_pages_total[10m]) + rate(ocr_skipped_pages_total[10m]))
      > 0.1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "More than 10% of OCR pages skipped over the last 10m"
    description: |
      The engine is raising on >10% of pages. Likely causes: corrupt PDFs,
      a kraken model regression, or pyvips/cv2 OOM. Investigate via
      `{job="ocr-service"} |= "Guided OCR failed" or "OCR failed on page"`.

Alert 3 — Training error rate

- alert: OcrTrainingErrorRateHigh
  expr: |
    sum(rate(ocr_training_runs_total{outcome="error"}[1h]))
      / sum(rate(ocr_training_runs_total[1h]))
      > 0.5
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "More than 50% of OCR training runs failed in the last hour"
    description: |
      ketos train or segtrain is failing on the majority of attempted runs.
      Inspect the latest /train log lines in Grafana → Loki.

Acceptance criteria

infra/observability/prometheus/rules/ocr.yml exists with the three rules above.
prometheus.yml includes the file via rule_files:.
promtool check rules infra/observability/prometheus/rules/ocr.yml passes locally.
Grafana → Alerting shows the three rules after the next deploy.
DEPLOYMENT.md or OBSERVABILITY.md mentions which alerts exist for the OCR domain.

PR #653 (introduced the metrics consumed here)
Issue #652 (parent feature)
ADR-023 (docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md)

## Context PR #653 wires the `ocr_*` Prometheus metrics on `ocr:8000/metrics`. The scrape target is configured but no alerting rules consume the new signals yet. Without alerts, a stuck or degraded OCR service is invisible until a user reports it. ## Scope Add a Prometheus rule file (most likely `infra/observability/prometheus/rules/ocr.yml`) wired up via the `rule_files:` list in `prometheus.yml`, containing the three alerts below. Owner of dispatch (Alertmanager → email / webhook) is out of scope for this issue — assume the existing receiver chain. ### Alert 1 — OCR service models not ready ```yaml - alert: OcrModelsNotReady expr: ocr_models_ready < 1 for: 2m labels: severity: critical annotations: summary: "OCR service lifespan startup has not flipped models_ready to 1" description: | `ocr_models_ready` has been < 1 for 2 minutes on {{ $labels.instance }}. The Kraken model or the spell-checker is failing to load; /ocr returns 503. ``` Rationale: `ocr_models_ready` is set to 1 exactly once at the end of the FastAPI lifespan. If it stays 0, the container is up but cannot serve OCR. ### Alert 2 — High skipped-page rate ```yaml - alert: OcrSkippedPageRateHigh expr: | sum(rate(ocr_skipped_pages_total[10m])) / sum(rate(ocr_pages_total[10m]) + rate(ocr_skipped_pages_total[10m])) > 0.1 for: 10m labels: severity: warning annotations: summary: "More than 10% of OCR pages skipped over the last 10m" description: | The engine is raising on >10% of pages. Likely causes: corrupt PDFs, a kraken model regression, or pyvips/cv2 OOM. Investigate via `{job="ocr-service"} |= "Guided OCR failed" or "OCR failed on page"`. ``` ### Alert 3 — Training error rate ```yaml - alert: OcrTrainingErrorRateHigh expr: | sum(rate(ocr_training_runs_total{outcome="error"}[1h])) / sum(rate(ocr_training_runs_total[1h])) > 0.5 for: 30m labels: severity: warning annotations: summary: "More than 50% of OCR training runs failed in the last hour" description: | ketos train or segtrain is failing on the majority of attempted runs. Inspect the latest /train log lines in Grafana → Loki. ``` ## Acceptance criteria - [ ] `infra/observability/prometheus/rules/ocr.yml` exists with the three rules above. - [ ] `prometheus.yml` includes the file via `rule_files:`. - [ ] `promtool check rules infra/observability/prometheus/rules/ocr.yml` passes locally. - [ ] Grafana → Alerting shows the three rules after the next deploy. - [ ] DEPLOYMENT.md or OBSERVABILITY.md mentions which alerts exist for the OCR domain. ## Related - PR #653 (introduced the metrics consumed here) - Issue #652 (parent feature) - ADR-023 (`docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md`)

marcel added the devops feature phase-7: monitoring labels 2026-05-21 17:10:46 +02:00

marcel referenced this issue

2026-05-21 17:11:25 +02:00

Add instrumentation-overhead NFR to OCR service #656

marcel referenced this issue

2026-05-21 17:11:46 +02:00

feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

marcel referenced this issue

2026-05-21 17:16:13 +02:00

feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

marcel referenced this issue

2026-05-21 17:16:38 +02:00