From 2dbb3c37b45d632922bedae61a3b04d7c18888c2 Mon Sep 17 00:00:00 2001 From: Marcel Date: Thu, 21 May 2026 17:05:27 +0200 Subject: [PATCH] docs(observability): document ocr metrics, scrape edge, and access-log filter - L2 container diagram now shows the Prometheus -> ocr:8000 scrape edge (plus the previously-undrawn Prometheus -> backend edge for symmetry). - OBSERVABILITY.md gains a full ocr_* metrics table with labels, units, and the canonical example queries from issue #652. - New "Internal-only endpoints" subsection captures the unauthenticated /metrics caveat and provides the Caddy block snippet for the case where the service ever gets a host port. - Explicit note that MetricsPathFilter only quiets uvicorn stdout, and the OCR metrics must never carry PII or document content. Co-Authored-By: Claude Sonnet 4.6 --- docs/OBSERVABILITY.md | 69 ++++++++++++++++++++++++- docs/architecture/c4/l2-containers.puml | 2 + 2 files changed, 70 insertions(+), 1 deletion(-) diff --git a/docs/OBSERVABILITY.md b/docs/OBSERVABILITY.md index b895e849..2b3855d6 100644 --- a/docs/OBSERVABILITY.md +++ b/docs/OBSERVABILITY.md @@ -118,11 +118,14 @@ To find a trace for a specific request in staging/production, either increase th ## Metrics (Prometheus → Grafana) -Prometheus scrapes the backend management endpoint every 15 s: +Prometheus scrapes two targets every 15 s: ``` Target: backend:8081/actuator/prometheus Labels: job="spring-boot", application="Familienarchiv" + +Target: ocr:8000/metrics +Labels: job="ocr-service" ``` All Spring Boot metrics carry the `application="Familienarchiv"` tag, which is how the Grafana Spring Boot Observability dashboard (ID 17175) filters to this service. @@ -146,6 +149,70 @@ jvm_memory_used_bytes{area="heap", application="Familienarchiv"} hikaricp_connections_active ``` +### OCR-service custom metrics + +Exposed at `ocr:8000/metrics` by `prometheus-fastapi-instrumentator`. The +`http_*` metrics describe the FastAPI request layer; the `ocr_*` series are +domain-specific. **Never label these with PII or document content** — labels +have unbounded cardinality risk and are visible to anyone with Grafana access. + +| Metric | Type | Labels | Unit | What it tracks | +|---|---|---|---|---| +| `ocr_jobs_total` | Counter | `engine` (`surya`/`kraken`), `script_type` | jobs | OCR jobs that started after a successful PDF download | +| `ocr_pages_total` | Counter | `engine` | pages | Successfully OCR'd pages in the streaming generator | +| `ocr_skipped_pages_total` | Counter | — | pages | Pages skipped because the engine raised on them | +| `ocr_words_total` | Counter | — | words | Recognized words summed across every block | +| `ocr_illegible_words_total` | Counter | — | words | Words below the confidence threshold (rendered as `[unleserlich]`) | +| `ocr_processing_seconds` | Histogram | `engine` | seconds | Per-page (stream) or per-document (`/ocr`) engine time, excluding preprocessing | +| `ocr_training_runs_total` | Counter | `kind` (`recognition`/`segmentation`), `outcome` (`success`/`error`) | runs | Completed training runs | +| `ocr_model_accuracy` | Gauge | `kind` | ratio (0–1) | Latest accuracy reported by a successful training run | +| `ocr_models_ready` | Gauge | — | 0\|1 | 1 once the lifespan startup has finished loading models | + +Canonical example queries (the same ones referenced in issue #652): + +```promql +# OCR throughput by engine +sum by (engine) (rate(ocr_pages_total[5m])) + +# Share of words rendered as [unleserlich] +sum(rate(ocr_illegible_words_total[5m])) + / sum(rate(ocr_words_total[5m])) + +# p95 page processing time per engine +histogram_quantile(0.95, sum by (engine, le) ( + rate(ocr_processing_seconds_bucket[5m]) +)) + +# Training error rate +sum(rate(ocr_training_runs_total{outcome="error"}[1h])) + / sum(rate(ocr_training_runs_total[1h])) + +# Latest recognition vs segmentation accuracy +ocr_model_accuracy +``` + +### Internal-only endpoints + +`/metrics` is exposed by the OCR service over plain HTTP without +authentication. The container is reachable only on the internal Docker +network — Caddy never proxies to it directly. If the service is ever +exposed (e.g. a `ports:` mapping is added), block the endpoint at the +reverse proxy: + +```caddy +ocr.example.com { + @internal_only path /metrics /health + respond @internal_only 404 + reverse_proxy ocr:8000 +} +``` + +The `MetricsPathFilter` in `ocr-service/main.py` suppresses uvicorn's +**stdout** access log lines for `/metrics` and `/health` so the container +console stays focused on real OCR traffic. Promtail/Loki still receive +access lines from any other source. Treat the filter as console +noise-control, not an audit-suppression mechanism. + ## Errors (GlitchTip) GlitchTip receives errors from both the backend (via Sentry Java SDK) and the frontend (via Sentry JavaScript SDK). It groups events by fingerprint, tracks first/last seen times, and links to the release that introduced the error. diff --git a/docs/architecture/c4/l2-containers.puml b/docs/architecture/c4/l2-containers.puml index 346efe75..8d66a614 100644 --- a/docs/architecture/c4/l2-containers.puml +++ b/docs/architecture/c4/l2-containers.puml @@ -43,6 +43,8 @@ Rel(ocr, storage, "Fetches PDF via presigned URL", "HTTP / S3 presigned") Rel(mc, storage, "Bootstraps bucket + service account on startup", "MinIO Client CLI") Rel(promtail, loki, "Pushes log streams", "HTTP/Loki push API") Rel(backend, tempo, "Sends distributed traces via OTLP", "HTTP / OTLP / port 4318 (archiv-net)") +Rel(prometheus, backend, "Scrapes JVM + HTTP metrics", "HTTP 8081 /actuator/prometheus") +Rel(prometheus, ocr, "Scrapes OCR + http_* metrics", "HTTP 8000 /metrics") Rel(grafana, prometheus, "Queries metrics", "HTTP 9090") Rel(grafana, loki, "Queries logs", "HTTP 3100") Rel(grafana, tempo, "Queries traces", "HTTP 3200")