docs(observability): document ocr metrics, scrape edge, and access-log filter
- L2 container diagram now shows the Prometheus -> ocr:8000 scrape edge (plus the previously-undrawn Prometheus -> backend edge for symmetry). - OBSERVABILITY.md gains a full ocr_* metrics table with labels, units, and the canonical example queries from issue #652. - New "Internal-only endpoints" subsection captures the unauthenticated /metrics caveat and provides the Caddy block snippet for the case where the service ever gets a host port. - Explicit note that MetricsPathFilter only quiets uvicorn stdout, and the OCR metrics must never carry PII or document content. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -118,11 +118,14 @@ To find a trace for a specific request in staging/production, either increase th
|
||||
|
||||
## Metrics (Prometheus → Grafana)
|
||||
|
||||
Prometheus scrapes the backend management endpoint every 15 s:
|
||||
Prometheus scrapes two targets every 15 s:
|
||||
|
||||
```
|
||||
Target: backend:8081/actuator/prometheus
|
||||
Labels: job="spring-boot", application="Familienarchiv"
|
||||
|
||||
Target: ocr:8000/metrics
|
||||
Labels: job="ocr-service"
|
||||
```
|
||||
|
||||
All Spring Boot metrics carry the `application="Familienarchiv"` tag, which is how the Grafana Spring Boot Observability dashboard (ID 17175) filters to this service.
|
||||
@@ -146,6 +149,70 @@ jvm_memory_used_bytes{area="heap", application="Familienarchiv"}
|
||||
hikaricp_connections_active
|
||||
```
|
||||
|
||||
### OCR-service custom metrics
|
||||
|
||||
Exposed at `ocr:8000/metrics` by `prometheus-fastapi-instrumentator`. The
|
||||
`http_*` metrics describe the FastAPI request layer; the `ocr_*` series are
|
||||
domain-specific. **Never label these with PII or document content** — labels
|
||||
have unbounded cardinality risk and are visible to anyone with Grafana access.
|
||||
|
||||
| Metric | Type | Labels | Unit | What it tracks |
|
||||
|---|---|---|---|---|
|
||||
| `ocr_jobs_total` | Counter | `engine` (`surya`/`kraken`), `script_type` | jobs | OCR jobs that started after a successful PDF download |
|
||||
| `ocr_pages_total` | Counter | `engine` | pages | Successfully OCR'd pages in the streaming generator |
|
||||
| `ocr_skipped_pages_total` | Counter | — | pages | Pages skipped because the engine raised on them |
|
||||
| `ocr_words_total` | Counter | — | words | Recognized words summed across every block |
|
||||
| `ocr_illegible_words_total` | Counter | — | words | Words below the confidence threshold (rendered as `[unleserlich]`) |
|
||||
| `ocr_processing_seconds` | Histogram | `engine` | seconds | Per-page (stream) or per-document (`/ocr`) engine time, excluding preprocessing |
|
||||
| `ocr_training_runs_total` | Counter | `kind` (`recognition`/`segmentation`), `outcome` (`success`/`error`) | runs | Completed training runs |
|
||||
| `ocr_model_accuracy` | Gauge | `kind` | ratio (0–1) | Latest accuracy reported by a successful training run |
|
||||
| `ocr_models_ready` | Gauge | — | 0\|1 | 1 once the lifespan startup has finished loading models |
|
||||
|
||||
Canonical example queries (the same ones referenced in issue #652):
|
||||
|
||||
```promql
|
||||
# OCR throughput by engine
|
||||
sum by (engine) (rate(ocr_pages_total[5m]))
|
||||
|
||||
# Share of words rendered as [unleserlich]
|
||||
sum(rate(ocr_illegible_words_total[5m]))
|
||||
/ sum(rate(ocr_words_total[5m]))
|
||||
|
||||
# p95 page processing time per engine
|
||||
histogram_quantile(0.95, sum by (engine, le) (
|
||||
rate(ocr_processing_seconds_bucket[5m])
|
||||
))
|
||||
|
||||
# Training error rate
|
||||
sum(rate(ocr_training_runs_total{outcome="error"}[1h]))
|
||||
/ sum(rate(ocr_training_runs_total[1h]))
|
||||
|
||||
# Latest recognition vs segmentation accuracy
|
||||
ocr_model_accuracy
|
||||
```
|
||||
|
||||
### Internal-only endpoints
|
||||
|
||||
`/metrics` is exposed by the OCR service over plain HTTP without
|
||||
authentication. The container is reachable only on the internal Docker
|
||||
network — Caddy never proxies to it directly. If the service is ever
|
||||
exposed (e.g. a `ports:` mapping is added), block the endpoint at the
|
||||
reverse proxy:
|
||||
|
||||
```caddy
|
||||
ocr.example.com {
|
||||
@internal_only path /metrics /health
|
||||
respond @internal_only 404
|
||||
reverse_proxy ocr:8000
|
||||
}
|
||||
```
|
||||
|
||||
The `MetricsPathFilter` in `ocr-service/main.py` suppresses uvicorn's
|
||||
**stdout** access log lines for `/metrics` and `/health` so the container
|
||||
console stays focused on real OCR traffic. Promtail/Loki still receive
|
||||
access lines from any other source. Treat the filter as console
|
||||
noise-control, not an audit-suppression mechanism.
|
||||
|
||||
## Errors (GlitchTip)
|
||||
|
||||
GlitchTip receives errors from both the backend (via Sentry Java SDK) and the frontend (via Sentry JavaScript SDK). It groups events by fingerprint, tracks first/last seen times, and links to the release that introduced the error.
|
||||
|
||||
@@ -43,6 +43,8 @@ Rel(ocr, storage, "Fetches PDF via presigned URL", "HTTP / S3 presigned")
|
||||
Rel(mc, storage, "Bootstraps bucket + service account on startup", "MinIO Client CLI")
|
||||
Rel(promtail, loki, "Pushes log streams", "HTTP/Loki push API")
|
||||
Rel(backend, tempo, "Sends distributed traces via OTLP", "HTTP / OTLP / port 4318 (archiv-net)")
|
||||
Rel(prometheus, backend, "Scrapes JVM + HTTP metrics", "HTTP 8081 /actuator/prometheus")
|
||||
Rel(prometheus, ocr, "Scrapes OCR + http_* metrics", "HTTP 8000 /metrics")
|
||||
Rel(grafana, prometheus, "Queries metrics", "HTTP 9090")
|
||||
Rel(grafana, loki, "Queries logs", "HTTP 3100")
|
||||
Rel(grafana, tempo, "Queries traces", "HTTP 3200")
|
||||
|
||||
Reference in New Issue
Block a user