Files

Marcel 2dbb3c37b4 docs(observability): document ocr metrics, scrape edge, and access-log filter

- L2 container diagram now shows the Prometheus -> ocr:8000 scrape edge
  (plus the previously-undrawn Prometheus -> backend edge for symmetry).
- OBSERVABILITY.md gains a full ocr_* metrics table with labels, units,
  and the canonical example queries from issue #652.
- New "Internal-only endpoints" subsection captures the unauthenticated
  /metrics caveat and provides the Caddy block snippet for the case
  where the service ever gets a host port.
- Explicit note that MetricsPathFilter only quiets uvicorn stdout, and
  the OCR metrics must never carry PII or document content.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-21 17:05:27 +02:00

11 KiB

Raw Blame History

Observability Guide

Ops reference (starting the stack, env vars, CI wiring) → DEPLOYMENT.md §4. This file is for developers: what signal lives where, how to reach it, and what to look for.

Where to look for what

I want to…	Go to
See the last N log lines from the backend	`docker compose logs --tail=100 backend`
Search logs by keyword across time	Grafana → Explore → Loki
Understand why an HTTP request failed	Grafana → Explore → Loki → filter by `traceId` → follow link to Tempo
See a full distributed trace (DB queries, HTTP calls)	Grafana → Explore → Tempo → search by service or trace ID
Check JVM heap / GC / thread count	Grafana → Dashboards → Spring Boot Observability
Check HTTP error rate or p95 latency	Grafana → Dashboards → Spring Boot Observability
Check host CPU / memory / disk	Grafana → Dashboards → Node Exporter Full
See grouped application errors with stack traces	GlitchTip
Check if the backend is healthy	`curl http://localhost:8081/actuator/health` (on the server)
Check what Prometheus is scraping	`curl http://localhost:9090/api/v1/targets` (on the server)

Access

Tool	External URL	Who it's for
Grafana	`https://grafana.archiv.raddatz.cloud`	Logs, metrics, traces — the primary observability UI
GlitchTip	`https://glitchtip.archiv.raddatz.cloud`	Grouped errors with stack traces and release tracking

Loki, Tempo, and Prometheus have no external URL. They are internal services, accessible only through Grafana (or via SSH tunnel — see below).

Logs (Loki)

Logs reach Loki via Promtail, which reads all Docker container logs from the Docker socket and ships them with labels derived from Docker Compose metadata.

Labels available in every log line

Label	What it contains	Example
`job`	Compose service name	`backend`, `frontend`, `db`
`compose_service`	Same as `job`	`backend`
`compose_project`	Compose project name	`archiv-staging`, `archiv-production`
`container_name`	Docker container name	`archiv-staging-backend-1`
`filename`	Docker log source	`/var/lib/docker/containers/…`

Use job in LogQL queries — it is stable across dev, staging, and production. container_name changes between environments.

Common LogQL queries in Grafana Explore

# All backend logs
{job="backend"}

# Backend ERROR and WARN lines only
{job="backend"} |= "ERROR" or {job="backend"} |= "WARN"

# All logs for a specific request (paste a traceId from a log line)
{job="backend"} |= "3fa85f64-5717-4562-b3fc-2c963f66afa6"

# Log lines containing a specific exception class
{job="backend"} |~ "DomainException|NullPointerException"

# Frontend logs
{job="frontend"}

# Database (slow query log, if enabled)
{job="db"}

Log → Trace correlation

Spring Boot writes the active traceId into every log line when a request is being processed:

2026-05-16 ... INFO  [Familienarchiv,3fa85f64...,1b2c3d4e] o.r.f.document.DocumentService : ...

In Grafana Explore → Loki, log lines with a traceId field show a Tempo link. Clicking it opens the full trace in Explore → Tempo without copying and pasting IDs.

This linking is configured in the Loki datasource provisioning via the traceId derived field regex. No manual setup required.

Traces (Tempo)

The backend sends traces to Tempo via OTLP HTTP (port 4318). Every inbound HTTP request and every JPA query produces a span. Spans are linked into traces by the propagated traceId header.

Finding a trace in Grafana

Option A — from a log line:

Grafana → Explore → select Loki datasource
Query {job="backend"} and find the failing request
Click the Tempo link in the log line (appears when traceId is present)

Option B — by service:

Grafana → Explore → select Tempo datasource
Query type: Search
Service name: familienarchiv-backend
Filter by HTTP status, duration, or operation name as needed

Option C — by trace ID:

Grafana → Explore → select Tempo datasource
Query type: TraceQL or Trace ID
Paste the trace ID

What each span type tells you

Root span name pattern	What it covers
`GET /api/documents`, `POST /api/documents`	Full HTTP request lifecycle
`SELECT archiv.*`	A single JPA/JDBC query inside that request
`HikariPool.getConnection`	Connection pool wait time

A slow SELECT span inside an otherwise fast HTTP span pinpoints a missing index. A slow HikariPool.getConnection span indicates connection pool exhaustion.

Sampling rate

Dev: 100% of requests are traced (management.tracing.sampling.probability: 1.0 in application.yaml)
Staging / Production: 10% (MANAGEMENT_TRACING_SAMPLING_PROBABILITY=0.1 in docker-compose.prod.yml)

To find a trace for a specific request in staging/production, either increase the sampling rate temporarily or trigger the request multiple times.

Metrics (Prometheus → Grafana)

Prometheus scrapes two targets every 15 s:

Target: backend:8081/actuator/prometheus
Labels: job="spring-boot", application="Familienarchiv"

Target: ocr:8000/metrics
Labels: job="ocr-service"

All Spring Boot metrics carry the application="Familienarchiv" tag, which is how the Grafana Spring Boot Observability dashboard (ID 17175) filters to this service.

Useful Prometheus queries (run on the server or via Grafana Explore → Prometheus)

# HTTP error rate (5xx) as a fraction of all requests
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
  / sum(rate(http_server_requests_seconds_count[5m]))

# p95 response time
histogram_quantile(0.95, sum by (le) (
  rate(http_server_requests_seconds_bucket[5m])
))

# JVM heap used
jvm_memory_used_bytes{area="heap", application="Familienarchiv"}

# Active DB connections
hikaricp_connections_active

OCR-service custom metrics

Exposed at ocr:8000/metrics by prometheus-fastapi-instrumentator. The http_* metrics describe the FastAPI request layer; the ocr_* series are domain-specific. Never label these with PII or document content — labels have unbounded cardinality risk and are visible to anyone with Grafana access.

Metric	Type	Labels	Unit	What it tracks
`ocr_jobs_total`	Counter	`engine` (`surya`/`kraken`), `script_type`	jobs	OCR jobs that started after a successful PDF download
`ocr_pages_total`	Counter	`engine`	pages	Successfully OCR'd pages in the streaming generator
`ocr_skipped_pages_total`	Counter	—	pages	Pages skipped because the engine raised on them
`ocr_words_total`	Counter	—	words	Recognized words summed across every block
`ocr_illegible_words_total`	Counter	—	words	Words below the confidence threshold (rendered as `[unleserlich]`)
`ocr_processing_seconds`	Histogram	`engine`	seconds	Per-page (stream) or per-document (`/ocr`) engine time, excluding preprocessing
`ocr_training_runs_total`	Counter	`kind` (`recognition`/`segmentation`), `outcome` (`success`/`error`)	runs	Completed training runs
`ocr_model_accuracy`	Gauge	`kind`	ratio (0–1)	Latest accuracy reported by a successful training run
`ocr_models_ready`	Gauge	—	0\|1	1 once the lifespan startup has finished loading models

Canonical example queries (the same ones referenced in issue #652):

# OCR throughput by engine
sum by (engine) (rate(ocr_pages_total[5m]))

# Share of words rendered as [unleserlich]
sum(rate(ocr_illegible_words_total[5m]))
  / sum(rate(ocr_words_total[5m]))

# p95 page processing time per engine
histogram_quantile(0.95, sum by (engine, le) (
  rate(ocr_processing_seconds_bucket[5m])
))

# Training error rate
sum(rate(ocr_training_runs_total{outcome="error"}[1h]))
  / sum(rate(ocr_training_runs_total[1h]))

# Latest recognition vs segmentation accuracy
ocr_model_accuracy

Internal-only endpoints

/metrics is exposed by the OCR service over plain HTTP without authentication. The container is reachable only on the internal Docker network — Caddy never proxies to it directly. If the service is ever exposed (e.g. a ports: mapping is added), block the endpoint at the reverse proxy:

ocr.example.com {
    @internal_only path /metrics /health
    respond @internal_only 404
    reverse_proxy ocr:8000
}

The MetricsPathFilter in ocr-service/main.py suppresses uvicorn's stdout access log lines for /metrics and /health so the container console stays focused on real OCR traffic. Promtail/Loki still receive access lines from any other source. Treat the filter as console noise-control, not an audit-suppression mechanism.

Errors (GlitchTip)

GlitchTip receives errors from both the backend (via Sentry Java SDK) and the frontend (via Sentry JavaScript SDK). It groups events by fingerprint, tracks first/last seen times, and links to the release that introduced the error.

GlitchTip complements Loki: use GlitchTip when you need grouped, de-duplicated errors with stack traces and release attribution; use Loki when you need raw log lines with full context or want to search across all log levels.

Direct API access (debugging only)

Loki and Tempo bind no host ports. To reach them directly from your laptop, use an SSH tunnel through the server:

# Loki API on localhost:3100 (then query via curl or logcli)
ssh -L 3100:172.20.0.x:3100 root@raddatz.cloud
# Replace 172.20.0.x with the obs-loki container IP:
#   docker inspect obs-loki --format '{{.NetworkSettings.Networks.archiv-obs-net.IPAddress}}'

# Tempo API on localhost:3200 (then query via curl or tempo-cli)
ssh -L 3200:172.20.0.x:3200 root@raddatz.cloud

In practice, Grafana Explore covers all common debugging workflows without needing direct API access.

Signal summary

Signal	Source	Transport	Storage	UI
Application logs	Spring Boot stdout → Docker log driver	Promtail → Loki push API	Loki	Grafana Explore → Loki
Distributed traces	Spring Boot OTel agent	OTLP HTTP → Tempo:4318	Tempo	Grafana Explore → Tempo
JVM + HTTP metrics	Spring Actuator `/actuator/prometheus`	Prometheus pull (15 s)	Prometheus	Grafana dashboards
Host metrics	node-exporter	Prometheus pull	Prometheus	Grafana → Node Exporter Full
Container metrics	cAdvisor	Prometheus pull	Prometheus	Grafana (via Prometheus datasource)
Application errors	Sentry SDK	HTTP POST → GlitchTip ingest	GlitchTip DB	GlitchTip UI

11 KiB Raw Blame History Unescape Escape