- New docs/OBSERVABILITY.md: developer-facing guide with a "where to look for what" table, common LogQL queries, trace exploration workflow, log→trace correlation via traceId links, and a signal summary table - Link from DEPLOYMENT.md §4 (ops section now points to dev guide) and from CLAUDE.md Infrastructure section - Fix stale DEPLOYMENT.md env var table: OTEL_EXPORTER_OTLP_ENDPOINT now documents port 4318 (HTTP) not 4317 (gRPC); add the three new env vars wired in this PR (OTEL_LOGS_EXPORTER, OTEL_METRICS_EXPORTER, MANAGEMENT_METRICS_TAGS_APPLICATION) with their rationale - Fix stale obs-tempo service description (port 4318, not 4317) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.7 KiB
Observability Guide
Ops reference (starting the stack, env vars, CI wiring) → DEPLOYMENT.md §4. This file is for developers: what signal lives where, how to reach it, and what to look for.
Where to look for what
| I want to… | Go to |
|---|---|
| See the last N log lines from the backend | docker compose logs --tail=100 backend |
| Search logs by keyword across time | Grafana → Explore → Loki |
| Understand why an HTTP request failed | Grafana → Explore → Loki → filter by traceId → follow link to Tempo |
| See a full distributed trace (DB queries, HTTP calls) | Grafana → Explore → Tempo → search by service or trace ID |
| Check JVM heap / GC / thread count | Grafana → Dashboards → Spring Boot Observability |
| Check HTTP error rate or p95 latency | Grafana → Dashboards → Spring Boot Observability |
| Check host CPU / memory / disk | Grafana → Dashboards → Node Exporter Full |
| See grouped application errors with stack traces | GlitchTip |
| Check if the backend is healthy | curl http://localhost:8081/actuator/health (on the server) |
| Check what Prometheus is scraping | curl http://localhost:9090/api/v1/targets (on the server) |
Access
| Tool | External URL | Who it's for |
|---|---|---|
| Grafana | https://grafana.archiv.raddatz.cloud |
Logs, metrics, traces — the primary observability UI |
| GlitchTip | https://glitchtip.archiv.raddatz.cloud |
Grouped errors with stack traces and release tracking |
Loki, Tempo, and Prometheus have no external URL. They are internal services, accessible only through Grafana (or via SSH tunnel — see below).
Logs (Loki)
Logs reach Loki via Promtail, which reads all Docker container logs from the Docker socket and ships them with labels derived from Docker Compose metadata.
Labels available in every log line
| Label | What it contains | Example |
|---|---|---|
job |
Compose service name | backend, frontend, db |
compose_service |
Same as job |
backend |
compose_project |
Compose project name | archiv-staging, archiv-production |
container_name |
Docker container name | archiv-staging-backend-1 |
filename |
Docker log source | /var/lib/docker/containers/… |
Use job in LogQL queries — it is stable across dev, staging, and production. container_name changes between environments.
Common LogQL queries in Grafana Explore
# All backend logs
{job="backend"}
# Backend ERROR and WARN lines only
{job="backend"} |= "ERROR" or {job="backend"} |= "WARN"
# All logs for a specific request (paste a traceId from a log line)
{job="backend"} |= "3fa85f64-5717-4562-b3fc-2c963f66afa6"
# Log lines containing a specific exception class
{job="backend"} |~ "DomainException|NullPointerException"
# Frontend logs
{job="frontend"}
# Database (slow query log, if enabled)
{job="db"}
Log → Trace correlation
Spring Boot writes the active traceId into every log line when a request is being processed:
2026-05-16 ... INFO [Familienarchiv,3fa85f64...,1b2c3d4e] o.r.f.document.DocumentService : ...
In Grafana Explore → Loki, log lines with a traceId field show a Tempo link. Clicking it opens the full trace in Explore → Tempo without copying and pasting IDs.
This linking is configured in the Loki datasource provisioning via the traceId derived field regex. No manual setup required.
Traces (Tempo)
The backend sends traces to Tempo via OTLP HTTP (port 4318). Every inbound HTTP request and every JPA query produces a span. Spans are linked into traces by the propagated traceId header.
Finding a trace in Grafana
Option A — from a log line:
- Grafana → Explore → select Loki datasource
- Query
{job="backend"}and find the failing request - Click the Tempo link in the log line (appears when
traceIdis present)
Option B — by service:
- Grafana → Explore → select Tempo datasource
- Query type: Search
- Service name:
familienarchiv-backend - Filter by HTTP status, duration, or operation name as needed
Option C — by trace ID:
- Grafana → Explore → select Tempo datasource
- Query type: TraceQL or Trace ID
- Paste the trace ID
What each span type tells you
| Root span name pattern | What it covers |
|---|---|
GET /api/documents, POST /api/documents |
Full HTTP request lifecycle |
SELECT archiv.* |
A single JPA/JDBC query inside that request |
HikariPool.getConnection |
Connection pool wait time |
A slow SELECT span inside an otherwise fast HTTP span pinpoints a missing index. A slow HikariPool.getConnection span indicates connection pool exhaustion.
Sampling rate
- Dev: 100% of requests are traced (
management.tracing.sampling.probability: 1.0inapplication.yaml) - Staging / Production: 10% (
MANAGEMENT_TRACING_SAMPLING_PROBABILITY=0.1indocker-compose.prod.yml)
To find a trace for a specific request in staging/production, either increase the sampling rate temporarily or trigger the request multiple times.
Metrics (Prometheus → Grafana)
Prometheus scrapes the backend management endpoint every 15 s:
Target: backend:8081/actuator/prometheus
Labels: job="spring-boot", application="Familienarchiv"
All Spring Boot metrics carry the application="Familienarchiv" tag, which is how the Grafana Spring Boot Observability dashboard (ID 17175) filters to this service.
Useful Prometheus queries (run on the server or via Grafana Explore → Prometheus)
# HTTP error rate (5xx) as a fraction of all requests
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count[5m]))
# p95 response time
histogram_quantile(0.95, sum by (le) (
rate(http_server_requests_seconds_bucket[5m])
))
# JVM heap used
jvm_memory_used_bytes{area="heap", application="Familienarchiv"}
# Active DB connections
hikaricp_connections_active
Errors (GlitchTip)
GlitchTip receives errors from both the backend (via Sentry Java SDK) and the frontend (via Sentry JavaScript SDK). It groups events by fingerprint, tracks first/last seen times, and links to the release that introduced the error.
GlitchTip complements Loki: use GlitchTip when you need grouped, de-duplicated errors with stack traces and release attribution; use Loki when you need raw log lines with full context or want to search across all log levels.
Direct API access (debugging only)
Loki and Tempo bind no host ports. To reach them directly from your laptop, use an SSH tunnel through the server:
# Loki API on localhost:3100 (then query via curl or logcli)
ssh -L 3100:172.20.0.x:3100 root@raddatz.cloud
# Replace 172.20.0.x with the obs-loki container IP:
# docker inspect obs-loki --format '{{.NetworkSettings.Networks.archiv-obs-net.IPAddress}}'
# Tempo API on localhost:3200 (then query via curl or tempo-cli)
ssh -L 3200:172.20.0.x:3200 root@raddatz.cloud
In practice, Grafana Explore covers all common debugging workflows without needing direct API access.
Signal summary
| Signal | Source | Transport | Storage | UI |
|---|---|---|---|---|
| Application logs | Spring Boot stdout → Docker log driver | Promtail → Loki push API | Loki | Grafana Explore → Loki |
| Distributed traces | Spring Boot OTel agent | OTLP HTTP → Tempo:4318 | Tempo | Grafana Explore → Tempo |
| JVM + HTTP metrics | Spring Actuator /actuator/prometheus |
Prometheus pull (15 s) | Prometheus | Grafana dashboards |
| Host metrics | node-exporter | Prometheus pull | Prometheus | Grafana → Node Exporter Full |
| Container metrics | cAdvisor | Prometheus pull | Prometheus | Grafana (via Prometheus datasource) |
| Application errors | Sentry SDK | HTTP POST → GlitchTip ingest | GlitchTip DB | GlitchTip UI |