# Observability Guide > **Ops reference (starting the stack, env vars, CI wiring) → [DEPLOYMENT.md §4](./DEPLOYMENT.md#4-logs--observability).** > This file is for developers: what signal lives where, how to reach it, and what to look for. ## Where to look for what | I want to… | Go to | |---|---| | See the last N log lines from the backend | `docker compose logs --tail=100 backend` | | Search logs by keyword across time | Grafana → Explore → Loki | | Understand why an HTTP request failed | Grafana → Explore → Loki → filter by `traceId` → follow link to Tempo | | See a full distributed trace (DB queries, HTTP calls) | Grafana → Explore → Tempo → search by service or trace ID | | Check JVM heap / GC / thread count | Grafana → Dashboards → Spring Boot Observability | | Check HTTP error rate or p95 latency | Grafana → Dashboards → Spring Boot Observability | | Check host CPU / memory / disk | Grafana → Dashboards → Node Exporter Full | | See grouped application errors with stack traces | GlitchTip | | Check if the backend is healthy | `curl http://localhost:8081/actuator/health` (on the server) | | Check what Prometheus is scraping | `curl http://localhost:9090/api/v1/targets` (on the server) | ## Access | Tool | External URL | Who it's for | |---|---|---| | Grafana | `https://grafana.archiv.raddatz.cloud` | Logs, metrics, traces — the primary observability UI | | GlitchTip | `https://glitchtip.archiv.raddatz.cloud` | Grouped errors with stack traces and release tracking | Loki, Tempo, and Prometheus have no external URL. They are internal services, accessible only through Grafana (or via SSH tunnel — see below). ## Logs (Loki) Logs reach Loki via Promtail, which reads all Docker container logs from the Docker socket and ships them with labels derived from Docker Compose metadata. ### Labels available in every log line | Label | What it contains | Example | |---|---|---| | `job` | Compose service name | `backend`, `frontend`, `db` | | `compose_service` | Same as `job` | `backend` | | `compose_project` | Compose project name | `archiv-staging`, `archiv-production` | | `container_name` | Docker container name | `archiv-staging-backend-1` | | `filename` | Docker log source | `/var/lib/docker/containers/…` | **Use `job` in LogQL queries** — it is stable across dev, staging, and production. `container_name` changes between environments. ### Common LogQL queries in Grafana Explore ```logql # All backend logs {job="backend"} # Backend ERROR and WARN lines only {job="backend"} |= "ERROR" or {job="backend"} |= "WARN" # All logs for a specific request (paste a traceId from a log line) {job="backend"} |= "3fa85f64-5717-4562-b3fc-2c963f66afa6" # Log lines containing a specific exception class {job="backend"} |~ "DomainException|NullPointerException" # Frontend logs {job="frontend"} # Database (slow query log, if enabled) {job="db"} ``` ### Log → Trace correlation Spring Boot writes the active `traceId` into every log line when a request is being processed: ``` 2026-05-16 ... INFO [Familienarchiv,3fa85f64...,1b2c3d4e] o.r.f.document.DocumentService : ... ``` In Grafana Explore → Loki, log lines with a `traceId` field show a **Tempo** link. Clicking it opens the full trace in Explore → Tempo without copying and pasting IDs. This linking is configured in the Loki datasource provisioning via the `traceId` derived field regex. No manual setup required. ## Traces (Tempo) The backend sends traces to Tempo via OTLP HTTP (port 4318). Every inbound HTTP request and every JPA query produces a span. Spans are linked into traces by the propagated `traceId` header. ### Finding a trace in Grafana **Option A — from a log line:** 1. Grafana → Explore → select *Loki* datasource 2. Query `{job="backend"}` and find the failing request 3. Click the **Tempo** link in the log line (appears when `traceId` is present) **Option B — by service:** 1. Grafana → Explore → select *Tempo* datasource 2. Query type: **Search** 3. Service name: `familienarchiv-backend` 4. Filter by HTTP status, duration, or operation name as needed **Option C — by trace ID:** 1. Grafana → Explore → select *Tempo* datasource 2. Query type: **TraceQL** or **Trace ID** 3. Paste the trace ID ### What each span type tells you | Root span name pattern | What it covers | |---|---| | `GET /api/documents`, `POST /api/documents` | Full HTTP request lifecycle | | `SELECT archiv.*` | A single JPA/JDBC query inside that request | | `HikariPool.getConnection` | Connection pool wait time | A slow `SELECT` span inside an otherwise fast HTTP span pinpoints a missing index. A slow `HikariPool.getConnection` span indicates connection pool exhaustion. ### Sampling rate - **Dev**: 100% of requests are traced (`management.tracing.sampling.probability: 1.0` in `application.yaml`) - **Staging / Production**: 10% (`MANAGEMENT_TRACING_SAMPLING_PROBABILITY=0.1` in `docker-compose.prod.yml`) To find a trace for a specific request in staging/production, either increase the sampling rate temporarily or trigger the request multiple times. ## Metrics (Prometheus → Grafana) Prometheus scrapes the backend management endpoint every 15 s: ``` Target: backend:8081/actuator/prometheus Labels: job="spring-boot", application="Familienarchiv" ``` All Spring Boot metrics carry the `application="Familienarchiv"` tag, which is how the Grafana Spring Boot Observability dashboard (ID 17175) filters to this service. ### Useful Prometheus queries (run on the server or via Grafana Explore → Prometheus) ```promql # HTTP error rate (5xx) as a fraction of all requests sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) / sum(rate(http_server_requests_seconds_count[5m])) # p95 response time histogram_quantile(0.95, sum by (le) ( rate(http_server_requests_seconds_bucket[5m]) )) # JVM heap used jvm_memory_used_bytes{area="heap", application="Familienarchiv"} # Active DB connections hikaricp_connections_active ``` ## Errors (GlitchTip) GlitchTip receives errors from both the backend (via Sentry Java SDK) and the frontend (via Sentry JavaScript SDK). It groups events by fingerprint, tracks first/last seen times, and links to the release that introduced the error. GlitchTip complements Loki: use GlitchTip when you need **grouped, de-duplicated errors with stack traces and release attribution**; use Loki when you need **raw log lines with full context** or want to search across all log levels. ## Direct API access (debugging only) Loki and Tempo bind no host ports. To reach them directly from your laptop, use an SSH tunnel through the server: ```bash # Loki API on localhost:3100 (then query via curl or logcli) ssh -L 3100:172.20.0.x:3100 root@raddatz.cloud # Replace 172.20.0.x with the obs-loki container IP: # docker inspect obs-loki --format '{{.NetworkSettings.Networks.archiv-obs-net.IPAddress}}' # Tempo API on localhost:3200 (then query via curl or tempo-cli) ssh -L 3200:172.20.0.x:3200 root@raddatz.cloud ``` In practice, Grafana Explore covers all common debugging workflows without needing direct API access. ## Signal summary | Signal | Source | Transport | Storage | UI | |---|---|---|---|---| | Application logs | Spring Boot stdout → Docker log driver | Promtail → Loki push API | Loki | Grafana Explore → Loki | | Distributed traces | Spring Boot OTel agent | OTLP HTTP → Tempo:4318 | Tempo | Grafana Explore → Tempo | | JVM + HTTP metrics | Spring Actuator `/actuator/prometheus` | Prometheus pull (15 s) | Prometheus | Grafana dashboards | | Host metrics | node-exporter | Prometheus pull | Prometheus | Grafana → Node Exporter Full | | Container metrics | cAdvisor | Prometheus pull | Prometheus | Grafana (via Prometheus datasource) | | Application errors | Sentry SDK | HTTP POST → GlitchTip ingest | GlitchTip DB | GlitchTip UI |