From f2d9bfda6f94f86bdfa1d455d693d6d4b33512c5 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 16 May 2026 15:27:50 +0200
Subject: [PATCH] docs(obs): add OBSERVABILITY.md developer guide and fix stale
 env var docs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- New docs/OBSERVABILITY.md: developer-facing guide with a "where to look
  for what" table, common LogQL queries, trace exploration workflow,
  log→trace correlation via traceId links, and a signal summary table
- Link from DEPLOYMENT.md §4 (ops section now points to dev guide) and
  from CLAUDE.md Infrastructure section
- Fix stale DEPLOYMENT.md env var table: OTEL_EXPORTER_OTLP_ENDPOINT
  now documents port 4318 (HTTP) not 4317 (gRPC); add the three new
  env vars wired in this PR (OTEL_LOGS_EXPORTER, OTEL_METRICS_EXPORTER,
  MANAGEMENT_METRICS_TAGS_APPLICATION) with their rationale
- Fix stale obs-tempo service description (port 4318, not 4317)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 CLAUDE.md             |   4 +
 docs/DEPLOYMENT.md    |  10 ++-
 docs/OBSERVABILITY.md | 180 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 192 insertions(+), 2 deletions(-)
 create mode 100644 docs/OBSERVABILITY.md
diff --git a/CLAUDE.md b/CLAUDE.md
index 5a83c6e8..e968fc82 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -274,6 +274,10 @@ Back button pattern — use the shared `<BackButton>` component from `$lib/share
 
 → See [docs/DEPLOYMENT.md](./docs/DEPLOYMENT.md)
 
+## Observability
+
+→ See [docs/OBSERVABILITY.md](./docs/OBSERVABILITY.md) — where to look for logs, traces, metrics, and errors.
+
 ## API Testing
 
 HTTP test files are in `backend/api_tests/` for use with the VS Code REST Client extension.
diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md
index 6a4871b4..3fda1e71 100644
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -107,7 +107,10 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
 | `MAIL_SMTP_AUTH` | SMTP auth enabled | `false` (dev) | YES (prod) | — |
 | `MAIL_STARTTLS_ENABLE` | STARTTLS enabled | `false` (dev) | YES (prod) | — |
 | `SPRING_PROFILES_ACTIVE` | Spring profile | `dev,e2e` (compose) | YES | — |
-| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP gRPC endpoint for distributed traces (Tempo). Set to `http://tempo:4317` via compose. | `http://localhost:4317` | — | — |
+| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP HTTP endpoint for distributed traces (Tempo). Port 4318 = HTTP transport; port 4317 is gRPC-only and causes "Connection reset" with Spring Boot's HttpExporter. | `http://localhost:4318` | — | — |
+| `OTEL_LOGS_EXPORTER` | Disable OTLP log export — Promtail captures Docker logs via the logging driver; Tempo does not accept logs. | `none` | — | — |
+| `OTEL_METRICS_EXPORTER` | Disable OTLP metric export — Prometheus scrapes `/actuator/prometheus` via pull model; Tempo does not accept metrics. | `none` | — | — |
+| `MANAGEMENT_METRICS_TAGS_APPLICATION` | Common tag added to every Micrometer metric. Required by Grafana's Spring Boot Observability dashboard (ID 17175) `label_values(application)` template variable. | `Familienarchiv` | — | — |
 | `MANAGEMENT_TRACING_SAMPLING_PROBABILITY` | Micrometer tracing sample rate; overridden to `0.0` in test profile. | `0.1` (compose) / `1.0` (dev) | — | — |
 
 ### PostgreSQL container
@@ -277,6 +280,9 @@ Before the first deploy: rotate `PROD_APP_ADMIN_PASSWORD` to a strong value. Aft
 
 ## 4. Logs + observability
 
+> **Developer guide (where to look for what, LogQL queries, trace exploration) → [docs/OBSERVABILITY.md](./OBSERVABILITY.md).**
+> This section covers the ops side: starting the stack, env vars, and CI wiring.
+
 ### First-response commands
 
 ```bash
@@ -372,7 +378,7 @@ Current services:
 | `obs-cadvisor` | `gcr.io/cadvisor/cadvisor:v0.52.1` | Per-container resource metrics |
 | `obs-loki` | `grafana/loki:3.4.2` | Log aggregation — receives log streams from Promtail. Port 3100 is `expose`-only (not host-bound). |
 | `obs-promtail` | `grafana/promtail:3.4.2` | Log shipping agent — reads all Docker container logs via the Docker socket and forwards them to Loki with `container_name`, `compose_service`, `compose_project`, and `job` labels. The `job` label is mapped from the Docker Compose service name (`com.docker.compose.service`) so that Grafana Loki dashboard queries (`{job="backend"}`, `{job="frontend"}`) work out of the box and the "App" variable dropdown is populated. |
-| `obs-tempo` | `grafana/tempo:2.7.2` | Distributed trace storage — OTLP gRPC receiver on port 4317, OTLP HTTP on port 4318 (both `archiv-net`-internal). Grafana queries traces on port 3200 (`obs-net`-internal). All ports are `expose`-only (not host-bound). |
+| `obs-tempo` | `grafana/tempo:2.7.2` | Distributed trace storage — OTLP HTTP receiver on port 4318 (`archiv-net`-internal; backend sends traces here). Grafana queries traces on port 3200 (`obs-net`-internal). All ports are `expose`-only (not host-bound). |
 | `obs-grafana` | `grafana/grafana-oss:11.6.1` | Unified observability UI — metrics dashboards, log exploration, trace viewer. Bound to `127.0.0.1:${PORT_GRAFANA:-3001}` on the host. |
 | `obs-glitchtip` | `glitchtip/glitchtip:v4` | Sentry-compatible error tracker. Receives frontend + backend error events, groups by fingerprint, provides issue UI with stack traces. Bound to `127.0.0.1:${PORT_GLITCHTIP:-3002}`. |
 | `obs-glitchtip-worker` | `glitchtip/glitchtip:v4` | Celery + beat worker — processes async GlitchTip tasks (event ingestion, notifications, cleanup). |
diff --git a/docs/OBSERVABILITY.md b/docs/OBSERVABILITY.md
new file mode 100644
index 00000000..b895e849
--- /dev/null
+++ b/docs/OBSERVABILITY.md
@@ -0,0 +1,180 @@
+# Observability Guide
+
+> **Ops reference (starting the stack, env vars, CI wiring) → [DEPLOYMENT.md §4](./DEPLOYMENT.md#4-logs--observability).**
+> This file is for developers: what signal lives where, how to reach it, and what to look for.
+
+## Where to look for what
+
+| I want to… | Go to |
+|---|---|
+| See the last N log lines from the backend | `docker compose logs --tail=100 backend` |
+| Search logs by keyword across time | Grafana → Explore → Loki |
+| Understand why an HTTP request failed | Grafana → Explore → Loki → filter by `traceId` → follow link to Tempo |
+| See a full distributed trace (DB queries, HTTP calls) | Grafana → Explore → Tempo → search by service or trace ID |
+| Check JVM heap / GC / thread count | Grafana → Dashboards → Spring Boot Observability |
+| Check HTTP error rate or p95 latency | Grafana → Dashboards → Spring Boot Observability |
+| Check host CPU / memory / disk | Grafana → Dashboards → Node Exporter Full |
+| See grouped application errors with stack traces | GlitchTip |
+| Check if the backend is healthy | `curl http://localhost:8081/actuator/health` (on the server) |
+| Check what Prometheus is scraping | `curl http://localhost:9090/api/v1/targets` (on the server) |
+
+## Access
+
+| Tool | External URL | Who it's for |
+|---|---|---|
+| Grafana | `https://grafana.archiv.raddatz.cloud` | Logs, metrics, traces — the primary observability UI |
+| GlitchTip | `https://glitchtip.archiv.raddatz.cloud` | Grouped errors with stack traces and release tracking |
+
+Loki, Tempo, and Prometheus have no external URL. They are internal services, accessible only through Grafana (or via SSH tunnel — see below).
+
+## Logs (Loki)
+
+Logs reach Loki via Promtail, which reads all Docker container logs from the Docker socket and ships them with labels derived from Docker Compose metadata.
+
+### Labels available in every log line
+
+| Label | What it contains | Example |
+|---|---|---|
+| `job` | Compose service name | `backend`, `frontend`, `db` |
+| `compose_service` | Same as `job` | `backend` |
+| `compose_project` | Compose project name | `archiv-staging`, `archiv-production` |
+| `container_name` | Docker container name | `archiv-staging-backend-1` |
+| `filename` | Docker log source | `/var/lib/docker/containers/…` |
+
+**Use `job` in LogQL queries** — it is stable across dev, staging, and production. `container_name` changes between environments.
+
+### Common LogQL queries in Grafana Explore
+
+```logql
+# All backend logs
+{job="backend"}
+
+# Backend ERROR and WARN lines only
+{job="backend"} |= "ERROR" or {job="backend"} |= "WARN"
+
+# All logs for a specific request (paste a traceId from a log line)
+{job="backend"} |= "3fa85f64-5717-4562-b3fc-2c963f66afa6"
+
+# Log lines containing a specific exception class
+{job="backend"} |~ "DomainException|NullPointerException"
+
+# Frontend logs
+{job="frontend"}
+
+# Database (slow query log, if enabled)
+{job="db"}
+```
+
+### Log → Trace correlation
+
+Spring Boot writes the active `traceId` into every log line when a request is being processed:
+
+```
+2026-05-16 ... INFO  [Familienarchiv,3fa85f64...,1b2c3d4e] o.r.f.document.DocumentService : ...
+```
+
+In Grafana Explore → Loki, log lines with a `traceId` field show a **Tempo** link. Clicking it opens the full trace in Explore → Tempo without copying and pasting IDs.
+
+This linking is configured in the Loki datasource provisioning via the `traceId` derived field regex. No manual setup required.
+
+## Traces (Tempo)
+
+The backend sends traces to Tempo via OTLP HTTP (port 4318). Every inbound HTTP request and every JPA query produces a span. Spans are linked into traces by the propagated `traceId` header.
+
+### Finding a trace in Grafana
+
+**Option A — from a log line:**
+1. Grafana → Explore → select *Loki* datasource
+2. Query `{job="backend"}` and find the failing request
+3. Click the **Tempo** link in the log line (appears when `traceId` is present)
+
+**Option B — by service:**
+1. Grafana → Explore → select *Tempo* datasource
+2. Query type: **Search**
+3. Service name: `familienarchiv-backend`
+4. Filter by HTTP status, duration, or operation name as needed
+
+**Option C — by trace ID:**
+1. Grafana → Explore → select *Tempo* datasource
+2. Query type: **TraceQL** or **Trace ID**
+3. Paste the trace ID
+
+### What each span type tells you
+
+| Root span name pattern | What it covers |
+|---|---|
+| `GET /api/documents`, `POST /api/documents` | Full HTTP request lifecycle |
+| `SELECT archiv.*` | A single JPA/JDBC query inside that request |
+| `HikariPool.getConnection` | Connection pool wait time |
+
+A slow `SELECT` span inside an otherwise fast HTTP span pinpoints a missing index. A slow `HikariPool.getConnection` span indicates connection pool exhaustion.
+
+### Sampling rate
+
+- **Dev**: 100% of requests are traced (`management.tracing.sampling.probability: 1.0` in `application.yaml`)
+- **Staging / Production**: 10% (`MANAGEMENT_TRACING_SAMPLING_PROBABILITY=0.1` in `docker-compose.prod.yml`)
+
+To find a trace for a specific request in staging/production, either increase the sampling rate temporarily or trigger the request multiple times.
+
+## Metrics (Prometheus → Grafana)
+
+Prometheus scrapes the backend management endpoint every 15 s:
+
+```
+Target: backend:8081/actuator/prometheus
+Labels: job="spring-boot", application="Familienarchiv"
+```
+
+All Spring Boot metrics carry the `application="Familienarchiv"` tag, which is how the Grafana Spring Boot Observability dashboard (ID 17175) filters to this service.
+
+### Useful Prometheus queries (run on the server or via Grafana Explore → Prometheus)
+
+```promql
+# HTTP error rate (5xx) as a fraction of all requests
+sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
+  / sum(rate(http_server_requests_seconds_count[5m]))
+
+# p95 response time
+histogram_quantile(0.95, sum by (le) (
+  rate(http_server_requests_seconds_bucket[5m])
+))
+
+# JVM heap used
+jvm_memory_used_bytes{area="heap", application="Familienarchiv"}
+
+# Active DB connections
+hikaricp_connections_active
+```
+
+## Errors (GlitchTip)
+
+GlitchTip receives errors from both the backend (via Sentry Java SDK) and the frontend (via Sentry JavaScript SDK). It groups events by fingerprint, tracks first/last seen times, and links to the release that introduced the error.
+
+GlitchTip complements Loki: use GlitchTip when you need **grouped, de-duplicated errors with stack traces and release attribution**; use Loki when you need **raw log lines with full context** or want to search across all log levels.
+
+## Direct API access (debugging only)
+
+Loki and Tempo bind no host ports. To reach them directly from your laptop, use an SSH tunnel through the server:
+
+```bash
+# Loki API on localhost:3100 (then query via curl or logcli)
+ssh -L 3100:172.20.0.x:3100 root@raddatz.cloud
+# Replace 172.20.0.x with the obs-loki container IP:
+#   docker inspect obs-loki --format '{{.NetworkSettings.Networks.archiv-obs-net.IPAddress}}'
+
+# Tempo API on localhost:3200 (then query via curl or tempo-cli)
+ssh -L 3200:172.20.0.x:3200 root@raddatz.cloud
+```
+
+In practice, Grafana Explore covers all common debugging workflows without needing direct API access.
+
+## Signal summary
+
+| Signal | Source | Transport | Storage | UI |
+|---|---|---|---|---|
+| Application logs | Spring Boot stdout → Docker log driver | Promtail → Loki push API | Loki | Grafana Explore → Loki |
+| Distributed traces | Spring Boot OTel agent | OTLP HTTP → Tempo:4318 | Tempo | Grafana Explore → Tempo |
+| JVM + HTTP metrics | Spring Actuator `/actuator/prometheus` | Prometheus pull (15 s) | Prometheus | Grafana dashboards |
+| Host metrics | node-exporter | Prometheus pull | Prometheus | Grafana → Node Exporter Full |
+| Container metrics | cAdvisor | Prometheus pull | Prometheus | Grafana (via Prometheus datasource) |
+| Application errors | Sentry SDK | HTTP POST → GlitchTip ingest | GlitchTip DB | GlitchTip UI |