devops: production observability stack — Prometheus, Loki, Grafana, Alertmanager #498
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Issue #497 ships the first production deployment without observability. This is acceptable for the initial go-live, but operations are currently blind: no metrics, no log aggregation, no alerting. This issue adds the full observability stack to
docker-compose.prod.ymland configures it for the Familienarchiv service topology.The service definitions for Prometheus, Grafana, Loki, Promtail, and Alertmanager already exist in
docs/infrastructure/production-compose.md— this issue implements them.What needs to happen
1. Metrics — Prometheus + Spring Boot Actuator
The backend already exposes a Prometheus-compatible metrics endpoint at
/actuator/prometheus(viamicrometer-registry-prometheus— verify dependency is present inpom.xml; add it if not). Prometheus scrapes this on the internal management port (8081 or 8080 — verify which is exposed inside the Docker network).Prometheus config (
observability/prometheus.yml):The
/actuator/prometheusendpoint must be added to the management exposure allowlist inapplication.yaml(currently onlyhealthis implied):This endpoint is internal-only — Caddy already blocks
/actuator/*from external access (per #497).2. Log aggregation — Loki + Promtail
Promtail ships Docker container logs to Loki. Config reads from the Docker socket and labels logs by container name and compose project.
Promtail config (
observability/promtail-config.yml): standard Docker scrape config, filter toarchiv-productionproject only.Loki config (
observability/loki-config.yml): local filesystem storage, 30-day retention.3. Dashboards — Grafana
Grafana is provisioned via config files (no manual dashboard setup):
observability/grafana/provisioning/datasources/prometheus.ymlobservability/grafana/provisioning/datasources/loki.ymlGrafana is accessible only on the internal network — not reverse-proxied through Caddy in the initial implementation.
4. Alerting — Alertmanager
One alert rule to start: backend health endpoint returns non-UP for > 2 minutes → notify via email (uses the same SMTP config as the application).
Alert config (
observability/alertmanager.yml): email receiver using${MAIL_HOST},${MAIL_USERNAME},${MAIL_PASSWORD}from the prod env.5.
docker-compose.prod.ymladditionsAdd the following services (all with
expose:only, noports:— internal network access only):prom/prometheus:v2.51.0prometheus-datagrafana/grafana:10.4.0grafana-datagrafana/loki:2.9.0loki-datagrafana/promtail:2.9.0prom/alertmanager:v0.27.0All images pinned. Renovate handles future version bumps.
Acceptance criteria
pom.xmlincludesmicrometer-registry-prometheusand/actuator/prometheusreturns metrics when backend is runninghttp://localhost:9090viadocker compose exec){compose_project="archiv-production"}returns results)docker compose up(dev) is unaffected — observability services are prod-onlyEffort
M — 1 day. Config file writing and wiring, no code changes except adding the Micrometer dependency.