devops: production observability stack — Prometheus, Loki, Grafana, Alertmanager #498

New Issue

marcel · 2026-05-10T20:47:48+02:00

marcel commented

2026-05-10 20:47:48 +02:00

Context

Issue #497 ships the first production deployment without observability. This is acceptable for the initial go-live, but operations are currently blind: no metrics, no log aggregation, no alerting. This issue adds the full observability stack to docker-compose.prod.yml and configures it for the Familienarchiv service topology.

The service definitions for Prometheus, Grafana, Loki, Promtail, and Alertmanager already exist in docs/infrastructure/production-compose.md — this issue implements them.

What needs to happen

1. Metrics — Prometheus + Spring Boot Actuator

The backend already exposes a Prometheus-compatible metrics endpoint at /actuator/prometheus (via micrometer-registry-prometheus — verify dependency is present in pom.xml; add it if not). Prometheus scrapes this on the internal management port (8081 or 8080 — verify which is exposed inside the Docker network).

Prometheus config (observability/prometheus.yml):

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: backend
    static_configs:
      - targets: ['backend:8080']
    metrics_path: /actuator/prometheus

The /actuator/prometheus endpoint must be added to the management exposure allowlist in application.yaml (currently only health is implied):

management:
  endpoints:
    web:
      exposure:
        include: health,prometheus
  endpoint:
    prometheus:
      enabled: true

This endpoint is internal-only — Caddy already blocks /actuator/* from external access (per #497).

2. Log aggregation — Loki + Promtail

Promtail ships Docker container logs to Loki. Config reads from the Docker socket and labels logs by container name and compose project.

Promtail config (observability/promtail-config.yml): standard Docker scrape config, filter to archiv-production project only.

Loki config (observability/loki-config.yml): local filesystem storage, 30-day retention.

3. Dashboards — Grafana

Grafana is provisioned via config files (no manual dashboard setup):

Prometheus datasource: observability/grafana/provisioning/datasources/prometheus.yml
Loki datasource: observability/grafana/provisioning/datasources/loki.yml
Dashboard: JVM metrics (heap, GC pauses, thread count, HTTP request rate/latency)
Dashboard: Application metrics (document count, active users, OCR job rate — Spring Boot custom metrics if available)

Grafana is accessible only on the internal network — not reverse-proxied through Caddy in the initial implementation.

4. Alerting — Alertmanager

One alert rule to start: backend health endpoint returns non-UP for > 2 minutes → notify via email (uses the same SMTP config as the application).

Alert config (observability/alertmanager.yml): email receiver using ${MAIL_HOST}, ${MAIL_USERNAME}, ${MAIL_PASSWORD} from the prod env.

5. `docker-compose.prod.yml` additions

Add the following services (all with expose: only, no ports: — internal network access only):

Service	Image	Port	Named volume
prometheus	`prom/prometheus:v2.51.0`	9090	`prometheus-data`
grafana	`grafana/grafana:10.4.0`	3000	`grafana-data`
loki	`grafana/loki:2.9.0`	3100	`loki-data`
promtail	`grafana/promtail:2.9.0`	—	mounts Docker socket (read-only)
alertmanager	`prom/alertmanager:v0.27.0`	9093	—

All images pinned. Renovate handles future version bumps.

Acceptance criteria

pom.xml includes micrometer-registry-prometheus and /actuator/prometheus returns metrics when backend is running
Prometheus scrapes backend metrics successfully (visible in Prometheus UI at http://localhost:9090 via docker compose exec)
Loki receives logs from all production containers (query {compose_project="archiv-production"} returns results)
Grafana starts with pre-provisioned Prometheus and Loki datasources — no manual setup required
At least one dashboard is provisioned showing JVM heap, HTTP request rate, and error rate
Alertmanager sends an email when the backend health endpoint is down for > 2 minutes (test by stopping the backend container)
None of the observability services are reachable from the public internet (Caddy does not route to them)
All five services use named volumes — no bind mounts for persistent data
docker compose up (dev) is unaffected — observability services are prod-only

Effort

M — 1 day. Config file writing and wiring, no code changes except adding the Micrometer dependency.

## Context Issue #497 ships the first production deployment without observability. This is acceptable for the initial go-live, but operations are currently blind: no metrics, no log aggregation, no alerting. This issue adds the full observability stack to `docker-compose.prod.yml` and configures it for the Familienarchiv service topology. The service definitions for Prometheus, Grafana, Loki, Promtail, and Alertmanager already exist in `docs/infrastructure/production-compose.md` — this issue implements them. --- ## What needs to happen ### 1. Metrics — Prometheus + Spring Boot Actuator The backend already exposes a Prometheus-compatible metrics endpoint at `/actuator/prometheus` (via `micrometer-registry-prometheus` — verify dependency is present in `pom.xml`; add it if not). Prometheus scrapes this on the internal management port (8081 or 8080 — verify which is exposed inside the Docker network). Prometheus config (`observability/prometheus.yml`): ```yaml global: scrape_interval: 15s scrape_configs: - job_name: backend static_configs: - targets: ['backend:8080'] metrics_path: /actuator/prometheus ``` The `/actuator/prometheus` endpoint must be added to the management exposure allowlist in `application.yaml` (currently only `health` is implied): ```yaml management: endpoints: web: exposure: include: health,prometheus endpoint: prometheus: enabled: true ``` This endpoint is internal-only — Caddy already blocks `/actuator/*` from external access (per #497). ### 2. Log aggregation — Loki + Promtail Promtail ships Docker container logs to Loki. Config reads from the Docker socket and labels logs by container name and compose project. Promtail config (`observability/promtail-config.yml`): standard Docker scrape config, filter to `archiv-production` project only. Loki config (`observability/loki-config.yml`): local filesystem storage, 30-day retention. ### 3. Dashboards — Grafana Grafana is provisioned via config files (no manual dashboard setup): - Prometheus datasource: `observability/grafana/provisioning/datasources/prometheus.yml` - Loki datasource: `observability/grafana/provisioning/datasources/loki.yml` - Dashboard: JVM metrics (heap, GC pauses, thread count, HTTP request rate/latency) - Dashboard: Application metrics (document count, active users, OCR job rate — Spring Boot custom metrics if available) Grafana is accessible only on the internal network — **not** reverse-proxied through Caddy in the initial implementation. ### 4. Alerting — Alertmanager One alert rule to start: backend health endpoint returns non-UP for > 2 minutes → notify via email (uses the same SMTP config as the application). Alert config (`observability/alertmanager.yml`): email receiver using `${MAIL_HOST}`, `${MAIL_USERNAME}`, `${MAIL_PASSWORD}` from the prod env. ### 5. `docker-compose.prod.yml` additions Add the following services (all with `expose:` only, no `ports:` — internal network access only): | Service | Image | Port | Named volume | |---|---|---|---| | prometheus | `prom/prometheus:v2.51.0` | 9090 | `prometheus-data` | | grafana | `grafana/grafana:10.4.0` | 3000 | `grafana-data` | | loki | `grafana/loki:2.9.0` | 3100 | `loki-data` | | promtail | `grafana/promtail:2.9.0` | — | mounts Docker socket (read-only) | | alertmanager | `prom/alertmanager:v0.27.0` | 9093 | — | All images pinned. Renovate handles future version bumps. --- ## Acceptance criteria - [ ] `pom.xml` includes `micrometer-registry-prometheus` and `/actuator/prometheus` returns metrics when backend is running - [ ] Prometheus scrapes backend metrics successfully (visible in Prometheus UI at `http://localhost:9090` via `docker compose exec`) - [ ] Loki receives logs from all production containers (query `{compose_project="archiv-production"}` returns results) - [ ] Grafana starts with pre-provisioned Prometheus and Loki datasources — no manual setup required - [ ] At least one dashboard is provisioned showing JVM heap, HTTP request rate, and error rate - [ ] Alertmanager sends an email when the backend health endpoint is down for > 2 minutes (test by stopping the backend container) - [ ] None of the observability services are reachable from the public internet (Caddy does not route to them) - [ ] All five services use named volumes — no bind mounts for persistent data - [ ] `docker compose up` (dev) is unaffected — observability services are prod-only ## Effort M — 1 day. Config file writing and wiring, no code changes except adding the Micrometer dependency.

marcel added the P2-medium devops phase-7: monitoring labels 2026-05-10 20:47:54 +02:00

marcel referenced this issue

2026-05-10 20:49:43 +02:00

devops: production deployment — Caddy, staging env, and Gitea Actions CI/CD #497

marcel referenced this issue from a commit

2026-05-10 22:03:49 +02:00

docs: retire overlay narrative; add Caddy to C4 L2 diagram

marcel referenced this issue

2026-05-10 22:04:13 +02:00

feat(infra): production deployment pipeline — Caddy, staging, Gitea Actions (#497) #499

marcel referenced this issue

2026-05-11 12:52:50 +02:00