Add Prometheus + Loki + Grafana monitoring stack #140

New Issue

marcel · 2026-03-28T08:53:23+01:00

marcel commented

2026-03-28 08:53:23 +01:00

Why

Without monitoring, production problems are discovered by users rather than by alerts. For a family archive:

You won't know the disk is full until uploads start failing.
You won't know the backend crashed until someone complains they can't log in.
You won't know a Flyway migration is hanging at startup until it times out.

The goal here is just enough observability for a single-VPS deployment, with no external SaaS dependency and minimal operational overhead.

What to do

Add three services to the production compose stack. They run on the internal Docker network and are accessible only through an authenticated Caddy route — never exposed directly to the internet.

Services

Prometheus — scrapes /actuator/prometheus from the backend every 15s. Stores metrics locally with 15-day retention.

Loki + Promtail — collects all Docker container logs via the Docker socket and makes them queryable in Grafana. No log shipping to external services.

Grafana — dashboards and alerts. Pre-provisioned with a Prometheus datasource and a Loki datasource. Accessible at https://your-domain/grafana/ behind HTTP Basic Auth in Caddy (or Grafana's built-in auth).

docker-compose.monitoring.yml

Applied as a third overlay: docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.monitoring.yml up -d

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    restart: unless-stopped
    volumes:
      - ./infra/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'
    networks:
      - archive-net

  loki:
    image: grafana/loki:3.0.0
    restart: unless-stopped
    volumes:
      - loki_data:/loki
    networks:
      - archive-net

  promtail:
    image: grafana/promtail:3.0.0
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./infra/promtail/promtail.yml:/etc/promtail/config.yml:ro
    networks:
      - archive-net

  grafana:
    image: grafana/grafana:10.4.0
    restart: unless-stopped
    volumes:
      - grafana_data:/var/lib/grafana
      - ./infra/grafana/provisioning:/etc/grafana/provisioning:ro
    environment:
      GF_SERVER_ROOT_URL: https://your-domain/grafana
      GF_SERVER_SERVE_FROM_SUB_PATH: "true"
    networks:
      - archive-net

volumes:
  prometheus_data:
  loki_data:
  grafana_data:

Caddyfile addition

# Add to the existing server block
handle /grafana/* {
    reverse_proxy grafana:3000
}

prometheus.yml

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: familienarchiv-backend
    static_configs:
      - targets: ['backend:8081']   # management port — internal only
    metrics_path: /actuator/prometheus

Alerting (minimal)

Configure Alertmanager or use Grafana's built-in alerting to send a notification when:

Backend is unreachable for > 2 minutes
Disk usage on the VPS exceeds 80%
No successful backup log entry in the last 26 hours

What this is NOT

This issue is not asking for a full observability platform. No OpenTelemetry, no distributed tracing, no Kubernetes-grade setup. Three containers, one compose file, zero external dependencies.

Acceptance criteria

https://your-domain/grafana/ is accessible and shows a working Grafana instance.
Prometheus has at least one active scrape target (backend) with status UP.
Loki receives log entries from the backend and frontend containers.
At least one alert rule is configured for backend downtime.
All monitoring data persists across container restarts via named volumes.

## Why Without monitoring, production problems are discovered by users rather than by alerts. For a family archive: - You won't know the disk is full until uploads start failing. - You won't know the backend crashed until someone complains they can't log in. - You won't know a Flyway migration is hanging at startup until it times out. The goal here is **just enough observability** for a single-VPS deployment, with no external SaaS dependency and minimal operational overhead. ## What to do Add three services to the production compose stack. They run on the internal Docker network and are accessible only through an authenticated Caddy route — never exposed directly to the internet. ### Services **Prometheus** — scrapes `/actuator/prometheus` from the backend every 15s. Stores metrics locally with 15-day retention. **Loki + Promtail** — collects all Docker container logs via the Docker socket and makes them queryable in Grafana. No log shipping to external services. **Grafana** — dashboards and alerts. Pre-provisioned with a Prometheus datasource and a Loki datasource. Accessible at `https://your-domain/grafana/` behind HTTP Basic Auth in Caddy (or Grafana's built-in auth). ### docker-compose.monitoring.yml Applied as a third overlay: `docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.monitoring.yml up -d` ```yaml services: prometheus: image: prom/prometheus:v2.51.0 restart: unless-stopped volumes: - ./infra/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.retention.time=15d' networks: - archive-net loki: image: grafana/loki:3.0.0 restart: unless-stopped volumes: - loki_data:/loki networks: - archive-net promtail: image: grafana/promtail:3.0.0 restart: unless-stopped volumes: - /var/run/docker.sock:/var/run/docker.sock:ro - ./infra/promtail/promtail.yml:/etc/promtail/config.yml:ro networks: - archive-net grafana: image: grafana/grafana:10.4.0 restart: unless-stopped volumes: - grafana_data:/var/lib/grafana - ./infra/grafana/provisioning:/etc/grafana/provisioning:ro environment: GF_SERVER_ROOT_URL: https://your-domain/grafana GF_SERVER_SERVE_FROM_SUB_PATH: "true" networks: - archive-net volumes: prometheus_data: loki_data: grafana_data: ``` ### Caddyfile addition ```caddyfile # Add to the existing server block handle /grafana/* { reverse_proxy grafana:3000 } ``` ### prometheus.yml ```yaml global: scrape_interval: 15s scrape_configs: - job_name: familienarchiv-backend static_configs: - targets: ['backend:8081'] # management port — internal only metrics_path: /actuator/prometheus ``` ### Alerting (minimal) Configure Alertmanager or use Grafana's built-in alerting to send a notification when: - Backend is unreachable for > 2 minutes - Disk usage on the VPS exceeds 80% - No successful backup log entry in the last 26 hours ## What this is NOT This issue is not asking for a full observability platform. No OpenTelemetry, no distributed tracing, no Kubernetes-grade setup. Three containers, one compose file, zero external dependencies. ## Acceptance criteria - `https://your-domain/grafana/` is accessible and shows a working Grafana instance. - Prometheus has at least one active scrape target (`backend`) with status UP. - Loki receives log entries from the backend and frontend containers. - At least one alert rule is configured for backend downtime. - All monitoring data persists across container restarts via named volumes.

marcel added the devops phase-7: monitoring labels 2026-03-28 10:46:47 +01:00

marcel referenced this issue

2026-03-31 20:56:20 +02:00

fix(security): annotate AppUser.password with @JsonIgnore to prevent accidental hash leakage #110

marcel referenced this issue

2026-03-31 21:13:09 +02:00

fix(security): add rate limiting to login and password-reset endpoints #111

marcel referenced this issue

2026-03-31 21:13:22 +02:00

fix(security): enforce maximum file upload size limit #112

marcel referenced this issue

2026-05-07 17:19:29 +02:00

Add application-prod.yaml with secure Spring Boot production defaults #137

marcel commented

2026-05-07 17:20:08 +02:00

Audit-derived scope expansion (2026-05-07)

This issue covers metrics + logs + dashboards. Audit finding F-08 asks to also add distributed tracing — same operational stack, same milestone, same complexity. Folding it in here keeps the observability work coherent.

Why tracing matters here

The architecture has three trust boundaries (frontend SSR → backend → OCR service). Without trace context propagation, debugging a slow OCR call means reading three log files simultaneously and guessing. With OTel:

Frontend sets traceparent header in handleFetch.
Backend's spring-boot-starter-otel (Boot 4) auto-propagates spans across the OCR RestClient call.
OCR service uses opentelemetry-instrumentation-fastapi to receive + emit child spans.
All three converge in Tempo / Jaeger.

Suggested AC additions

OTel Collector added to docker-compose alongside Prometheus/Loki/Grafana — receives OTLP, fans out to Tempo (traces) + Prometheus (metrics) + Loki (logs).

Backend: add spring-boot-starter-otel (Boot 4) — supersedes micrometer-tracing-bridge-otel for Boot 4. Configure:

management:
  tracing.sampling.probability: 1.0   # tune down later
  otlp.tracing.endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://otel-collector:4318/v1/traces}

OCR service: add opentelemetry-instrumentation-fastapi to ocr-service/requirements.txt.
Frontend SSR: propagate traceparent from incoming request → backend fetch calls. Either @vercel/otel or a small custom hook.
Grafana dashboards:
- Tempo: trace by traceId — paste a trace ID, see end-to-end span tree
- Loki: logs by traceId — MDC-injected trace ID (#137) makes this work
- Prometheus: HikariCP saturation — exposes hikaricp_connections_* from /actuator/prometheus
Smoke test: hit /api/documents/{id}/ocr/start, find the trace in Tempo, see frontend → backend → OCR → MinIO presigned-URL fetch as one tree.

This is the dynamic-debugging counterpart to the static review's F-08 (Critical). See docs/audits/2026-05-07-pre-prod-architectural-review.md.

## Audit-derived scope expansion (2026-05-07) This issue covers metrics + logs + dashboards. Audit finding **F-08** asks to **also add distributed tracing** — same operational stack, same milestone, same complexity. Folding it in here keeps the observability work coherent. ### Why tracing matters here The architecture has three trust boundaries (frontend SSR → backend → OCR service). Without trace context propagation, debugging a slow OCR call means reading three log files simultaneously and guessing. With OTel: - Frontend sets `traceparent` header in `handleFetch`. - Backend's `spring-boot-starter-otel` (Boot 4) auto-propagates spans across the OCR `RestClient` call. - OCR service uses `opentelemetry-instrumentation-fastapi` to receive + emit child spans. - All three converge in Tempo / Jaeger. ### Suggested AC additions - [ ] **OTel Collector** added to docker-compose alongside Prometheus/Loki/Grafana — receives OTLP, fans out to Tempo (traces) + Prometheus (metrics) + Loki (logs). - [ ] **Backend**: add `spring-boot-starter-otel` (Boot 4) — supersedes `micrometer-tracing-bridge-otel` for Boot 4. Configure: ```yaml management: tracing.sampling.probability: 1.0 # tune down later otlp.tracing.endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://otel-collector:4318/v1/traces} ``` - [ ] **OCR service**: add `opentelemetry-instrumentation-fastapi` to `ocr-service/requirements.txt`. - [ ] **Frontend SSR**: propagate `traceparent` from incoming request → backend `fetch` calls. Either `@vercel/otel` or a small custom hook. - [ ] **Grafana dashboards**: - `Tempo: trace by traceId` — paste a trace ID, see end-to-end span tree - `Loki: logs by traceId` — MDC-injected trace ID (#137) makes this work - `Prometheus: HikariCP saturation` — exposes `hikaricp_connections_*` from `/actuator/prometheus` - [ ] **Smoke test**: hit `/api/documents/{id}/ocr/start`, find the trace in Tempo, see frontend → backend → OCR → MinIO presigned-URL fetch as one tree. This is the dynamic-debugging counterpart to the static review's F-08 (Critical). See `docs/audits/2026-05-07-pre-prod-architectural-review.md`.

marcel referenced this issue

2026-05-07 17:23:16 +02:00

feat(observability): add handleError hook with structured stdout sink #462

marcel referenced this issue

2026-05-07 17:24:25 +02:00

feat(resilience): wrap OCR client with Resilience4j retry + circuit-breaker + time-limiter #463

marcel referenced this issue

2026-05-07 17:37:28 +02:00

feat(observability): add handleError hook with structured stdout sink #462

marcel referenced this issue

2026-05-07 17:38:31 +02:00

feat(resilience): wrap OCR client with Resilience4j retry + circuit-breaker + time-limiter #463