Add Prometheus + Loki + Grafana monitoring stack #140
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Why
Without monitoring, production problems are discovered by users rather than by alerts. For a family archive:
The goal here is just enough observability for a single-VPS deployment, with no external SaaS dependency and minimal operational overhead.
What to do
Add three services to the production compose stack. They run on the internal Docker network and are accessible only through an authenticated Caddy route — never exposed directly to the internet.
Services
Prometheus — scrapes
/actuator/prometheusfrom the backend every 15s. Stores metrics locally with 15-day retention.Loki + Promtail — collects all Docker container logs via the Docker socket and makes them queryable in Grafana. No log shipping to external services.
Grafana — dashboards and alerts. Pre-provisioned with a Prometheus datasource and a Loki datasource. Accessible at
https://your-domain/grafana/behind HTTP Basic Auth in Caddy (or Grafana's built-in auth).docker-compose.monitoring.yml
Applied as a third overlay:
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.monitoring.yml up -dCaddyfile addition
prometheus.yml
Alerting (minimal)
Configure Alertmanager or use Grafana's built-in alerting to send a notification when:
What this is NOT
This issue is not asking for a full observability platform. No OpenTelemetry, no distributed tracing, no Kubernetes-grade setup. Three containers, one compose file, zero external dependencies.
Acceptance criteria
https://your-domain/grafana/is accessible and shows a working Grafana instance.backend) with status UP.Audit-derived scope expansion (2026-05-07)
This issue covers metrics + logs + dashboards. Audit finding F-08 asks to also add distributed tracing — same operational stack, same milestone, same complexity. Folding it in here keeps the observability work coherent.
Why tracing matters here
The architecture has three trust boundaries (frontend SSR → backend → OCR service). Without trace context propagation, debugging a slow OCR call means reading three log files simultaneously and guessing. With OTel:
traceparentheader inhandleFetch.spring-boot-starter-otel(Boot 4) auto-propagates spans across the OCRRestClientcall.opentelemetry-instrumentation-fastapito receive + emit child spans.Suggested AC additions
spring-boot-starter-otel(Boot 4) — supersedesmicrometer-tracing-bridge-otelfor Boot 4. Configure:opentelemetry-instrumentation-fastapitoocr-service/requirements.txt.traceparentfrom incoming request → backendfetchcalls. Either@vercel/otelor a small custom hook.Tempo: trace by traceId— paste a trace ID, see end-to-end span treeLoki: logs by traceId— MDC-injected trace ID (#137) makes this workPrometheus: HikariCP saturation— exposeshikaricp_connections_*from/actuator/prometheus/api/documents/{id}/ocr/start, find the trace in Tempo, see frontend → backend → OCR → MinIO presigned-URL fetch as one tree.This is the dynamic-debugging counterpart to the static review's F-08 (Critical). See
docs/audits/2026-05-07-pre-prod-architectural-review.md.