feat(observability): add Grafana with provisioned datasources and dashboards #589

Merged
marcel merged 3 commits from feat/issue-577-grafana into main 2026-05-15 04:35:11 +02:00
Owner

Summary

  • Adds obs-grafana service (grafana/grafana-oss:11.6.1) to docker-compose.observability.yml on port 127.0.0.1:${PORT_GRAFANA:-3001}
  • Provisions Prometheus, Loki, and Tempo datasources with trace-to-logs and log-to-trace correlation
  • Pre-loads 3 dashboards: Node Exporter Full, Spring Boot Observability, Loki logs
  • Updates .env.example with GRAFANA_ADMIN_PASSWORD
  • Updates docs/DEPLOYMENT.md and C4 L2 diagram

Closes #577

🤖 Generated with Claude Code

## Summary - Adds `obs-grafana` service (`grafana/grafana-oss:11.6.1`) to `docker-compose.observability.yml` on port `127.0.0.1:${PORT_GRAFANA:-3001}` - Provisions Prometheus, Loki, and Tempo datasources with trace-to-logs and log-to-trace correlation - Pre-loads 3 dashboards: Node Exporter Full, Spring Boot Observability, Loki logs - Updates `.env.example` with `GRAFANA_ADMIN_PASSWORD` - Updates `docs/DEPLOYMENT.md` and C4 L2 diagram Closes #577 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 2 commits 2026-05-15 04:05:42 +02:00
Add obs-grafana service (grafana/grafana-oss:11.6.1) to docker-compose.observability.yml.
Datasources (Prometheus, Loki, Tempo) are auto-provisioned via
infra/observability/grafana/provisioning/datasources/datasources.yml with
cross-datasource linking (Loki traceId → Tempo, Tempo → Loki, service map via Prometheus).
Three dashboards are pre-loaded: Node Exporter Full (1860), Spring Boot Observability (17175),
Loki Logs (13639) — datasource template variables replaced with provisioned UIDs.
GRAFANA_ADMIN_PASSWORD added to .env.example.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs(observability): document Grafana in DEPLOYMENT.md and C4 diagram
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 4m7s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 5m41s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
c99321e5cf
Add Grafana row to the observability services table, Grafana access details
(URL, credentials, auto-provisioned datasources, pre-loaded dashboards), and
GRAFANA_ADMIN_PASSWORD to the env vars table in DEPLOYMENT.md.
Update C4 l2-containers.puml: replace placeholder Grafana entry with pinned
image version, expand observability boundary with node_exporter and cadvisor
containers, and add Rel() edges for Grafana → Prometheus, Loki, and Tempo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

🏗️ Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns

What I Checked

  • C4 L2 diagram update
  • DEPLOYMENT.md update
  • Service topology and network isolation
  • Documentation table compliance per persona checklist

What's Done Right

  • C4 L2 diagram updated with Grafana container + three Rel() edges to Prometheus/Loki/Tempo — the required doc update is there
  • DEPLOYMENT.md updated with access URL, credentials, and env var table
  • Grafana correctly placed on obs-net only — no unnecessary archiv-net attachment
  • Datasource provisioning uses UIDs (prometheus, loki, tempo) that match across datasources and dashboard JSONs — correct pattern

⚠️ Concern (non-blocking for this PR, but worth tracking)

The depends_on block uses the simple list form:

depends_on:
  - prometheus
  - loki
  - tempo

Prometheus, Loki, and Tempo all define healthchecks. Using condition: service_healthy would guarantee Grafana only starts when all backends are actually ready — not just when their containers exist. In practice Grafana retries datasource connections, so startup won't fail, but it's inconsistent with the pattern used for Promtail (which uses condition: service_healthy for Loki). Consider tightening this in a follow-up.

Doc Table Pass

PR trigger Required update Status
New Docker service (Grafana) l2-containers.puml + DEPLOYMENT.md Done
## 🏗️ Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** ### What I Checked - C4 L2 diagram update - DEPLOYMENT.md update - Service topology and network isolation - Documentation table compliance per persona checklist ### ✅ What's Done Right - C4 L2 diagram updated with Grafana container + three `Rel()` edges to Prometheus/Loki/Tempo — the required doc update is there - DEPLOYMENT.md updated with access URL, credentials, and env var table - Grafana correctly placed on `obs-net` only — no unnecessary `archiv-net` attachment - Datasource provisioning uses UIDs (`prometheus`, `loki`, `tempo`) that match across datasources and dashboard JSONs — correct pattern ### ⚠️ Concern (non-blocking for this PR, but worth tracking) The `depends_on` block uses the simple list form: ```yaml depends_on: - prometheus - loki - tempo ``` Prometheus, Loki, and Tempo all define healthchecks. Using `condition: service_healthy` would guarantee Grafana only starts when all backends are actually ready — not just when their containers exist. In practice Grafana retries datasource connections, so startup won't fail, but it's inconsistent with the pattern used for Promtail (which uses `condition: service_healthy` for Loki). Consider tightening this in a follow-up. ### ✅ Doc Table Pass | PR trigger | Required update | Status | |---|---|---| | New Docker service (Grafana) | `l2-containers.puml` + `DEPLOYMENT.md` | ✅ Done |
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: Approved

What I Checked

  • Clean code in YAML configs
  • Naming conventions
  • No backend/frontend code changed in this PR

What's Done Right

  • Config files are well-structured and self-documenting
  • Dashboard provider config (dashboards.yml) is minimal and correct — no unnecessary options
  • disableDeletion: true in the dashboard provider prevents operators from accidentally deleting provisioned dashboards via the UI

Observations

This PR is pure infrastructure — no Java, Svelte, or Python code changed. The YAML configs are clean, short, and readable. The datasource UIDs are stable strings (prometheus, loki, tempo) which is the right approach for cross-datasource linking.

The regex in the Loki derivedField '"traceId":"(\w+)"' is correct for matching Spring Boot's structured JSON tracing output format from Micrometer/OTel.

No TDD concerns — this is infrastructure configuration, not testable application logic. Validated via docker compose config.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** ### What I Checked - Clean code in YAML configs - Naming conventions - No backend/frontend code changed in this PR ### ✅ What's Done Right - Config files are well-structured and self-documenting - Dashboard provider config (`dashboards.yml`) is minimal and correct — no unnecessary options - `disableDeletion: true` in the dashboard provider prevents operators from accidentally deleting provisioned dashboards via the UI ### Observations This PR is pure infrastructure — no Java, Svelte, or Python code changed. The YAML configs are clean, short, and readable. The datasource UIDs are stable strings (`prometheus`, `loki`, `tempo`) which is the right approach for cross-datasource linking. The regex in the Loki derivedField `'"traceId":"(\w+)"'` is correct for matching Spring Boot's structured JSON tracing output format from Micrometer/OTel. No TDD concerns — this is infrastructure configuration, not testable application logic. Validated via `docker compose config`.
Author
Owner

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: 🚫 Changes requested

Blockers

1. Missing healthcheck on obs-grafana

Grafana exposes /api/health which returns {"database": "ok"} when ready. Without a healthcheck, other services (or operators running docker compose ps) can't distinguish "container started" from "Grafana is accepting traffic." Every other data-holding service in this stack has a healthcheck — Grafana should too.

Fix:

obs-grafana:
  healthcheck:
    test: ["CMD-SHELL", "wget -qO- http://localhost:3000/api/health | grep -q ok || exit 1"]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 30s

2. depends_on is a simple list — should use condition: service_healthy

Prometheus, Loki, and Tempo all define healthchecks. The simple list form starts Grafana as soon as those containers exist, not when they're healthy. Promtail already uses condition: service_healthy for Loki — Grafana should be consistent.

Fix:

depends_on:
  prometheus:
    condition: service_healthy
  loki:
    condition: service_healthy
  tempo:
    condition: service_healthy

Suggestions (non-blocking)

3. Datasource URLs use container_name (obs-prometheus) not service key (prometheus)

Docker DNS on user-defined networks resolves both, so obs-prometheus:9090 works. But our convention elsewhere is to use service keys as the internal hostname. Minor inconsistency — either pick container_name consistently or service key consistently. Container_name form (obs-prometheus) is defensible since it matches what docker ps shows.

4. No start_period guard for Grafana startup
Grafana on first start initializes its SQLite/PostgreSQL database and can take 10–15 seconds. The proposed start_period: 30s above covers this.

What's Done Right

  • Image pinned to grafana/grafana-oss:11.6.1
  • Port bound to 127.0.0.1 only ✓
  • Provisioning directory mounted :ro
  • grafana_data named volume for persistence ✓
  • GF_USERS_ALLOW_SIGN_UP: "false" — anonymous sign-up disabled ✓
  • GRAFANA_ADMIN_PASSWORD in .env.example with a warning comment ✓
## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: 🚫 Changes requested** ### Blockers **1. Missing healthcheck on `obs-grafana`** Grafana exposes `/api/health` which returns `{"database": "ok"}` when ready. Without a healthcheck, other services (or operators running `docker compose ps`) can't distinguish "container started" from "Grafana is accepting traffic." Every other data-holding service in this stack has a healthcheck — Grafana should too. Fix: ```yaml obs-grafana: healthcheck: test: ["CMD-SHELL", "wget -qO- http://localhost:3000/api/health | grep -q ok || exit 1"] interval: 30s timeout: 5s retries: 3 start_period: 30s ``` **2. `depends_on` is a simple list — should use `condition: service_healthy`** Prometheus, Loki, and Tempo all define healthchecks. The simple list form starts Grafana as soon as those containers *exist*, not when they're *healthy*. Promtail already uses `condition: service_healthy` for Loki — Grafana should be consistent. Fix: ```yaml depends_on: prometheus: condition: service_healthy loki: condition: service_healthy tempo: condition: service_healthy ``` ### Suggestions (non-blocking) **3. Datasource URLs use `container_name` (obs-prometheus) not service key (prometheus)** Docker DNS on user-defined networks resolves both, so `obs-prometheus:9090` works. But our convention elsewhere is to use service keys as the internal hostname. Minor inconsistency — either pick container_name consistently or service key consistently. Container_name form (`obs-prometheus`) is defensible since it matches what `docker ps` shows. **4. No `start_period` guard for Grafana startup** Grafana on first start initializes its SQLite/PostgreSQL database and can take 10–15 seconds. The proposed `start_period: 30s` above covers this. ### ✅ What's Done Right - Image pinned to `grafana/grafana-oss:11.6.1` ✓ - Port bound to `127.0.0.1` only ✓ - Provisioning directory mounted `:ro` ✓ - `grafana_data` named volume for persistence ✓ - `GF_USERS_ALLOW_SIGN_UP: "false"` — anonymous sign-up disabled ✓ - `GRAFANA_ADMIN_PASSWORD` in `.env.example` with a warning comment ✓
Author
Owner

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved

What I Checked

  • Exposed attack surface (ports, networks)
  • Default credentials and their documentation
  • Anonymous access configuration
  • Grafana security settings
  • Mount permissions

What's Done Right

  • Port binding 127.0.0.1:${PORT_GRAFANA:-3001}:3000 — Grafana is not reachable from any external interface ✓
  • GF_USERS_ALLOW_SIGN_UP: "false" — no self-registration possible ✓
  • Provisioning directory mounted :ro — Grafana container cannot modify its own provisioning config ✓
  • GRAFANA_ADMIN_PASSWORD=changeme in .env.example with a comment: "change this before exposing Grafana beyond localhost" — appropriate for a dev default ✓
  • Grafana is on obs-net only — the application network (archiv-net) is not attached, so Grafana cannot directly reach application containers ✓

Observations (informational, not blockers)

Anonymous access — Grafana defaults to GF_AUTH_ANONYMOUS_ENABLED=false since v10. Since this uses v11.6.1, the default is secure. Explicitly setting it is belt-and-suspenders but not required.

Gravatar — Grafana by default makes outbound HTTP requests to Gravatar for user avatars. On a privacy-conscious self-hosted archive, consider GF_USERS_GRAVATAR_ENABLED=false. Not a vulnerability, but a privacy consideration for a family archive project.

Dashboard JSON provenance — The three dashboard JSONs are downloaded from grafana.com at commit time and committed to the repo. This is the correct pattern — the content is pinned at download time and auditable in git history. No supply chain concern since the files are not fetched at runtime.

Security summary: Grafana is appropriately isolated — localhost-only binding, no anonymous access, read-only provisioning mount, no unnecessary network attachments. For a single-operator self-hosted stack this posture is correct.

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** ### What I Checked - Exposed attack surface (ports, networks) - Default credentials and their documentation - Anonymous access configuration - Grafana security settings - Mount permissions ### ✅ What's Done Right - Port binding `127.0.0.1:${PORT_GRAFANA:-3001}:3000` — Grafana is not reachable from any external interface ✓ - `GF_USERS_ALLOW_SIGN_UP: "false"` — no self-registration possible ✓ - Provisioning directory mounted `:ro` — Grafana container cannot modify its own provisioning config ✓ - `GRAFANA_ADMIN_PASSWORD=changeme` in `.env.example` with a comment: *"change this before exposing Grafana beyond localhost"* — appropriate for a dev default ✓ - Grafana is on `obs-net` only — the application network (`archiv-net`) is not attached, so Grafana cannot directly reach application containers ✓ ### Observations (informational, not blockers) **Anonymous access** — Grafana defaults to `GF_AUTH_ANONYMOUS_ENABLED=false` since v10. Since this uses v11.6.1, the default is secure. Explicitly setting it is belt-and-suspenders but not required. **Gravatar** — Grafana by default makes outbound HTTP requests to Gravatar for user avatars. On a privacy-conscious self-hosted archive, consider `GF_USERS_GRAVATAR_ENABLED=false`. Not a vulnerability, but a privacy consideration for a family archive project. **Dashboard JSON provenance** — The three dashboard JSONs are downloaded from grafana.com at commit time and committed to the repo. This is the correct pattern — the content is pinned at download time and auditable in git history. No supply chain concern since the files are not fetched at runtime. **Security summary**: Grafana is appropriately isolated — localhost-only binding, no anonymous access, read-only provisioning mount, no unnecessary network attachments. For a single-operator self-hosted stack this posture is correct.
Author
Owner

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: Approved

What I Checked

  • Testability of infrastructure configuration
  • Validation approach for Docker Compose config
  • CI/CD integration for the new service

Observations

This PR is pure infrastructure with no application code changes. The appropriate validation is docker compose -f docker-compose.observability.yml config (schema validation) rather than unit or integration tests. Based on the commit message, this was run.

The three provisioned dashboard JSONs are committed to the repository, which means they're auditable and reproducible — no runtime download that could fail or change. This is the right approach for testability (the test is: does docker compose config pass? Does Grafana start and show dashboards?).

The one infrastructure testability gap (pre-existing, not introduced here): there's no CI job that runs docker compose config on the observability stack. This means YAML syntax errors in docker-compose.observability.yml or the provisioning files would only be caught manually. Consider adding a CI step for this in a future DevOps issue.

Dashboard datasource variable substitution was done at commit time (replacing ${DS_PROMETHEUS} etc. with the provisioned datasource UIDs). This avoids the common Grafana pitfall of dashboards that show "datasource not found" after provisioning — good QA practice.

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ✅ Approved** ### What I Checked - Testability of infrastructure configuration - Validation approach for Docker Compose config - CI/CD integration for the new service ### Observations This PR is pure infrastructure with no application code changes. The appropriate validation is `docker compose -f docker-compose.observability.yml config` (schema validation) rather than unit or integration tests. Based on the commit message, this was run. The three provisioned dashboard JSONs are committed to the repository, which means they're auditable and reproducible — no runtime download that could fail or change. This is the right approach for testability (the test is: does `docker compose config` pass? Does Grafana start and show dashboards?). **The one infrastructure testability gap** (pre-existing, not introduced here): there's no CI job that runs `docker compose config` on the observability stack. This means YAML syntax errors in `docker-compose.observability.yml` or the provisioning files would only be caught manually. Consider adding a CI step for this in a future DevOps issue. **Dashboard datasource variable substitution** was done at commit time (replacing `${DS_PROMETHEUS}` etc. with the provisioned datasource UIDs). This avoids the common Grafana pitfall of dashboards that show "datasource not found" after provisioning — good QA practice.
Author
Owner

🎨 Leonie Voss — UX Design Lead

Verdict: Approved

What I Checked

  • Frontend/UI changes in this PR

Result

This PR adds only Docker Compose configuration, Grafana provisioning YAML, and dashboard JSON files. No Svelte components, frontend routes, or UI code were modified.

Grafana's own UI is a third-party application — its UX is outside the scope of this review. The relevant UX decision (port 3001, admin credentials, pre-loaded dashboards) is correctly documented in DEPLOYMENT.md so operators know what they get.

No concerns from a UX perspective.

## 🎨 Leonie Voss — UX Design Lead **Verdict: ✅ Approved** ### What I Checked - Frontend/UI changes in this PR ### Result This PR adds only Docker Compose configuration, Grafana provisioning YAML, and dashboard JSON files. No Svelte components, frontend routes, or UI code were modified. Grafana's own UI is a third-party application — its UX is outside the scope of this review. The relevant UX decision (port 3001, admin credentials, pre-loaded dashboards) is correctly documented in DEPLOYMENT.md so operators know what they get. No concerns from a UX perspective. ✅
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved

What I Checked

  • Alignment with Milestone 10 (Observability Stack) intent
  • Acceptance criteria coverage
  • Missing requirements or scope gaps

Requirements Alignment

Issue #577 called for Grafana with provisioned Prometheus, Loki, and Tempo datasources plus pre-imported dashboard JSONs. This PR delivers:

Requirement Status
Grafana service in observability compose
Prometheus datasource (default)
Loki datasource with traceId derivedField → Tempo
Tempo datasource with tracesToLogsV2 → Loki
Pre-imported dashboards (3 total)
Admin password configurable via env var
DEPLOYMENT.md documentation
C4 diagram updated

Observations

The dashboard selection (Node Exporter Full, Spring Boot Observability, Loki) maps directly to the three pillars of the observability stack: host metrics, application metrics+traces, and logs. This is a good, practical default set for a self-hosted family archive.

The GF_USERS_ALLOW_SIGN_UP: "false" setting correctly implements the implicit requirement that Grafana access should be admin-controlled, not open to all registered users.

One open question for the future (not a blocker for this PR): the Grafana admin user is admin. If multiple operators need Grafana access, they'd need to create additional accounts manually. A future issue could add LDAP/OAuth SSO via the existing user system — but this is clearly out of scope for Milestone 10.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** ### What I Checked - Alignment with Milestone 10 (Observability Stack) intent - Acceptance criteria coverage - Missing requirements or scope gaps ### Requirements Alignment Issue #577 called for Grafana with provisioned Prometheus, Loki, and Tempo datasources plus pre-imported dashboard JSONs. This PR delivers: | Requirement | Status | |---|---| | Grafana service in observability compose | ✅ | | Prometheus datasource (default) | ✅ | | Loki datasource with traceId derivedField → Tempo | ✅ | | Tempo datasource with tracesToLogsV2 → Loki | ✅ | | Pre-imported dashboards (3 total) | ✅ | | Admin password configurable via env var | ✅ | | DEPLOYMENT.md documentation | ✅ | | C4 diagram updated | ✅ | ### Observations The dashboard selection (Node Exporter Full, Spring Boot Observability, Loki) maps directly to the three pillars of the observability stack: host metrics, application metrics+traces, and logs. This is a good, practical default set for a self-hosted family archive. The `GF_USERS_ALLOW_SIGN_UP: "false"` setting correctly implements the implicit requirement that Grafana access should be admin-controlled, not open to all registered users. **One open question for the future** (not a blocker for this PR): the Grafana admin user is `admin`. If multiple operators need Grafana access, they'd need to create additional accounts manually. A future issue could add LDAP/OAuth SSO via the existing user system — but this is clearly out of scope for Milestone 10.
marcel added 1 commit 2026-05-15 04:09:49 +02:00
fix(observability): add grafana healthcheck and service_healthy depends_on
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 4m19s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 5m32s
CI / fail2ban Regex (pull_request) Successful in 48s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
457c1d3aee
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel merged commit 84f9bbadeb into main 2026-05-15 04:35:11 +02:00
marcel deleted branch feat/issue-577-grafana 2026-05-15 04:35:11 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#589