feat(observability): add Grafana with provisioned datasources and dashboards #589
Reference in New Issue
Block a user
Delete Branch "feat/issue-577-grafana"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
obs-grafanaservice (grafana/grafana-oss:11.6.1) todocker-compose.observability.ymlon port127.0.0.1:${PORT_GRAFANA:-3001}.env.examplewithGRAFANA_ADMIN_PASSWORDdocs/DEPLOYMENT.mdand C4 L2 diagramCloses #577
🤖 Generated with Claude Code
🏗️ Markus Keller — Senior Application Architect
Verdict: ⚠️ Approved with concerns
What I Checked
✅ What's Done Right
Rel()edges to Prometheus/Loki/Tempo — the required doc update is thereobs-netonly — no unnecessaryarchiv-netattachmentprometheus,loki,tempo) that match across datasources and dashboard JSONs — correct pattern⚠️ Concern (non-blocking for this PR, but worth tracking)
The
depends_onblock uses the simple list form:Prometheus, Loki, and Tempo all define healthchecks. Using
condition: service_healthywould guarantee Grafana only starts when all backends are actually ready — not just when their containers exist. In practice Grafana retries datasource connections, so startup won't fail, but it's inconsistent with the pattern used for Promtail (which usescondition: service_healthyfor Loki). Consider tightening this in a follow-up.✅ Doc Table Pass
l2-containers.puml+DEPLOYMENT.md👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
What I Checked
✅ What's Done Right
dashboards.yml) is minimal and correct — no unnecessary optionsdisableDeletion: truein the dashboard provider prevents operators from accidentally deleting provisioned dashboards via the UIObservations
This PR is pure infrastructure — no Java, Svelte, or Python code changed. The YAML configs are clean, short, and readable. The datasource UIDs are stable strings (
prometheus,loki,tempo) which is the right approach for cross-datasource linking.The regex in the Loki derivedField
'"traceId":"(\w+)"'is correct for matching Spring Boot's structured JSON tracing output format from Micrometer/OTel.No TDD concerns — this is infrastructure configuration, not testable application logic. Validated via
docker compose config.🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: 🚫 Changes requested
Blockers
1. Missing healthcheck on
obs-grafanaGrafana exposes
/api/healthwhich returns{"database": "ok"}when ready. Without a healthcheck, other services (or operators runningdocker compose ps) can't distinguish "container started" from "Grafana is accepting traffic." Every other data-holding service in this stack has a healthcheck — Grafana should too.Fix:
2.
depends_onis a simple list — should usecondition: service_healthyPrometheus, Loki, and Tempo all define healthchecks. The simple list form starts Grafana as soon as those containers exist, not when they're healthy. Promtail already uses
condition: service_healthyfor Loki — Grafana should be consistent.Fix:
Suggestions (non-blocking)
3. Datasource URLs use
container_name(obs-prometheus) not service key (prometheus)Docker DNS on user-defined networks resolves both, so
obs-prometheus:9090works. But our convention elsewhere is to use service keys as the internal hostname. Minor inconsistency — either pick container_name consistently or service key consistently. Container_name form (obs-prometheus) is defensible since it matches whatdocker psshows.4. No
start_periodguard for Grafana startupGrafana on first start initializes its SQLite/PostgreSQL database and can take 10–15 seconds. The proposed
start_period: 30sabove covers this.✅ What's Done Right
grafana/grafana-oss:11.6.1✓127.0.0.1only ✓:ro✓grafana_datanamed volume for persistence ✓GF_USERS_ALLOW_SIGN_UP: "false"— anonymous sign-up disabled ✓GRAFANA_ADMIN_PASSWORDin.env.examplewith a warning comment ✓🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
What I Checked
✅ What's Done Right
127.0.0.1:${PORT_GRAFANA:-3001}:3000— Grafana is not reachable from any external interface ✓GF_USERS_ALLOW_SIGN_UP: "false"— no self-registration possible ✓:ro— Grafana container cannot modify its own provisioning config ✓GRAFANA_ADMIN_PASSWORD=changemein.env.examplewith a comment: "change this before exposing Grafana beyond localhost" — appropriate for a dev default ✓obs-netonly — the application network (archiv-net) is not attached, so Grafana cannot directly reach application containers ✓Observations (informational, not blockers)
Anonymous access — Grafana defaults to
GF_AUTH_ANONYMOUS_ENABLED=falsesince v10. Since this uses v11.6.1, the default is secure. Explicitly setting it is belt-and-suspenders but not required.Gravatar — Grafana by default makes outbound HTTP requests to Gravatar for user avatars. On a privacy-conscious self-hosted archive, consider
GF_USERS_GRAVATAR_ENABLED=false. Not a vulnerability, but a privacy consideration for a family archive project.Dashboard JSON provenance — The three dashboard JSONs are downloaded from grafana.com at commit time and committed to the repo. This is the correct pattern — the content is pinned at download time and auditable in git history. No supply chain concern since the files are not fetched at runtime.
Security summary: Grafana is appropriately isolated — localhost-only binding, no anonymous access, read-only provisioning mount, no unnecessary network attachments. For a single-operator self-hosted stack this posture is correct.
🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ✅ Approved
What I Checked
Observations
This PR is pure infrastructure with no application code changes. The appropriate validation is
docker compose -f docker-compose.observability.yml config(schema validation) rather than unit or integration tests. Based on the commit message, this was run.The three provisioned dashboard JSONs are committed to the repository, which means they're auditable and reproducible — no runtime download that could fail or change. This is the right approach for testability (the test is: does
docker compose configpass? Does Grafana start and show dashboards?).The one infrastructure testability gap (pre-existing, not introduced here): there's no CI job that runs
docker compose configon the observability stack. This means YAML syntax errors indocker-compose.observability.ymlor the provisioning files would only be caught manually. Consider adding a CI step for this in a future DevOps issue.Dashboard datasource variable substitution was done at commit time (replacing
${DS_PROMETHEUS}etc. with the provisioned datasource UIDs). This avoids the common Grafana pitfall of dashboards that show "datasource not found" after provisioning — good QA practice.🎨 Leonie Voss — UX Design Lead
Verdict: ✅ Approved
What I Checked
Result
This PR adds only Docker Compose configuration, Grafana provisioning YAML, and dashboard JSON files. No Svelte components, frontend routes, or UI code were modified.
Grafana's own UI is a third-party application — its UX is outside the scope of this review. The relevant UX decision (port 3001, admin credentials, pre-loaded dashboards) is correctly documented in DEPLOYMENT.md so operators know what they get.
No concerns from a UX perspective. ✅
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
What I Checked
Requirements Alignment
Issue #577 called for Grafana with provisioned Prometheus, Loki, and Tempo datasources plus pre-imported dashboard JSONs. This PR delivers:
Observations
The dashboard selection (Node Exporter Full, Spring Boot Observability, Loki) maps directly to the three pillars of the observability stack: host metrics, application metrics+traces, and logs. This is a good, practical default set for a self-hosted family archive.
The
GF_USERS_ALLOW_SIGN_UP: "false"setting correctly implements the implicit requirement that Grafana access should be admin-controlled, not open to all registered users.One open question for the future (not a blocker for this PR): the Grafana admin user is
admin. If multiple operators need Grafana access, they'd need to create additional accounts manually. A future issue could add LDAP/OAuth SSO via the existing user system — but this is clearly out of scope for Milestone 10.