devops(observability): add Grafana with provisioned Prometheus, Loki, and Tempo data sources and pre-imported dashboards #577
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Grafana is the single UI for all observability signals — metrics (Prometheus), logs (Loki), and traces (Tempo). Data sources and dashboards are provisioned via config files so they are available immediately after
docker compose upwith no manual clicking in the UI.Depends on: Prometheus issue, Loki issue, Tempo issue (data sources must exist for health checks to pass)
Service to Add
Add
GRAFANA_ADMIN_PASSWORD=changemeto.env.example.Config Files to Create
infra/observability/grafana/provisioning/datasources/datasources.ymlinfra/observability/grafana/provisioning/dashboards/dashboards.ymlDashboard JSON files
Download the following dashboard JSONs from grafana.com and commit them to
infra/observability/grafana/provisioning/dashboards/:node-exporter-full.jsoncadvisor-docker.jsonspring-boot-statistics.jsonUse
curl -s https://grafana.com/api/dashboards/<ID>/revisions/latest/download -o <file>.jsonto download. Commit the JSON files directly — do not reference them by ID at runtime.Acceptance Criteria
docker compose -f docker-compose.observability.yml up -d grafanastarts without errorhttp://localhost:3001(login: admin / value ofGRAFANA_ADMIN_PASSWORD)Definition of Done
main🏗️ Markus Keller — Senior Application Architect
Observations
docker-compose.observability.ymltriggers the mandatory update table:docs/architecture/c4/l2-containers.pumlneeds Grafana, Prometheus, Loki, and Tempo added as containers, anddocs/DEPLOYMENT.mdneeds the topology diagram updated. The currentl2-containers.pumlhas no observability containers.docker-compose.observability.ymlseparation (keeping observability out of the maindocker-compose.yml). This is an architectural decision with lasting operational consequences — why a separate file, why these three data sources, what the upgrade path looks like.docs/infrastructure/production-compose.mdalready references #498 as the follow-up; an ADR formalizes the rationale. The next ADR number is 015 (ADR-014 isupload-artifact-v3-pin.md).obs-netas the Grafana network. The application services run onarchiv-net. If Grafana is meant to scrape the Spring Boot/actuator/prometheusendpoint, those two networks need a bridge or Prometheus needs to reacharchiv-netdirectly. The issue does not clarify how Prometheus (in a separate issue) is expected to reach the backend. This is a dependency the implementer must resolve.docker-compose.observability.ymlfile naming. The existing prod stack deliberately uses a standalone compose (ADR-009). A separatedocker-compose.observability.ymlis consistent with that pattern. No conflict here.docs/DEPLOYMENT.md §4 Logs + observabilityreferences the observability stack but currently says it is "not yet deployed." Once this issue lands, that section needs updating.Recommendations
obs-net, provisioned viainfra/observability/." Context: why a separate file (avoids coupling to app container restart cycles, as noted inproduction-compose.md). Decision: separate compose with explicit network join for Prometheus→backend scrape.l2-containers.puml, add Grafana, Prometheus, Loki, and Tempo as containers inside aSystem_Boundary(obs, "Observability Stack")and add relationship lines: Prometheus→backend (scrapes metrics), Loki→Docker log driver (ingests logs), Grafana→all three (queries).archiv-net(so it can scrape the backend directly), or the backend's management port is exposed on a shared network. The issue spec should specify this explicitly since it says "data sources must exist for health checks to pass."👨💻 Felix Brandt — Senior Fullstack Developer
Observations
docker-compose.observability.yml+ provisioning YAML + dashboard JSON files). There is no backend Java code, no SvelteKit route, and no TypeScript to write or test. TDD does not apply here directly.curl -s https://grafana.com/api/dashboards/<ID>/revisions/latest/downloadto fetch three dashboard JSONs and commit them. This is a reproducible, documented step — fine. The important thing is that thecurlcommands are scripted or documented in a shell script committed alongside the files (e.g.,infra/observability/grafana/provisioning/dashboards/download-dashboards.sh) so future upgrades are one command, not a hunt through the issue body.datasources.ymlusesdatasourceUid: tempoas a literal string. In Grafana provisioning,datasourceUidin derived fields must match the actual UID of the provisioned data source. When data sources are provisioned via files without an explicituid:field, Grafana auto-generates a UID. The Loki→Tempo derived fielddatasourceUid: tempowill silently fail trace correlation unless either (a) auid: tempois added to the Tempo data source entry, or (b) the value matches the auto-generated UID. The same applies totracesToLogsV2.datasourceUid: lokiandserviceMap.datasourceUid: prometheus. Fix: add explicituid:fields to each data source entry matching the string references.disableDeletion: trueindashboards.ymlis the right call — prevents manual dashboard edits from being lost on restart.Recommendations
uid:fields to each data source indatasources.yml:infra/observability/grafana/provisioning/dashboards/download-dashboards.shwith the threecurlcommands from the issue body. Commit the script alongside the downloaded JSON files so dashboard versions are reproducible and upgradeable.🔒 Nora Steiner — Application Security Engineer
Observations
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-admin}with a hardcoded fallback. The-admindefault means that ifGRAFANA_ADMIN_PASSWORDis not set in the environment, Grafana starts with the well-known default password. For a local dev stack this is tolerable, but the.env.exampleentry saysGRAFANA_ADMIN_PASSWORD=changeme— both defaults are weak and discoverable. More importantly, the issue does not specify whether this compose file will also be used in production (thedepends_onsection references Prometheus, Loki, Tempo — all production dependencies). If this file is used in production and the operator forgets to set the variable, Grafana is exposed withadmin/admin. Recommendation: fail loudly if unset rather than defaulting toadmin.3001conflicts with the staging frontend. Looking atdocs/DEPLOYMENT.md §1, staging runs on port 3001 (archiv-stagingproject mapsPORT_FRONTEND=3001). Exposing"${PORT_GRAFANA:-3001}:3000"on the same host will cause a port conflict if staging and the observability stack cohabit. The default 3001 is a collision waiting to happen.GF_AUTH_ANONYMOUS_ENABLED: "false"is good — this prevents unauthenticated access. The acceptance criteria also verify this explicitly, which is correct.GF_ANALYTICS_REPORTING_ENABLED: "false"is good — prevents Grafana from phoning home to grafana.com with usage data. This aligns with the self-hosted philosophy.ports: - "${PORT_GRAFANA:-3001}:3000"without binding to127.0.0.1. In the production stack, all ports are explicitly bound to127.0.0.1only (seedocker-compose.prod.yml—"127.0.0.1:${PORT_BACKEND}:8080"). If this observability compose is used in production, Grafana should follow the same pattern:"127.0.0.1:${PORT_GRAFANA:-3001}:3000". Without this, Grafana is reachable from any network interface — potentially externally if the host firewall has a gap.curlURL useslatest/download— consider pinning to a specific revision number to prevent supply-chain drift on re-download).Recommendations
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-admin}toGF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:?GRAFANA_ADMIN_PASSWORD must be set}. Compose will refuse to start if the variable is missing, rather than silently defaulting to the well-known password.3001to something that does not collide with staging:${PORT_GRAFANA:-3002}:3000or just remove the default and require it in.env.example.127.0.0.1in any production-facing config:"127.0.0.1:${PORT_GRAFANA}:3000". Add a comment explaining this is accessed via Caddy or SSH tunnel, not directly from the internet.latest/downloadto a specific revision number in thecurlcommands and document the pinned revision alongside the JSON file in the commit message.🔧 Tobias Wendt — DevOps & Platform Engineer
Observations
grafana/grafana-oss:11.6.1is a specific version tag — good. Matches the project standard for pinned images.docs/DEPLOYMENT.md §1states: "Production and staging cohabit on the same host via docker compose project names:archiv-production(ports 8080/3000) andarchiv-staging(ports 8081/3001)." Grafana defaulting to port 3001 will conflict with the staging frontend on a shared host. The default should be a non-colliding port (e.g., 3002 or a higher port like 9091).obs-netis a new isolated network. This is structurally fine for a separate compose file, but Prometheus (in a linked issue) will need to reach the Spring Boot backend onarchiv-netto scrape/actuator/prometheus. If Prometheus lives onobs-netonly, it cannot reach the backend. The implementer needs to addarchiv-net(or its prod equivalent) to the Prometheus service, or expose the Spring Boot management port on a shared network. This issue spec is silent on this cross-network topology, which is the single biggest implementation risk.grafana_datanamed volume is correct. Dashboard state, user preferences, and alerts persist across upgrades. Good pattern, consistent with the project's named-volume approach for persistence.http://localhost:3000/api/health. Without a healthcheck,depends_on: condition: service_healthycannot be used by downstream services and Compose can't detect an unhealthy Grafana on startup.depends_onon prometheus/loki/tempo is correct — Grafana provisioning health checks fail at data-source load time if the backends aren't up. This is well-specified.GRAFANA_ADMIN_PASSWORDin.env.example— the issue says "AddGRAFANA_ADMIN_PASSWORD=changemeto.env.example." This is fine as a task but must actually land in the PR.docs/infrastructure/production-compose.mdalready flags that "Loki + Grafana with >30 days retention" may require upgrading from CX32 to CX42 (adding ~12 EUR/mo). The PR description should note this or update the doc to reflect that the observability stack is now deployed.Recommendations
3001to3002(or omit the default entirely and require it in.env.example) to avoid the staging frontend collision.obs-netreach the Spring Boot backend onarchiv-net? This is the implementation dependency that will block the "data sources show green" acceptance criterion.docs/infrastructure/production-compose.md §Observability stack — not yet deployedto remove that section and point to the new compose file once this lands.docs/DEPLOYMENT.md §4 Logs + observabilityto describe the new stack.127.0.0.1in production usage:"127.0.0.1:${PORT_GRAFANA}:3000"— consistent with how all other ports are handled indocker-compose.prod.yml.🧪 Sara Holt — Senior QA Engineer
Observations
GET /api/datasources/<id>/health) rather than string-matching the UI. For now, this is a manual verification step — acceptable, but worth noting.docker compose -f docker-compose.observability.yml down grafana && docker compose -f docker-compose.observability.yml up -d grafana.Recommendations
curl -s http://localhost:3001/api/health | jq .databaseshould return"ok". Document this in the issue as a quick sanity check alongside the UI walkthrough.GRAFANA_ADMIN_PASSWORDis set in.env.exampleand that the variable is not committed with a real password anywhere in the repo.🎨 Leonie Voss — UX Designer & Accessibility Strategist
Observations
This is a backend infrastructure issue — no frontend Svelte components, no Tailwind, no user-facing UI to design. Grafana's UI is owned by the Grafana project and is not part of the Familienarchiv design system.
From a UX standpoint, the one user-facing consideration is discoverability and access: how does the operator (Marcel) know Grafana is running, on which port, and how to log in? The issue covers this in the acceptance criteria ("Grafana UI accessible at
http://localhost:3001, login: admin / value ofGRAFANA_ADMIN_PASSWORD"), which is sufficient.No concerns from a brand, accessibility, or responsive-design angle — Grafana is an internal operator tool, not part of the family-facing application. The senior users (60+) who are the primary audience for accessibility concerns will never interact with Grafana.
One minor note: if a Grafana link is ever surfaced in the Familienarchiv admin panel (e.g., a "View metrics" button for admins), that would require UX review at that time. For now, Grafana is accessed directly by its port — no UI work required.
📋 Elicit — Requirements Engineer
Observations
grafana_data. The spec covers that the volume is named and persistent, but does not specify how large it will grow or what the retention policy is for dashboard state. For a dev environment this is a non-issue. For production, Grafana's SQLite database insidegrafana_datagrows slowly but steadily. No action needed now — worth noting for the production deployment checklist.Recommendations
🗳️ Decision Queue — Action Required
3 decisions need your input before implementation starts.
Infrastructure / Networking
archiv-net; the observability services run onobs-net. Prometheus must scrape the backend's/actuator/prometheusendpoint, but can't do so if it's isolated toobs-netonly. Options: (a) add the application'sarchiv-netto the Prometheus container definition — simplest, one line; (b) expose the backend management port on a named shared network; (c) use Caddy or a sidecar to bridge. Option (a) is the obvious default for a single-host setup. (Raised by: Markus, Tobias)Configuration / Security
GF_SECURITY_ADMIN_PASSWORDfallback: silentadmindefault vs. hard fail on unset. Current spec:${GRAFANA_ADMIN_PASSWORD:-admin}. If this compose file reaches production and the env var is not set, Grafana starts with the well-known default password. Options: (a) change to${GRAFANA_ADMIN_PASSWORD:?GRAFANA_ADMIN_PASSWORD must be set}— Compose refuses to start if unset; (b) keep the-admindefault and rely on operator discipline +.env.exampledocumentation. Option (a) is the safer choice for any file that could be used in production. (Raised by: Nora, Tobias)Default port
3001collides with the staging frontend.docs/DEPLOYMENT.md §1explicitly maps staging to port 3001. The Grafana default${PORT_GRAFANA:-3001}will conflict on a shared host. Options: (a) change default to3002or9091; (b) remove the default entirely and require it in.env.example(consistent with howPORT_BACKEND,PORT_FRONTEND, etc. are handled). Either is fine — just pick one to avoid the collision. (Raised by: Nora, Tobias, Markus)Implementation complete — branch
feat/issue-577-grafanaWhat was implemented
Commit 1:
feat(observability): add Grafana with provisioned datasources and dashboardsobs-grafanaservice (grafana/grafana-oss:11.6.1) todocker-compose.observability.yml127.0.0.1:${PORT_GRAFANA:-3001}:3000(127.0.0.1-only per security convention)obs-netonly (noarchiv-net— Grafana only needs to reach Prometheus/Loki/Tempo, not the app)depends_on: prometheus, loki, tempografana_datavolume was already defined from #572infra/observability/grafana/provisioning/datasources/datasources.ymlwith:prometheus, isDefault: true)loki) withtraceIdderived field linking → Tempotempo) withtracesToLogsV2→ Loki, service map → Prometheus, node graph enabledinfra/observability/grafana/provisioning/dashboards/dashboards.yml(file provider, disableDeletion: true, 30s update interval)node-exporter-full.json(ID 1860) — no template variable substitution neededspring-boot-observability.json(ID 17175) — replaced${DS_PROMETHEUS}and${DS_LOKI}uid references withprometheus/lokiloki-logs.json(ID 13639) — replaced"${DS_LOKI}"string with{"type": "loki", "uid": "loki"}objectGRAFANA_ADMIN_PASSWORD=changemeto.env.example(observability section)Commit 2:
docs(observability): document Grafana in DEPLOYMENT.md and C4 diagramdocs/DEPLOYMENT.md: addedobs-grafanarow to services table, Grafana access details block (URL, credentials, datasources, dashboard list),PORT_GRAFANAandGRAFANA_ADMIN_PASSWORDto env vars tabledocs/architecture/c4/l2-containers.puml: replaced placeholder Grafana entry with pinned image tag, expanded observability boundary withnode_exporterandcadvisorcontainers that were missing, addedRel()edges for Grafana → Prometheus (HTTP 9090), Grafana → Loki (HTTP 3100), Grafana → Tempo (HTTP 3200)Validation
docker compose -f docker-compose.observability.yml configpasses cleanly.✅ Implemented and merged via PR #589
What was delivered
obs-grafanaservice added todocker-compose.observability.yml(grafana/grafana-oss:11.6.1, port127.0.0.1:${PORT_GRAFANA:-3001}:3000,obs-netonly)infra/observability/grafana/provisioning/datasources/datasources.yml):prometheus)loki)tempo)infra/observability/grafana/provisioning/dashboards/dashboards.yml)/api/health, 30s interval, 30s start_period)depends_onwithcondition: service_healthyfor prometheus, loki, tempoGRAFANA_ADMIN_PASSWORDadded to.env.exampledocs/DEPLOYMENT.mdupdated with access URL, credentials, env var tabledocs/architecture/c4/l2-containers.pumlupdated with Grafana container + 3 Rel edgesCommits
f3f8345bfeat(observability): add Grafana with provisioned datasources and dashboardsc99321e5docs(observability): document Grafana in DEPLOYMENT.md and C4 diagram457c1d3afix(observability): add grafana healthcheck and service_healthy depends_on