devops(observability): add Grafana with provisioned Prometheus, Loki, and Tempo data sources and pre-imported dashboards #577

New Issue

marcel · 2026-05-14T15:04:53+02:00

marcel commented

2026-05-14 15:04:53 +02:00

Context

Grafana is the single UI for all observability signals — metrics (Prometheus), logs (Loki), and traces (Tempo). Data sources and dashboards are provisioned via config files so they are available immediately after docker compose up with no manual clicking in the UI.

Depends on: Prometheus issue, Loki issue, Tempo issue (data sources must exist for health checks to pass)

Service to Add

# docker-compose.observability.yml

grafana:
  image: grafana/grafana-oss:11.6.1   # pin to latest stable
  container_name: obs-grafana
  restart: unless-stopped
  volumes:
    - ./infra/observability/grafana/provisioning:/etc/grafana/provisioning:ro
    - grafana_data:/var/lib/grafana
  environment:
    GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-admin}
    GF_USERS_ALLOW_SIGN_UP: "false"
    GF_AUTH_ANONYMOUS_ENABLED: "false"
    GF_ANALYTICS_REPORTING_ENABLED: "false"
    GF_SERVER_ROOT_URL: "%(protocol)s://%(domain)s:%(http_port)s/"
  ports:
    - "${PORT_GRAFANA:-3001}:3000"
  networks:
    - obs-net
  depends_on:
    - prometheus
    - loki
    - tempo

Add GRAFANA_ADMIN_PASSWORD=changeme to .env.example.

Config Files to Create

`infra/observability/grafana/provisioning/datasources/datasources.yml`

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      httpMethod: POST

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"traceId":"([a-f0-9]+)"'
          name: TraceID
          url: "$${__value.raw}"

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: '-5m'
        spanEndTimeShift: '5m'
        filterByTraceID: true
        filterBySpanID: false
      serviceMap:
        datasourceUid: prometheus
      search:
        hide: false
      nodeGraph:
        enabled: true

`infra/observability/grafana/provisioning/dashboards/dashboards.yml`

apiVersion: 1

providers:
  - name: default
    type: file
    disableDeletion: true
    updateIntervalSeconds: 30
    options:
      path: /etc/grafana/provisioning/dashboards
      foldersFromFilesStructure: true

Dashboard JSON files

Download the following dashboard JSONs from grafana.com and commit them to infra/observability/grafana/provisioning/dashboards/:

File name	Grafana ID	Purpose
`node-exporter-full.json`	1860	Host CPU / RAM / disk / network
`cadvisor-docker.json`	14282	Per-container resource usage
`spring-boot-statistics.json`	19004	JVM heap, GC, HTTP rates

Use curl -s https://grafana.com/api/dashboards/<ID>/revisions/latest/download -o <file>.json to download. Commit the JSON files directly — do not reference them by ID at runtime.

Acceptance Criteria

docker compose -f docker-compose.observability.yml up -d grafana starts without error
Grafana UI accessible at http://localhost:3001 (login: admin / value of GRAFANA_ADMIN_PASSWORD)
All three data sources show green "Data source connected and labels found" status under Configuration → Data Sources
"Node Exporter Full" dashboard loads and displays CPU/RAM graphs (requires node-exporter to be running)
"Spring Boot Statistics" dashboard loads (panels may show "No data" until backend instrumentation issue is complete — that is acceptable)
Trace-to-logs correlation works: open any trace in Grafana Explore → Tempo, click a span, "Related logs" opens the correlated Loki stream
Anonymous sign-up is disabled (verify: open a private browser window, confirm no sign-up option)

Definition of Done

All acceptance criteria checked
All three config files and three dashboard JSON files committed
Committed on a feature branch, PR opened against main

## Context Grafana is the single UI for all observability signals — metrics (Prometheus), logs (Loki), and traces (Tempo). Data sources and dashboards are provisioned via config files so they are available immediately after `docker compose up` with no manual clicking in the UI. **Depends on:** Prometheus issue, Loki issue, Tempo issue (data sources must exist for health checks to pass) ## Service to Add ```yaml # docker-compose.observability.yml grafana: image: grafana/grafana-oss:11.6.1 # pin to latest stable container_name: obs-grafana restart: unless-stopped volumes: - ./infra/observability/grafana/provisioning:/etc/grafana/provisioning:ro - grafana_data:/var/lib/grafana environment: GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-admin} GF_USERS_ALLOW_SIGN_UP: "false" GF_AUTH_ANONYMOUS_ENABLED: "false" GF_ANALYTICS_REPORTING_ENABLED: "false" GF_SERVER_ROOT_URL: "%(protocol)s://%(domain)s:%(http_port)s/" ports: - "${PORT_GRAFANA:-3001}:3000" networks: - obs-net depends_on: - prometheus - loki - tempo ``` Add `GRAFANA_ADMIN_PASSWORD=changeme` to `.env.example`. ## Config Files to Create ### `infra/observability/grafana/provisioning/datasources/datasources.yml` ```yaml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true jsonData: httpMethod: POST - name: Loki type: loki access: proxy url: http://loki:3100 jsonData: derivedFields: - datasourceUid: tempo matcherRegex: '"traceId":"([a-f0-9]+)"' name: TraceID url: "$${__value.raw}" - name: Tempo type: tempo access: proxy url: http://tempo:3200 jsonData: tracesToLogsV2: datasourceUid: loki spanStartTimeShift: '-5m' spanEndTimeShift: '5m' filterByTraceID: true filterBySpanID: false serviceMap: datasourceUid: prometheus search: hide: false nodeGraph: enabled: true ``` ### `infra/observability/grafana/provisioning/dashboards/dashboards.yml` ```yaml apiVersion: 1 providers: - name: default type: file disableDeletion: true updateIntervalSeconds: 30 options: path: /etc/grafana/provisioning/dashboards foldersFromFilesStructure: true ``` ### Dashboard JSON files Download the following dashboard JSONs from grafana.com and commit them to `infra/observability/grafana/provisioning/dashboards/`: | File name | Grafana ID | Purpose | |-----------|-----------|---------| | `node-exporter-full.json` | [1860](https://grafana.com/grafana/dashboards/1860) | Host CPU / RAM / disk / network | | `cadvisor-docker.json` | [14282](https://grafana.com/grafana/dashboards/14282) | Per-container resource usage | | `spring-boot-statistics.json` | [19004](https://grafana.com/grafana/dashboards/19004) | JVM heap, GC, HTTP rates | Use `curl -s https://grafana.com/api/dashboards/<ID>/revisions/latest/download -o <file>.json` to download. Commit the JSON files directly — do not reference them by ID at runtime. ## Acceptance Criteria - [ ] `docker compose -f docker-compose.observability.yml up -d grafana` starts without error - [ ] Grafana UI accessible at `http://localhost:3001` (login: admin / value of `GRAFANA_ADMIN_PASSWORD`) - [ ] All three data sources show green "Data source connected and labels found" status under Configuration → Data Sources - [ ] "Node Exporter Full" dashboard loads and displays CPU/RAM graphs (requires node-exporter to be running) - [ ] "Spring Boot Statistics" dashboard loads (panels may show "No data" until backend instrumentation issue is complete — that is acceptable) - [ ] Trace-to-logs correlation works: open any trace in Grafana Explore → Tempo, click a span, "Related logs" opens the correlated Loki stream - [ ] Anonymous sign-up is disabled (verify: open a private browser window, confirm no sign-up option) ## Definition of Done - All acceptance criteria checked - All three config files and three dashboard JSON files committed - Committed on a feature branch, PR opened against `main`

marcel added this to the Observability Stack — Grafana LGTM + GlitchTip milestone 2026-05-14 15:04:53 +02:00

marcel added the P2-medium devops phase-7: monitoring labels 2026-05-14 15:06:12 +02:00

marcel commented

2026-05-14 15:28:31 +02:00

🏗️ Markus Keller — Senior Application Architect

Observations

New Docker service → doc update required. Adding Grafana to docker-compose.observability.yml triggers the mandatory update table: docs/architecture/c4/l2-containers.puml needs Grafana, Prometheus, Loki, and Tempo added as containers, and docs/DEPLOYMENT.md needs the topology diagram updated. The current l2-containers.puml has no observability containers.
ADR warranted. The issue documents the docker-compose.observability.yml separation (keeping observability out of the main docker-compose.yml). This is an architectural decision with lasting operational consequences — why a separate file, why these three data sources, what the upgrade path looks like. docs/infrastructure/production-compose.md already references #498 as the follow-up; an ADR formalizes the rationale. The next ADR number is 015 (ADR-014 is upload-artifact-v3-pin.md).
Network isolation question. The issue specifies obs-net as the Grafana network. The application services run on archiv-net. If Grafana is meant to scrape the Spring Boot /actuator/prometheus endpoint, those two networks need a bridge or Prometheus needs to reach archiv-net directly. The issue does not clarify how Prometheus (in a separate issue) is expected to reach the backend. This is a dependency the implementer must resolve.
docker-compose.observability.yml file naming. The existing prod stack deliberately uses a standalone compose (ADR-009). A separate docker-compose.observability.yml is consistent with that pattern. No conflict here.
docs/DEPLOYMENT.md §4 Logs + observability references the observability stack but currently says it is "not yet deployed." Once this issue lands, that section needs updating.

Recommendations

Write ADR-015 before or alongside the implementation: "Observability stack deployed as separate compose file with obs-net, provisioned via infra/observability/." Context: why a separate file (avoids coupling to app container restart cycles, as noted in production-compose.md). Decision: separate compose with explicit network join for Prometheus→backend scrape.
In l2-containers.puml, add Grafana, Prometheus, Loki, and Tempo as containers inside a System_Boundary(obs, "Observability Stack") and add relationship lines: Prometheus→backend (scrapes metrics), Loki→Docker log driver (ingests logs), Grafana→all three (queries).
Clarify network topology in the issue or implementation: either Prometheus joins archiv-net (so it can scrape the backend directly), or the backend's management port is exposed on a shared network. The issue spec should specify this explicitly since it says "data sources must exist for health checks to pass."

## 🏗️ Markus Keller — Senior Application Architect ### Observations - **New Docker service → doc update required.** Adding Grafana to `docker-compose.observability.yml` triggers the mandatory update table: `docs/architecture/c4/l2-containers.puml` needs Grafana, Prometheus, Loki, and Tempo added as containers, and `docs/DEPLOYMENT.md` needs the topology diagram updated. The current `l2-containers.puml` has no observability containers. - **ADR warranted.** The issue documents the `docker-compose.observability.yml` separation (keeping observability out of the main `docker-compose.yml`). This is an architectural decision with lasting operational consequences — why a separate file, why these three data sources, what the upgrade path looks like. `docs/infrastructure/production-compose.md` already references #498 as the follow-up; an ADR formalizes the rationale. The next ADR number is **015** (ADR-014 is `upload-artifact-v3-pin.md`). - **Network isolation question.** The issue specifies `obs-net` as the Grafana network. The application services run on `archiv-net`. If Grafana is meant to scrape the Spring Boot `/actuator/prometheus` endpoint, those two networks need a bridge or Prometheus needs to reach `archiv-net` directly. The issue does not clarify how Prometheus (in a separate issue) is expected to reach the backend. This is a dependency the implementer must resolve. - **`docker-compose.observability.yml` file naming.** The existing prod stack deliberately uses a standalone compose (ADR-009). A separate `docker-compose.observability.yml` is consistent with that pattern. No conflict here. - **`docs/DEPLOYMENT.md §4 Logs + observability`** references the observability stack but currently says it is "not yet deployed." Once this issue lands, that section needs updating. ### Recommendations - Write ADR-015 before or alongside the implementation: "Observability stack deployed as separate compose file with `obs-net`, provisioned via `infra/observability/`." Context: why a separate file (avoids coupling to app container restart cycles, as noted in `production-compose.md`). Decision: separate compose with explicit network join for Prometheus→backend scrape. - In `l2-containers.puml`, add Grafana, Prometheus, Loki, and Tempo as containers inside a `System_Boundary(obs, "Observability Stack")` and add relationship lines: Prometheus→backend (scrapes metrics), Loki→Docker log driver (ingests logs), Grafana→all three (queries). - Clarify network topology in the issue or implementation: either Prometheus joins `archiv-net` (so it can scrape the backend directly), or the backend's management port is exposed on a shared network. The issue spec should specify this explicitly since it says "data sources must exist for health checks to pass."

marcel commented

2026-05-14 15:28:43 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

No application code changes required. This issue is infrastructure-only (new docker-compose.observability.yml + provisioning YAML + dashboard JSON files). There is no backend Java code, no SvelteKit route, and no TypeScript to write or test. TDD does not apply here directly.
Dashboard JSON download is a manual step. The issue specifies curl -s https://grafana.com/api/dashboards/<ID>/revisions/latest/download to fetch three dashboard JSONs and commit them. This is a reproducible, documented step — fine. The important thing is that the curl commands are scripted or documented in a shell script committed alongside the files (e.g., infra/observability/grafana/provisioning/dashboards/download-dashboards.sh) so future upgrades are one command, not a hunt through the issue body.
datasources.yml uses datasourceUid: tempo as a literal string. In Grafana provisioning, datasourceUid in derived fields must match the actual UID of the provisioned data source. When data sources are provisioned via files without an explicit uid: field, Grafana auto-generates a UID. The Loki→Tempo derived field datasourceUid: tempo will silently fail trace correlation unless either (a) a uid: tempo is added to the Tempo data source entry, or (b) the value matches the auto-generated UID. The same applies to tracesToLogsV2.datasourceUid: loki and serviceMap.datasourceUid: prometheus. Fix: add explicit uid: fields to each data source entry matching the string references.
disableDeletion: true in dashboards.yml is the right call — prevents manual dashboard edits from being lost on restart.

Recommendations

Add uid: fields to each data source in datasources.yml:
```
- name: Prometheus
  uid: prometheus
  ...
- name: Loki
  uid: loki
  ...
- name: Tempo
  uid: tempo
  ...
```
This makes the cross-referencing in derived fields work deterministically rather than relying on auto-generated UIDs.
Create infra/observability/grafana/provisioning/dashboards/download-dashboards.sh with the three curl commands from the issue body. Commit the script alongside the downloaded JSON files so dashboard versions are reproducible and upgradeable.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - **No application code changes required.** This issue is infrastructure-only (new `docker-compose.observability.yml` + provisioning YAML + dashboard JSON files). There is no backend Java code, no SvelteKit route, and no TypeScript to write or test. TDD does not apply here directly. - **Dashboard JSON download is a manual step.** The issue specifies `curl -s https://grafana.com/api/dashboards/<ID>/revisions/latest/download` to fetch three dashboard JSONs and commit them. This is a reproducible, documented step — fine. The important thing is that the `curl` commands are scripted or documented in a shell script committed alongside the files (e.g., `infra/observability/grafana/provisioning/dashboards/download-dashboards.sh`) so future upgrades are one command, not a hunt through the issue body. - **`datasources.yml` uses `datasourceUid: tempo` as a literal string.** In Grafana provisioning, `datasourceUid` in derived fields must match the actual UID of the provisioned data source. When data sources are provisioned via files without an explicit `uid:` field, Grafana auto-generates a UID. The Loki→Tempo derived field `datasourceUid: tempo` will silently fail trace correlation unless either (a) a `uid: tempo` is added to the Tempo data source entry, or (b) the value matches the auto-generated UID. The same applies to `tracesToLogsV2.datasourceUid: loki` and `serviceMap.datasourceUid: prometheus`. **Fix: add explicit `uid:` fields to each data source entry matching the string references.** - **`disableDeletion: true` in `dashboards.yml`** is the right call — prevents manual dashboard edits from being lost on restart. ### Recommendations - Add `uid:` fields to each data source in `datasources.yml`: ```yaml - name: Prometheus uid: prometheus ... - name: Loki uid: loki ... - name: Tempo uid: tempo ... ``` This makes the cross-referencing in derived fields work deterministically rather than relying on auto-generated UIDs. - Create `infra/observability/grafana/provisioning/dashboards/download-dashboards.sh` with the three `curl` commands from the issue body. Commit the script alongside the downloaded JSON files so dashboard versions are reproducible and upgradeable.

marcel commented

2026-05-14 15:29:00 +02:00

🔒 Nora Steiner — Application Security Engineer

Observations

GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-admin} with a hardcoded fallback. The -admin default means that if GRAFANA_ADMIN_PASSWORD is not set in the environment, Grafana starts with the well-known default password. For a local dev stack this is tolerable, but the .env.example entry says GRAFANA_ADMIN_PASSWORD=changeme — both defaults are weak and discoverable. More importantly, the issue does not specify whether this compose file will also be used in production (the depends_on section references Prometheus, Loki, Tempo — all production dependencies). If this file is used in production and the operator forgets to set the variable, Grafana is exposed with admin/admin. Recommendation: fail loudly if unset rather than defaulting to admin.
Grafana port 3001 conflicts with the staging frontend. Looking at docs/DEPLOYMENT.md §1, staging runs on port 3001 (archiv-staging project maps PORT_FRONTEND=3001). Exposing "${PORT_GRAFANA:-3001}:3000" on the same host will cause a port conflict if staging and the observability stack cohabit. The default 3001 is a collision waiting to happen.
GF_AUTH_ANONYMOUS_ENABLED: "false" is good — this prevents unauthenticated access. The acceptance criteria also verify this explicitly, which is correct.
GF_ANALYTICS_REPORTING_ENABLED: "false" is good — prevents Grafana from phoning home to grafana.com with usage data. This aligns with the self-hosted philosophy.
No network isolation for Grafana's port. The issue spec shows ports: - "${PORT_GRAFANA:-3001}:3000" without binding to 127.0.0.1. In the production stack, all ports are explicitly bound to 127.0.0.1 only (see docker-compose.prod.yml — "127.0.0.1:${PORT_BACKEND}:8080"). If this observability compose is used in production, Grafana should follow the same pattern: "127.0.0.1:${PORT_GRAFANA:-3001}:3000". Without this, Grafana is reachable from any network interface — potentially externally if the host firewall has a gap.
Dashboard JSON files from grafana.com contain community-contributed code. This is standard practice in the Grafana ecosystem, but worth noting: commit the downloaded files and pin the specific revision (the curl URL uses latest/download — consider pinning to a specific revision number to prevent supply-chain drift on re-download).

Recommendations

Change GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-admin} to GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:?GRAFANA_ADMIN_PASSWORD must be set}. Compose will refuse to start if the variable is missing, rather than silently defaulting to the well-known password.
Change the default port from 3001 to something that does not collide with staging: ${PORT_GRAFANA:-3002}:3000 or just remove the default and require it in .env.example.
Bind the port to 127.0.0.1 in any production-facing config: "127.0.0.1:${PORT_GRAFANA}:3000". Add a comment explaining this is accessed via Caddy or SSH tunnel, not directly from the internet.
Pin the dashboard download revision: change latest/download to a specific revision number in the curl commands and document the pinned revision alongside the JSON file in the commit message.

## 🔒 Nora Steiner — Application Security Engineer ### Observations - **`GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-admin}` with a hardcoded fallback.** The `-admin` default means that if `GRAFANA_ADMIN_PASSWORD` is not set in the environment, Grafana starts with the well-known default password. For a local dev stack this is tolerable, but the `.env.example` entry says `GRAFANA_ADMIN_PASSWORD=changeme` — both defaults are weak and discoverable. More importantly, the issue does not specify whether this compose file will also be used in production (the `depends_on` section references Prometheus, Loki, Tempo — all production dependencies). If this file is used in production and the operator forgets to set the variable, Grafana is exposed with `admin`/`admin`. **Recommendation: fail loudly if unset rather than defaulting to `admin`.** - **Grafana port `3001` conflicts with the staging frontend.** Looking at `docs/DEPLOYMENT.md §1`, staging runs on port 3001 (`archiv-staging` project maps `PORT_FRONTEND=3001`). Exposing `"${PORT_GRAFANA:-3001}:3000"` on the same host will cause a port conflict if staging and the observability stack cohabit. The default 3001 is a collision waiting to happen. - **`GF_AUTH_ANONYMOUS_ENABLED: "false"` is good** — this prevents unauthenticated access. The acceptance criteria also verify this explicitly, which is correct. - **`GF_ANALYTICS_REPORTING_ENABLED: "false"` is good** — prevents Grafana from phoning home to grafana.com with usage data. This aligns with the self-hosted philosophy. - **No network isolation for Grafana's port.** The issue spec shows `ports: - "${PORT_GRAFANA:-3001}:3000"` without binding to `127.0.0.1`. In the production stack, all ports are explicitly bound to `127.0.0.1` only (see `docker-compose.prod.yml` — `"127.0.0.1:${PORT_BACKEND}:8080"`). If this observability compose is used in production, Grafana should follow the same pattern: `"127.0.0.1:${PORT_GRAFANA:-3001}:3000"`. Without this, Grafana is reachable from any network interface — potentially externally if the host firewall has a gap. - **Dashboard JSON files from grafana.com contain community-contributed code.** This is standard practice in the Grafana ecosystem, but worth noting: commit the downloaded files and pin the specific revision (the `curl` URL uses `latest/download` — consider pinning to a specific revision number to prevent supply-chain drift on re-download). ### Recommendations - Change `GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-admin}` to `GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:?GRAFANA_ADMIN_PASSWORD must be set}`. Compose will refuse to start if the variable is missing, rather than silently defaulting to the well-known password. - Change the default port from `3001` to something that does not collide with staging: `${PORT_GRAFANA:-3002}:3000` or just remove the default and require it in `.env.example`. - Bind the port to `127.0.0.1` in any production-facing config: `"127.0.0.1:${PORT_GRAFANA}:3000"`. Add a comment explaining this is accessed via Caddy or SSH tunnel, not directly from the internet. - Pin the dashboard download revision: change `latest/download` to a specific revision number in the `curl` commands and document the pinned revision alongside the JSON file in the commit message.

marcel commented

2026-05-14 15:29:18 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Observations

Grafana image is pinned correctly. grafana/grafana-oss:11.6.1 is a specific version tag — good. Matches the project standard for pinned images.
Port 3001 default conflicts with the staging frontend. docs/DEPLOYMENT.md §1 states: "Production and staging cohabit on the same host via docker compose project names: archiv-production (ports 8080/3000) and archiv-staging (ports 8081/3001)." Grafana defaulting to port 3001 will conflict with the staging frontend on a shared host. The default should be a non-colliding port (e.g., 3002 or a higher port like 9091).
obs-net is a new isolated network. This is structurally fine for a separate compose file, but Prometheus (in a linked issue) will need to reach the Spring Boot backend on archiv-net to scrape /actuator/prometheus. If Prometheus lives on obs-net only, it cannot reach the backend. The implementer needs to add archiv-net (or its prod equivalent) to the Prometheus service, or expose the Spring Boot management port on a shared network. This issue spec is silent on this cross-network topology, which is the single biggest implementation risk.
grafana_data named volume is correct. Dashboard state, user preferences, and alerts persist across upgrades. Good pattern, consistent with the project's named-volume approach for persistence.
Healthcheck is absent from the Grafana service spec. Per Tobias's rules, all services should declare a healthcheck. Grafana's built-in health endpoint is http://localhost:3000/api/health. Without a healthcheck, depends_on: condition: service_healthy cannot be used by downstream services and Compose can't detect an unhealthy Grafana on startup.
depends_on on prometheus/loki/tempo is correct — Grafana provisioning health checks fail at data-source load time if the backends aren't up. This is well-specified.
No GRAFANA_ADMIN_PASSWORD in .env.example — the issue says "Add GRAFANA_ADMIN_PASSWORD=changeme to .env.example." This is fine as a task but must actually land in the PR.
Resource consumption note missing. docs/infrastructure/production-compose.md already flags that "Loki + Grafana with >30 days retention" may require upgrading from CX32 to CX42 (adding ~12 EUR/mo). The PR description should note this or update the doc to reflect that the observability stack is now deployed.

Recommendations

Add a Grafana healthcheck to the service definition:

healthcheck:
  test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
  interval: 15s
  timeout: 5s
  retries: 5
  start_period: 30s

Change the default port from 3001 to 3002 (or omit the default entirely and require it in .env.example) to avoid the staging frontend collision.
Explicitly document the cross-network topology in the issue or in a comment in the compose file: how does Prometheus on obs-net reach the Spring Boot backend on archiv-net? This is the implementation dependency that will block the "data sources show green" acceptance criterion.
Update docs/infrastructure/production-compose.md §Observability stack — not yet deployed to remove that section and point to the new compose file once this lands.
Update docs/DEPLOYMENT.md §4 Logs + observability to describe the new stack.
Bind the port to 127.0.0.1 in production usage: "127.0.0.1:${PORT_GRAFANA}:3000" — consistent with how all other ports are handled in docker-compose.prod.yml.

## 🔧 Tobias Wendt — DevOps & Platform Engineer ### Observations - **Grafana image is pinned correctly.** `grafana/grafana-oss:11.6.1` is a specific version tag — good. Matches the project standard for pinned images. - **Port 3001 default conflicts with the staging frontend.** `docs/DEPLOYMENT.md §1` states: "Production and staging cohabit on the same host via docker compose project names: `archiv-production` (ports 8080/3000) and `archiv-staging` (ports 8081/3001)." Grafana defaulting to port 3001 will conflict with the staging frontend on a shared host. The default should be a non-colliding port (e.g., 3002 or a higher port like 9091). - **`obs-net` is a new isolated network.** This is structurally fine for a separate compose file, but Prometheus (in a linked issue) will need to reach the Spring Boot backend on `archiv-net` to scrape `/actuator/prometheus`. If Prometheus lives on `obs-net` only, it cannot reach the backend. The implementer needs to add `archiv-net` (or its prod equivalent) to the Prometheus service, or expose the Spring Boot management port on a shared network. This issue spec is silent on this cross-network topology, which is the single biggest implementation risk. - **`grafana_data` named volume is correct.** Dashboard state, user preferences, and alerts persist across upgrades. Good pattern, consistent with the project's named-volume approach for persistence. - **Healthcheck is absent** from the Grafana service spec. Per Tobias's rules, all services should declare a healthcheck. Grafana's built-in health endpoint is `http://localhost:3000/api/health`. Without a healthcheck, `depends_on: condition: service_healthy` cannot be used by downstream services and Compose can't detect an unhealthy Grafana on startup. - **`depends_on` on prometheus/loki/tempo** is correct — Grafana provisioning health checks fail at data-source load time if the backends aren't up. This is well-specified. - **No `GRAFANA_ADMIN_PASSWORD` in `.env.example`** — the issue says "Add `GRAFANA_ADMIN_PASSWORD=changeme` to `.env.example`." This is fine as a task but must actually land in the PR. - **Resource consumption note missing.** `docs/infrastructure/production-compose.md` already flags that "Loki + Grafana with >30 days retention" may require upgrading from CX32 to CX42 (adding ~12 EUR/mo). The PR description should note this or update the doc to reflect that the observability stack is now deployed. ### Recommendations - Add a Grafana healthcheck to the service definition: ```yaml healthcheck: test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"] interval: 15s timeout: 5s retries: 5 start_period: 30s ``` - Change the default port from `3001` to `3002` (or omit the default entirely and require it in `.env.example`) to avoid the staging frontend collision. - Explicitly document the cross-network topology in the issue or in a comment in the compose file: how does Prometheus on `obs-net` reach the Spring Boot backend on `archiv-net`? This is the implementation dependency that will block the "data sources show green" acceptance criterion. - Update `docs/infrastructure/production-compose.md §Observability stack — not yet deployed` to remove that section and point to the new compose file once this lands. - Update `docs/DEPLOYMENT.md §4 Logs + observability` to describe the new stack. - Bind the port to `127.0.0.1` in production usage: `"127.0.0.1:${PORT_GRAFANA}:3000"` — consistent with how all other ports are handled in `docker-compose.prod.yml`.

marcel commented

2026-05-14 15:29:34 +02:00

🧪 Sara Holt — Senior QA Engineer

Observations

Acceptance criteria are well-specified and mostly verifiable. The seven AC items cover startup, UI access, data source health, dashboard loading, "No data" tolerance for not-yet-instrumented panels, trace-to-log correlation, and anonymous sign-up verification. This is a solid list.
One AC is ambiguous: "All three data sources show green 'Data source connected and labels found' status." This exact UI string may differ between Grafana versions. The AC is fine for manual verification, but if this is ever automated it should use the Grafana HTTP API (GET /api/datasources/<id>/health) rather than string-matching the UI. For now, this is a manual verification step — acceptable, but worth noting.
Trace-to-logs correlation AC depends on a running trace. "Open any trace in Grafana Explore → Tempo, click a span" — this requires the backend to be instrumented and sending traces. The issue notes the Spring Boot statistics dashboard "may show 'No data' until backend instrumentation issue is complete." But the trace-to-logs AC implies traces will exist. If no traces are flowing yet, this AC cannot be verified. The issue should either (a) note this AC is conditional on the Tempo instrumentation issue being complete, or (b) provide a way to generate a synthetic trace for verification.
No automated test for anonymous sign-up lockout. The AC says "open a private browser window, confirm no sign-up option" — this is a manual step. For a security control (authentication disabled), a Playwright test would give permanent regression coverage:
```
test('grafana does not offer anonymous sign-up', async ({ page }) => {
  await page.goto('http://localhost:3001');
  await expect(page.getByText('Sign up')).not.toBeVisible();
});
```
This is optional given it's an infrastructure component, but worth flagging.
Definition of Done is clear and complete. All AC checked + config files + dashboard JSONs committed + feature branch + PR. No gaps here.
No rollback procedure specified. What happens if Grafana fails to start after the provisioning files are committed? The operator should know: docker compose -f docker-compose.observability.yml down grafana && docker compose -f docker-compose.observability.yml up -d grafana.

Recommendations

Annotate the trace-to-logs correlation AC with a conditional: "(Requires backend tracing instrumentation to be active; skip if Tempo has no data yet)." This prevents a false "failed" status on this issue.
The acceptance criteria verification can be streamlined with a one-liner smoke check for the API health endpoint: curl -s http://localhost:3001/api/health | jq .database should return "ok". Document this in the issue as a quick sanity check alongside the UI walkthrough.
Add a note to the Definition of Done: confirm GRAFANA_ADMIN_PASSWORD is set in .env.example and that the variable is not committed with a real password anywhere in the repo.

## 🧪 Sara Holt — Senior QA Engineer ### Observations - **Acceptance criteria are well-specified and mostly verifiable.** The seven AC items cover startup, UI access, data source health, dashboard loading, "No data" tolerance for not-yet-instrumented panels, trace-to-log correlation, and anonymous sign-up verification. This is a solid list. - **One AC is ambiguous: "All three data sources show green 'Data source connected and labels found' status."** This exact UI string may differ between Grafana versions. The AC is fine for manual verification, but if this is ever automated it should use the Grafana HTTP API (`GET /api/datasources/<id>/health`) rather than string-matching the UI. For now, this is a manual verification step — acceptable, but worth noting. - **Trace-to-logs correlation AC depends on a running trace.** "Open any trace in Grafana Explore → Tempo, click a span" — this requires the backend to be instrumented and sending traces. The issue notes the Spring Boot statistics dashboard "may show 'No data' until backend instrumentation issue is complete." But the trace-to-logs AC implies traces will exist. If no traces are flowing yet, this AC cannot be verified. The issue should either (a) note this AC is conditional on the Tempo instrumentation issue being complete, or (b) provide a way to generate a synthetic trace for verification. - **No automated test for anonymous sign-up lockout.** The AC says "open a private browser window, confirm no sign-up option" — this is a manual step. For a security control (authentication disabled), a Playwright test would give permanent regression coverage: ```typescript test('grafana does not offer anonymous sign-up', async ({ page }) => { await page.goto('http://localhost:3001'); await expect(page.getByText('Sign up')).not.toBeVisible(); }); ``` This is optional given it's an infrastructure component, but worth flagging. - **Definition of Done is clear and complete.** All AC checked + config files + dashboard JSONs committed + feature branch + PR. No gaps here. - **No rollback procedure specified.** What happens if Grafana fails to start after the provisioning files are committed? The operator should know: `docker compose -f docker-compose.observability.yml down grafana && docker compose -f docker-compose.observability.yml up -d grafana`. ### Recommendations - Annotate the trace-to-logs correlation AC with a conditional: "_(Requires backend tracing instrumentation to be active; skip if Tempo has no data yet)_." This prevents a false "failed" status on this issue. - The acceptance criteria verification can be streamlined with a one-liner smoke check for the API health endpoint: `curl -s http://localhost:3001/api/health | jq .database` should return `"ok"`. Document this in the issue as a quick sanity check alongside the UI walkthrough. - Add a note to the Definition of Done: confirm `GRAFANA_ADMIN_PASSWORD` is set in `.env.example` and that the variable is not committed with a real password anywhere in the repo.

marcel commented

2026-05-14 15:29:41 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Observations

This is a backend infrastructure issue — no frontend Svelte components, no Tailwind, no user-facing UI to design. Grafana's UI is owned by the Grafana project and is not part of the Familienarchiv design system.

From a UX standpoint, the one user-facing consideration is discoverability and access: how does the operator (Marcel) know Grafana is running, on which port, and how to log in? The issue covers this in the acceptance criteria ("Grafana UI accessible at http://localhost:3001, login: admin / value of GRAFANA_ADMIN_PASSWORD"), which is sufficient.

No concerns from a brand, accessibility, or responsive-design angle — Grafana is an internal operator tool, not part of the family-facing application. The senior users (60+) who are the primary audience for accessibility concerns will never interact with Grafana.

One minor note: if a Grafana link is ever surfaced in the Familienarchiv admin panel (e.g., a "View metrics" button for admins), that would require UX review at that time. For now, Grafana is accessed directly by its port — no UI work required.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist ### Observations This is a backend infrastructure issue — no frontend Svelte components, no Tailwind, no user-facing UI to design. Grafana's UI is owned by the Grafana project and is not part of the Familienarchiv design system. From a UX standpoint, the one user-facing consideration is **discoverability and access**: how does the operator (Marcel) know Grafana is running, on which port, and how to log in? The issue covers this in the acceptance criteria ("Grafana UI accessible at `http://localhost:3001`, login: admin / value of `GRAFANA_ADMIN_PASSWORD`"), which is sufficient. No concerns from a brand, accessibility, or responsive-design angle — Grafana is an internal operator tool, not part of the family-facing application. The senior users (60+) who are the primary audience for accessibility concerns will never interact with Grafana. One minor note: if a Grafana link is ever surfaced in the Familienarchiv admin panel (e.g., a "View metrics" button for admins), that would require UX review at that time. For now, Grafana is accessed directly by its port — no UI work required.

marcel commented

2026-05-14 15:29:56 +02:00

📋 Elicit — Requirements Engineer

Observations

Scope is well-bounded and implementation-ready. The issue provides a concrete service definition, three config file templates, three dashboard IDs, seven acceptance criteria, and a Definition of Done. This is a high-quality spec — no vague requirements.
One acceptance criterion has a hidden dependency not flagged in the issue. AC 6 ("Trace-to-logs correlation works: open any trace in Grafana Explore → Tempo, click a span, 'Related logs' opens the correlated Loki stream") is only verifiable when the backend is instrumented with OpenTelemetry tracing. The issue says the Spring Boot statistics dashboard "may show 'No data'" (acceptable) but does not apply the same conditional reasoning to the trace correlation AC. This creates a situation where the issue could be "done" by the Definition of Done but one AC cannot be verified. Recommendation: add a conditional note to AC 6.
"Depends on: Prometheus issue, Loki issue, Tempo issue" — these dependencies are stated but not linked. Adding Gitea issue numbers to those dependencies (e.g., "Depends on: #NNN Prometheus, #NNN Loki, #NNN Tempo") would make the blocking relationship navigable and allow Gitea's dependency graph to reflect the actual order.
The issue is part of milestone "Observability Stack — Grafana LGTM + GlitchTip" but it is not clear which issue is the integration test / "everything works together" verification. For a multi-issue observability milestone, there should be one umbrella issue or acceptance test that verifies the full stack (Prometheus scrapes backend → Grafana shows live metrics → Loki receives logs → Tempo receives traces). If that issue exists, link to it. If not, consider adding it.
NFR: retention policy for grafana_data. The spec covers that the volume is named and persistent, but does not specify how large it will grow or what the retention policy is for dashboard state. For a dev environment this is a non-issue. For production, Grafana's SQLite database inside grafana_data grows slowly but steadily. No action needed now — worth noting for the production deployment checklist.

Recommendations

Link the three dependency issues by number rather than by name in the "Depends on" section.
Add to AC 6: "(Conditional: only verifiable once backend OpenTelemetry tracing is active. Mark this AC as 'pending' if Tempo has no data yet — it does not block closing the issue.)"
Consider whether the milestone needs a final "end-to-end smoke test" issue that verifies all four components together after each individual issue closes.

## 📋 Elicit — Requirements Engineer ### Observations - **Scope is well-bounded and implementation-ready.** The issue provides a concrete service definition, three config file templates, three dashboard IDs, seven acceptance criteria, and a Definition of Done. This is a high-quality spec — no vague requirements. - **One acceptance criterion has a hidden dependency not flagged in the issue.** AC 6 ("Trace-to-logs correlation works: open any trace in Grafana Explore → Tempo, click a span, 'Related logs' opens the correlated Loki stream") is only verifiable when the backend is instrumented with OpenTelemetry tracing. The issue says the Spring Boot statistics dashboard "may show 'No data'" (acceptable) but does not apply the same conditional reasoning to the trace correlation AC. This creates a situation where the issue could be "done" by the Definition of Done but one AC cannot be verified. **Recommendation: add a conditional note to AC 6.** - **"Depends on: Prometheus issue, Loki issue, Tempo issue" — these dependencies are stated but not linked.** Adding Gitea issue numbers to those dependencies (e.g., "Depends on: #NNN Prometheus, #NNN Loki, #NNN Tempo") would make the blocking relationship navigable and allow Gitea's dependency graph to reflect the actual order. - **The issue is part of milestone "Observability Stack — Grafana LGTM + GlitchTip" but it is not clear which issue is the integration test / "everything works together" verification.** For a multi-issue observability milestone, there should be one umbrella issue or acceptance test that verifies the full stack (Prometheus scrapes backend → Grafana shows live metrics → Loki receives logs → Tempo receives traces). If that issue exists, link to it. If not, consider adding it. - **NFR: retention policy for `grafana_data`.** The spec covers that the volume is named and persistent, but does not specify how large it will grow or what the retention policy is for dashboard state. For a dev environment this is a non-issue. For production, Grafana's SQLite database inside `grafana_data` grows slowly but steadily. No action needed now — worth noting for the production deployment checklist. ### Recommendations - Link the three dependency issues by number rather than by name in the "Depends on" section. - Add to AC 6: "_(Conditional: only verifiable once backend OpenTelemetry tracing is active. Mark this AC as 'pending' if Tempo has no data yet — it does not block closing the issue.)_" - Consider whether the milestone needs a final "end-to-end smoke test" issue that verifies all four components together after each individual issue closes.

marcel commented

2026-05-14 15:30:11 +02:00

🗳️ Decision Queue — Action Required

3 decisions need your input before implementation starts.

Infrastructure / Networking

Cross-network topology: how does Prometheus reach the Spring Boot backend? The app services run on archiv-net; the observability services run on obs-net. Prometheus must scrape the backend's /actuator/prometheus endpoint, but can't do so if it's isolated to obs-net only. Options: (a) add the application's archiv-net to the Prometheus container definition — simplest, one line; (b) expose the backend management port on a named shared network; (c) use Caddy or a sidecar to bridge. Option (a) is the obvious default for a single-host setup. (Raised by: Markus, Tobias)

Configuration / Security

GF_SECURITY_ADMIN_PASSWORD fallback: silent admin default vs. hard fail on unset. Current spec: ${GRAFANA_ADMIN_PASSWORD:-admin}. If this compose file reaches production and the env var is not set, Grafana starts with the well-known default password. Options: (a) change to ${GRAFANA_ADMIN_PASSWORD:?GRAFANA_ADMIN_PASSWORD must be set} — Compose refuses to start if unset; (b) keep the -admin default and rely on operator discipline + .env.example documentation. Option (a) is the safer choice for any file that could be used in production. (Raised by: Nora, Tobias)
Default port 3001 collides with the staging frontend. docs/DEPLOYMENT.md §1 explicitly maps staging to port 3001. The Grafana default ${PORT_GRAFANA:-3001} will conflict on a shared host. Options: (a) change default to 3002 or 9091; (b) remove the default entirely and require it in .env.example (consistent with how PORT_BACKEND, PORT_FRONTEND, etc. are handled). Either is fine — just pick one to avoid the collision. (Raised by: Nora, Tobias, Markus)

## 🗳️ Decision Queue — Action Required _3 decisions need your input before implementation starts._ ### Infrastructure / Networking - **Cross-network topology: how does Prometheus reach the Spring Boot backend?** The app services run on `archiv-net`; the observability services run on `obs-net`. Prometheus must scrape the backend's `/actuator/prometheus` endpoint, but can't do so if it's isolated to `obs-net` only. Options: (a) add the application's `archiv-net` to the Prometheus container definition — simplest, one line; (b) expose the backend management port on a named shared network; (c) use Caddy or a sidecar to bridge. Option (a) is the obvious default for a single-host setup. _(Raised by: Markus, Tobias)_ ### Configuration / Security - **`GF_SECURITY_ADMIN_PASSWORD` fallback: silent `admin` default vs. hard fail on unset.** Current spec: `${GRAFANA_ADMIN_PASSWORD:-admin}`. If this compose file reaches production and the env var is not set, Grafana starts with the well-known default password. Options: (a) change to `${GRAFANA_ADMIN_PASSWORD:?GRAFANA_ADMIN_PASSWORD must be set}` — Compose refuses to start if unset; (b) keep the `-admin` default and rely on operator discipline + `.env.example` documentation. Option (a) is the safer choice for any file that could be used in production. _(Raised by: Nora, Tobias)_ - **Default port `3001` collides with the staging frontend.** `docs/DEPLOYMENT.md §1` explicitly maps staging to port 3001. The Grafana default `${PORT_GRAFANA:-3001}` will conflict on a shared host. Options: (a) change default to `3002` or `9091`; (b) remove the default entirely and require it in `.env.example` (consistent with how `PORT_BACKEND`, `PORT_FRONTEND`, etc. are handled). Either is fine — just pick one to avoid the collision. _(Raised by: Nora, Tobias, Markus)_

marcel commented

2026-05-15 04:05:05 +02:00

Implementation complete — branch `feat/issue-577-grafana`

What was implemented

Commit 1: feat(observability): add Grafana with provisioned datasources and dashboards

Added obs-grafana service (grafana/grafana-oss:11.6.1) to docker-compose.observability.yml
- Port bound to 127.0.0.1:${PORT_GRAFANA:-3001}:3000 (127.0.0.1-only per security convention)
- obs-net only (no archiv-net — Grafana only needs to reach Prometheus/Loki/Tempo, not the app)
- depends_on: prometheus, loki, tempo
- grafana_data volume was already defined from #572
Created infra/observability/grafana/provisioning/datasources/datasources.yml with:
- Prometheus (uid: prometheus, isDefault: true)
- Loki (uid: loki) with traceId derived field linking → Tempo
- Tempo (uid: tempo) with tracesToLogsV2 → Loki, service map → Prometheus, node graph enabled
Created infra/observability/grafana/provisioning/dashboards/dashboards.yml (file provider, disableDeletion: true, 30s update interval)
Downloaded and committed 3 dashboard JSONs:
- node-exporter-full.json (ID 1860) — no template variable substitution needed
- spring-boot-observability.json (ID 17175) — replaced ${DS_PROMETHEUS} and ${DS_LOKI} uid references with prometheus / loki
- loki-logs.json (ID 13639) — replaced "${DS_LOKI}" string with {"type": "loki", "uid": "loki"} object
Added GRAFANA_ADMIN_PASSWORD=changeme to .env.example (observability section)

Commit 2: docs(observability): document Grafana in DEPLOYMENT.md and C4 diagram

docs/DEPLOYMENT.md: added obs-grafana row to services table, Grafana access details block (URL, credentials, datasources, dashboard list), PORT_GRAFANA and GRAFANA_ADMIN_PASSWORD to env vars table
docs/architecture/c4/l2-containers.puml: replaced placeholder Grafana entry with pinned image tag, expanded observability boundary with node_exporter and cadvisor containers that were missing, added Rel() edges for Grafana → Prometheus (HTTP 9090), Grafana → Loki (HTTP 3100), Grafana → Tempo (HTTP 3200)

Validation

docker compose -f docker-compose.observability.yml config passes cleanly.

## Implementation complete — branch `feat/issue-577-grafana` ### What was implemented **Commit 1: `feat(observability): add Grafana with provisioned datasources and dashboards`** - Added `obs-grafana` service (`grafana/grafana-oss:11.6.1`) to `docker-compose.observability.yml` - Port bound to `127.0.0.1:${PORT_GRAFANA:-3001}:3000` (127.0.0.1-only per security convention) - `obs-net` only (no `archiv-net` — Grafana only needs to reach Prometheus/Loki/Tempo, not the app) - `depends_on: prometheus, loki, tempo` - `grafana_data` volume was already defined from #572 - Created `infra/observability/grafana/provisioning/datasources/datasources.yml` with: - Prometheus (uid: `prometheus`, isDefault: true) - Loki (uid: `loki`) with `traceId` derived field linking → Tempo - Tempo (uid: `tempo`) with `tracesToLogsV2` → Loki, service map → Prometheus, node graph enabled - Created `infra/observability/grafana/provisioning/dashboards/dashboards.yml` (file provider, disableDeletion: true, 30s update interval) - Downloaded and committed 3 dashboard JSONs: - `node-exporter-full.json` (ID 1860) — no template variable substitution needed - `spring-boot-observability.json` (ID 17175) — replaced `${DS_PROMETHEUS}` and `${DS_LOKI}` uid references with `prometheus` / `loki` - `loki-logs.json` (ID 13639) — replaced `"${DS_LOKI}"` string with `{"type": "loki", "uid": "loki"}` object - Added `GRAFANA_ADMIN_PASSWORD=changeme` to `.env.example` (observability section) **Commit 2: `docs(observability): document Grafana in DEPLOYMENT.md and C4 diagram`** - `docs/DEPLOYMENT.md`: added `obs-grafana` row to services table, Grafana access details block (URL, credentials, datasources, dashboard list), `PORT_GRAFANA` and `GRAFANA_ADMIN_PASSWORD` to env vars table - `docs/architecture/c4/l2-containers.puml`: replaced placeholder Grafana entry with pinned image tag, expanded observability boundary with `node_exporter` and `cadvisor` containers that were missing, added `Rel()` edges for Grafana → Prometheus (HTTP 9090), Grafana → Loki (HTTP 3100), Grafana → Tempo (HTTP 3200) ### Validation `docker compose -f docker-compose.observability.yml config` passes cleanly.

marcel referenced a pull request that will close this issue

2026-05-15 04:05:41 +02:00

feat(observability): add Grafana with provisioned datasources and dashboards #589

marcel referenced this issue

2026-05-15 04:08:15 +02:00

feat(observability): add Grafana with provisioned datasources and dashboards #589

marcel closed this issue

2026-05-15 04:35:11 +02:00

marcel commented

2026-05-15 04:35:29 +02:00

✅ Implemented and merged via PR #589

What was delivered

obs-grafana service added to docker-compose.observability.yml (grafana/grafana-oss:11.6.1, port 127.0.0.1:${PORT_GRAFANA:-3001}:3000, obs-net only)
Provisioned datasources (infra/observability/grafana/provisioning/datasources/datasources.yml):
- Prometheus (default, uid: prometheus)
- Loki with traceId → Tempo derived field (uid: loki)
- Tempo with tracesToLogsV2 → Loki correlation + service map + node graph (uid: tempo)
Dashboard provider (infra/observability/grafana/provisioning/dashboards/dashboards.yml)
3 pre-loaded dashboards (datasource template variables substituted at commit time):
- Node Exporter Full
- Spring Boot Observability
- Loki logs
Healthcheck on Grafana (/api/health, 30s interval, 30s start_period)
depends_on with condition: service_healthy for prometheus, loki, tempo
GRAFANA_ADMIN_PASSWORD added to .env.example
docs/DEPLOYMENT.md updated with access URL, credentials, env var table
docs/architecture/c4/l2-containers.puml updated with Grafana container + 3 Rel edges

Commits

f3f8345b feat(observability): add Grafana with provisioned datasources and dashboards
c99321e5 docs(observability): document Grafana in DEPLOYMENT.md and C4 diagram
457c1d3a fix(observability): add grafana healthcheck and service_healthy depends_on

## ✅ Implemented and merged via PR #589 ### What was delivered - **`obs-grafana` service** added to `docker-compose.observability.yml` (`grafana/grafana-oss:11.6.1`, port `127.0.0.1:${PORT_GRAFANA:-3001}:3000`, `obs-net` only) - **Provisioned datasources** (`infra/observability/grafana/provisioning/datasources/datasources.yml`): - Prometheus (default, uid: `prometheus`) - Loki with traceId → Tempo derived field (uid: `loki`) - Tempo with tracesToLogsV2 → Loki correlation + service map + node graph (uid: `tempo`) - **Dashboard provider** (`infra/observability/grafana/provisioning/dashboards/dashboards.yml`) - **3 pre-loaded dashboards** (datasource template variables substituted at commit time): - Node Exporter Full - Spring Boot Observability - Loki logs - **Healthcheck** on Grafana (`/api/health`, 30s interval, 30s start_period) - **`depends_on` with `condition: service_healthy`** for prometheus, loki, tempo - **`GRAFANA_ADMIN_PASSWORD`** added to `.env.example` - **`docs/DEPLOYMENT.md`** updated with access URL, credentials, env var table - **`docs/architecture/c4/l2-containers.puml`** updated with Grafana container + 3 Rel edges ### Commits - `f3f8345b` feat(observability): add Grafana with provisioned datasources and dashboards - `c99321e5` docs(observability): document Grafana in DEPLOYMENT.md and C4 diagram - `457c1d3a` fix(observability): add grafana healthcheck and service_healthy depends_on

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#577

devops(observability): add Grafana with provisioned Prometheus, Loki, and Tempo data sources and pre-imported dashboards #577

Context

Service to Add

Config Files to Create

infra/observability/grafana/provisioning/datasources/datasources.yml

infra/observability/grafana/provisioning/dashboards/dashboards.yml

Dashboard JSON files

Acceptance Criteria

Definition of Done

🏗️ Markus Keller — Senior Application Architect

Observations

Recommendations

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

Recommendations

🔒 Nora Steiner — Application Security Engineer

Observations

Recommendations

🔧 Tobias Wendt — DevOps & Platform Engineer

Observations

Recommendations

🧪 Sara Holt — Senior QA Engineer

Observations

Recommendations

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Observations

📋 Elicit — Requirements Engineer

Observations

Recommendations

🗳️ Decision Queue — Action Required

Infrastructure / Networking

Configuration / Security

Implementation complete — branch feat/issue-577-grafana

What was implemented

Validation

✅ Implemented and merged via PR #589

What was delivered

Commits

`infra/observability/grafana/provisioning/datasources/datasources.yml`

`infra/observability/grafana/provisioning/dashboards/dashboards.yml`

Implementation complete — branch `feat/issue-577-grafana`