Merge pull request 'feat(observability): add Grafana with provisioned datasources and dashboards' (#589) from feat/issue-577-grafana into main
Some checks failed
Some checks failed
feat(observability): add Grafana with provisioned datasources and dashboards (#589)
This commit was merged in pull request #589.
This commit is contained in:
@@ -35,6 +35,9 @@ PORT_GRAFANA=3001
|
||||
PORT_GLITCHTIP=3002
|
||||
PORT_PROMETHEUS=9090
|
||||
|
||||
# Grafana admin password — change this before exposing Grafana beyond localhost
|
||||
GRAFANA_ADMIN_PASSWORD=changeme
|
||||
|
||||
# GlitchTip domain — production: use https://grafana.raddatz.cloud (must match Caddy vhost)
|
||||
GLITCHTIP_DOMAIN=http://localhost:3002
|
||||
|
||||
|
||||
@@ -136,8 +136,35 @@ services:
|
||||
- obs-net # Grafana reaches tempo:3200 over this network
|
||||
|
||||
# --- Dashboards: Grafana ---
|
||||
# grafana: (see future issue)
|
||||
#
|
||||
|
||||
obs-grafana:
|
||||
image: grafana/grafana-oss:11.6.1
|
||||
container_name: obs-grafana
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "127.0.0.1:${PORT_GRAFANA:-3001}:3000"
|
||||
environment:
|
||||
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}
|
||||
GF_USERS_ALLOW_SIGN_UP: "false"
|
||||
volumes:
|
||||
- grafana_data:/var/lib/grafana
|
||||
- ./infra/observability/grafana/provisioning:/etc/grafana/provisioning:ro
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "wget -qO- http://localhost:3000/api/health | grep -q ok || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
start_period: 30s
|
||||
depends_on:
|
||||
prometheus:
|
||||
condition: service_healthy
|
||||
loki:
|
||||
condition: service_healthy
|
||||
tempo:
|
||||
condition: service_healthy
|
||||
networks:
|
||||
- obs-net
|
||||
|
||||
# --- Error Tracking: GlitchTip ---
|
||||
# glitchtip: (see future issue)
|
||||
|
||||
|
||||
@@ -142,6 +142,8 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
|
||||
| Variable | Purpose | Default | Required? | Sensitive? |
|
||||
|---|---|---|---|---|
|
||||
| `PORT_PROMETHEUS` | Host port for the Prometheus UI (bound to `127.0.0.1` only) | `9090` | — | — |
|
||||
| `PORT_GRAFANA` | Host port for the Grafana UI (bound to `127.0.0.1` only) | `3001` | — | — |
|
||||
| `GRAFANA_ADMIN_PASSWORD` | Grafana `admin` user password | `changeme` | YES (prod) | YES |
|
||||
|
||||
---
|
||||
|
||||
@@ -284,6 +286,25 @@ Current services:
|
||||
| `obs-loki` | `grafana/loki:3.4.2` | Log aggregation — receives log streams from Promtail. Port 3100 is `expose`-only (not host-bound). |
|
||||
| `obs-promtail` | `grafana/promtail:3.4.2` | Log shipping agent — reads all Docker container logs via the Docker socket and forwards them to Loki with `container_name`, `compose_service`, and `compose_project` labels |
|
||||
| `obs-tempo` | `grafana/tempo:2.7.2` | Distributed trace storage — OTLP gRPC receiver on port 4317, OTLP HTTP on port 4318 (both `archiv-net`-internal). Grafana queries traces on port 3200 (`obs-net`-internal). All ports are `expose`-only (not host-bound). |
|
||||
| `obs-grafana` | `grafana/grafana-oss:11.6.1` | Unified observability UI — metrics dashboards, log exploration, trace viewer. Bound to `127.0.0.1:${PORT_GRAFANA:-3001}` on the host. |
|
||||
|
||||
#### Grafana
|
||||
|
||||
| Item | Value |
|
||||
|---|---|
|
||||
| URL | `http://localhost:3001` (or `http://localhost:$PORT_GRAFANA`) |
|
||||
| Username | `admin` |
|
||||
| Password | `$GRAFANA_ADMIN_PASSWORD` (default: `changeme` — **change before exposing to a network**) |
|
||||
|
||||
Datasources are auto-provisioned on first start (Prometheus, Loki, Tempo — no manual setup required). Three dashboards are pre-loaded:
|
||||
|
||||
| Dashboard | Grafana ID | Purpose |
|
||||
|---|---|---|
|
||||
| Node Exporter Full | 1860 | Host CPU, memory, disk, network |
|
||||
| Spring Boot Observability | 17175 | JVM metrics, HTTP latency, error rate |
|
||||
| Loki Logs | 13639 | Log exploration and filtering |
|
||||
|
||||
Tempo traces are accessible via Grafana Explore → Tempo datasource, and linked from Loki logs via the `traceId` derived field.
|
||||
|
||||
**Loki quick checks** (after ~60 s, run from inside the `obs-loki` container):
|
||||
|
||||
@@ -301,7 +322,7 @@ docker exec obs-loki wget -qO- \
|
||||
|
||||
**Prefer `compose_service` over `container_name` in LogQL queries** — `container_name` differs between dev (`archive-backend`) and prod (`archiv-production-backend-1`), while `compose_service` is stable (`backend`, `db`, `minio`, etc.).
|
||||
|
||||
Prometheus port `9090` is bound to `127.0.0.1:${PORT_PROMETHEUS:-9090}` on the host. No other observability ports are host-bound. Full wiring and Grafana dashboards are tracked in issue #581.
|
||||
Prometheus port `9090` and Grafana port `3001` are bound to `127.0.0.1` on the host. No other observability ports are host-bound.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -17,12 +17,14 @@ System_Boundary(archiv, "Familienarchiv (Docker Compose)") {
|
||||
Container(mc, "Bucket / Service-Account Init", "MinIO Client (mc)", "One-shot container on startup. Idempotent: creates the archive bucket, the archiv-app service account, and attaches the readwrite policy.")
|
||||
}
|
||||
|
||||
System_Boundary(observability, "Observability Stack (docker-compose.observability.yml / archiv-net)") {
|
||||
Container(prometheus, "Prometheus", "prom/prometheus", "Scrapes metrics from backend management port 8081 (/actuator/prometheus). Retention and alert rules TBD — see issue #581.")
|
||||
System_Boundary(observability, "Observability Stack (docker-compose.observability.yml)") {
|
||||
Container(prometheus, "Prometheus", "prom/prometheus:v3.4.0", "Scrapes metrics from backend management port 8081 (/actuator/prometheus), node-exporter, and cAdvisor. Retention: 30 days.")
|
||||
Container(node_exporter, "Node Exporter", "prom/node-exporter:v1.9.0", "Host-level CPU, memory, disk, and network metrics.")
|
||||
Container(cadvisor, "cAdvisor", "gcr.io/cadvisor/cadvisor:v0.52.1", "Per-container resource metrics.")
|
||||
Container(loki, "Loki", "grafana/loki:3.4.2", "Stores log streams from all containers.")
|
||||
Container(promtail, "Promtail", "grafana/promtail:3.4.2", "Ships Docker container logs to Loki via Docker SD")
|
||||
Container(promtail, "Promtail", "grafana/promtail:3.4.2", "Ships Docker container logs to Loki via Docker SD.")
|
||||
Container(tempo, "Tempo", "grafana/tempo:2.7.2", "Distributed trace storage. OTLP gRPC receiver on port 4317 (archiv-net). Grafana queries traces on port 3200 (obs-net). All ports internal only.")
|
||||
Container(grafana, "Grafana", "grafana/grafana", "Dashboards and alerting UI. Data sources: Prometheus + Loki + Tempo. Wiring TBD — see issue #581.")
|
||||
Container(grafana, "Grafana", "grafana/grafana-oss:11.6.1", "Unified observability UI — dashboards, logs, traces. Datasources (Prometheus, Loki, Tempo) and three dashboards are auto-provisioned.")
|
||||
}
|
||||
|
||||
Rel(user, caddy, "HTTPS", "TLS 1.2/1.3")
|
||||
@@ -38,5 +40,8 @@ Rel(ocr, storage, "Fetches PDF via presigned URL", "HTTP / S3 presigned")
|
||||
Rel(mc, storage, "Bootstraps bucket + service account on startup", "MinIO Client CLI")
|
||||
Rel(promtail, loki, "Pushes log streams", "HTTP/Loki push API")
|
||||
Rel(backend, tempo, "Sends distributed traces via OTLP", "gRPC / OTLP / port 4317 (archiv-net)")
|
||||
Rel(grafana, prometheus, "Queries metrics", "HTTP 9090")
|
||||
Rel(grafana, loki, "Queries logs", "HTTP 3100")
|
||||
Rel(grafana, tempo, "Queries traces", "HTTP 3200")
|
||||
|
||||
@enduml
|
||||
|
||||
@@ -0,0 +1,10 @@
|
||||
apiVersion: 1
|
||||
|
||||
providers:
|
||||
- name: default
|
||||
type: file
|
||||
disableDeletion: true
|
||||
updateIntervalSeconds: 30
|
||||
options:
|
||||
path: /etc/grafana/provisioning/dashboards
|
||||
foldersFromFilesStructure: false
|
||||
@@ -0,0 +1,284 @@
|
||||
{
|
||||
"__inputs": [
|
||||
{
|
||||
"name": "DS_LOKI",
|
||||
"label": "Loki",
|
||||
"description": "",
|
||||
"type": "datasource",
|
||||
"pluginId": "loki",
|
||||
"pluginName": "Loki"
|
||||
}
|
||||
],
|
||||
"__requires": [
|
||||
{
|
||||
"type": "grafana",
|
||||
"id": "grafana",
|
||||
"name": "Grafana",
|
||||
"version": "7.1.0"
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "graph",
|
||||
"name": "Graph",
|
||||
"version": ""
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "logs",
|
||||
"name": "Logs",
|
||||
"version": ""
|
||||
},
|
||||
{
|
||||
"type": "datasource",
|
||||
"id": "loki",
|
||||
"name": "Loki",
|
||||
"version": "1.0.0"
|
||||
}
|
||||
],
|
||||
"annotations": {
|
||||
"list": [
|
||||
{
|
||||
"$$hashKey": "object:75",
|
||||
"builtIn": 1,
|
||||
"datasource": "-- Grafana --",
|
||||
"enable": true,
|
||||
"hide": true,
|
||||
"iconColor": "rgba(0, 211, 255, 1)",
|
||||
"name": "Annotations & Alerts",
|
||||
"type": "dashboard"
|
||||
}
|
||||
]
|
||||
},
|
||||
"description": "Log Viewer Dashboard for Loki",
|
||||
"editable": false,
|
||||
"gnetId": 13639,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"iteration": 1608932746420,
|
||||
"links": [
|
||||
{
|
||||
"$$hashKey": "object:59",
|
||||
"icon": "bolt",
|
||||
"includeVars": true,
|
||||
"keepTime": true,
|
||||
"tags": [],
|
||||
"targetBlank": true,
|
||||
"title": "View In Explore",
|
||||
"type": "link",
|
||||
"url": "/explore?orgId=1&left=[\"now-1h\",\"now\",\"Loki\",{\"expr\":\"{job=\\\"$app\\\"}\"},{\"ui\":[true,true,true,\"none\"]}]"
|
||||
},
|
||||
{
|
||||
"$$hashKey": "object:61",
|
||||
"icon": "external link",
|
||||
"tags": [],
|
||||
"targetBlank": true,
|
||||
"title": "Learn LogQL",
|
||||
"type": "link",
|
||||
"url": "https://grafana.com/docs/loki/latest/logql/"
|
||||
}
|
||||
],
|
||||
"panels": [
|
||||
{
|
||||
"aliasColors": {},
|
||||
"bars": true,
|
||||
"dashLength": 10,
|
||||
"dashes": false,
|
||||
"datasource": {"type": "loki", "uid": "loki"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"custom": {},
|
||||
"links": []
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"fill": 1,
|
||||
"fillGradient": 0,
|
||||
"gridPos": {
|
||||
"h": 3,
|
||||
"w": 24,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"hiddenSeries": false,
|
||||
"id": 6,
|
||||
"legend": {
|
||||
"avg": false,
|
||||
"current": false,
|
||||
"max": false,
|
||||
"min": false,
|
||||
"show": false,
|
||||
"total": false,
|
||||
"values": false
|
||||
},
|
||||
"lines": false,
|
||||
"linewidth": 1,
|
||||
"nullPointMode": "null",
|
||||
"percentage": false,
|
||||
"pluginVersion": "7.1.0",
|
||||
"pointradius": 2,
|
||||
"points": false,
|
||||
"renderer": "flot",
|
||||
"seriesOverrides": [],
|
||||
"spaceLength": 10,
|
||||
"stack": false,
|
||||
"steppedLine": false,
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(count_over_time({job=\"$app\"} |= \"$search\" [$__interval]))",
|
||||
"legendFormat": "",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"thresholds": [],
|
||||
"timeFrom": null,
|
||||
"timeRegions": [],
|
||||
"timeShift": null,
|
||||
"title": "",
|
||||
"tooltip": {
|
||||
"shared": true,
|
||||
"sort": 0,
|
||||
"value_type": "individual"
|
||||
},
|
||||
"type": "graph",
|
||||
"xaxis": {
|
||||
"buckets": null,
|
||||
"mode": "time",
|
||||
"name": null,
|
||||
"show": true,
|
||||
"values": []
|
||||
},
|
||||
"yaxes": [
|
||||
{
|
||||
"$$hashKey": "object:168",
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": false
|
||||
},
|
||||
{
|
||||
"$$hashKey": "object:169",
|
||||
"format": "short",
|
||||
"label": null,
|
||||
"logBase": 1,
|
||||
"max": null,
|
||||
"min": null,
|
||||
"show": false
|
||||
}
|
||||
],
|
||||
"yaxis": {
|
||||
"align": false,
|
||||
"alignLevel": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "loki", "uid": "loki"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"custom": {}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 25,
|
||||
"w": 24,
|
||||
"x": 0,
|
||||
"y": 3
|
||||
},
|
||||
"id": 2,
|
||||
"maxDataPoints": "",
|
||||
"options": {
|
||||
"showLabels": false,
|
||||
"showTime": true,
|
||||
"sortOrder": "Descending",
|
||||
"wrapLogMessage": false
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "{job=\"$app\"} |= \"$search\" | logfmt",
|
||||
"hide": false,
|
||||
"legendFormat": "",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"timeFrom": null,
|
||||
"timeShift": null,
|
||||
"title": "",
|
||||
"transparent": true,
|
||||
"type": "logs"
|
||||
}
|
||||
],
|
||||
"refresh": false,
|
||||
"schemaVersion": 26,
|
||||
"style": "dark",
|
||||
"tags": [],
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"allValue": null,
|
||||
"current": {},
|
||||
"datasource": {"type": "loki", "uid": "loki"},
|
||||
"definition": "label_values(job)",
|
||||
"hide": 0,
|
||||
"includeAll": false,
|
||||
"label": "App",
|
||||
"multi": false,
|
||||
"name": "app",
|
||||
"options": [],
|
||||
"query": "label_values(job)",
|
||||
"refresh": 1,
|
||||
"regex": "",
|
||||
"skipUrlSync": false,
|
||||
"sort": 0,
|
||||
"tagValuesQuery": "",
|
||||
"tags": [],
|
||||
"tagsQuery": "",
|
||||
"type": "query",
|
||||
"useTags": false
|
||||
},
|
||||
{
|
||||
"current": {
|
||||
"selected": false,
|
||||
"text": "",
|
||||
"value": ""
|
||||
},
|
||||
"hide": 0,
|
||||
"label": "String Match",
|
||||
"name": "search",
|
||||
"options": [
|
||||
{
|
||||
"selected": true,
|
||||
"text": "",
|
||||
"value": ""
|
||||
}
|
||||
],
|
||||
"query": "",
|
||||
"skipUrlSync": false,
|
||||
"type": "textbox"
|
||||
}
|
||||
]
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"timepicker": {
|
||||
"hidden": false,
|
||||
"refresh_intervals": [
|
||||
"10s",
|
||||
"30s",
|
||||
"1m",
|
||||
"5m",
|
||||
"15m",
|
||||
"30m",
|
||||
"1h",
|
||||
"2h",
|
||||
"1d"
|
||||
]
|
||||
},
|
||||
"timezone": "",
|
||||
"title": "Logs / App",
|
||||
"uid": "sadlil-loki-apps-dashboard",
|
||||
"version": 13
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,38 @@
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Prometheus
|
||||
type: prometheus
|
||||
uid: prometheus
|
||||
url: http://obs-prometheus:9090
|
||||
isDefault: true
|
||||
editable: false
|
||||
|
||||
- name: Loki
|
||||
type: loki
|
||||
uid: loki
|
||||
url: http://obs-loki:3100
|
||||
editable: false
|
||||
jsonData:
|
||||
derivedFields:
|
||||
- name: TraceID
|
||||
matcherRegex: '"traceId":"(\w+)"'
|
||||
url: "${__value.raw}"
|
||||
datasourceUid: tempo
|
||||
|
||||
- name: Tempo
|
||||
type: tempo
|
||||
uid: tempo
|
||||
url: http://obs-tempo:3200
|
||||
editable: false
|
||||
jsonData:
|
||||
tracesToLogsV2:
|
||||
datasourceUid: loki
|
||||
spanStartTimeShift: "-1m"
|
||||
spanEndTimeShift: "1m"
|
||||
filterByTraceID: true
|
||||
filterBySpanID: false
|
||||
serviceMap:
|
||||
datasourceUid: prometheus
|
||||
nodeGraph:
|
||||
enabled: true
|
||||
Reference in New Issue
Block a user