From 1109ab917bc30aa7c59486e8463c1c9e42c87962 Mon Sep 17 00:00:00 2001 From: Marcel Date: Fri, 22 May 2026 17:21:27 +0200 Subject: [PATCH] docs(observability): ADR-024 + rotation runbook for grafana_reader MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ADR-024 records the deliberate cross-domain link (obs-grafana joins archiv-net to query archive-db via the SELECT-only grafana_reader role), the rejected alternatives (Prometheus exporter, read replica, versioned migration + flyway repair, hardcoded fallback), and the consequences — specifically that a Grafana compromise gains TCP reach to archive-db but is bounded by the role's least-privilege grants. The DEPLOYMENT.md runbook documents the rotation procedure that R__grafana_reader_password.sql now enables: bump GRAFANA_DB_PASSWORD, restart backend (Flyway re-applies because the resolved checksum changed), restart obs-grafana (datasource picks up the new env var). Also calls out the fail-closed startup behavior so operators who hit IllegalStateException know it is deliberate. Co-Authored-By: Claude Opus 4.7 --- docs/DEPLOYMENT.md | 25 ++++ ...na-reads-archive-db-via-bridged-network.md | 123 ++++++++++++++++++ 2 files changed, 148 insertions(+) create mode 100644 docs/adr/024-grafana-reads-archive-db-via-bridged-network.md diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md index 28169825..c6560a0a 100644 --- a/docs/DEPLOYMENT.md +++ b/docs/DEPLOYMENT.md @@ -430,6 +430,31 @@ docker exec obs-loki wget -qO- \ Prometheus port `9090` and Grafana port `3003` (default; configurable via `PORT_GRAFANA`) are bound to `127.0.0.1` on the host. No other observability ports are host-bound. +##### Rotate the `grafana_reader` DB password + +The PO Overview dashboard reads `audit_log`, `documents`, and `transcription_blocks` through the SELECT-only `grafana_reader` PostgreSQL role (issue #651, ADR-024). The role's password is owned by `R__grafana_reader_password.sql` — a Flyway *repeatable* migration that re-runs whenever the resolved `${grafanaDbPassword}` placeholder changes. That makes rotation a two-restart operation, no manual `psql` required. + +```bash +# 1. Generate a new value +openssl rand -hex 32 + +# 2. Update both sides: +# - Gitea secret GRAFANA_DB_PASSWORD (nightly + release workflows pick it up) +# - Local .env on the server / dev machine + +# 3. Restart the backend. Flyway sees that R__'s resolved checksum changed and +# re-applies it, issuing ALTER ROLE grafana_reader WITH PASSWORD ''. +docker compose restart backend + +# 4. Restart obs-grafana so the provisioned datasource picks up the new env value. +docker compose -f docker-compose.observability.yml restart obs-grafana + +# 5. Verify the dashboard loads — PO Overview's Postgres panels should populate +# instead of "Data source error". +``` + +If `GRAFANA_DB_PASSWORD` is unset, the backend **refuses to start** (`IllegalStateException`). That is deliberate — see `FlywayConfig.resolveGrafanaDbPassword()` and the rationale in ADR-024. + #### GlitchTip | Item | Value | diff --git a/docs/adr/024-grafana-reads-archive-db-via-bridged-network.md b/docs/adr/024-grafana-reads-archive-db-via-bridged-network.md new file mode 100644 index 00000000..da47cdf3 --- /dev/null +++ b/docs/adr/024-grafana-reads-archive-db-via-bridged-network.md @@ -0,0 +1,123 @@ +# ADR-024: Grafana reads archive-db via a bridged network and a SELECT-only role + +## Status + +Accepted + +## Context + +Issue #651 (the PO Overview Grafana dashboard) needs aggregates over three +tables in the main application database — `audit_log`, `documents`, and +`transcription_blocks` — to answer the operator's four weekly questions: is +everything working, are people using it, is the archive making progress, is +OCR working well. + +Until now, `obs-grafana` and the rest of the observability stack lived on +their own Docker network (`obs-net`) and never touched `archiv-net`, where +`archive-db` runs. The two were intentionally isolated: a compromise of any +observability container could not pivot to the application database. + +The PO Overview's archive-progress and user-activity panels need rolling +7-day SQL aggregates that cannot be served by Prometheus or Loki. That +forces a connection from `obs-grafana` to `archive-db` for the first time. + +Two implementation requirements shaped the design: + +1. **Least privilege on the database side.** The Spring Boot application + role (`archiv`) has full read/write on every table. Letting Grafana + connect with that role would mean a Grafana compromise becomes an + application compromise. The dashboard only needs SELECT on three + tables; the role must reflect that and nothing more. + +2. **Operational simplicity of secret rotation.** The role's password is + shared between the migration that sets it and the Grafana datasource + that uses it. A first version of this work put the password in a + versioned Flyway migration (V68), which Flyway only applies once — + leaving rotation as an out-of-band `psql ALTER ROLE` step that no + runbook documented. The shape must support rotation without manual + SQL. + +## Decision + +- Provision a dedicated PostgreSQL role `grafana_reader` with `LOGIN` plus + `GRANT SELECT` on `audit_log`, `documents`, `transcription_blocks` only. + No INSERT/UPDATE/DELETE on any table, no access to any other table — + enforced by the database, locked in by both positive and parameterized + negative tests in `GrafanaReaderRoleIntegrationTest`. +- Split the role's lifecycle across two migrations: + - `V68__add_grafana_reader_role.sql` — versioned, immutable, idempotent. + Creates the role and applies the grants. Runs exactly once per + database, like every other versioned migration. + - `R__grafana_reader_password.sql` — Flyway *repeatable* migration that + issues `ALTER ROLE grafana_reader WITH PASSWORD '${grafanaDbPassword}'`. + Flyway computes the checksum on the resolved content, so any change + to `GRAFANA_DB_PASSWORD` flips the checksum and re-applies the + migration on the next boot. Rotation becomes "bump env var, restart + backend, restart obs-grafana" — see the runbook in + `docs/DEPLOYMENT.md §4 → Rotate the grafana_reader DB password`. +- Resolve the password through Spring's `Environment` rather than a raw + `System.getenv()` call, so tests inject via `application.properties` + and the resolver is unit-testable with `MockEnvironment`. Fail closed + with `IllegalStateException` when the variable is unset — no fallback + string. Same shape as `UserDataInitializer`'s refusal to seed default + admin credentials outside dev/test/e2e. +- Join `obs-grafana` to `archiv-net` in addition to `obs-net`. Only the + Grafana container crosses the boundary; Loki, Tempo, Prometheus, + GlitchTip, and the worker containers remain `obs-net`-only. + +## Consequences + +**Positive** + +- Database-level least privilege: a Grafana compromise gains SELECT on + three tables. Cannot write, cannot read PII tables like `app_users`, + `persons`, `notifications`, `document_comments`, `geschichten`. The + parameterized PII negative sweep in `GrafanaReaderRoleIntegrationTest` + is the regression gate; new sensitive tables get added to that list. +- Rotation is documented, idempotent, and survives operator turnover. + No "the password set on day 1 is the password forever" failure mode. +- Tests pin down both sides of the boundary: positive grants must hold, + write-deny must hold, and the PII negative list must stay empty. + +**Negative / trade-offs** + +- `obs-net` is no longer fully isolated from `archiv-net`. A Grafana RCE + (e.g. via a future Grafana CVE) gains a TCP path to `archive-db` — + contained, but not impossible. The least-privilege role is the + mitigation; we accept that mitigation as sufficient for a single + bridged container. +- The backend must hold `GRAFANA_DB_PASSWORD` in its environment forever, + so Flyway can resolve the placeholder on every boot. A backend RCE + therefore also leaks the Grafana datasource password. Acceptable + because that password's blast radius is itself bounded by the + least-privilege grants on `grafana_reader`. + +## Alternatives considered + +- **Prometheus PostgreSQL exporter, no direct connection.** Loses ad-hoc + SQL aggregates — the dashboard would need every metric pre-defined as + an exporter query, with a redeploy to add a new one. The PO Overview + is the type of dashboard that grows panels over time; pre-defining + every aggregate is the wrong shape. +- **Read replica or logical-replication slot dedicated to Grafana.** + Real operational cost (extra Postgres instance, replication monitoring, + storage doubled) disproportionate to a weekly PO glance. +- **Versioned migration with `flyway repair` for rotation.** Rejected: + conflates schema lifecycle with credential lifecycle, requires manual + intervention to rotate, and the repair command's semantics are + surprising to operators unfamiliar with Flyway internals. +- **Hardcoded fallback password when env var is unset.** Rejected as a + security blocker: publishes a known credential for a role with read + access to user activity and full letter text. The fail-closed + behavior is the explicit defense. + +## References + +- Issue #651 — PO Overview Grafana dashboard +- `backend/src/main/resources/db/migration/V68__add_grafana_reader_role.sql` +- `backend/src/main/resources/db/migration/R__grafana_reader_password.sql` +- `backend/src/main/java/org/raddatz/familienarchiv/config/FlywayConfig.java` +- `backend/src/test/java/org/raddatz/familienarchiv/config/GrafanaReaderRoleIntegrationTest.java` +- `infra/observability/grafana/provisioning/datasources/datasources.yml` +- `docker-compose.observability.yml` — `archiv-net` bridge on `obs-grafana` +- `docs/DEPLOYMENT.md §4` — rotation runbook