docs(observability): ADR-024 + rotation runbook for grafana_reader
All checks were successful
CI / Backend Unit Tests (push) Successful in 3m35s
CI / fail2ban Regex (push) Successful in 42s
CI / Semgrep Security Scan (push) Successful in 19s
CI / Compose Bucket Idempotency (push) Successful in 1m3s
nightly / deploy-staging (push) Successful in 2m0s
CI / Unit & Component Tests (pull_request) Successful in 3m39s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m53s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Unit & Component Tests (push) Successful in 3m39s
CI / OCR Service Tests (push) Successful in 20s
All checks were successful
CI / Backend Unit Tests (push) Successful in 3m35s
CI / fail2ban Regex (push) Successful in 42s
CI / Semgrep Security Scan (push) Successful in 19s
CI / Compose Bucket Idempotency (push) Successful in 1m3s
nightly / deploy-staging (push) Successful in 2m0s
CI / Unit & Component Tests (pull_request) Successful in 3m39s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m53s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Unit & Component Tests (push) Successful in 3m39s
CI / OCR Service Tests (push) Successful in 20s
ADR-024 records the deliberate cross-domain link (obs-grafana joins archiv-net to query archive-db via the SELECT-only grafana_reader role), the rejected alternatives (Prometheus exporter, read replica, versioned migration + flyway repair, hardcoded fallback), and the consequences — specifically that a Grafana compromise gains TCP reach to archive-db but is bounded by the role's least-privilege grants. The DEPLOYMENT.md runbook documents the rotation procedure that R__grafana_reader_password.sql now enables: bump GRAFANA_DB_PASSWORD, restart backend (Flyway re-applies because the resolved checksum changed), restart obs-grafana (datasource picks up the new env var). Also calls out the fail-closed startup behavior so operators who hit IllegalStateException know it is deliberate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit was merged in pull request #659.
This commit is contained in:
@@ -430,6 +430,31 @@ docker exec obs-loki wget -qO- \
|
||||
|
||||
Prometheus port `9090` and Grafana port `3003` (default; configurable via `PORT_GRAFANA`) are bound to `127.0.0.1` on the host. No other observability ports are host-bound.
|
||||
|
||||
##### Rotate the `grafana_reader` DB password
|
||||
|
||||
The PO Overview dashboard reads `audit_log`, `documents`, and `transcription_blocks` through the SELECT-only `grafana_reader` PostgreSQL role (issue #651, ADR-024). The role's password is owned by `R__grafana_reader_password.sql` — a Flyway *repeatable* migration that re-runs whenever the resolved `${grafanaDbPassword}` placeholder changes. That makes rotation a two-restart operation, no manual `psql` required.
|
||||
|
||||
```bash
|
||||
# 1. Generate a new value
|
||||
openssl rand -hex 32
|
||||
|
||||
# 2. Update both sides:
|
||||
# - Gitea secret GRAFANA_DB_PASSWORD (nightly + release workflows pick it up)
|
||||
# - Local .env on the server / dev machine
|
||||
|
||||
# 3. Restart the backend. Flyway sees that R__'s resolved checksum changed and
|
||||
# re-applies it, issuing ALTER ROLE grafana_reader WITH PASSWORD '<new>'.
|
||||
docker compose restart backend
|
||||
|
||||
# 4. Restart obs-grafana so the provisioned datasource picks up the new env value.
|
||||
docker compose -f docker-compose.observability.yml restart obs-grafana
|
||||
|
||||
# 5. Verify the dashboard loads — PO Overview's Postgres panels should populate
|
||||
# instead of "Data source error".
|
||||
```
|
||||
|
||||
If `GRAFANA_DB_PASSWORD` is unset, the backend **refuses to start** (`IllegalStateException`). That is deliberate — see `FlywayConfig.resolveGrafanaDbPassword()` and the rationale in ADR-024.
|
||||
|
||||
#### GlitchTip
|
||||
|
||||
| Item | Value |
|
||||
|
||||
123
docs/adr/024-grafana-reads-archive-db-via-bridged-network.md
Normal file
123
docs/adr/024-grafana-reads-archive-db-via-bridged-network.md
Normal file
@@ -0,0 +1,123 @@
|
||||
# ADR-024: Grafana reads archive-db via a bridged network and a SELECT-only role
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Issue #651 (the PO Overview Grafana dashboard) needs aggregates over three
|
||||
tables in the main application database — `audit_log`, `documents`, and
|
||||
`transcription_blocks` — to answer the operator's four weekly questions: is
|
||||
everything working, are people using it, is the archive making progress, is
|
||||
OCR working well.
|
||||
|
||||
Until now, `obs-grafana` and the rest of the observability stack lived on
|
||||
their own Docker network (`obs-net`) and never touched `archiv-net`, where
|
||||
`archive-db` runs. The two were intentionally isolated: a compromise of any
|
||||
observability container could not pivot to the application database.
|
||||
|
||||
The PO Overview's archive-progress and user-activity panels need rolling
|
||||
7-day SQL aggregates that cannot be served by Prometheus or Loki. That
|
||||
forces a connection from `obs-grafana` to `archive-db` for the first time.
|
||||
|
||||
Two implementation requirements shaped the design:
|
||||
|
||||
1. **Least privilege on the database side.** The Spring Boot application
|
||||
role (`archiv`) has full read/write on every table. Letting Grafana
|
||||
connect with that role would mean a Grafana compromise becomes an
|
||||
application compromise. The dashboard only needs SELECT on three
|
||||
tables; the role must reflect that and nothing more.
|
||||
|
||||
2. **Operational simplicity of secret rotation.** The role's password is
|
||||
shared between the migration that sets it and the Grafana datasource
|
||||
that uses it. A first version of this work put the password in a
|
||||
versioned Flyway migration (V68), which Flyway only applies once —
|
||||
leaving rotation as an out-of-band `psql ALTER ROLE` step that no
|
||||
runbook documented. The shape must support rotation without manual
|
||||
SQL.
|
||||
|
||||
## Decision
|
||||
|
||||
- Provision a dedicated PostgreSQL role `grafana_reader` with `LOGIN` plus
|
||||
`GRANT SELECT` on `audit_log`, `documents`, `transcription_blocks` only.
|
||||
No INSERT/UPDATE/DELETE on any table, no access to any other table —
|
||||
enforced by the database, locked in by both positive and parameterized
|
||||
negative tests in `GrafanaReaderRoleIntegrationTest`.
|
||||
- Split the role's lifecycle across two migrations:
|
||||
- `V68__add_grafana_reader_role.sql` — versioned, immutable, idempotent.
|
||||
Creates the role and applies the grants. Runs exactly once per
|
||||
database, like every other versioned migration.
|
||||
- `R__grafana_reader_password.sql` — Flyway *repeatable* migration that
|
||||
issues `ALTER ROLE grafana_reader WITH PASSWORD '${grafanaDbPassword}'`.
|
||||
Flyway computes the checksum on the resolved content, so any change
|
||||
to `GRAFANA_DB_PASSWORD` flips the checksum and re-applies the
|
||||
migration on the next boot. Rotation becomes "bump env var, restart
|
||||
backend, restart obs-grafana" — see the runbook in
|
||||
`docs/DEPLOYMENT.md §4 → Rotate the grafana_reader DB password`.
|
||||
- Resolve the password through Spring's `Environment` rather than a raw
|
||||
`System.getenv()` call, so tests inject via `application.properties`
|
||||
and the resolver is unit-testable with `MockEnvironment`. Fail closed
|
||||
with `IllegalStateException` when the variable is unset — no fallback
|
||||
string. Same shape as `UserDataInitializer`'s refusal to seed default
|
||||
admin credentials outside dev/test/e2e.
|
||||
- Join `obs-grafana` to `archiv-net` in addition to `obs-net`. Only the
|
||||
Grafana container crosses the boundary; Loki, Tempo, Prometheus,
|
||||
GlitchTip, and the worker containers remain `obs-net`-only.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**
|
||||
|
||||
- Database-level least privilege: a Grafana compromise gains SELECT on
|
||||
three tables. Cannot write, cannot read PII tables like `app_users`,
|
||||
`persons`, `notifications`, `document_comments`, `geschichten`. The
|
||||
parameterized PII negative sweep in `GrafanaReaderRoleIntegrationTest`
|
||||
is the regression gate; new sensitive tables get added to that list.
|
||||
- Rotation is documented, idempotent, and survives operator turnover.
|
||||
No "the password set on day 1 is the password forever" failure mode.
|
||||
- Tests pin down both sides of the boundary: positive grants must hold,
|
||||
write-deny must hold, and the PII negative list must stay empty.
|
||||
|
||||
**Negative / trade-offs**
|
||||
|
||||
- `obs-net` is no longer fully isolated from `archiv-net`. A Grafana RCE
|
||||
(e.g. via a future Grafana CVE) gains a TCP path to `archive-db` —
|
||||
contained, but not impossible. The least-privilege role is the
|
||||
mitigation; we accept that mitigation as sufficient for a single
|
||||
bridged container.
|
||||
- The backend must hold `GRAFANA_DB_PASSWORD` in its environment forever,
|
||||
so Flyway can resolve the placeholder on every boot. A backend RCE
|
||||
therefore also leaks the Grafana datasource password. Acceptable
|
||||
because that password's blast radius is itself bounded by the
|
||||
least-privilege grants on `grafana_reader`.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Prometheus PostgreSQL exporter, no direct connection.** Loses ad-hoc
|
||||
SQL aggregates — the dashboard would need every metric pre-defined as
|
||||
an exporter query, with a redeploy to add a new one. The PO Overview
|
||||
is the type of dashboard that grows panels over time; pre-defining
|
||||
every aggregate is the wrong shape.
|
||||
- **Read replica or logical-replication slot dedicated to Grafana.**
|
||||
Real operational cost (extra Postgres instance, replication monitoring,
|
||||
storage doubled) disproportionate to a weekly PO glance.
|
||||
- **Versioned migration with `flyway repair` for rotation.** Rejected:
|
||||
conflates schema lifecycle with credential lifecycle, requires manual
|
||||
intervention to rotate, and the repair command's semantics are
|
||||
surprising to operators unfamiliar with Flyway internals.
|
||||
- **Hardcoded fallback password when env var is unset.** Rejected as a
|
||||
security blocker: publishes a known credential for a role with read
|
||||
access to user activity and full letter text. The fail-closed
|
||||
behavior is the explicit defense.
|
||||
|
||||
## References
|
||||
|
||||
- Issue #651 — PO Overview Grafana dashboard
|
||||
- `backend/src/main/resources/db/migration/V68__add_grafana_reader_role.sql`
|
||||
- `backend/src/main/resources/db/migration/R__grafana_reader_password.sql`
|
||||
- `backend/src/main/java/org/raddatz/familienarchiv/config/FlywayConfig.java`
|
||||
- `backend/src/test/java/org/raddatz/familienarchiv/config/GrafanaReaderRoleIntegrationTest.java`
|
||||
- `infra/observability/grafana/provisioning/datasources/datasources.yml`
|
||||
- `docker-compose.observability.yml` — `archiv-net` bridge on `obs-grafana`
|
||||
- `docs/DEPLOYMENT.md §4` — rotation runbook
|
||||
Reference in New Issue
Block a user