docs(observability): ADR-024 + rotation runbook for grafana_reader
All checks were successful
CI / Backend Unit Tests (push) Successful in 3m35s
CI / fail2ban Regex (push) Successful in 42s
CI / Semgrep Security Scan (push) Successful in 19s
CI / Compose Bucket Idempotency (push) Successful in 1m3s
nightly / deploy-staging (push) Successful in 2m0s
CI / Unit & Component Tests (pull_request) Successful in 3m39s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m53s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Unit & Component Tests (push) Successful in 3m39s
CI / OCR Service Tests (push) Successful in 20s
All checks were successful
CI / Backend Unit Tests (push) Successful in 3m35s
CI / fail2ban Regex (push) Successful in 42s
CI / Semgrep Security Scan (push) Successful in 19s
CI / Compose Bucket Idempotency (push) Successful in 1m3s
nightly / deploy-staging (push) Successful in 2m0s
CI / Unit & Component Tests (pull_request) Successful in 3m39s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m53s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Unit & Component Tests (push) Successful in 3m39s
CI / OCR Service Tests (push) Successful in 20s
ADR-024 records the deliberate cross-domain link (obs-grafana joins archiv-net to query archive-db via the SELECT-only grafana_reader role), the rejected alternatives (Prometheus exporter, read replica, versioned migration + flyway repair, hardcoded fallback), and the consequences — specifically that a Grafana compromise gains TCP reach to archive-db but is bounded by the role's least-privilege grants. The DEPLOYMENT.md runbook documents the rotation procedure that R__grafana_reader_password.sql now enables: bump GRAFANA_DB_PASSWORD, restart backend (Flyway re-applies because the resolved checksum changed), restart obs-grafana (datasource picks up the new env var). Also calls out the fail-closed startup behavior so operators who hit IllegalStateException know it is deliberate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit was merged in pull request #659.
This commit is contained in:
@@ -430,6 +430,31 @@ docker exec obs-loki wget -qO- \
|
|||||||
|
|
||||||
Prometheus port `9090` and Grafana port `3003` (default; configurable via `PORT_GRAFANA`) are bound to `127.0.0.1` on the host. No other observability ports are host-bound.
|
Prometheus port `9090` and Grafana port `3003` (default; configurable via `PORT_GRAFANA`) are bound to `127.0.0.1` on the host. No other observability ports are host-bound.
|
||||||
|
|
||||||
|
##### Rotate the `grafana_reader` DB password
|
||||||
|
|
||||||
|
The PO Overview dashboard reads `audit_log`, `documents`, and `transcription_blocks` through the SELECT-only `grafana_reader` PostgreSQL role (issue #651, ADR-024). The role's password is owned by `R__grafana_reader_password.sql` — a Flyway *repeatable* migration that re-runs whenever the resolved `${grafanaDbPassword}` placeholder changes. That makes rotation a two-restart operation, no manual `psql` required.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Generate a new value
|
||||||
|
openssl rand -hex 32
|
||||||
|
|
||||||
|
# 2. Update both sides:
|
||||||
|
# - Gitea secret GRAFANA_DB_PASSWORD (nightly + release workflows pick it up)
|
||||||
|
# - Local .env on the server / dev machine
|
||||||
|
|
||||||
|
# 3. Restart the backend. Flyway sees that R__'s resolved checksum changed and
|
||||||
|
# re-applies it, issuing ALTER ROLE grafana_reader WITH PASSWORD '<new>'.
|
||||||
|
docker compose restart backend
|
||||||
|
|
||||||
|
# 4. Restart obs-grafana so the provisioned datasource picks up the new env value.
|
||||||
|
docker compose -f docker-compose.observability.yml restart obs-grafana
|
||||||
|
|
||||||
|
# 5. Verify the dashboard loads — PO Overview's Postgres panels should populate
|
||||||
|
# instead of "Data source error".
|
||||||
|
```
|
||||||
|
|
||||||
|
If `GRAFANA_DB_PASSWORD` is unset, the backend **refuses to start** (`IllegalStateException`). That is deliberate — see `FlywayConfig.resolveGrafanaDbPassword()` and the rationale in ADR-024.
|
||||||
|
|
||||||
#### GlitchTip
|
#### GlitchTip
|
||||||
|
|
||||||
| Item | Value |
|
| Item | Value |
|
||||||
|
|||||||
123
docs/adr/024-grafana-reads-archive-db-via-bridged-network.md
Normal file
123
docs/adr/024-grafana-reads-archive-db-via-bridged-network.md
Normal file
@@ -0,0 +1,123 @@
|
|||||||
|
# ADR-024: Grafana reads archive-db via a bridged network and a SELECT-only role
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Issue #651 (the PO Overview Grafana dashboard) needs aggregates over three
|
||||||
|
tables in the main application database — `audit_log`, `documents`, and
|
||||||
|
`transcription_blocks` — to answer the operator's four weekly questions: is
|
||||||
|
everything working, are people using it, is the archive making progress, is
|
||||||
|
OCR working well.
|
||||||
|
|
||||||
|
Until now, `obs-grafana` and the rest of the observability stack lived on
|
||||||
|
their own Docker network (`obs-net`) and never touched `archiv-net`, where
|
||||||
|
`archive-db` runs. The two were intentionally isolated: a compromise of any
|
||||||
|
observability container could not pivot to the application database.
|
||||||
|
|
||||||
|
The PO Overview's archive-progress and user-activity panels need rolling
|
||||||
|
7-day SQL aggregates that cannot be served by Prometheus or Loki. That
|
||||||
|
forces a connection from `obs-grafana` to `archive-db` for the first time.
|
||||||
|
|
||||||
|
Two implementation requirements shaped the design:
|
||||||
|
|
||||||
|
1. **Least privilege on the database side.** The Spring Boot application
|
||||||
|
role (`archiv`) has full read/write on every table. Letting Grafana
|
||||||
|
connect with that role would mean a Grafana compromise becomes an
|
||||||
|
application compromise. The dashboard only needs SELECT on three
|
||||||
|
tables; the role must reflect that and nothing more.
|
||||||
|
|
||||||
|
2. **Operational simplicity of secret rotation.** The role's password is
|
||||||
|
shared between the migration that sets it and the Grafana datasource
|
||||||
|
that uses it. A first version of this work put the password in a
|
||||||
|
versioned Flyway migration (V68), which Flyway only applies once —
|
||||||
|
leaving rotation as an out-of-band `psql ALTER ROLE` step that no
|
||||||
|
runbook documented. The shape must support rotation without manual
|
||||||
|
SQL.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
- Provision a dedicated PostgreSQL role `grafana_reader` with `LOGIN` plus
|
||||||
|
`GRANT SELECT` on `audit_log`, `documents`, `transcription_blocks` only.
|
||||||
|
No INSERT/UPDATE/DELETE on any table, no access to any other table —
|
||||||
|
enforced by the database, locked in by both positive and parameterized
|
||||||
|
negative tests in `GrafanaReaderRoleIntegrationTest`.
|
||||||
|
- Split the role's lifecycle across two migrations:
|
||||||
|
- `V68__add_grafana_reader_role.sql` — versioned, immutable, idempotent.
|
||||||
|
Creates the role and applies the grants. Runs exactly once per
|
||||||
|
database, like every other versioned migration.
|
||||||
|
- `R__grafana_reader_password.sql` — Flyway *repeatable* migration that
|
||||||
|
issues `ALTER ROLE grafana_reader WITH PASSWORD '${grafanaDbPassword}'`.
|
||||||
|
Flyway computes the checksum on the resolved content, so any change
|
||||||
|
to `GRAFANA_DB_PASSWORD` flips the checksum and re-applies the
|
||||||
|
migration on the next boot. Rotation becomes "bump env var, restart
|
||||||
|
backend, restart obs-grafana" — see the runbook in
|
||||||
|
`docs/DEPLOYMENT.md §4 → Rotate the grafana_reader DB password`.
|
||||||
|
- Resolve the password through Spring's `Environment` rather than a raw
|
||||||
|
`System.getenv()` call, so tests inject via `application.properties`
|
||||||
|
and the resolver is unit-testable with `MockEnvironment`. Fail closed
|
||||||
|
with `IllegalStateException` when the variable is unset — no fallback
|
||||||
|
string. Same shape as `UserDataInitializer`'s refusal to seed default
|
||||||
|
admin credentials outside dev/test/e2e.
|
||||||
|
- Join `obs-grafana` to `archiv-net` in addition to `obs-net`. Only the
|
||||||
|
Grafana container crosses the boundary; Loki, Tempo, Prometheus,
|
||||||
|
GlitchTip, and the worker containers remain `obs-net`-only.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
**Positive**
|
||||||
|
|
||||||
|
- Database-level least privilege: a Grafana compromise gains SELECT on
|
||||||
|
three tables. Cannot write, cannot read PII tables like `app_users`,
|
||||||
|
`persons`, `notifications`, `document_comments`, `geschichten`. The
|
||||||
|
parameterized PII negative sweep in `GrafanaReaderRoleIntegrationTest`
|
||||||
|
is the regression gate; new sensitive tables get added to that list.
|
||||||
|
- Rotation is documented, idempotent, and survives operator turnover.
|
||||||
|
No "the password set on day 1 is the password forever" failure mode.
|
||||||
|
- Tests pin down both sides of the boundary: positive grants must hold,
|
||||||
|
write-deny must hold, and the PII negative list must stay empty.
|
||||||
|
|
||||||
|
**Negative / trade-offs**
|
||||||
|
|
||||||
|
- `obs-net` is no longer fully isolated from `archiv-net`. A Grafana RCE
|
||||||
|
(e.g. via a future Grafana CVE) gains a TCP path to `archive-db` —
|
||||||
|
contained, but not impossible. The least-privilege role is the
|
||||||
|
mitigation; we accept that mitigation as sufficient for a single
|
||||||
|
bridged container.
|
||||||
|
- The backend must hold `GRAFANA_DB_PASSWORD` in its environment forever,
|
||||||
|
so Flyway can resolve the placeholder on every boot. A backend RCE
|
||||||
|
therefore also leaks the Grafana datasource password. Acceptable
|
||||||
|
because that password's blast radius is itself bounded by the
|
||||||
|
least-privilege grants on `grafana_reader`.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **Prometheus PostgreSQL exporter, no direct connection.** Loses ad-hoc
|
||||||
|
SQL aggregates — the dashboard would need every metric pre-defined as
|
||||||
|
an exporter query, with a redeploy to add a new one. The PO Overview
|
||||||
|
is the type of dashboard that grows panels over time; pre-defining
|
||||||
|
every aggregate is the wrong shape.
|
||||||
|
- **Read replica or logical-replication slot dedicated to Grafana.**
|
||||||
|
Real operational cost (extra Postgres instance, replication monitoring,
|
||||||
|
storage doubled) disproportionate to a weekly PO glance.
|
||||||
|
- **Versioned migration with `flyway repair` for rotation.** Rejected:
|
||||||
|
conflates schema lifecycle with credential lifecycle, requires manual
|
||||||
|
intervention to rotate, and the repair command's semantics are
|
||||||
|
surprising to operators unfamiliar with Flyway internals.
|
||||||
|
- **Hardcoded fallback password when env var is unset.** Rejected as a
|
||||||
|
security blocker: publishes a known credential for a role with read
|
||||||
|
access to user activity and full letter text. The fail-closed
|
||||||
|
behavior is the explicit defense.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Issue #651 — PO Overview Grafana dashboard
|
||||||
|
- `backend/src/main/resources/db/migration/V68__add_grafana_reader_role.sql`
|
||||||
|
- `backend/src/main/resources/db/migration/R__grafana_reader_password.sql`
|
||||||
|
- `backend/src/main/java/org/raddatz/familienarchiv/config/FlywayConfig.java`
|
||||||
|
- `backend/src/test/java/org/raddatz/familienarchiv/config/GrafanaReaderRoleIntegrationTest.java`
|
||||||
|
- `infra/observability/grafana/provisioning/datasources/datasources.yml`
|
||||||
|
- `docker-compose.observability.yml` — `archiv-net` bridge on `obs-grafana`
|
||||||
|
- `docs/DEPLOYMENT.md §4` — rotation runbook
|
||||||
Reference in New Issue
Block a user