ADR-024 records the deliberate cross-domain link (obs-grafana joins archiv-net to query archive-db via the SELECT-only grafana_reader role), the rejected alternatives (Prometheus exporter, read replica, versioned migration + flyway repair, hardcoded fallback), and the consequences — specifically that a Grafana compromise gains TCP reach to archive-db but is bounded by the role's least-privilege grants. The DEPLOYMENT.md runbook documents the rotation procedure that R__grafana_reader_password.sql now enables: bump GRAFANA_DB_PASSWORD, restart backend (Flyway re-applies because the resolved checksum changed), restart obs-grafana (datasource picks up the new env var). Also calls out the fail-closed startup behavior so operators who hit IllegalStateException know it is deliberate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6.3 KiB
ADR-024: Grafana reads archive-db via a bridged network and a SELECT-only role
Status
Accepted
Context
Issue #651 (the PO Overview Grafana dashboard) needs aggregates over three
tables in the main application database — audit_log, documents, and
transcription_blocks — to answer the operator's four weekly questions: is
everything working, are people using it, is the archive making progress, is
OCR working well.
Until now, obs-grafana and the rest of the observability stack lived on
their own Docker network (obs-net) and never touched archiv-net, where
archive-db runs. The two were intentionally isolated: a compromise of any
observability container could not pivot to the application database.
The PO Overview's archive-progress and user-activity panels need rolling
7-day SQL aggregates that cannot be served by Prometheus or Loki. That
forces a connection from obs-grafana to archive-db for the first time.
Two implementation requirements shaped the design:
-
Least privilege on the database side. The Spring Boot application role (
archiv) has full read/write on every table. Letting Grafana connect with that role would mean a Grafana compromise becomes an application compromise. The dashboard only needs SELECT on three tables; the role must reflect that and nothing more. -
Operational simplicity of secret rotation. The role's password is shared between the migration that sets it and the Grafana datasource that uses it. A first version of this work put the password in a versioned Flyway migration (V68), which Flyway only applies once — leaving rotation as an out-of-band
psql ALTER ROLEstep that no runbook documented. The shape must support rotation without manual SQL.
Decision
- Provision a dedicated PostgreSQL role
grafana_readerwithLOGINplusGRANT SELECTonaudit_log,documents,transcription_blocksonly. No INSERT/UPDATE/DELETE on any table, no access to any other table — enforced by the database, locked in by both positive and parameterized negative tests inGrafanaReaderRoleIntegrationTest. - Split the role's lifecycle across two migrations:
V68__add_grafana_reader_role.sql— versioned, immutable, idempotent. Creates the role and applies the grants. Runs exactly once per database, like every other versioned migration.R__grafana_reader_password.sql— Flyway repeatable migration that issuesALTER ROLE grafana_reader WITH PASSWORD '${grafanaDbPassword}'. Flyway computes the checksum on the resolved content, so any change toGRAFANA_DB_PASSWORDflips the checksum and re-applies the migration on the next boot. Rotation becomes "bump env var, restart backend, restart obs-grafana" — see the runbook indocs/DEPLOYMENT.md §4 → Rotate the grafana_reader DB password.
- Resolve the password through Spring's
Environmentrather than a rawSystem.getenv()call, so tests inject viaapplication.propertiesand the resolver is unit-testable withMockEnvironment. Fail closed withIllegalStateExceptionwhen the variable is unset — no fallback string. Same shape asUserDataInitializer's refusal to seed default admin credentials outside dev/test/e2e. - Join
obs-grafanatoarchiv-netin addition toobs-net. Only the Grafana container crosses the boundary; Loki, Tempo, Prometheus, GlitchTip, and the worker containers remainobs-net-only.
Consequences
Positive
- Database-level least privilege: a Grafana compromise gains SELECT on
three tables. Cannot write, cannot read PII tables like
app_users,persons,notifications,document_comments,geschichten. The parameterized PII negative sweep inGrafanaReaderRoleIntegrationTestis the regression gate; new sensitive tables get added to that list. - Rotation is documented, idempotent, and survives operator turnover. No "the password set on day 1 is the password forever" failure mode.
- Tests pin down both sides of the boundary: positive grants must hold, write-deny must hold, and the PII negative list must stay empty.
Negative / trade-offs
obs-netis no longer fully isolated fromarchiv-net. A Grafana RCE (e.g. via a future Grafana CVE) gains a TCP path toarchive-db— contained, but not impossible. The least-privilege role is the mitigation; we accept that mitigation as sufficient for a single bridged container.- The backend must hold
GRAFANA_DB_PASSWORDin its environment forever, so Flyway can resolve the placeholder on every boot. A backend RCE therefore also leaks the Grafana datasource password. Acceptable because that password's blast radius is itself bounded by the least-privilege grants ongrafana_reader.
Alternatives considered
- Prometheus PostgreSQL exporter, no direct connection. Loses ad-hoc SQL aggregates — the dashboard would need every metric pre-defined as an exporter query, with a redeploy to add a new one. The PO Overview is the type of dashboard that grows panels over time; pre-defining every aggregate is the wrong shape.
- Read replica or logical-replication slot dedicated to Grafana. Real operational cost (extra Postgres instance, replication monitoring, storage doubled) disproportionate to a weekly PO glance.
- Versioned migration with
flyway repairfor rotation. Rejected: conflates schema lifecycle with credential lifecycle, requires manual intervention to rotate, and the repair command's semantics are surprising to operators unfamiliar with Flyway internals. - Hardcoded fallback password when env var is unset. Rejected as a security blocker: publishes a known credential for a role with read access to user activity and full letter text. The fail-closed behavior is the explicit defense.
References
- Issue #651 — PO Overview Grafana dashboard
backend/src/main/resources/db/migration/V68__add_grafana_reader_role.sqlbackend/src/main/resources/db/migration/R__grafana_reader_password.sqlbackend/src/main/java/org/raddatz/familienarchiv/config/FlywayConfig.javabackend/src/test/java/org/raddatz/familienarchiv/config/GrafanaReaderRoleIntegrationTest.javainfra/observability/grafana/provisioning/datasources/datasources.ymldocker-compose.observability.yml—archiv-netbridge onobs-grafanadocs/DEPLOYMENT.md §4— rotation runbook