Files
familienarchiv/docs/adr/024-grafana-reads-archive-db-via-bridged-network.md
Marcel 1109ab917b
All checks were successful
CI / Backend Unit Tests (push) Successful in 3m35s
CI / fail2ban Regex (push) Successful in 42s
CI / Semgrep Security Scan (push) Successful in 19s
CI / Compose Bucket Idempotency (push) Successful in 1m3s
nightly / deploy-staging (push) Successful in 2m0s
CI / Unit & Component Tests (pull_request) Successful in 3m39s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m53s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Unit & Component Tests (push) Successful in 3m39s
CI / OCR Service Tests (push) Successful in 20s
docs(observability): ADR-024 + rotation runbook for grafana_reader
ADR-024 records the deliberate cross-domain link (obs-grafana joins
archiv-net to query archive-db via the SELECT-only grafana_reader role),
the rejected alternatives (Prometheus exporter, read replica, versioned
migration + flyway repair, hardcoded fallback), and the consequences —
specifically that a Grafana compromise gains TCP reach to archive-db
but is bounded by the role's least-privilege grants.

The DEPLOYMENT.md runbook documents the rotation procedure that
R__grafana_reader_password.sql now enables: bump GRAFANA_DB_PASSWORD,
restart backend (Flyway re-applies because the resolved checksum
changed), restart obs-grafana (datasource picks up the new env var).
Also calls out the fail-closed startup behavior so operators who hit
IllegalStateException know it is deliberate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 17:21:27 +02:00

6.3 KiB

ADR-024: Grafana reads archive-db via a bridged network and a SELECT-only role

Status

Accepted

Context

Issue #651 (the PO Overview Grafana dashboard) needs aggregates over three tables in the main application database — audit_log, documents, and transcription_blocks — to answer the operator's four weekly questions: is everything working, are people using it, is the archive making progress, is OCR working well.

Until now, obs-grafana and the rest of the observability stack lived on their own Docker network (obs-net) and never touched archiv-net, where archive-db runs. The two were intentionally isolated: a compromise of any observability container could not pivot to the application database.

The PO Overview's archive-progress and user-activity panels need rolling 7-day SQL aggregates that cannot be served by Prometheus or Loki. That forces a connection from obs-grafana to archive-db for the first time.

Two implementation requirements shaped the design:

  1. Least privilege on the database side. The Spring Boot application role (archiv) has full read/write on every table. Letting Grafana connect with that role would mean a Grafana compromise becomes an application compromise. The dashboard only needs SELECT on three tables; the role must reflect that and nothing more.

  2. Operational simplicity of secret rotation. The role's password is shared between the migration that sets it and the Grafana datasource that uses it. A first version of this work put the password in a versioned Flyway migration (V68), which Flyway only applies once — leaving rotation as an out-of-band psql ALTER ROLE step that no runbook documented. The shape must support rotation without manual SQL.

Decision

  • Provision a dedicated PostgreSQL role grafana_reader with LOGIN plus GRANT SELECT on audit_log, documents, transcription_blocks only. No INSERT/UPDATE/DELETE on any table, no access to any other table — enforced by the database, locked in by both positive and parameterized negative tests in GrafanaReaderRoleIntegrationTest.
  • Split the role's lifecycle across two migrations:
    • V68__add_grafana_reader_role.sql — versioned, immutable, idempotent. Creates the role and applies the grants. Runs exactly once per database, like every other versioned migration.
    • R__grafana_reader_password.sql — Flyway repeatable migration that issues ALTER ROLE grafana_reader WITH PASSWORD '${grafanaDbPassword}'. Flyway computes the checksum on the resolved content, so any change to GRAFANA_DB_PASSWORD flips the checksum and re-applies the migration on the next boot. Rotation becomes "bump env var, restart backend, restart obs-grafana" — see the runbook in docs/DEPLOYMENT.md §4 → Rotate the grafana_reader DB password.
  • Resolve the password through Spring's Environment rather than a raw System.getenv() call, so tests inject via application.properties and the resolver is unit-testable with MockEnvironment. Fail closed with IllegalStateException when the variable is unset — no fallback string. Same shape as UserDataInitializer's refusal to seed default admin credentials outside dev/test/e2e.
  • Join obs-grafana to archiv-net in addition to obs-net. Only the Grafana container crosses the boundary; Loki, Tempo, Prometheus, GlitchTip, and the worker containers remain obs-net-only.

Consequences

Positive

  • Database-level least privilege: a Grafana compromise gains SELECT on three tables. Cannot write, cannot read PII tables like app_users, persons, notifications, document_comments, geschichten. The parameterized PII negative sweep in GrafanaReaderRoleIntegrationTest is the regression gate; new sensitive tables get added to that list.
  • Rotation is documented, idempotent, and survives operator turnover. No "the password set on day 1 is the password forever" failure mode.
  • Tests pin down both sides of the boundary: positive grants must hold, write-deny must hold, and the PII negative list must stay empty.

Negative / trade-offs

  • obs-net is no longer fully isolated from archiv-net. A Grafana RCE (e.g. via a future Grafana CVE) gains a TCP path to archive-db — contained, but not impossible. The least-privilege role is the mitigation; we accept that mitigation as sufficient for a single bridged container.
  • The backend must hold GRAFANA_DB_PASSWORD in its environment forever, so Flyway can resolve the placeholder on every boot. A backend RCE therefore also leaks the Grafana datasource password. Acceptable because that password's blast radius is itself bounded by the least-privilege grants on grafana_reader.

Alternatives considered

  • Prometheus PostgreSQL exporter, no direct connection. Loses ad-hoc SQL aggregates — the dashboard would need every metric pre-defined as an exporter query, with a redeploy to add a new one. The PO Overview is the type of dashboard that grows panels over time; pre-defining every aggregate is the wrong shape.
  • Read replica or logical-replication slot dedicated to Grafana. Real operational cost (extra Postgres instance, replication monitoring, storage doubled) disproportionate to a weekly PO glance.
  • Versioned migration with flyway repair for rotation. Rejected: conflates schema lifecycle with credential lifecycle, requires manual intervention to rotate, and the repair command's semantics are surprising to operators unfamiliar with Flyway internals.
  • Hardcoded fallback password when env var is unset. Rejected as a security blocker: publishes a known credential for a role with read access to user activity and full letter text. The fail-closed behavior is the explicit defense.

References

  • Issue #651 — PO Overview Grafana dashboard
  • backend/src/main/resources/db/migration/V68__add_grafana_reader_role.sql
  • backend/src/main/resources/db/migration/R__grafana_reader_password.sql
  • backend/src/main/java/org/raddatz/familienarchiv/config/FlywayConfig.java
  • backend/src/test/java/org/raddatz/familienarchiv/config/GrafanaReaderRoleIntegrationTest.java
  • infra/observability/grafana/provisioning/datasources/datasources.yml
  • docker-compose.observability.ymlarchiv-net bridge on obs-grafana
  • docs/DEPLOYMENT.md §4 — rotation runbook