As a product owner I want a Grafana overview dashboard so I can check system health and archive progress at a weekly glance #651
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
The product owner opens Grafana once a week before a stakeholder meeting and needs to answer three questions without digging into raw metrics:
Audience: Lightly technical — can read "HTTP 5xx: 3" or "p95: 240 ms" but does not need JVM heap graphs.
Default time range: Last 7 days. Refresh: Manual (not a live monitor).
User Story
Acceptance Criteria
Row 1 — System Health
Given the observability stack is running,
When the PO opens the dashboard,
Then they see 7 panels across two sub-rows:
Sub-row 1a — Application health (Prometheus + Loki):
up{job="spring-boot"}displayed as a green "UP" / red "DOWN" statsum(increase(http_server_requests_seconds_count{status=~"5.."}[$__range]))histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[$__range])) by (le))count_over_time({compose_service="backend"} | json | level="ERROR" [$__range])Sub-row 1b — Infrastructure (node-exporter):
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[$__range])) * 100)(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100Given a metric crosses a threshold,
Then the panel colour changes according to these rules:
Row 2 — User Activity
Given the PostgreSQL datasource is wired (see Infrastructure below),
When the PO views the dashboard,
Then they see 4 panels sourced from
audit_log:SELECT COUNT(DISTINCT actor_id) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS'SELECT COUNT(*) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS'SELECT COUNT(*) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind IN ('LOGIN_FAILED', 'LOGIN_RATE_LIMITED')SELECT DATE_TRUNC('day', happened_at) AS day, COUNT(*) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS' GROUP BY day ORDER BY dayThreshold for failed logins: green = 0, amber = 1–3, red > 3.
Row 3 — Archive Progress
Given the PostgreSQL datasource is wired,
When the PO views the dashboard,
Then they see 5 panels:
transcription_blockstext IS NOT NULL AND text != ''vs.text IS NULL OR text = ''. Shows%complete + "+N blocks this week" badge fromaudit_logTEXT_SAVED.documentsCOUNT(*) WHERE status != 'PLACEHOLDER'. Sub-label shows+N this weekfromaudit_logFILE_UPLOADED.audit_logCOUNT(*) WHERE kind = 'FILE_UPLOADED'scoped to rangeaudit_logCOUNT(*) WHERE kind = 'TEXT_SAVED'scoped to rangeaudit_logCOUNT(*) WHERE kind = 'BLOCK_REVIEWED'scoped to rangeRow 4 — OCR Health (requires #652)
Given the OCR service exposes
/metrics(see #652),When the PO views the dashboard,
Then they see 4 panels sourced from Prometheus
job="ocr-service":sum(increase(ocr_jobs_total[$__range]))sum(increase(ocr_skipped_pages_total[$__range])) / sum(increase(ocr_pages_total[$__range]))— shown as %, threshold: green < 1%, amber 1–5%, red > 5%sum(increase(ocr_illegible_words_total[$__range])) / sum(increase(ocr_words_total[$__range]))— shown as %, lower is betterocr_models_ready— 1 = ready, 0 = not readyInfrastructure Changes
1. Flyway migration —
grafana_readerroleNew migration (next available version number):
2. Grafana datasource provisioning
Add to
infra/observability/grafana/provisioning/datasources/datasources.yml:3. Dashboard JSON
New file:
infra/observability/grafana/provisioning/dashboards/po-overview.jsonProvisioned automatically via the existing
dashboards.ymlfile-provider (no config change needed there). Row 4 panels can be added in a follow-up commit once #652 is merged.4. Environment variable
Add
GRAFANA_DB_PASSWORDto:.env.example(with a placeholder comment)docker-compose.observability.yml(passed intoobs-grafanaviaenvironment:)docs/DEPLOYMENT.md(env var reference table)Non-Functional Requirements
grafana_readerhasSELECTonly on the three named tables; no write access, no access toapp_usersor other sensitive tables.docker compose up.GRAFANA_DB_PASSWORDenv var must default to a non-empty placeholder in.env.exampleso a missing var causes an obvious error at startup, not a silent query failure.Out of Scope
🏗️ Markus Keller — Application Architect
Observations
grafana_readerrole placement is correct. Putting this in a Flyway migration (not a one-off script) means it gets tested in CI via Testcontainers. The GRANT scope is correctly minimal: onlyaudit_log,documents,transcription_blocks— notapp_users.infra/observability/grafana/provisioning/dashboards/dashboards.ymlalready has afileprovider pointing at/etc/grafana/provisioning/dashboards, withupdateIntervalSeconds: 30. No config change needed — the issue states this correctly.PASSWORD '…'as a literal placeholder. At migration time, Flyway does not have access to env vars — the password must either be set via a follow-upALTER ROLEcall using a connection that can read the env var, or the migration needs a different approach entirely.Recommendations
NOLOGINin the migration, then run a separate startup script (or adocker-composecommandhook) that doesALTER ROLE grafana_reader WITH LOGIN PASSWORD '...'using$GRAFANA_DB_PASSWORD.System.getenv("GRAFANA_DB_PASSWORD")— but this is heavier for a one-time role creation.Option 1 is simpler and follows the pattern used in
infra/minio/bootstrap.shfor MinIO credentials.GRAFANA_DB_PASSWORDtoobs.env(the non-secret block ininfra/observability/obs.env) as a comment referencing where the real value comes from, consistent with how other secrets are handled in that file.obs-grafananow talks toarchive-db) and a new env var. Per the doc update rules: updatedocs/architecture/c4/l2-containers.puml(new external connection: Grafana → PostgreSQL) anddocs/DEPLOYMENT.md(new env var table row).uidfield. The dashboard JSON should include a stable"uid": "po-overview"field — without it, Grafana auto-generates a uid on first import and re-imports can create duplicates if the file is reprovisioned.Open Decisions
ALTER ROLE. Option (a) is more secure; option (b) is simpler. Which approach fits the operational model?🔒 Nora Steiner — Application Security Engineer
Observations
grafana_readerscope is well-defined. NFR-SEC-01 correctly restricts the role toSELECTon three named tables. Noapp_usersaccess. This is least-privilege done right.GRAFANA_DB_PASSWORDis only passed toobs-grafana(as the issue specifies fordocker-compose.observability.yml), the backend service that runs Flyway will not have it. A migration likeCREATE ROLE grafana_reader WITH LOGIN PASSWORD '…'with a placeholder in git is a credential leak waiting to happen — someone will substitute a real password and commit it.grafana_readeron thearchive-dbnetwork. The observability compose file currently joinsarchiv-netonly for Promtail and GlitchTip. Addingobs-grafanatoarchiv-net(needed to reacharchive-db:5432) increases the surface of that network. This is a deliberate, acceptable tradeoff for a single-operator deployment, but should be noted.sslmode: disablein the datasource config. The issue specifiessslmode: disable. This is fine within the same Docker network (traffic never leaves the host), but should be a conscious choice, not a default. Document it as intentional.editable: falseon the datasource. Correct — prevents interactive credentials leakage via the Grafana UI.GRAFANA_DB_PASSWORDdefault in.env.example. NFR-OPS-02 correctly requires a non-empty placeholder so startup fails loudly. The current.env.exampledoes not have this var at all — it must be added.Recommendations
NOLOGINand no password; a separate idempotent init script (run at deploy time, not at app startup) sets the password from$GRAFANA_DB_PASSWORD.GRAFANA_DB_PASSWORDto the backend's env indocker-compose.observability.yml(not justobs-grafana) and have the Flyway migration read it via a Java callback — heavier but fully integrated.docker-compose.observability.ymlnext toobs-grafana's network list explaining why it joinsarchiv-net(same reason Promtail does: direct container-name DNS resolution toarchive-db).audit_log.happened_atif one does not already exist. The Row 2 and Row 3 queries all filter onhappened_at >= NOW() - INTERVAL '7 days'. Without an index, these are full-table scans. At current archive scale this is fine, but the index is cheap insurance. I checked the migrations:V62__index_fk_columns.sqladds FK indexes but I did not see ahappened_atindex. Verify before shipping.GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}. The:-changemedefault in the existing compose file is a weak credential that production must override. The issue addsGRAFANA_DB_PASSWORD— make sure its default in.env.exampleis a clearly fake placeholder (# REQUIRED — generate with openssl rand -hex 32), not a guessable string.Open Decisions
obs-grafanajoinarchiv-netor should PostgreSQL be exposed onobs-net? Joiningarchiv-netgives Grafana direct container-name DNS (archive-db). The alternative — exposing PostgreSQL onobs-net— widens the database's network exposure.archiv-netmembership for Grafana is the better option, but it is a deliberate choice that should be noted in the compose file.🧪 Sara Holt — QA Engineer
Observations
grafana_readerrole, the datasource connection) can and should be validated.AuditKind.java—LOGIN_SUCCESS,LOGIN_FAILED,LOGIN_RATE_LIMITED,FILE_UPLOADED,TEXT_SAVED, andBLOCK_REVIEWEDall exist. The Row 2 and Row 3 SQL queries reference the correct enum string values.transcription_blocks.textcolumn. The coverage query relies ontext IS NOT NULL AND text != ''to determine "filled" blocks. TheTranscriptionBlockentity has@Column(columnDefinition = "TEXT")for the text field with no explicitnullable = false, meaningNULLis a valid DB state. The query logic matches this.Recommendations
MigrationIntegrationTest(if it exists — there's a Testcontainers setup inAuditLogQueryRepositoryIntegrationTest) should be extended to verify that thegrafana_readerrole exists and has the correct grants after all migrations run. Even a simpleSELECT has_table_privilege('grafana_reader', 'audit_log', 'SELECT')assertion in a Testcontainers-based test would catch a broken migration.docker compose -f docker-compose.observability.yml up, visit Grafana atlocalhost:3003, open the PO Overview dashboard, and verify all panels show data (not 'No data')." This is a manual step, but making it explicit prevents shipping a wired-but-broken dashboard.+N this weekbadge on the Transcription block coverage panel is described as coming fromaudit_logTEXT_SAVED. Verify this is not a separate SQL panel (which Grafana cannot combine into a single stat panel) vs. a Grafana transform — this is a dashboard design detail that needs resolving before the JSON is written. A single stat panel cannot natively show both a computed percentage and a badge from a separate query without a Grafana transformation step.Open Decisions
+N this weekbadge implementation. Grafana stat panels can show a single value. Showing "75% (+12 blocks this week)" requires either two separate panels or a Grafana Transformation that computes a composite string. The issue describes it as a single panel — is this a two-panel layout or does it accept a simplified single-value display?👨💻 Felix Brandt — Senior Fullstack Developer
Observations
http_server_requests_seconds_count,http_server_requests_seconds_bucket, andup{job="spring-boot"}are what the existing Spring Boot observability stack emits. I verified by checkinginfra/observability/prometheus/prometheus.ymlcontext — the backend scrape job is already wired.audit_logqueries filter onhappened_atandkind. TheAuditKindenum values (LOGIN_SUCCESS,LOGIN_FAILED,LOGIN_RATE_LIMITED,FILE_UPLOADED,TEXT_SAVED,BLOCK_REVIEWED) are all confirmed inAuditKind.java. The column names (happened_at,actor_id,kind) match theAuditLogentity.transcription_blockscoverage query. The query comparestext IS NOT NULL AND text != ''(filled) vstext IS NULL OR text = ''(empty). This is a correct expression of the entity's nullabletextcolumn. However, Grafana's PostgreSQL datasource requires the progress bar visualisation to be achieved viabar gaugepanel type with a single value; the issue describes "progress bar" which maps cleanly to that type.V67__recreate_spring_session_tables.sql. The newgrafana_readermigration should beV68__add_grafana_reader_role.sql.Recommendations
$GRAFANA_DB_PASSWORD. Flyway SQL files do not support${ENV_VAR}substitution by default. Flyway'splaceholdersfeature requires opt-in configuration inapplication.yaml(spring.flyway.placeholders.*). Either:GRAFANA_DB_PASSWORDasspring.flyway.placeholders.grafanaDbPassword— clean, but requires Spring config change.Option 1 is the cleaner developer experience — the password is set atomically in the migration, and the migration is idempotent because
CREATE ROLE IF NOT EXISTSis supported in PostgreSQL 9.6+.IF NOT EXISTSblock makes the migration re-runnable safely if the role already exists in the database.__inputssection. When exporting a dashboard from Grafana for provisioning, the JSON includes an__inputsblock that references datasource UIDs. Make sure thepostgresdatasource UID in the JSON matches theuid: postgresdefined in the provisioned datasource. A mismatch causes "No data" for all PostgreSQL panels on first load.🚀 Tobias Wendt — DevOps & Platform Engineer
Observations
grafana/grafana-oss:11.6.1), healthchecks exist, named volumes are used for persistence, and the provisioning pattern is already established. This issue builds cleanly on top of what exists.dashboards.ymluses afileprovider scanning/etc/grafana/provisioning/dashboardswithupdateIntervalSeconds: 30. Droppingpo-overview.jsonthere is all that's needed — confirmed.obs-grafanadoes not currently joinarchiv-net. Looking at the compose file,obs-grafanais only onobs-net. Adding the PostgreSQL datasource requires Grafana to reacharchive-db:5432. The compose file needsobs-grafanato also joinarchiv-net— same pattern asobs-promtailandobs-glitchtip.GRAFANA_DB_PASSWORDis not yet in.env.example. Confirmed — it's absent. The issue correctly identifies this as a required addition.GRAFANA_DB_PASSWORDinobs.enveither. The non-secret block atinfra/observability/obs.envdoes not have this variable. It should have a commented-out reference explaining that the real value comes from CI secrets (same pattern asGRAFANA_ADMIN_PASSWORDwhich is in the main.env.example).archive-dbservice name is correct. The maindocker-compose.ymlnames the Postgres servicedbbut the container name isarchive-db(fromcontainer_name: archive-db). The datasource URL${POSTGRES_HOST:-archive-db}:5432uses the container name, which is the correct approach for cross-Compose-file DNS resolution on a shared network.Recommendations
archiv-nettoobs-grafana's networks block indocker-compose.observability.yml:GRAFANA_DB_PASSWORDintoobs-grafanaindocker-compose.observability.yml:GRAFANA_DB_PASSWORDto.env.examplewith a clear generation instruction:disableDeletion: trueon the dashboard provider is correct. This prevents the PO from accidentally deleting the provisioned dashboard via the Grafana UI. Keep it.updateIntervalSeconds: 30vs.0for provisioned dashboards. With30, Grafana polls for JSON file changes every 30 seconds. For a static provisioned dashboard this is fine, but if you're iterating on the JSON during development, set it to10locally for faster feedback.🎨 Leonie Voss — UX Designer & Accessibility Strategist
Observations
histogram_quantile(0.95, ...)as a panel title would not be. The dashboard JSON must use human-readable panel titles throughout.Recommendations
No open decisions from the UX side — the audience, scope, and panel types are well-specified.
📋 Elicit — Requirements Engineer
Observations
BLOCK_REVIEWEDaudit kind is confirmed. I checkedAuditKind.java— this enum value exists and isROLLUP_ELIGIBLE. The Row 3 query is grounded in real data.GRAFANA_DB_PASSWORDenv var must default to a non-empty placeholder in.env.exampleso a missing var causes an obvious error at startup." This is an infrastructure startup-failure requirement, not just a documentation requirement. It implies the application or compose setup should validate that the var is set before Grafana tries to use it — and "obvious error at startup" needs a concrete implementation path.%complete ++N blocks this weekbadge fromaudit_logTEXT_SAVED." A Grafana stat panel shows one value. A badge with a secondary metric requires either two separate panels or a Grafana Transformation. This is not resolvable without a design decision.documentscount filterstatus != 'PLACEHOLDER'. TheDocumentStatuslifecycle in CLAUDE.md listsPLACEHOLDER → UPLOADED → TRANSCRIBED → REVIEWED → ARCHIVED. The query correctly excludesPLACEHOLDERdocuments (Excel import stubs with no file). This is the right semantic for "total documents in the archive."Recommendations
THEN they see a bar gauge showing % completion AND a stat panel showing +N blocks transcribed this week. This removes the ambiguity about whether a single panel can show both values.grafana_readerrole. The Flyway migration and network change are infrastructure, but the security boundary deserves an explicit acceptance criterion:GIVEN the grafana_reader role exists, WHEN it attempts SELECT on app_users, THEN it receives permission denied. This can be verified with a one-line psql check in the smoke test.Open Decisions
%complete and+N this week. Grafana stat panels cannot display two independent metrics in a single panel without a Transformation step. Decide: (a) one panel with a Grafana Transformation concatenating both values into a display string; (b) two adjacent panels — a bar gauge for % and a stat for weekly delta. Option (b) is simpler to build and easier for the PO to read.🗳️ Decision Queue — Action Required
3 decisions need your input before implementation starts.
Infrastructure / Database
GRAFANA_DB_PASSWORDget into the Flyway SQL migration? Flyway SQL files cannot interpolate env vars by default. Three viable options: (a) Flyway placeholder substitution viaspring.flyway.placeholders.grafanaDbPassword— clean, atomic, requires one Spring config line; (b) Flyway migration creates the role withNOLOGIN, a separate idempotent startup script sets the password at deploy time — simpler, no Spring config change; (c) hardcode a placeholder in the migration and document that production must runALTER ROLEmanually — least safe. Option (a) is recommended by Felix; option (b) is recommended by Markus and Nora as the simpler, more secure operational pattern. (Raised by: Markus, Felix, Nora, Elicit)Dashboard Design
Transcription block coverage panel: single panel or two panels? The spec describes one panel showing both
%complete and+N blocks this week. Grafana stat panels display one value. Showing both requires either: (a) two adjacent panels — a bar gauge for % and a stat for the weekly delta (simpler, more readable for the PO); (b) one panel with a Grafana Transformation that computes a composite string (more compact but harder to build and harder to read). Option (a) is recommended. (Raised by: Sara, Elicit, Leonie)Row 4 placeholder in the initial JSON? Two approaches: (a) omit Row 4 entirely from the initial JSON and add it in the follow-up commit after #652 lands — cleaner JSON, no confusion; (b) include a collapsed empty row panel titled "OCR Health (blocked by #652)" so the PO sees the planned structure — more transparent but adds placeholder complexity. The issue already states "Row 4 in a follow-up commit," so option (a) is the default unless transparency for the PO is a priority. (Raised by: Tobias, Leonie)
We do the recommendations, but this ticket will be worked on after the OCR service exposes it's metrics, so no placeholder.
Implementation shipped as PR #659.
Decisions resolved (from the three-item Decision Queue):
FlywayConfigreadsGRAFANA_DB_PASSWORDfrom env and passes it via.placeholders(Map.of("grafanaDbPassword", …)). Migration V68 wraps role creation inDO $$ IF NOT EXISTS $$so it's idempotent (CREATEfirst run,ALTERthereafter).Blocks Transcribed This Weekstat carries the weekly delta.Commits (atomic, conventional prefixes):
f0b801f1feat(observability): create grafana_reader read-only DB rolef9d4d9a2feat(observability): wire obs-grafana to archive-db and inject GRAFANA_DB_PASSWORD1564ffeafeat(observability): pass GRAFANA_DB_PASSWORD to archive-backend336ef20bfeat(observability): provision Grafana PostgreSQL datasource93eed612chore(observability): document GRAFANA_DB_PASSWORD in env files99c9612afeat(observability): add PO Overview Grafana dashboard2fa1ce3edocs(deployment): document GRAFANA_DB_PASSWORD across env tables5d191b22docs(architecture): show Grafana→PostgreSQL link for PO Overview dashboardTest evidence
GrafanaReaderRoleIntegrationTest— 4/4 green (positive grants onaudit_log/documents/transcription_blocks, negative grant onapp_userscovering NFR-SEC-01).AuditLogQueryRepositoryIntegrationTest— still 4/4 green (placeholder change didn't break existing Flyway runs).Manual smoke (open Grafana, confirm panels populate) is left for the reviewer / operator per Sara's recommendation; the privilege assertions in the PR description show the exact
psqlchecks to run.