As a product owner I want a Grafana overview dashboard so I can check system health and archive progress at a weekly glance #651

New Issue

marcel · 2026-05-21T15:12:05+02:00

marcel commented

2026-05-21 15:12:05 +02:00

Context

The product owner opens Grafana once a week before a stakeholder meeting and needs to answer three questions without digging into raw metrics:

Is everything working? — backend up, error rate, latency, infrastructure load
Are people using it? — logins, active users, engagement trend
Is the archive making progress? — documents uploaded, transcription block coverage
Is OCR working well? — jobs processed, page error rate, model quality

Audience: Lightly technical — can read "HTTP 5xx: 3" or "p95: 240 ms" but does not need JVM heap graphs.
Default time range: Last 7 days. Refresh: Manual (not a live monitor).

⚠️ Blocked by #652 — Row 4 (OCR Health) requires the OCR service to expose /metrics first. Rows 1–3 can be implemented independently.

User Story

As a product owner, I want a single Grafana dashboard showing system health, user activity, archive progress, and OCR quality for the past 7 days, so that I can give a confident status update in a stakeholder meeting without needing developer assistance.

Acceptance Criteria

Row 1 — System Health

Given the observability stack is running,
When the PO opens the dashboard,
Then they see 7 panels across two sub-rows:

Sub-row 1a — Application health (Prometheus + Loki):

up{job="spring-boot"} displayed as a green "UP" / red "DOWN" stat
HTTP 5xx error count over the selected range: sum(increase(http_server_requests_seconds_count{status=~"5.."}[$__range]))
p95 response time: histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[$__range])) by (le))
Error log line count: count_over_time({compose_service="backend"} | json | level="ERROR" [$__range])

Sub-row 1b — Infrastructure (node-exporter):

CPU usage % with a progress bar: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[$__range])) * 100)
RAM used GB / total GB with % bar: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Storage used GB / total GB with % bar: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

Given a metric crosses a threshold,
Then the panel colour changes according to these rules:

Metric	Green	Amber	Red
Backend status	UP	—	DOWN
HTTP 5xx count	0	1–5	> 5
p95 latency	< 500 ms	500 ms–2 s	> 2 s
CPU %	< 70%	70–85%	> 85%
RAM %	< 70%	70–85%	> 85%
Storage %	< 70%	70–80%	> 80%

Row 2 — User Activity

Given the PostgreSQL datasource is wired (see Infrastructure below),
When the PO views the dashboard,
Then they see 4 panels sourced from audit_log:

Panel	Query
Active users this week	`SELECT COUNT(DISTINCT actor_id) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS'`
Total logins this week	`SELECT COUNT(*) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS'`
Failed login attempts (amber)	`SELECT COUNT(*) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind IN ('LOGIN_FAILED', 'LOGIN_RATE_LIMITED')`
Logins per day (bar chart)	`SELECT DATE_TRUNC('day', happened_at) AS day, COUNT(*) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS' GROUP BY day ORDER BY day`

Threshold for failed logins: green = 0, amber = 1–3, red > 3.

Row 3 — Archive Progress

Given the PostgreSQL datasource is wired,
When the PO views the dashboard,
Then they see 5 panels:

Panel	Source	Query / Description
Transcription block coverage	`transcription_blocks`	Progress bar: blocks where `text IS NOT NULL AND text != ''` vs. `text IS NULL OR text = ''`. Shows `%` complete + "+N blocks this week" badge from `audit_log` `TEXT_SAVED`.
Total documents	`documents`	`COUNT(*) WHERE status != 'PLACEHOLDER'`. Sub-label shows `+N this week` from `audit_log` `FILE_UPLOADED`.
Uploads this week	`audit_log`	`COUNT(*) WHERE kind = 'FILE_UPLOADED'` scoped to range
Blocks transcribed this week	`audit_log`	`COUNT(*) WHERE kind = 'TEXT_SAVED'` scoped to range
Blocks reviewed this week	`audit_log`	`COUNT(*) WHERE kind = 'BLOCK_REVIEWED'` scoped to range

Row 4 — OCR Health (requires #652)

Given the OCR service exposes /metrics (see #652),
When the PO views the dashboard,
Then they see 4 panels sourced from Prometheus job="ocr-service":

Panel	Type	Query
OCR jobs this week	Stat	`sum(increase(ocr_jobs_total[$__range]))`
OCR page error rate	Stat (red > 5%)	`sum(increase(ocr_skipped_pages_total[$__range])) / sum(increase(ocr_pages_total[$__range]))` — shown as %, threshold: green < 1%, amber 1–5%, red > 5%
Illegible word rate	Stat	`sum(increase(ocr_illegible_words_total[$__range])) / sum(increase(ocr_words_total[$__range]))` — shown as %, lower is better
OCR service status	Stat (green/red)	`ocr_models_ready` — 1 = ready, 0 = not ready

Infrastructure Changes

1. Flyway migration — `grafana_reader` role

New migration (next available version number):

CREATE ROLE grafana_reader WITH LOGIN PASSWORD '…';  -- password from GRAFANA_DB_PASSWORD env var
GRANT CONNECT ON DATABASE familienarchiv TO grafana_reader;
GRANT USAGE ON SCHEMA public TO grafana_reader;
GRANT SELECT ON audit_log, documents, transcription_blocks TO grafana_reader;

2. Grafana datasource provisioning

Add to infra/observability/grafana/provisioning/datasources/datasources.yml:

- name: PostgreSQL
  type: postgres
  uid: postgres
  url: ${POSTGRES_HOST:-archive-db}:5432
  database: ${POSTGRES_DB}
  user: grafana_reader
  secureJsonData:
    password: ${GRAFANA_DB_PASSWORD}
  jsonData:
    sslmode: disable
    postgresVersion: 1600
  editable: false

3. Dashboard JSON

New file: infra/observability/grafana/provisioning/dashboards/po-overview.json

Provisioned automatically via the existing dashboards.yml file-provider (no config change needed there). Row 4 panels can be added in a follow-up commit once #652 is merged.

4. Environment variable

Add GRAFANA_DB_PASSWORD to:

.env.example (with a placeholder comment)
docker-compose.observability.yml (passed into obs-grafana via environment:)
docs/DEPLOYMENT.md (env var reference table)

Non-Functional Requirements

NFR-SEC-01: grafana_reader has SELECT only on the three named tables; no write access, no access to app_users or other sensitive tables.
NFR-PERF-01: All dashboard panels must load within 5 seconds on a local network when the observability stack is healthy.
NFR-OPS-01: The dashboard is provisioned as code (JSON file); it must not require manual Grafana UI steps to appear after docker compose up.
NFR-OPS-02: The GRAFANA_DB_PASSWORD env var must default to a non-empty placeholder in .env.example so a missing var causes an obvious error at startup, not a silent query failure.

Out of Scope

Alerting / push notifications (PagerDuty, email) — dashboard only
Per-user activity breakdown — aggregate totals are sufficient
Per-sender OCR model accuracy — developer/ops concern, not PO-level
Mobile / responsive Grafana layout

## Context The product owner opens Grafana once a week before a stakeholder meeting and needs to answer three questions without digging into raw metrics: 1. **Is everything working?** — backend up, error rate, latency, infrastructure load 2. **Are people using it?** — logins, active users, engagement trend 3. **Is the archive making progress?** — documents uploaded, transcription block coverage 4. **Is OCR working well?** — jobs processed, page error rate, model quality **Audience:** Lightly technical — can read "HTTP 5xx: 3" or "p95: 240 ms" but does not need JVM heap graphs. **Default time range:** Last 7 days. **Refresh:** Manual (not a live monitor). > ⚠️ **Blocked by #652** — Row 4 (OCR Health) requires the OCR service to expose `/metrics` first. Rows 1–3 can be implemented independently. --- ## User Story > As a product owner, I want a single Grafana dashboard showing system health, user activity, archive progress, and OCR quality for the past 7 days, so that I can give a confident status update in a stakeholder meeting without needing developer assistance. --- ## Acceptance Criteria ### Row 1 — System Health **Given** the observability stack is running, **When** the PO opens the dashboard, **Then** they see 7 panels across two sub-rows: **Sub-row 1a — Application health** (Prometheus + Loki): - `up{job="spring-boot"}` displayed as a green "UP" / red "DOWN" stat - HTTP 5xx error count over the selected range: `sum(increase(http_server_requests_seconds_count{status=~"5.."}[$__range]))` - p95 response time: `histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[$__range])) by (le))` - Error log line count: `count_over_time({compose_service="backend"} | json | level="ERROR" [$__range])` **Sub-row 1b — Infrastructure** (node-exporter): - CPU usage % with a progress bar: `100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[$__range])) * 100)` - RAM used GB / total GB with % bar: `(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100` - Storage used GB / total GB with % bar: `(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100` **Given** a metric crosses a threshold, **Then** the panel colour changes according to these rules: | Metric | Green | Amber | Red | |--------|-------|-------|-----| | Backend status | UP | — | DOWN | | HTTP 5xx count | 0 | 1–5 | > 5 | | p95 latency | < 500 ms | 500 ms–2 s | > 2 s | | CPU % | < 70% | 70–85% | > 85% | | RAM % | < 70% | 70–85% | > 85% | | Storage % | < 70% | 70–80% | > 80% | --- ### Row 2 — User Activity **Given** the PostgreSQL datasource is wired (see Infrastructure below), **When** the PO views the dashboard, **Then** they see 4 panels sourced from `audit_log`: | Panel | Query | |-------|-------| | Active users this week | `SELECT COUNT(DISTINCT actor_id) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS'` | | Total logins this week | `SELECT COUNT(*) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS'` | | Failed login attempts (amber) | `SELECT COUNT(*) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind IN ('LOGIN_FAILED', 'LOGIN_RATE_LIMITED')` | | Logins per day (bar chart) | `SELECT DATE_TRUNC('day', happened_at) AS day, COUNT(*) FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS' GROUP BY day ORDER BY day` | **Threshold for failed logins:** green = 0, amber = 1–3, red > 3. --- ### Row 3 — Archive Progress **Given** the PostgreSQL datasource is wired, **When** the PO views the dashboard, **Then** they see 5 panels: | Panel | Source | Query / Description | |-------|--------|---------------------| | Transcription block coverage | `transcription_blocks` | Progress bar: blocks where `text IS NOT NULL AND text != ''` vs. `text IS NULL OR text = ''`. Shows `%` complete + "+N blocks this week" badge from `audit_log` `TEXT_SAVED`. | | Total documents | `documents` | `COUNT(*) WHERE status != 'PLACEHOLDER'`. Sub-label shows `+N this week` from `audit_log` `FILE_UPLOADED`. | | Uploads this week | `audit_log` | `COUNT(*) WHERE kind = 'FILE_UPLOADED'` scoped to range | | Blocks transcribed this week | `audit_log` | `COUNT(*) WHERE kind = 'TEXT_SAVED'` scoped to range | | Blocks reviewed this week | `audit_log` | `COUNT(*) WHERE kind = 'BLOCK_REVIEWED'` scoped to range | --- ### Row 4 — OCR Health _(requires #652)_ **Given** the OCR service exposes `/metrics` (see #652), **When** the PO views the dashboard, **Then** they see 4 panels sourced from Prometheus `job="ocr-service"`: | Panel | Type | Query | |-------|------|-------| | OCR jobs this week | Stat | `sum(increase(ocr_jobs_total[$__range]))` | | OCR page error rate | Stat (red > 5%) | `sum(increase(ocr_skipped_pages_total[$__range])) / sum(increase(ocr_pages_total[$__range]))` — shown as %, threshold: green < 1%, amber 1–5%, red > 5% | | Illegible word rate | Stat | `sum(increase(ocr_illegible_words_total[$__range])) / sum(increase(ocr_words_total[$__range]))` — shown as %, lower is better | | OCR service status | Stat (green/red) | `ocr_models_ready` — 1 = ready, 0 = not ready | --- ## Infrastructure Changes ### 1. Flyway migration — `grafana_reader` role New migration (next available version number): ```sql CREATE ROLE grafana_reader WITH LOGIN PASSWORD '…'; -- password from GRAFANA_DB_PASSWORD env var GRANT CONNECT ON DATABASE familienarchiv TO grafana_reader; GRANT USAGE ON SCHEMA public TO grafana_reader; GRANT SELECT ON audit_log, documents, transcription_blocks TO grafana_reader; ``` ### 2. Grafana datasource provisioning Add to `infra/observability/grafana/provisioning/datasources/datasources.yml`: ```yaml - name: PostgreSQL type: postgres uid: postgres url: ${POSTGRES_HOST:-archive-db}:5432 database: ${POSTGRES_DB} user: grafana_reader secureJsonData: password: ${GRAFANA_DB_PASSWORD} jsonData: sslmode: disable postgresVersion: 1600 editable: false ``` ### 3. Dashboard JSON New file: `infra/observability/grafana/provisioning/dashboards/po-overview.json` Provisioned automatically via the existing `dashboards.yml` file-provider (no config change needed there). Row 4 panels can be added in a follow-up commit once #652 is merged. ### 4. Environment variable Add `GRAFANA_DB_PASSWORD` to: - `.env.example` (with a placeholder comment) - `docker-compose.observability.yml` (passed into `obs-grafana` via `environment:`) - `docs/DEPLOYMENT.md` (env var reference table) --- ## Non-Functional Requirements - **NFR-SEC-01:** `grafana_reader` has `SELECT` only on the three named tables; no write access, no access to `app_users` or other sensitive tables. - **NFR-PERF-01:** All dashboard panels must load within 5 seconds on a local network when the observability stack is healthy. - **NFR-OPS-01:** The dashboard is provisioned as code (JSON file); it must not require manual Grafana UI steps to appear after `docker compose up`. - **NFR-OPS-02:** The `GRAFANA_DB_PASSWORD` env var must default to a non-empty placeholder in `.env.example` so a missing var causes an obvious error at startup, not a silent query failure. --- ## Out of Scope - Alerting / push notifications (PagerDuty, email) — dashboard only - Per-user activity breakdown — aggregate totals are sufficient - Per-sender OCR model accuracy — developer/ops concern, not PO-level - Mobile / responsive Grafana layout

marcel added the P2-medium devops feature phase-7: monitoring labels 2026-05-21 15:12:09 +02:00

marcel referenced this issue

2026-05-21 15:15:37 +02:00

As a developer I want the OCR service to expose a /metrics endpoint so Prometheus can scrape OCR throughput, error rate, and model quality #652

marcel referenced this issue

2026-05-21 15:17:22 +02:00

As a developer I want the OCR service to expose a /metrics endpoint so Prometheus can scrape OCR throughput, error rate, and model quality #652

marcel commented

2026-05-21 15:19:42 +02:00

🏗️ Markus Keller — Application Architect

Observations

Architecture is appropriate. This is purely additive infrastructure work — new Flyway migration, new datasource provisioning file, new dashboard JSON, new env var. No domain boundaries are crossed. The decision to provision as code (NFR-OPS-01) is the right call.
grafana_reader role placement is correct. Putting this in a Flyway migration (not a one-off script) means it gets tested in CI via Testcontainers. The GRANT scope is correctly minimal: only audit_log, documents, transcription_blocks — not app_users.
Dependency ordering at the architecture level. The issue correctly splits into "Rows 1–3 now, Row 4 after #652" — this is the right staged delivery pattern. The JSON file should be structured so Row 4 panels can be appended in a follow-up commit without touching the existing panels.
Dashboard as code path is already wired. infra/observability/grafana/provisioning/dashboards/dashboards.yml already has a file provider pointing at /etc/grafana/provisioning/dashboards, with updateIntervalSeconds: 30. No config change needed — the issue states this correctly.
The Flyway migration has a password hardening gap. The migration script contains PASSWORD '…' as a literal placeholder. At migration time, Flyway does not have access to env vars — the password must either be set via a follow-up ALTER ROLE call using a connection that can read the env var, or the migration needs a different approach entirely.

Recommendations

Do not embed the password in the Flyway SQL migration directly. Flyway SQL files cannot interpolate env vars. Two safe options:
1. Create the role with NOLOGIN in the migration, then run a separate startup script (or a docker-compose command hook) that does ALTER ROLE grafana_reader WITH LOGIN PASSWORD '...' using $GRAFANA_DB_PASSWORD.
2. Use a Flyway Java-based migration that reads System.getenv("GRAFANA_DB_PASSWORD") — but this is heavier for a one-time role creation.
  Option 1 is simpler and follows the pattern used in infra/minio/bootstrap.sh for MinIO credentials.
Add GRAFANA_DB_PASSWORD to obs.env (the non-secret block in infra/observability/obs.env) as a comment referencing where the real value comes from, consistent with how other secrets are handled in that file.
Documentation required. This PR adds a new Docker service dependency (obs-grafana now talks to archive-db) and a new env var. Per the doc update rules: update docs/architecture/c4/l2-containers.puml (new external connection: Grafana → PostgreSQL) and docs/DEPLOYMENT.md (new env var table row).
Dashboard uid field. The dashboard JSON should include a stable "uid": "po-overview" field — without it, Grafana auto-generates a uid on first import and re-imports can create duplicates if the file is reprovisioned.

Open Decisions

Flyway migration vs. startup script for role creation. The issue specifies a Flyway migration but Flyway SQL cannot interpolate env vars. Choose: (a) Flyway migration creates the role without a password, separate script sets the password at startup; (b) hardcode a weak placeholder in the migration and document that production must override it via ALTER ROLE. Option (a) is more secure; option (b) is simpler. Which approach fits the operational model?

## 🏗️ Markus Keller — Application Architect ### Observations - **Architecture is appropriate.** This is purely additive infrastructure work — new Flyway migration, new datasource provisioning file, new dashboard JSON, new env var. No domain boundaries are crossed. The decision to provision as code (NFR-OPS-01) is the right call. - **`grafana_reader` role placement is correct.** Putting this in a Flyway migration (not a one-off script) means it gets tested in CI via Testcontainers. The GRANT scope is correctly minimal: only `audit_log`, `documents`, `transcription_blocks` — not `app_users`. - **Dependency ordering at the architecture level.** The issue correctly splits into "Rows 1–3 now, Row 4 after #652" — this is the right staged delivery pattern. The JSON file should be structured so Row 4 panels can be appended in a follow-up commit without touching the existing panels. - **Dashboard as code path is already wired.** `infra/observability/grafana/provisioning/dashboards/dashboards.yml` already has a `file` provider pointing at `/etc/grafana/provisioning/dashboards`, with `updateIntervalSeconds: 30`. No config change needed — the issue states this correctly. - **The Flyway migration has a password hardening gap.** The migration script contains `PASSWORD '…'` as a literal placeholder. At migration time, Flyway does not have access to env vars — the password must either be set via a follow-up `ALTER ROLE` call using a connection that can read the env var, or the migration needs a different approach entirely. ### Recommendations - **Do not embed the password in the Flyway SQL migration directly.** Flyway SQL files cannot interpolate env vars. Two safe options: 1. Create the role with `NOLOGIN` in the migration, then run a separate startup script (or a `docker-compose` `command` hook) that does `ALTER ROLE grafana_reader WITH LOGIN PASSWORD '...'` using `$GRAFANA_DB_PASSWORD`. 2. Use a Flyway Java-based migration that reads `System.getenv("GRAFANA_DB_PASSWORD")` — but this is heavier for a one-time role creation. Option 1 is simpler and follows the pattern used in `infra/minio/bootstrap.sh` for MinIO credentials. - **Add `GRAFANA_DB_PASSWORD` to `obs.env`** (the non-secret block in `infra/observability/obs.env`) as a comment referencing where the real value comes from, consistent with how other secrets are handled in that file. - **Documentation required.** This PR adds a new Docker service dependency (`obs-grafana` now talks to `archive-db`) and a new env var. Per the doc update rules: update `docs/architecture/c4/l2-containers.puml` (new external connection: Grafana → PostgreSQL) and `docs/DEPLOYMENT.md` (new env var table row). - **Dashboard `uid` field.** The dashboard JSON should include a stable `"uid": "po-overview"` field — without it, Grafana auto-generates a uid on first import and re-imports can create duplicates if the file is reprovisioned. ### Open Decisions - **Flyway migration vs. startup script for role creation.** The issue specifies a Flyway migration but Flyway SQL cannot interpolate env vars. Choose: (a) Flyway migration creates the role without a password, separate script sets the password at startup; (b) hardcode a weak placeholder in the migration and document that production must override it via `ALTER ROLE`. Option (a) is more secure; option (b) is simpler. Which approach fits the operational model?

marcel commented

2026-05-21 15:20:04 +02:00

🔒 Nora Steiner — Application Security Engineer

Observations

grafana_reader scope is well-defined. NFR-SEC-01 correctly restricts the role to SELECT on three named tables. No app_users access. This is least-privilege done right.
The main risk is how the password reaches the migration. Flyway SQL migrations run at application startup with the application's database credentials. If GRAFANA_DB_PASSWORD is only passed to obs-grafana (as the issue specifies for docker-compose.observability.yml), the backend service that runs Flyway will not have it. A migration like CREATE ROLE grafana_reader WITH LOGIN PASSWORD '…' with a placeholder in git is a credential leak waiting to happen — someone will substitute a real password and commit it.
grafana_reader on the archive-db network. The observability compose file currently joins archiv-net only for Promtail and GlitchTip. Adding obs-grafana to archiv-net (needed to reach archive-db:5432) increases the surface of that network. This is a deliberate, acceptable tradeoff for a single-operator deployment, but should be noted.
sslmode: disable in the datasource config. The issue specifies sslmode: disable. This is fine within the same Docker network (traffic never leaves the host), but should be a conscious choice, not a default. Document it as intentional.
editable: false on the datasource. Correct — prevents interactive credentials leakage via the Grafana UI.
GRAFANA_DB_PASSWORD default in .env.example. NFR-OPS-02 correctly requires a non-empty placeholder so startup fails loudly. The current .env.example does not have this var at all — it must be added.

Recommendations

Never embed the real password in the Flyway SQL file. Use one of these safe patterns:
1. Flyway migration creates the role with NOLOGIN and no password; a separate idempotent init script (run at deploy time, not at app startup) sets the password from $GRAFANA_DB_PASSWORD.
2. Add GRAFANA_DB_PASSWORD to the backend's env in docker-compose.observability.yml (not just obs-grafana) and have the Flyway migration read it via a Java callback — heavier but fully integrated.
Network exposure is explicit and acceptable for single-operator self-hosted, but add a comment in docker-compose.observability.yml next to obs-grafana's network list explaining why it joins archiv-net (same reason Promtail does: direct container-name DNS resolution to archive-db).
Add an index to audit_log.happened_at if one does not already exist. The Row 2 and Row 3 queries all filter on happened_at >= NOW() - INTERVAL '7 days'. Without an index, these are full-table scans. At current archive scale this is fine, but the index is cheap insurance. I checked the migrations: V62__index_fk_columns.sql adds FK indexes but I did not see a happened_at index. Verify before shipping.
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}. The :-changeme default in the existing compose file is a weak credential that production must override. The issue adds GRAFANA_DB_PASSWORD — make sure its default in .env.example is a clearly fake placeholder (# REQUIRED — generate with openssl rand -hex 32), not a guessable string.

Open Decisions

Network topology: should obs-grafana join archiv-net or should PostgreSQL be exposed on obs-net? Joining archiv-net gives Grafana direct container-name DNS (archive-db). The alternative — exposing PostgreSQL on obs-net — widens the database's network exposure. archiv-net membership for Grafana is the better option, but it is a deliberate choice that should be noted in the compose file.

## 🔒 Nora Steiner — Application Security Engineer ### Observations - **`grafana_reader` scope is well-defined.** NFR-SEC-01 correctly restricts the role to `SELECT` on three named tables. No `app_users` access. This is least-privilege done right. - **The main risk is how the password reaches the migration.** Flyway SQL migrations run at application startup with the application's database credentials. If `GRAFANA_DB_PASSWORD` is only passed to `obs-grafana` (as the issue specifies for `docker-compose.observability.yml`), the backend service that runs Flyway will not have it. A migration like `CREATE ROLE grafana_reader WITH LOGIN PASSWORD '…'` with a placeholder in git is a credential leak waiting to happen — someone will substitute a real password and commit it. - **`grafana_reader` on the `archive-db` network.** The observability compose file currently joins `archiv-net` only for Promtail and GlitchTip. Adding `obs-grafana` to `archiv-net` (needed to reach `archive-db:5432`) increases the surface of that network. This is a deliberate, acceptable tradeoff for a single-operator deployment, but should be noted. - **`sslmode: disable` in the datasource config.** The issue specifies `sslmode: disable`. This is fine within the same Docker network (traffic never leaves the host), but should be a conscious choice, not a default. Document it as intentional. - **`editable: false` on the datasource.** Correct — prevents interactive credentials leakage via the Grafana UI. - **`GRAFANA_DB_PASSWORD` default in `.env.example`.** NFR-OPS-02 correctly requires a non-empty placeholder so startup fails loudly. The current `.env.example` does not have this var at all — it must be added. ### Recommendations - **Never embed the real password in the Flyway SQL file.** Use one of these safe patterns: 1. Flyway migration creates the role with `NOLOGIN` and no password; a separate idempotent init script (run at deploy time, not at app startup) sets the password from `$GRAFANA_DB_PASSWORD`. 2. Add `GRAFANA_DB_PASSWORD` to the backend's env in `docker-compose.observability.yml` (not just `obs-grafana`) and have the Flyway migration read it via a Java callback — heavier but fully integrated. - **Network exposure is explicit and acceptable** for single-operator self-hosted, but add a comment in `docker-compose.observability.yml` next to `obs-grafana`'s network list explaining why it joins `archiv-net` (same reason Promtail does: direct container-name DNS resolution to `archive-db`). - **Add an index to `audit_log.happened_at`** if one does not already exist. The Row 2 and Row 3 queries all filter on `happened_at >= NOW() - INTERVAL '7 days'`. Without an index, these are full-table scans. At current archive scale this is fine, but the index is cheap insurance. I checked the migrations: `V62__index_fk_columns.sql` adds FK indexes but I did not see a `happened_at` index. Verify before shipping. - **`GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}`.** The `:-changeme` default in the existing compose file is a weak credential that production must override. The issue adds `GRAFANA_DB_PASSWORD` — make sure its default in `.env.example` is a clearly fake placeholder (`# REQUIRED — generate with openssl rand -hex 32`), not a guessable string. ### Open Decisions - **Network topology: should `obs-grafana` join `archiv-net` or should PostgreSQL be exposed on `obs-net`?** Joining `archiv-net` gives Grafana direct container-name DNS (`archive-db`). The alternative — exposing PostgreSQL on `obs-net` — widens the database's network exposure. `archiv-net` membership for Grafana is the better option, but it is a deliberate choice that should be noted in the compose file.

marcel commented

2026-05-21 15:20:24 +02:00

🧪 Sara Holt — QA Engineer

Observations

No automated tests are defined for the dashboard itself. This is expected — Grafana dashboard JSON is not unit-testable in the traditional sense. But the infrastructure that supports it (the Flyway migration, the grafana_reader role, the datasource connection) can and should be validated.
NFR-PERF-01 ("all panels load within 5 seconds") is untestable as written. There is no test strategy for this. It is currently a manual acceptance criterion.
The acceptance criteria are panel-level specs, not behavior tests. The Given-When-Then criteria describe what panels exist, but do not describe a test that verifies them. This is acceptable for infrastructure work, but the test coverage gap should be acknowledged.
AuditKind values are confirmed. I checked AuditKind.java — LOGIN_SUCCESS, LOGIN_FAILED, LOGIN_RATE_LIMITED, FILE_UPLOADED, TEXT_SAVED, and BLOCK_REVIEWED all exist. The Row 2 and Row 3 SQL queries reference the correct enum string values.
transcription_blocks.text column. The coverage query relies on text IS NOT NULL AND text != '' to determine "filled" blocks. The TranscriptionBlock entity has @Column(columnDefinition = "TEXT") for the text field with no explicit nullable = false, meaning NULL is a valid DB state. The query logic matches this.

Recommendations

Add a Flyway migration integration test assertion. The existing MigrationIntegrationTest (if it exists — there's a Testcontainers setup in AuditLogQueryRepositoryIntegrationTest) should be extended to verify that the grafana_reader role exists and has the correct grants after all migrations run. Even a simple SELECT has_table_privilege('grafana_reader', 'audit_log', 'SELECT') assertion in a Testcontainers-based test would catch a broken migration.
Define a smoke test acceptance criterion. Add one explicit test step to the Definition of Done: "After docker compose -f docker-compose.observability.yml up, visit Grafana at localhost:3003, open the PO Overview dashboard, and verify all panels show data (not 'No data')." This is a manual step, but making it explicit prevents shipping a wired-but-broken dashboard.
The +N this week badge on the Transcription block coverage panel is described as coming from audit_log TEXT_SAVED. Verify this is not a separate SQL panel (which Grafana cannot combine into a single stat panel) vs. a Grafana transform — this is a dashboard design detail that needs resolving before the JSON is written. A single stat panel cannot natively show both a computed percentage and a badge from a separate query without a Grafana transformation step.
Row 4 as a follow-up commit. Confirm that the dashboard JSON structure uses rows or panel groups so Row 4 can be appended without touching the JSON structure of Rows 1–3. This prevents a merge conflict between this issue and #652.

Open Decisions

The +N this week badge implementation. Grafana stat panels can show a single value. Showing "75% (+12 blocks this week)" requires either two separate panels or a Grafana Transformation that computes a composite string. The issue describes it as a single panel — is this a two-panel layout or does it accept a simplified single-value display?

## 🧪 Sara Holt — QA Engineer ### Observations - **No automated tests are defined for the dashboard itself.** This is expected — Grafana dashboard JSON is not unit-testable in the traditional sense. But the *infrastructure* that supports it (the Flyway migration, the `grafana_reader` role, the datasource connection) can and should be validated. - **NFR-PERF-01 ("all panels load within 5 seconds") is untestable as written.** There is no test strategy for this. It is currently a manual acceptance criterion. - **The acceptance criteria are panel-level specs, not behavior tests.** The Given-When-Then criteria describe what panels exist, but do not describe a test that verifies them. This is acceptable for infrastructure work, but the test coverage gap should be acknowledged. - **AuditKind values are confirmed.** I checked `AuditKind.java` — `LOGIN_SUCCESS`, `LOGIN_FAILED`, `LOGIN_RATE_LIMITED`, `FILE_UPLOADED`, `TEXT_SAVED`, and `BLOCK_REVIEWED` all exist. The Row 2 and Row 3 SQL queries reference the correct enum string values. - **`transcription_blocks.text` column.** The coverage query relies on `text IS NOT NULL AND text != ''` to determine "filled" blocks. The `TranscriptionBlock` entity has `@Column(columnDefinition = "TEXT")` for the text field with no explicit `nullable = false`, meaning `NULL` is a valid DB state. The query logic matches this. ### Recommendations - **Add a Flyway migration integration test assertion.** The existing `MigrationIntegrationTest` (if it exists — there's a Testcontainers setup in `AuditLogQueryRepositoryIntegrationTest`) should be extended to verify that the `grafana_reader` role exists and has the correct grants after all migrations run. Even a simple `SELECT has_table_privilege('grafana_reader', 'audit_log', 'SELECT')` assertion in a Testcontainers-based test would catch a broken migration. - **Define a smoke test acceptance criterion.** Add one explicit test step to the Definition of Done: "After `docker compose -f docker-compose.observability.yml up`, visit Grafana at `localhost:3003`, open the PO Overview dashboard, and verify all panels show data (not 'No data')." This is a manual step, but making it explicit prevents shipping a wired-but-broken dashboard. - **The `+N this week` badge on the Transcription block coverage panel** is described as coming from `audit_log` `TEXT_SAVED`. Verify this is not a separate SQL panel (which Grafana cannot combine into a single stat panel) vs. a Grafana transform — this is a dashboard design detail that needs resolving before the JSON is written. A single stat panel cannot natively show both a computed percentage and a badge from a separate query without a Grafana transformation step. - **Row 4 as a follow-up commit.** Confirm that the dashboard JSON structure uses rows or panel groups so Row 4 can be appended without touching the JSON structure of Rows 1–3. This prevents a merge conflict between this issue and #652. ### Open Decisions - **The `+N this week` badge implementation.** Grafana stat panels can show a single value. Showing "75% (+12 blocks this week)" requires either two separate panels or a Grafana Transformation that computes a composite string. The issue describes it as a single panel — is this a two-panel layout or does it accept a simplified single-value display?

marcel commented

2026-05-21 15:20:44 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

This is pure infrastructure — no application code changes. No Spring Boot services, no SvelteKit routes, no TypeScript types, and no Python code are touched. The TDD lens applies to the Flyway migration (run-against-real-Postgres contract) rather than unit tests.
PromQL queries are correctly specified. All four sub-row 1a metrics are standard Micrometer/Spring Boot Prometheus metric names. http_server_requests_seconds_count, http_server_requests_seconds_bucket, and up{job="spring-boot"} are what the existing Spring Boot observability stack emits. I verified by checking infra/observability/prometheus/prometheus.yml context — the backend scrape job is already wired.
PostgreSQL queries are syntactically correct. All audit_log queries filter on happened_at and kind. The AuditKind enum values (LOGIN_SUCCESS, LOGIN_FAILED, LOGIN_RATE_LIMITED, FILE_UPLOADED, TEXT_SAVED, BLOCK_REVIEWED) are all confirmed in AuditKind.java. The column names (happened_at, actor_id, kind) match the AuditLog entity.
transcription_blocks coverage query. The query compares text IS NOT NULL AND text != '' (filled) vs text IS NULL OR text = '' (empty). This is a correct expression of the entity's nullable text column. However, Grafana's PostgreSQL datasource requires the progress bar visualisation to be achieved via bar gauge panel type with a single value; the issue describes "progress bar" which maps cleanly to that type.
Next Flyway version is V68. The latest migration is V67__recreate_spring_session_tables.sql. The new grafana_reader migration should be V68__add_grafana_reader_role.sql.

Recommendations

The Flyway migration cannot directly interpolate $GRAFANA_DB_PASSWORD. Flyway SQL files do not support ${ENV_VAR} substitution by default. Flyway's placeholders feature requires opt-in configuration in application.yaml (spring.flyway.placeholders.*). Either:
1. Enable Flyway placeholder substitution and pass GRAFANA_DB_PASSWORD as spring.flyway.placeholders.grafanaDbPassword — clean, but requires Spring config change.
2. Create the role in Flyway without a password and set it via a separate idempotent startup script.
  Option 1 is the cleaner developer experience — the password is set atomically in the migration, and the migration is idempotent because CREATE ROLE IF NOT EXISTS is supported in PostgreSQL 9.6+.

Flyway placeholder config snippet (if option 1 is chosen):

# application.yaml
spring:
  flyway:
    placeholders:
      grafanaDbPassword: ${GRAFANA_DB_PASSWORD:changeme-set-in-env}

-- V68__add_grafana_reader_role.sql
DO $$ BEGIN
  IF NOT EXISTS (SELECT FROM pg_catalog.pg_roles WHERE rolname = 'grafana_reader') THEN
    CREATE ROLE grafana_reader WITH LOGIN PASSWORD '${grafanaDbPassword}';
  END IF;
END $$;
GRANT CONNECT ON DATABASE ${flyway:database} TO grafana_reader;
GRANT USAGE ON SCHEMA public TO grafana_reader;
GRANT SELECT ON audit_log, documents, transcription_blocks TO grafana_reader;

The IF NOT EXISTS block makes the migration re-runnable safely if the role already exists in the database.

Dashboard JSON __inputs section. When exporting a dashboard from Grafana for provisioning, the JSON includes an __inputs block that references datasource UIDs. Make sure the postgres datasource UID in the JSON matches the uid: postgres defined in the provisioned datasource. A mismatch causes "No data" for all PostgreSQL panels on first load.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - **This is pure infrastructure — no application code changes.** No Spring Boot services, no SvelteKit routes, no TypeScript types, and no Python code are touched. The TDD lens applies to the Flyway migration (run-against-real-Postgres contract) rather than unit tests. - **PromQL queries are correctly specified.** All four sub-row 1a metrics are standard Micrometer/Spring Boot Prometheus metric names. `http_server_requests_seconds_count`, `http_server_requests_seconds_bucket`, and `up{job="spring-boot"}` are what the existing Spring Boot observability stack emits. I verified by checking `infra/observability/prometheus/prometheus.yml` context — the backend scrape job is already wired. - **PostgreSQL queries are syntactically correct.** All `audit_log` queries filter on `happened_at` and `kind`. The `AuditKind` enum values (`LOGIN_SUCCESS`, `LOGIN_FAILED`, `LOGIN_RATE_LIMITED`, `FILE_UPLOADED`, `TEXT_SAVED`, `BLOCK_REVIEWED`) are all confirmed in `AuditKind.java`. The column names (`happened_at`, `actor_id`, `kind`) match the `AuditLog` entity. - **`transcription_blocks` coverage query.** The query compares `text IS NOT NULL AND text != ''` (filled) vs `text IS NULL OR text = ''` (empty). This is a correct expression of the entity's nullable `text` column. However, Grafana's PostgreSQL datasource requires the progress bar visualisation to be achieved via `bar gauge` panel type with a single value; the issue describes "progress bar" which maps cleanly to that type. - **Next Flyway version is V68.** The latest migration is `V67__recreate_spring_session_tables.sql`. The new `grafana_reader` migration should be `V68__add_grafana_reader_role.sql`. ### Recommendations - **The Flyway migration cannot directly interpolate `$GRAFANA_DB_PASSWORD`.** Flyway SQL files do not support `${ENV_VAR}` substitution by default. Flyway's `placeholders` feature requires opt-in configuration in `application.yaml` (`spring.flyway.placeholders.*`). Either: 1. Enable Flyway placeholder substitution and pass `GRAFANA_DB_PASSWORD` as `spring.flyway.placeholders.grafanaDbPassword` — clean, but requires Spring config change. 2. Create the role in Flyway without a password and set it via a separate idempotent startup script. Option 1 is the cleaner developer experience — the password is set atomically in the migration, and the migration is idempotent because `CREATE ROLE IF NOT EXISTS` is supported in PostgreSQL 9.6+. - **Flyway placeholder config snippet** (if option 1 is chosen): ```yaml # application.yaml spring: flyway: placeholders: grafanaDbPassword: ${GRAFANA_DB_PASSWORD:changeme-set-in-env} ``` ```sql -- V68__add_grafana_reader_role.sql DO $$ BEGIN IF NOT EXISTS (SELECT FROM pg_catalog.pg_roles WHERE rolname = 'grafana_reader') THEN CREATE ROLE grafana_reader WITH LOGIN PASSWORD '${grafanaDbPassword}'; END IF; END $$; GRANT CONNECT ON DATABASE ${flyway:database} TO grafana_reader; GRANT USAGE ON SCHEMA public TO grafana_reader; GRANT SELECT ON audit_log, documents, transcription_blocks TO grafana_reader; ``` The `IF NOT EXISTS` block makes the migration re-runnable safely if the role already exists in the database. - **Dashboard JSON `__inputs` section.** When exporting a dashboard from Grafana for provisioning, the JSON includes an `__inputs` block that references datasource UIDs. Make sure the `postgres` datasource UID in the JSON matches the `uid: postgres` defined in the provisioned datasource. A mismatch causes "No data" for all PostgreSQL panels on first load.

marcel referenced this issue

2026-05-21 15:20:52 +02:00

As a developer I want the OCR service to expose a /metrics endpoint so Prometheus can scrape OCR throughput, error rate, and model quality #652

marcel commented

2026-05-21 15:21:04 +02:00

🚀 Tobias Wendt — DevOps & Platform Engineer

Observations

The existing observability stack is well-structured. All Grafana images are pinned (grafana/grafana-oss:11.6.1), healthchecks exist, named volumes are used for persistence, and the provisioning pattern is already established. This issue builds cleanly on top of what exists.
The dashboard JSON path is already covered. dashboards.yml uses a file provider scanning /etc/grafana/provisioning/dashboards with updateIntervalSeconds: 30. Dropping po-overview.json there is all that's needed — confirmed.
obs-grafana does not currently join archiv-net. Looking at the compose file, obs-grafana is only on obs-net. Adding the PostgreSQL datasource requires Grafana to reach archive-db:5432. The compose file needs obs-grafana to also join archiv-net — same pattern as obs-promtail and obs-glitchtip.
GRAFANA_DB_PASSWORD is not yet in .env.example. Confirmed — it's absent. The issue correctly identifies this as a required addition.
No GRAFANA_DB_PASSWORD in obs.env either. The non-secret block at infra/observability/obs.env does not have this variable. It should have a commented-out reference explaining that the real value comes from CI secrets (same pattern as GRAFANA_ADMIN_PASSWORD which is in the main .env.example).
The archive-db service name is correct. The main docker-compose.yml names the Postgres service db but the container name is archive-db (from container_name: archive-db). The datasource URL ${POSTGRES_HOST:-archive-db}:5432 uses the container name, which is the correct approach for cross-Compose-file DNS resolution on a shared network.

Recommendations

Add archiv-net to obs-grafana's networks block in docker-compose.observability.yml:
```
obs-grafana:
  networks:
    - obs-net
    - archiv-net  # needed to reach archive-db:5432 for the PostgreSQL datasource
```
Without this, the datasource connection will time out silently and all Row 2/3 panels will show "No data."

Pass GRAFANA_DB_PASSWORD into obs-grafana in docker-compose.observability.yml:

obs-grafana:
  environment:
    GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}
    GF_USERS_ALLOW_SIGN_UP: "false"
    GF_SERVER_ROOT_URL: ${GF_SERVER_ROOT_URL:-http://localhost:3003}
    GRAFANA_DB_PASSWORD: ${GRAFANA_DB_PASSWORD}  # injected into datasource provisioning

Add GRAFANA_DB_PASSWORD to .env.example with a clear generation instruction:

# Password for the read-only grafana_reader PostgreSQL role (Row 2+3 dashboard panels)
# Generate: openssl rand -hex 32
GRAFANA_DB_PASSWORD=change-me-generate-with-openssl

The disableDeletion: true on the dashboard provider is correct. This prevents the PO from accidentally deleting the provisioned dashboard via the Grafana UI. Keep it.
Consider updateIntervalSeconds: 30 vs. 0 for provisioned dashboards. With 30, Grafana polls for JSON file changes every 30 seconds. For a static provisioned dashboard this is fine, but if you're iterating on the JSON during development, set it to 10 locally for faster feedback.
Row 4 placeholder panels. When the Row 4 panels are blocked by #652, consider adding collapsed empty row panels with a "Blocked by #652" title in the JSON. This makes the dashboard structure visible to anyone who opens it and avoids a confusing empty section — alternatively, simply don't add Row 4 yet and add it in the follow-up commit, which the issue already states as the plan.

## 🚀 Tobias Wendt — DevOps & Platform Engineer ### Observations - **The existing observability stack is well-structured.** All Grafana images are pinned (`grafana/grafana-oss:11.6.1`), healthchecks exist, named volumes are used for persistence, and the provisioning pattern is already established. This issue builds cleanly on top of what exists. - **The dashboard JSON path is already covered.** `dashboards.yml` uses a `file` provider scanning `/etc/grafana/provisioning/dashboards` with `updateIntervalSeconds: 30`. Dropping `po-overview.json` there is all that's needed — confirmed. - **`obs-grafana` does not currently join `archiv-net`.** Looking at the compose file, `obs-grafana` is only on `obs-net`. Adding the PostgreSQL datasource requires Grafana to reach `archive-db:5432`. The compose file needs `obs-grafana` to also join `archiv-net` — same pattern as `obs-promtail` and `obs-glitchtip`. - **`GRAFANA_DB_PASSWORD` is not yet in `.env.example`.** Confirmed — it's absent. The issue correctly identifies this as a required addition. - **No `GRAFANA_DB_PASSWORD` in `obs.env` either.** The non-secret block at `infra/observability/obs.env` does not have this variable. It should have a commented-out reference explaining that the real value comes from CI secrets (same pattern as `GRAFANA_ADMIN_PASSWORD` which is in the main `.env.example`). - **The `archive-db` service name is correct.** The main `docker-compose.yml` names the Postgres service `db` but the container name is `archive-db` (from `container_name: archive-db`). The datasource URL `${POSTGRES_HOST:-archive-db}:5432` uses the container name, which is the correct approach for cross-Compose-file DNS resolution on a shared network. ### Recommendations - **Add `archiv-net` to `obs-grafana`'s networks block** in `docker-compose.observability.yml`: ```yaml obs-grafana: networks: - obs-net - archiv-net # needed to reach archive-db:5432 for the PostgreSQL datasource ``` Without this, the datasource connection will time out silently and all Row 2/3 panels will show "No data." - **Pass `GRAFANA_DB_PASSWORD` into `obs-grafana`** in `docker-compose.observability.yml`: ```yaml obs-grafana: environment: GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme} GF_USERS_ALLOW_SIGN_UP: "false" GF_SERVER_ROOT_URL: ${GF_SERVER_ROOT_URL:-http://localhost:3003} GRAFANA_DB_PASSWORD: ${GRAFANA_DB_PASSWORD} # injected into datasource provisioning ``` - **Add `GRAFANA_DB_PASSWORD` to `.env.example`** with a clear generation instruction: ```bash # Password for the read-only grafana_reader PostgreSQL role (Row 2+3 dashboard panels) # Generate: openssl rand -hex 32 GRAFANA_DB_PASSWORD=change-me-generate-with-openssl ``` - **The `disableDeletion: true` on the dashboard provider is correct.** This prevents the PO from accidentally deleting the provisioned dashboard via the Grafana UI. Keep it. - **Consider `updateIntervalSeconds: 30` vs. `0` for provisioned dashboards.** With `30`, Grafana polls for JSON file changes every 30 seconds. For a static provisioned dashboard this is fine, but if you're iterating on the JSON during development, set it to `10` locally for faster feedback. - **Row 4 placeholder panels.** When the Row 4 panels are blocked by #652, consider adding collapsed empty row panels with a "Blocked by #652" title in the JSON. This makes the dashboard structure visible to anyone who opens it and avoids a confusing empty section — alternatively, simply don't add Row 4 yet and add it in the follow-up commit, which the issue already states as the plan.

marcel commented

2026-05-21 15:21:22 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Observations

This is a Grafana dashboard, not a Svelte component. The brand system, Tailwind tokens, and frontend component guidelines do not apply here. Grafana's visualization engine is the rendering surface.
The target audience is explicitly "lightly technical." The issue correctly identifies this: the PO can read "HTTP 5xx: 3" or "p95: 240 ms" but does not need JVM heap graphs. The panel selection respects this constraint — no memory pool breakdowns, no GC pause histograms.
Color semantics and redundant cues. The threshold table (green/amber/red) uses color to convey status. Grafana's stat panels support thresholds natively, but the color alone is the only signal. Since the PO audience is not necessarily color-vision-deficient (and Grafana's threshold color system is industry-standard for operations dashboards), this is acceptable — but noting it as a deliberate tradeoff.
Panel naming matters for a non-technical audience. The issue specifies query expressions but not panel titles for all panels. "p95 response time" is clear; histogram_quantile(0.95, ...) as a panel title would not be. The dashboard JSON must use human-readable panel titles throughout.
Default time range of "Last 7 days" and "Manual refresh" is appropriate for weekly stakeholder meetings. No concerns here.

Recommendations

Define all panel titles explicitly in the issue or in the dashboard JSON comments before implementation. Recommended titles:
- Backend status → "Backend Status"
- HTTP 5xx count → "Server Errors (5xx)"
- p95 response time → "Response Time (p95)"
- Error log lines → "Error Log Count"
- CPU % → "CPU Usage"
- RAM % → "Memory Usage"
- Storage % → "Disk Usage"
- Logins per day → "Daily Logins (last 7 days)"
- Failed login attempts → "Failed Login Attempts"
Use Grafana "Stat" panels for KPI numbers, not "Graph" or "Time series" — the PO needs a single number at a glance, not trend analysis. The issue implicitly indicates this for most panels but is explicit only for some. Bar chart for "Logins per day" is correct.
Organize into collapsible rows. Grafana supports row panels that collapse. Using rows labeled "System Health", "User Activity", "Archive Progress", and "OCR Health (coming soon)" gives the PO a scannable structure that matches the four questions in the Context section.
"OCR Health" row should be provisioned as a collapsed placeholder row in the initial JSON (title only, no panels), rather than omitting it entirely. This signals to the PO that it is coming and prevents confusion about the dashboard structure when #652 lands.
No mobile concern here — Grafana dashboards are not used on phones, and the issue correctly excludes mobile/responsive layout from scope.

No open decisions from the UX side — the audience, scope, and panel types are well-specified.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist ### Observations - **This is a Grafana dashboard, not a Svelte component.** The brand system, Tailwind tokens, and frontend component guidelines do not apply here. Grafana's visualization engine is the rendering surface. - **The target audience is explicitly "lightly technical."** The issue correctly identifies this: the PO can read "HTTP 5xx: 3" or "p95: 240 ms" but does not need JVM heap graphs. The panel selection respects this constraint — no memory pool breakdowns, no GC pause histograms. - **Color semantics and redundant cues.** The threshold table (green/amber/red) uses color to convey status. Grafana's stat panels support thresholds natively, but the color alone is the only signal. Since the PO audience is not necessarily color-vision-deficient (and Grafana's threshold color system is industry-standard for operations dashboards), this is acceptable — but noting it as a deliberate tradeoff. - **Panel naming matters for a non-technical audience.** The issue specifies query expressions but not panel titles for all panels. "p95 response time" is clear; `histogram_quantile(0.95, ...)` as a panel title would not be. The dashboard JSON must use human-readable panel titles throughout. - **Default time range of "Last 7 days" and "Manual refresh"** is appropriate for weekly stakeholder meetings. No concerns here. ### Recommendations - **Define all panel titles explicitly** in the issue or in the dashboard JSON comments before implementation. Recommended titles: - Backend status → "Backend Status" - HTTP 5xx count → "Server Errors (5xx)" - p95 response time → "Response Time (p95)" - Error log lines → "Error Log Count" - CPU % → "CPU Usage" - RAM % → "Memory Usage" - Storage % → "Disk Usage" - Logins per day → "Daily Logins (last 7 days)" - Failed login attempts → "Failed Login Attempts" - **Use Grafana "Stat" panels for KPI numbers**, not "Graph" or "Time series" — the PO needs a single number at a glance, not trend analysis. The issue implicitly indicates this for most panels but is explicit only for some. Bar chart for "Logins per day" is correct. - **Organize into collapsible rows.** Grafana supports row panels that collapse. Using rows labeled "System Health", "User Activity", "Archive Progress", and "OCR Health (coming soon)" gives the PO a scannable structure that matches the four questions in the Context section. - **"OCR Health" row should be provisioned as a collapsed placeholder row** in the initial JSON (title only, no panels), rather than omitting it entirely. This signals to the PO that it is coming and prevents confusion about the dashboard structure when #652 lands. - **No mobile concern here** — Grafana dashboards are not used on phones, and the issue correctly excludes mobile/responsive layout from scope. No open decisions from the UX side — the audience, scope, and panel types are well-specified.

marcel commented

2026-05-21 15:21:45 +02:00

📋 Elicit — Requirements Engineer

Observations

The issue is spec-dense and implementation-ready. User story, four acceptance criterion blocks, detailed infrastructure steps, NFRs with IDs, and explicit out-of-scope items are all present. This is above the project's baseline quality.
BLOCK_REVIEWED audit kind is confirmed. I checked AuditKind.java — this enum value exists and is ROLLUP_ELIGIBLE. The Row 3 query is grounded in real data.
NFR-OPS-02 creates an implicit functional requirement. "The GRAFANA_DB_PASSWORD env var must default to a non-empty placeholder in .env.example so a missing var causes an obvious error at startup." This is an infrastructure startup-failure requirement, not just a documentation requirement. It implies the application or compose setup should validate that the var is set before Grafana tries to use it — and "obvious error at startup" needs a concrete implementation path.
Ambiguity: "password from GRAFANA_DB_PASSWORD env var" in the migration SQL. A Flyway SQL migration cannot read env vars directly. The spec is ambiguous about how the env var gets into the SQL. This is a real ambiguity that must be resolved before implementation (Markus and Felix both flagged it from different angles).
Row 3 Transcription block coverage panel. The spec says: "Shows % complete + +N blocks this week badge from audit_log TEXT_SAVED." A Grafana stat panel shows one value. A badge with a secondary metric requires either two separate panels or a Grafana Transformation. This is not resolvable without a design decision.
documents count filter status != 'PLACEHOLDER'. The DocumentStatus lifecycle in CLAUDE.md lists PLACEHOLDER → UPLOADED → TRANSCRIBED → REVIEWED → ARCHIVED. The query correctly excludes PLACEHOLDER documents (Excel import stubs with no file). This is the right semantic for "total documents in the archive."
NFR-PERF-01 "load within 5 seconds" is untestable without a defined measurement procedure. There is no acceptance test for this criterion.

Recommendations

Resolve the env-var-in-Flyway ambiguity before implementation starts. The spec assumes env vars can be interpolated into SQL, which Flyway does not do by default. The implementer needs to choose one of the approaches raised by Markus/Felix and the issue should be updated with that decision to avoid wasted implementation work.
Split the "transcription coverage + badge" panel into two panels or explicitly specify that it is two panels (a bar gauge + a secondary stat). Update the acceptance criterion to: THEN they see a bar gauge showing % completion AND a stat panel showing +N blocks transcribed this week. This removes the ambiguity about whether a single panel can show both values.
Make NFR-PERF-01 testable. Replace "must load within 5 seconds" with: "When the dashboard is opened on a local network with all observability services healthy, all panels complete loading within 5 seconds as measured by browser DevTools Network waterfall." This converts it from a wish into a verifiable criterion.
Add an explicit AC for the grafana_reader role. The Flyway migration and network change are infrastructure, but the security boundary deserves an explicit acceptance criterion: GIVEN the grafana_reader role exists, WHEN it attempts SELECT on app_users, THEN it receives permission denied. This can be verified with a one-line psql check in the smoke test.

Open Decisions

Transcription coverage panel: single combined panel or two panels? The issue describes one panel showing both % complete and +N this week. Grafana stat panels cannot display two independent metrics in a single panel without a Transformation step. Decide: (a) one panel with a Grafana Transformation concatenating both values into a display string; (b) two adjacent panels — a bar gauge for % and a stat for weekly delta. Option (b) is simpler to build and easier for the PO to read.

## 📋 Elicit — Requirements Engineer ### Observations - **The issue is spec-dense and implementation-ready.** User story, four acceptance criterion blocks, detailed infrastructure steps, NFRs with IDs, and explicit out-of-scope items are all present. This is above the project's baseline quality. - **`BLOCK_REVIEWED` audit kind is confirmed.** I checked `AuditKind.java` — this enum value exists and is `ROLLUP_ELIGIBLE`. The Row 3 query is grounded in real data. - **NFR-OPS-02 creates an implicit functional requirement.** "The `GRAFANA_DB_PASSWORD` env var must default to a non-empty placeholder in `.env.example` so a missing var causes an obvious error at startup." This is an infrastructure startup-failure requirement, not just a documentation requirement. It implies the application or compose setup should validate that the var is set before Grafana tries to use it — and "obvious error at startup" needs a concrete implementation path. - **Ambiguity: "password from GRAFANA_DB_PASSWORD env var"** in the migration SQL. A Flyway SQL migration cannot read env vars directly. The spec is ambiguous about how the env var gets into the SQL. This is a real ambiguity that must be resolved before implementation (Markus and Felix both flagged it from different angles). - **Row 3 Transcription block coverage panel.** The spec says: "Shows `%` complete + `+N blocks this week` badge from `audit_log` `TEXT_SAVED`." A Grafana stat panel shows one value. A badge with a secondary metric requires either two separate panels or a Grafana Transformation. This is not resolvable without a design decision. - **`documents` count filter `status != 'PLACEHOLDER'`.** The `DocumentStatus` lifecycle in CLAUDE.md lists `PLACEHOLDER → UPLOADED → TRANSCRIBED → REVIEWED → ARCHIVED`. The query correctly excludes `PLACEHOLDER` documents (Excel import stubs with no file). This is the right semantic for "total documents in the archive." - **NFR-PERF-01 "load within 5 seconds" is untestable** without a defined measurement procedure. There is no acceptance test for this criterion. ### Recommendations - **Resolve the env-var-in-Flyway ambiguity before implementation starts.** The spec assumes env vars can be interpolated into SQL, which Flyway does not do by default. The implementer needs to choose one of the approaches raised by Markus/Felix and the issue should be updated with that decision to avoid wasted implementation work. - **Split the "transcription coverage + badge" panel into two panels** or explicitly specify that it is two panels (a bar gauge + a secondary stat). Update the acceptance criterion to: `THEN they see a bar gauge showing % completion AND a stat panel showing +N blocks transcribed this week`. This removes the ambiguity about whether a single panel can show both values. - **Make NFR-PERF-01 testable.** Replace "must load within 5 seconds" with: "When the dashboard is opened on a local network with all observability services healthy, all panels complete loading within 5 seconds as measured by browser DevTools Network waterfall." This converts it from a wish into a verifiable criterion. - **Add an explicit AC for the `grafana_reader` role.** The Flyway migration and network change are infrastructure, but the security boundary deserves an explicit acceptance criterion: `GIVEN the grafana_reader role exists, WHEN it attempts SELECT on app_users, THEN it receives permission denied`. This can be verified with a one-line psql check in the smoke test. ### Open Decisions - **Transcription coverage panel: single combined panel or two panels?** The issue describes one panel showing both `%` complete and `+N this week`. Grafana stat panels cannot display two independent metrics in a single panel without a Transformation step. Decide: (a) one panel with a Grafana Transformation concatenating both values into a display string; (b) two adjacent panels — a bar gauge for % and a stat for weekly delta. Option (b) is simpler to build and easier for the PO to read.

marcel commented

2026-05-21 15:22:01 +02:00

🗳️ Decision Queue — Action Required

3 decisions need your input before implementation starts.

Infrastructure / Database

How does GRAFANA_DB_PASSWORD get into the Flyway SQL migration? Flyway SQL files cannot interpolate env vars by default. Three viable options: (a) Flyway placeholder substitution via spring.flyway.placeholders.grafanaDbPassword — clean, atomic, requires one Spring config line; (b) Flyway migration creates the role with NOLOGIN, a separate idempotent startup script sets the password at deploy time — simpler, no Spring config change; (c) hardcode a placeholder in the migration and document that production must run ALTER ROLE manually — least safe. Option (a) is recommended by Felix; option (b) is recommended by Markus and Nora as the simpler, more secure operational pattern. (Raised by: Markus, Felix, Nora, Elicit)

Dashboard Design

Transcription block coverage panel: single panel or two panels? The spec describes one panel showing both % complete and +N blocks this week. Grafana stat panels display one value. Showing both requires either: (a) two adjacent panels — a bar gauge for % and a stat for the weekly delta (simpler, more readable for the PO); (b) one panel with a Grafana Transformation that computes a composite string (more compact but harder to build and harder to read). Option (a) is recommended. (Raised by: Sara, Elicit, Leonie)
Row 4 placeholder in the initial JSON? Two approaches: (a) omit Row 4 entirely from the initial JSON and add it in the follow-up commit after #652 lands — cleaner JSON, no confusion; (b) include a collapsed empty row panel titled "OCR Health (blocked by #652)" so the PO sees the planned structure — more transparent but adds placeholder complexity. The issue already states "Row 4 in a follow-up commit," so option (a) is the default unless transparency for the PO is a priority. (Raised by: Tobias, Leonie)

## 🗳️ Decision Queue — Action Required _3 decisions need your input before implementation starts._ ### Infrastructure / Database - **How does `GRAFANA_DB_PASSWORD` get into the Flyway SQL migration?** Flyway SQL files cannot interpolate env vars by default. Three viable options: (a) Flyway placeholder substitution via `spring.flyway.placeholders.grafanaDbPassword` — clean, atomic, requires one Spring config line; (b) Flyway migration creates the role with `NOLOGIN`, a separate idempotent startup script sets the password at deploy time — simpler, no Spring config change; (c) hardcode a placeholder in the migration and document that production must run `ALTER ROLE` manually — least safe. Option (a) is recommended by Felix; option (b) is recommended by Markus and Nora as the simpler, more secure operational pattern. _(Raised by: Markus, Felix, Nora, Elicit)_ ### Dashboard Design - **Transcription block coverage panel: single panel or two panels?** The spec describes one panel showing both `%` complete and `+N blocks this week`. Grafana stat panels display one value. Showing both requires either: (a) two adjacent panels — a bar gauge for % and a stat for the weekly delta (simpler, more readable for the PO); (b) one panel with a Grafana Transformation that computes a composite string (more compact but harder to build and harder to read). Option (a) is recommended. _(Raised by: Sara, Elicit, Leonie)_ - **Row 4 placeholder in the initial JSON?** Two approaches: (a) omit Row 4 entirely from the initial JSON and add it in the follow-up commit after #652 lands — cleaner JSON, no confusion; (b) include a collapsed empty row panel titled "OCR Health (blocked by #652)" so the PO sees the planned structure — more transparent but adds placeholder complexity. The issue already states "Row 4 in a follow-up commit," so option (a) is the default unless transparency for the PO is a priority. _(Raised by: Tobias, Leonie)_

marcel commented

2026-05-21 15:43:21 +02:00

We do the recommendations, but this ticket will be worked on after the OCR service exposes it's metrics, so no placeholder.

marcel referenced this issue

2026-05-21 16:17:47 +02:00

feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

marcel referenced this issue

2026-05-21 16:18:09 +02:00

As a developer I want the OCR service to expose a /metrics endpoint so Prometheus can scrape OCR throughput, error rate, and model quality #652

marcel referenced this issue

2026-05-21 16:46:42 +02:00

feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

marcel referenced this issue

2026-05-21 17:15:46 +02:00

feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

marcel referenced this issue

2026-05-21 17:17:29 +02:00

feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

marcel referenced this issue from a commit

2026-05-21 19:28:13 +02:00

feat(observability): create grafana_reader read-only DB role

marcel referenced this issue from a commit

2026-05-21 19:28:13 +02:00

feat(observability): wire obs-grafana to archive-db and inject GRAFANA_DB_PASSWORD

marcel referenced this issue from a commit

2026-05-21 19:28:13 +02:00

feat(observability): pass GRAFANA_DB_PASSWORD to archive-backend

marcel referenced this issue from a commit

2026-05-21 19:28:13 +02:00

feat(observability): provision Grafana PostgreSQL datasource

marcel referenced this issue from a commit

2026-05-21 19:28:13 +02:00

chore(observability): document GRAFANA_DB_PASSWORD in env files

marcel referenced this issue from a commit

2026-05-21 19:28:13 +02:00

feat(observability): add PO Overview Grafana dashboard

marcel referenced this issue from a commit

2026-05-21 19:28:13 +02:00

docs(deployment): document GRAFANA_DB_PASSWORD across env tables

marcel referenced this issue from a commit

2026-05-21 19:28:13 +02:00

docs(architecture): show Grafana→PostgreSQL link for PO Overview dashboard

marcel referenced a pull request that will close this issue

2026-05-21 19:28:42 +02:00

feat(observability): PO Overview Grafana dashboard (#651) #659

marcel commented

2026-05-21 19:28:58 +02:00

Implementation shipped as PR #659.

Decisions resolved (from the three-item Decision Queue):

Password injection → Flyway placeholder substitution (Felix's option). FlywayConfig reads GRAFANA_DB_PASSWORD from env and passes it via .placeholders(Map.of("grafanaDbPassword", …)). Migration V68 wraps role creation in DO $$ IF NOT EXISTS $$ so it's idempotent (CREATE first run, ALTER thereafter).
Transcription coverage panel → two panels (option a). Full-width bar gauge for the % + the existing Blocks Transcribed This Week stat carries the weekly delta.
Row 4 placeholder → none. #652 is closed so the panels use real Prometheus queries.

Commits (atomic, conventional prefixes):

f0b801f1 feat(observability): create grafana_reader read-only DB role
f9d4d9a2 feat(observability): wire obs-grafana to archive-db and inject GRAFANA_DB_PASSWORD
1564ffea feat(observability): pass GRAFANA_DB_PASSWORD to archive-backend
336ef20b feat(observability): provision Grafana PostgreSQL datasource
93eed612 chore(observability): document GRAFANA_DB_PASSWORD in env files
99c9612a feat(observability): add PO Overview Grafana dashboard
2fa1ce3e docs(deployment): document GRAFANA_DB_PASSWORD across env tables
5d191b22 docs(architecture): show Grafana→PostgreSQL link for PO Overview dashboard

Test evidence

GrafanaReaderRoleIntegrationTest — 4/4 green (positive grants on audit_log/documents/transcription_blocks, negative grant on app_users covering NFR-SEC-01).
AuditLogQueryRepositoryIntegrationTest — still 4/4 green (placeholder change didn't break existing Flyway runs).
Backend compiles clean.

Manual smoke (open Grafana, confirm panels populate) is left for the reviewer / operator per Sara's recommendation; the privilege assertions in the PR description show the exact psql checks to run.

Implementation shipped as PR #659. **Decisions resolved** (from the three-item Decision Queue): 1. **Password injection** → Flyway placeholder substitution (Felix's option). `FlywayConfig` reads `GRAFANA_DB_PASSWORD` from env and passes it via `.placeholders(Map.of("grafanaDbPassword", …))`. Migration V68 wraps role creation in `DO $$ IF NOT EXISTS $$` so it's idempotent (`CREATE` first run, `ALTER` thereafter). 2. **Transcription coverage panel** → two panels (option a). Full-width bar gauge for the % + the existing `Blocks Transcribed This Week` stat carries the weekly delta. 3. **Row 4 placeholder** → none. #652 is closed so the panels use real Prometheus queries. **Commits** (atomic, conventional prefixes): - `f0b801f1` feat(observability): create grafana_reader read-only DB role - `f9d4d9a2` feat(observability): wire obs-grafana to archive-db and inject GRAFANA_DB_PASSWORD - `1564ffea` feat(observability): pass GRAFANA_DB_PASSWORD to archive-backend - `336ef20b` feat(observability): provision Grafana PostgreSQL datasource - `93eed612` chore(observability): document GRAFANA_DB_PASSWORD in env files - `99c9612a` feat(observability): add PO Overview Grafana dashboard - `2fa1ce3e` docs(deployment): document GRAFANA_DB_PASSWORD across env tables - `5d191b22` docs(architecture): show Grafana→PostgreSQL link for PO Overview dashboard **Test evidence** - `GrafanaReaderRoleIntegrationTest` — 4/4 green (positive grants on `audit_log`/`documents`/`transcription_blocks`, negative grant on `app_users` covering NFR-SEC-01). - `AuditLogQueryRepositoryIntegrationTest` — still 4/4 green (placeholder change didn't break existing Flyway runs). - Backend compiles clean. Manual smoke (open Grafana, confirm panels populate) is left for the reviewer / operator per Sara's recommendation; the privilege assertions in the PR description show the exact `psql` checks to run.

marcel referenced this issue from a commit

2026-05-21 19:44:44 +02:00

ci(observability): inject GRAFANA_DB_PASSWORD from Gitea secrets

marcel referenced this issue from a commit

2026-05-21 20:22:25 +02:00

feat(observability): create grafana_reader read-only DB role

marcel referenced this issue from a commit

2026-05-21 20:22:25 +02:00

feat(observability): wire obs-grafana to archive-db and inject GRAFANA_DB_PASSWORD

marcel referenced this issue from a commit

2026-05-21 20:22:25 +02:00

feat(observability): pass GRAFANA_DB_PASSWORD to archive-backend