observability: add /api/dashboard/activity p95 latency panel to Grafana #291
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Background
Deferred from PR #288 during review cycle 1 (Tobias Wendt).
The rollup query + partial covering index introduced in #285 makes
/api/dashboard/activitya new hot path consumed by both/chronikand the dashboard side-rail. If theV49index is missing on some environment, or a future query regression hits, the symptom is a slow feed — before anything else breaks.Concern
No Grafana panel for
/api/dashboard/activityrequest rate + latency. A silent regression would go unnoticed until users complain.Scope
Reference
backend/src/main/resources/db/migration/V49__add_audit_log_rollup_index.sql🏗️ Tobias Wendt — DevOps & Platform Engineer
Observations
spring-boot-starter-actuatoratpom.xml:35). Prometheus scrape endpoint is NOT enabled yet — needsmicrometer-registry-prometheusdependency +management.endpoints.web.exposure.include+management.endpoint.prometheus.enabled: true.docs/infrastructure/production-compose.md(lines 65–95) with pinned versions (prom/prometheus:v2.51.0,grafana/grafana:10.4.0,grafana/loki:2.9.0,grafana/promtail:2.9.0). The./observability/directory is referenced for provisioning config but does not exist in the repo yet — needs creation withprometheus.yml+ Grafana provisioning YAML + dashboard JSON.observability/folder, no dashboards in the repo. This is the first observability PR — setup work, not just a panel addition.http.server.requestshistogram metrics by default once Prometheus registry is wired. Per-URI latency is available via theurilabel (templated —/api/dashboard/activitynot/api/dashboard/activity?limit=40).Recommendations
micrometer-registry-prometheustopom.xml, enable/actuator/prometheusbehind the internal management port (8081, per architect guidance — never the public port). Add anapplication-prod.yamloverride.observability/directory structure:sum(rate(http_server_requests_seconds_count{uri="/api/dashboard/activity"}[5m]))histogram_quantile(0.50 | 0.95 | 0.99, sum(rate(http_server_requests_seconds_bucket{uri="/api/dashboard/activity"}[5m])) by (le))sum(rate(http_server_requests_seconds_count{uri="/api/dashboard/activity", status=~"5.."}[5m])) / sum(rate(http_server_requests_seconds_count{uri="/api/dashboard/activity"}[5m]))/api/dashboard/resumeand/api/dashboard/pulsewhile you're there — cheap, and these all share the rollup/audit query path./actuator/prometheusat Caddy. Management port 8081 is internal-only (Prometheus scrapes it over the compose network). The publicbackendreverse proxy must not forward/actuator/*— this was already called out as a security requirement./api/dashboard/activity?beforeAt=...&beforeDocId=...&beforeKind=...(once #290 lands) — make sure Micrometer templates the URI as/api/dashboard/activitynot the full querystring. Default behavior in Spring Boot 4 does this correctly; verify with a live scrape before merge.Open Decisions
/api/dashboard/activity, and the alert. Option B: do it properly — same as A, plus extend to all dashboard endpoints, JVM metrics panel, and a log aggregation check. Option C: split into two issues — this one stays "just the panel," a separate issue sets up the Prometheus pipeline. Recommend C if you want a clean commit history; A if you want to ship faster. B is premature without a clear operational pain point.🏛️ Markus Keller — Senior Application Architect
Observations
http_server_requests_seconds) is an application-framework concern. The rollup query itself emits no metrics today — Micrometer's Spring Data JPA integration is not wired.docs/infrastructure/production-compose.md. This issue should reuse that split, not negotiate it.Recommendations
/chronikslow?" — that's an end-to-end question. SQL-level timing for the rollup query specifically is a follow-up if the HTTP panel ever fires a slow-alert and we need to narrow down.MetricsServicewrapper or project-specific timer interface. InjectMeterRegistryif a non-HTTP custom metric ever gets justified — until then, lean on Actuator defaults.docs/runbooks/slow-activity-feed.mdwith: Symptom → Immediate check (is V49 present?\d+ idx_audit_log_rollup) → Common causes → Resolution → Escalation. Same rhythm as an ADR but operational.Open Decisions
🔒 Nora "NullX" Steiner — Application Security Engineer
Observations
application.yamlcurrently has onlymanagement.health.mail.enabled: false— nomanagement.endpoints.web.exposure.includeoverride. Spring Boot 4 default exposes only/actuator/healthon the web — safe./actuator/prometheusbroadens the attack surface. If reachable from the internet, it leaks:Recommendations
docs/infrastructure/production-compose.md. Enforce it at the application level:/actuator/*on the public domain. Add an explicit 404:.env, never hardcoded inalertmanager.ymlcommitted to the repo.{uri="/api/documents/{id}"}is safe; raw paths like/api/documents/550e8400-...would create high-cardinality metrics and leak document IDs. Spring Boot's default URI tagger templates correctly; verify after wiring.Open Decisions
🧪 Sara Holt — Senior QA Engineer
Observations
spring-boot-starter-actuator-testis already inpom.xml:84—@AutoConfigureObservabilityor directMeterRegistryaccess in tests is supported.Recommendations
/actuator/prometheusreturns 200 with the expected metric name./actuator/prometheusis NOT reachable on the public port.api-performance.jsonvia Grafana's Dashboard → Import UI; confirm no validation errors.promtool test rules rules_test.ymlin CI. Catches the case where someone tweaks the expression and accidentally disables the alert.Open Decisions
👨💻 Felix Brandt — Senior Fullstack Developer
No concerns from my angle — this is observability wiring, not application code. Checked: the backend changes needed are Spring Boot / Actuator configuration, which is Tobias's lane; the only "app code" touch is adding
micrometer-registry-prometheustopom.xmland anapplication-prod.yamlsnippet, both of which are one-line mechanical changes with no business logic implications.Happy to write the
actuator_prometheus_exposes_http_server_requests+actuator_prometheus_returns_404_on_public_porttests Sara outlined — those go in the existing@WebMvcTestslice pattern we use forDashboardControllerTestand require no new test infrastructure.🎨 Leonie Voss — UX/Design Lead
No concerns from my angle — this issue is internal operator tooling, not a user-facing surface. Checked: no spec reference, no UI, no user workflow. Grafana has its own design system that the family archive team cannot and should not influence.
One operational UX note since Tobias mentioned a runbook: when that runbook gets written, keep the voice friendly and instructive rather than terse. The "user" of that runbook is the same Marcel who debugs at 23:00 on a Sunday — future-you deserves clear steps, not telegram-style bullet points.
🗳️ Decision Queue — Action Required
1 decision needs your input before implementation starts.
Infrastructure
/api/dashboard/activity, and the p95 alert. Ships fast./api/dashboard/resumeand/api/dashboard/pulse, JVM metrics panel, log aggregation check. 1–2 days more work.