fix(obs): wire Prometheus metrics and Loki job label for Grafana dashboards #606

Merged
marcel merged 9 commits from worktree-fix+issue-604-obs-wiring into main 2026-05-16 16:00:36 +02:00

9 Commits

Author SHA1 Message Date
Marcel
e61409773e docs(c4): fix Tempo OTLP transport in l2-containers diagram
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m1s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 2m37s
CI / fail2ban Regex (pull_request) Successful in 39s
CI / Compose Bucket Idempotency (pull_request) Successful in 58s
CI / Unit & Component Tests (push) Successful in 3m1s
CI / OCR Service Tests (push) Successful in 16s
CI / Backend Unit Tests (push) Successful in 2m38s
CI / fail2ban Regex (push) Successful in 39s
CI / Compose Bucket Idempotency (push) Successful in 58s
Port 4317 is gRPC; the backend uses HttpExporter (HTTP/1.1) and sends
to port 4318. Update Container description and Rel label to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:48:06 +02:00
Marcel
7713a03cd5 docs(obs): add OBSERVABILITY.md developer guide and fix stale env var docs
- New docs/OBSERVABILITY.md: developer-facing guide with a "where to look
  for what" table, common LogQL queries, trace exploration workflow,
  log→trace correlation via traceId links, and a signal summary table
- Link from DEPLOYMENT.md §4 (ops section now points to dev guide) and
  from CLAUDE.md Infrastructure section
- Fix stale DEPLOYMENT.md env var table: OTEL_EXPORTER_OTLP_ENDPOINT
  now documents port 4318 (HTTP) not 4317 (gRPC); add the three new
  env vars wired in this PR (OTEL_LOGS_EXPORTER, OTEL_METRICS_EXPORTER,
  MANAGEMENT_METRICS_TAGS_APPLICATION) with their rationale
- Fix stale obs-tempo service description (port 4318, not 4317)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:48:06 +02:00
Marcel
cea94ce260 fix(obs): disable OTLP metric export (Prometheus scrapes pull-model)
Tempo only handles traces; sending metrics to /v1/metrics returns 404.
Prometheus already scrapes Spring Boot metrics via the pull-model at
/actuator/prometheus, so OTLP metric push is redundant and noisy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:46:45 +02:00
Marcel
45a992f5a8 fix(obs): fix OTLP transport port and add application metrics tag
- Change OTEL default endpoint from port 4317 (gRPC) to 4318 (HTTP) to
  match Spring Boot's HttpExporter; sending HTTP/1.1 to a gRPC listener
  caused "Connection reset" errors
- Add otel.logs.exporter=none: Promtail captures Docker logs via the
  logging driver; sending logs to Tempo's OTLP endpoint (which only
  handles traces) produced 404 errors
- Add management.metrics.tags.application to every metric so Grafana's
  Spring Boot Observability dashboard (ID 17175) can filter by the
  application label_values() template variable
- Add MANAGEMENT_METRICS_TAGS_APPLICATION and OTEL_LOGS_EXPORTER env
  vars to docker-compose.prod.yml; production Tempo endpoint already
  uses 4318
- Add MANAGEMENT_TRACING_SAMPLING_PROBABILITY to prod compose with
  0.1 default to avoid 100% trace sampling in production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:46:45 +02:00
Marcel
bd57310bbf docs(obs): document promtail job label mapping in DEPLOYMENT.md
The job label (derived from the Docker Compose service name) is what
powers {job="backend"} queries in Loki dashboards and populates the
Grafana "App" variable dropdown. Operators need to know this mapping
when writing custom Loki queries.

Addresses @markus non-blocker suggestion from PR #606 review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:46:45 +02:00
Marcel
c2d092f435 docs(adr): add ADR-017 — Spring Boot 4.0 management port shares main security filter chain
Documents the architectural decision behind the dedicated management
SecurityFilterChain, the discovery that SB4+Jetty removed the isolated
management child-context security, and the consequences for actuator
endpoint exposure.

Addresses @markus blocker from PR #606 review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:46:45 +02:00
Marcel
e19bd60984 fix(obs): add management security chain and split Prometheus IT tests
- Add @Order(1) managementFilterChain scoped to /actuator/** with explicit
  401 entry point, blocking all non-public actuator paths without the
  form-login redirect that the main chain uses for browser clients.
- Split single combined test into two focused assertions
  (prometheus_endpoint_returns_200_without_credentials,
   prometheus_endpoint_returns_jvm_metrics).
- Add negative regression test: actuator_metrics_requires_authentication
  verifies that /actuator/metrics returns 401 without credentials.

Addresses reviewer concerns from @sara (missing negative test, split
assertions) and @nora (dedicated management security layer).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:46:45 +02:00
Marcel
2aa0ff9e70 fix(obs): wire Prometheus endpoint for Spring Boot 4.0
Four Spring Boot 4.0-specific issues prevented /actuator/prometheus from working:

1. spring-boot-starter-micrometer-metrics missing — Spring Boot 4.0 splits
   Micrometer metrics export (including the Prometheus scrape endpoint) out of
   spring-boot-starter-actuator into its own starter. Added dependency.

2. management.prometheus.metrics.export.enabled not set — Spring Boot 4.0
   defaults metrics export to false (opt-in). Added the property to
   application.yaml.

3. SecurityConfig did not permit /actuator/prometheus — Spring Boot 4.0
   with Jetty serves the management port (8081) via the same security filter
   chain as the main port (8080). The previous commit's exclusion of
   ManagementWebSecurityAutoConfiguration was a no-op (that class no longer
   exists in Spring Boot 4.0); removed it and added the correct permitAll()
   rule. Updated the architecture comment in application.yaml to reflect the
   true filter-chain behaviour.

4. Reverted invalid FamilienarchivApplication.java change from the prior
   commit (ManagementWebSecurityAutoConfiguration import compiled against a
   class that does not exist in the Spring Boot 4.0 BOM).

Also adds ActuatorPrometheusIT — an integration test that asserts the
/actuator/prometheus endpoint returns 200 with jvm_memory_used_bytes without
credentials, serving as regression protection against future Spring Boot
upgrades silently breaking metrics collection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:46:45 +02:00
Marcel
5dd74df293 fix(obs): wire Prometheus metrics and Loki job label for Grafana dashboards
Three root causes confirmed via live server investigation (issue #604):

1. ManagementWebSecurityAutoConfiguration applied HTTP Basic auth to the
   management port (8081), causing Prometheus to receive 401 HTML responses
   instead of metrics. Excluded the auto-config — the Docker network
   (archiv-net) provides the security boundary for this internal port.

2. promtail-config.yml had no `job` relabel rule. Grafana's Loki dashboards
   query {job="$app"} which matched nothing; logs were in Loki under
   compose_service but invisible to every dashboard panel.

3. prometheus.yml had a stale comment claiming the spring-boot target would
   be DOWN until micrometer-registry-prometheus was added — it has been
   present in pom.xml for some time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:46:45 +02:00