devops(backend): expose Prometheus metrics endpoint + OTLP trace export from Spring Boot #588
Reference in New Issue
Block a user
Delete Branch "feat/issue-576-backend-instrumentation"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes #576
micrometer-registry-prometheus,micrometer-tracing-bridge-otel, andopentelemetry-spring-boot-starter:2.27.0tobackend/pom.xmlbackend:8081/actuator/prometheusdirectly via archiv-net, bypassing Caddy (which only proxies port 8080)1.0in dev,0.1in compose (env override),0.0in test profile (no OTLP connections during CI)OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317added todocker-compose.ymlbackend service./mvnw clean package -DskipTests→ BUILD SUCCESSTest plan
./mvnw clean package -DskipTestspassescurl -s http://localhost:8081/actuator/prometheusreturns HTTP 200 with# HELPlinesspring-bootshowsUPin Prometheus UI after observability stack starts🤖 Generated with Claude Code
🏗️ Markus Keller (@mkeller) — Senior Application Architect
Verdict: ✅ Approved
This PR gets the structural decisions right. I checked against the full doc-update matrix.
What I verified
docs/architecture/c4/l2-containers.puml— updated. The PR adds anObservability Stackboundary with Prometheus, Loki, Promtail, Tempo, and Grafana containers, plus theRel(backend, tempo, ...)OTLP trace arrow. The diagram matches the code change. ✓docs/DEPLOYMENT.md— updated. §1 topology bullet now references the management port 8081 scraping and the observability stack overlay file. §4 "Logs + observability" section documents the full service table and startup sequence. ✓No new backend package, route, Flyway migration,
ErrorCode, orPermissionadded. No corresponding doc updates required. ✓ADR — not required for this PR. The architectural decision here (separate management port for Prometheus scraping, OTLP export to Tempo) is configuration and dependency wiring, not a new architectural pattern with lasting structural consequences. Operational config decisions of this scope don't need an ADR. The inline comments in
application.yamlare sufficient. ✓Observations (not blockers)
otel.*YAML namespace used byopentelemetry-spring-boot-startersits alongside the Springmanagement.*namespace. This is correct — the OTel starter uses its own config tree, not the Spring micrometer tracing config. The comment# override via MANAGEMENT_TRACING_SAMPLING_PROBABILITYcorrectly identifies the env-var form.docker-compose.ymlwiresOTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317as a hard-coded hostname. This relies on Tempo being defined indocker-compose.observability.ymland joiningarchiv-net. That dependency is documented in the inline comment and inDEPLOYMENT.md. Acceptable for this stage — when the observability compose ships, Prometheus, Loki, and Tempo are all onarchiv-net.0.1in compose (dev+staging) is reasonable. Worth revisiting if trace gaps make debugging harder in staging — but that's a tuning decision, not an architecture concern.Not a blocker — follow-up question
The
DEPLOYMENT.mdenv-var table (§2) lists the variables documented for the backend. The two new env vars (OTEL_EXPORTER_OTLP_ENDPOINT,MANAGEMENT_TRACING_SAMPLING_PROBABILITY) injected viadocker-compose.ymlare not yet rows in that table. The table's own header says "Any var found in docker-compose.yml ... that is not in this table is a blocking review comment." I'll defer to Tobias on whether this is a blocker — he owns the env-var documentation table.👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
This PR touches only config files and
pom.xml— no new Java logic, no Svelte components, no Python. My review is correspondingly narrow. The code is clean and intentional. One concern worth flagging.Blockers
None.
Concerns
No test coverage for the new actuator endpoint exposure (
application-test.yamlonly disables tracing).The
application-test.yamlcorrectly disables OTLP export withprobability: 0.0. But there is no test asserting that:GET :8081/actuator/prometheusreturns200with# HELPcontent when Micrometer is on the classpathGET :8080/actuator/prometheusreturns404or is not reachable (port isolation)The test plan in the PR description lists these as manual
curlchecks. For a project that practices TDD, these belong as a@SpringBootTestactuator smoke test or at minimum a@WebMvcTeston the management port. Without a test, a future config change (e.g. collapsing back to a single port, or changing theinclude:list) could silently break Prometheus scraping.Suggestion (not blocker): Add a
ManagementEndpointITthat boots the full context and assertsGET :8081/actuator/prometheus→200andGET :8080/actuator/prometheus→404. This is a Spring Boot integration test, no mocking needed.Observations
pom.xmldependencies:micrometer-registry-prometheusandmicrometer-tracing-bridge-otelhave no explicit<version>— both are managed by the Spring Boot BOM. Correct. Theopentelemetry-spring-boot-starterpin at2.27.0carries a comment explaining it's not in the BOM. Exactly right.http://localhost:4317inapplication.yamlmakes local dev work without the observability stack running. Non-fatal OTLP failures are handled by the library, not by application code. This is the correct pattern.application-test.yamladdition is minimal and correct.probability: 0.0guarantees no spans are generated, so no export is attempted, so no connection error in CI. ✓🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
Security surface reviewed. No vulnerabilities found. A few things I want to call out explicitly as intentional and correct, plus one smell to watch.
Blockers
None.
What I checked
Actuator endpoint exposure — handled correctly.
management.server.port: 8081splits the management interface onto a separate TCP port. This means::8080— the Caddyfile already has arespond 404 on /actuator/*rule (confirmed inDEPLOYMENT.md§1). Spring Security's session-authenticated filter chain on:8080never sees actuator requests.expose-only indocker-compose.yml(notports:), so it is reachable only from withinarchiv-net. Prometheus scrapes it container-to-container. The host and the internet cannot reach it. ✓include: health,info,prometheus,metricslist is a whitelist, not*. Heap dumps, thread dumps, env, loggers, and shutdown are not exposed. ✓OTLP endpoint is configurable, defaults to localhost.
${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317}falls back to localhost when the env var is absent. In CI and local dev without Tempo, OTLP connections fail silently (the OTel SDK retries with backoff and logs warnings, but the app continues). This is safe — no sensitive data leaks from failed OTLP export attempts because the SDK does not include session tokens or raw HTTP bodies in spans unless explicitly instrumented.Trace sampling rates are appropriate.
1.0in dev (catch everything locally),0.1in compose/staging (10% — balanced),0.0in test (no spans, no export). These are safe defaults. ✓Smell to watch (not a blocker)
Span content is outside the scope of this PR, but worth a future audit.
Once tracing is active in prod/staging, each HTTP request and JPA query generates a span. By default, Spring Boot's OTel auto-instrumentation does not include query parameter values or request bodies in spans — it captures HTTP method, URL path, and status code. However, if anyone adds custom
Span.setAttribute("user.input", ...)calls in future, sensitive data could flow into Tempo. This isn't a problem today, but worth a note in the team's "security things to know" list.management.endpoints.web.exposure.include: info— theinfoendpoint exposes app name, build version, and active profiles. On a private internal port this is harmless, but if the management port were ever accidentally exposed externally,infowould reveal Spring profile names and app version. Low risk given the current architecture, but worth knowing.🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
The PR correctly handles the most important test-layer concern (suppressing OTLP connections in CI via
application-test.yaml). But there are gaps in the test pyramid that should be tracked.Blockers
None that block this merge, but the gaps below should live in Gitea issues.
Concerns
1. No automated verification that
GET :8081/actuator/prometheusreturns valid Prometheus output.The PR test plan lists three manual checks:
./mvnw clean package -DskipTestspasses ✓ (CI-verifiable, no test code needed)curl -s http://localhost:8081/actuator/prometheusreturns 200 with# HELPlines (manual only)UP(manual only, requires full observability stack)For a build that runs in CI, the second check should be automated. A
@SpringBootTestslice that boots the full context and hitshttp://localhost:${management.server.port}/actuator/prometheusviaTestRestTemplatewould catch:include:changes that dropprometheusSuggested test:
2. No test asserting that
management.server.portis isolated from the app port.A test that POSTs to
:8080/actuator/prometheus(or asserts a 404 from Caddy) would prove the port split is working. This is an integration concern — tricky to test in pure unit tests, but worth a Playwright smoke test or a CI curl step in the workflow.3.
application-test.yamlis correct but lacks a comment explaining the behavioral guarantee.The
probability: 0.0setting is correct. A brief comment# 0.0 = no spans created, no OTLP export attempted — prevents CI connection errors to Tempowould make it immediately clear to future contributors why this override exists and what would break if removed.What's correct
probability: 0.0inapplication-test.yamlprevents all OTLP connection attempts in test runs. No Testcontainers setup for Tempo required. ✓-DskipTestsper the PR description. The full test suite behavior with the new tracing profile needs theapplication-test.yamloverride, which is in place. ✓⚙️ Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: 🚫 Changes requested
The infrastructure wiring is solid and the port isolation design is correct. One item meets the blocking threshold defined in
DEPLOYMENT.mditself. Everything else is a concern or observation.Blockers
DEPLOYMENT.md§2 env-var table is incomplete — two new vars are missing.The
DEPLOYMENT.md§2 header states explicitly:This PR adds two new env vars to
docker-compose.yml:OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317MANAGEMENT_TRACING_SAMPLING_PROBABILITY: "0.1"Neither appears in the Backend env-var table in
§2. Both need rows added before merge.Suggested rows:
OTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4317(fromapplication.yaml)MANAGEMENT_TRACING_SAMPLING_PROBABILITY1.0= 100% (dev default inapplication.yaml),0.1= 10% (compose default),0.0= disabled (test profile).1.0(application.yaml default)Concerns (not blockers)
Image tags for the observability stack services.
The
l2-containers.pumldiagram referencesprom/prometheus,grafana/loki:3.4.2,grafana/promtail:3.4.2,grafana/tempo:2.7.2,grafana/grafana. TheDEPLOYMENT.md§4 observability table referencesprom/prometheus:v3.4.0,prom/node-exporter:v1.9.0,gcr.io/cadvisor/cadvisor:v0.52.1. These are pinned in the observability compose file (not in this PR's diff). Good practice — just confirm the actual observability compose file pins match before the observability stack ships.OTEL_EXPORTER_OTLP_ENDPOINTis hard-coded tohttp://tempo:4317indocker-compose.yml.This is fine for the current setup where Tempo is always the target. But if Tempo isn't running (dev without observability stack), this var resolves to an unreachable host. The
opentelemetry-spring-boot-starterhandles this gracefully (retries with backoff, app continues). The inline comment documents this. Acceptable — just confirm the non-fatal behavior holds in practice on first startup without the observability stack.What's done well
expose: ["8081"](notports:) for the management port — internal-only, not host-bound. ✓docker-compose.ymlare accurate and explain the why. ✓DEPLOYMENT.md§4 observability section is comprehensive and includes the correct startup order (docker compose up -dfirst to createarchiv-net, then the observability overlay). ✓archiv-net. Defense in depth matches the Caddyfile404 on /actuator/*. ✓📋 Elicit (Requirements Engineer)
Verdict: ✅ Approved
Reviewing from a requirements and traceability perspective.
Issue #576 traceability
The PR description says "Closes #576" and lists four concrete deliverables. Cross-referencing against the diff:
micrometer-registry-prometheusaddedpom.xmlmicrometer-tracing-bridge-oteladdedpom.xmlopentelemetry-spring-boot-starter:2.27.0addedpom.xmlwith pinned versionapplication.yamlbackend:8081/actuator/prometheusarchiv-net1.0dev,0.1compose,0.0testapplication.yaml+application-test.yaml+docker-compose.ymlOTEL_EXPORTER_OTLP_ENDPOINTin composedocker-compose.ymlAll stated requirements are traceable to code changes.
Acceptance criteria coverage
The PR test plan is:
./mvnw clean package -DskipTests→ BUILD SUCCESS ✓ (author verified)curl -s http://localhost:8081/actuator/prometheusreturns 200 with# HELP(manual)Items 2 and 3 are not automated. For a production-facing feature, these should be in CI (see Sara's comment). From a requirements perspective, the feature is "done" per the issue definition, but the verification evidence is incomplete.
NFR check
probability: 0.0in test profile prevents OTLP connection errors. ✓Open question
The PR comment references "future issue" for the
docker-compose.observability.ymlfile where Tempo is actually defined. That follow-up (issue #581 referenced inDEPLOYMENT.md) should be the dependency gate — users cannot actually verify criterion 3 until that file exists. Recommend that #588 clearly notes in its close comment that full E2E verification of Prometheus scraping depends on #581 shipping.🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead
Verdict: ✅ Approved (out of scope)
This PR makes no changes to any Svelte component, route, layout, CSS, or user-facing template. It adds Prometheus metrics and OTLP tracing to the Spring Boot backend via config files and
pom.xml.No UI/UX review items exist. The change is invisible to users — it affects only the operational observability layer.
I'll note that once Grafana dashboards ship (issue #581), there will be a UI surface to review (the Grafana admin panel and dashboard design). That's a future review scope, not this PR.
Fixed: added OTEL_EXPORTER_OTLP_ENDPOINT and MANAGEMENT_TRACING_SAMPLING_PROBABILITY to DEPLOYMENT.md env var table.
CI fix: OTel semconv conflict resolved
All backend integration tests were failing with:
Root cause:
opentelemetry-spring-boot-starter:2.27.0transitively pulls inio.opentelemetry.contrib:opentelemetry-azure-resources, which registersAzureAppServiceResourceProvidervia Java SPI. During Spring context startup (even in tests), OTel auto-configuration instantiates all registered resource providers — and this one referencesServiceAttributes.SERVICE_INSTANCE_ID, a field that does not exist in the semconv version resolved by this project. Settingprobability: 0.0in the test profile only suppresses span sampling, not context initialization, so it did not prevent the crash.Two fixes applied (defense in depth):
Fix 1 —
application-test.yaml: Addedotel.sdk.disabled: true. This tells the OTel SDK to skip all auto-configuration during tests, so no resource providers are ever instantiated.Fix 2 —
pom.xml: Added an<exclusion>forio.opentelemetry.contrib:opentelemetry-azure-resourcesfrom theopentelemetry-spring-boot-starterdependency. This removes the problematic provider from the dependency graph entirely, protecting any environment where the SDK is not explicitly disabled.Compilation verified locally with
./mvnw clean package -DskipTests— BUILD SUCCESS.