devops(backend): expose Prometheus metrics endpoint + OTLP trace export from Spring Boot #576

New Issue

marcel · 2026-05-14T15:04:35+02:00

marcel commented

2026-05-14 15:04:35 +02:00

Context

Spring Boot Actuator is already active (the Docker healthcheck calls /actuator/health). This issue adds two new capabilities to the backend:

Prometheus scrape endpoint — /actuator/prometheus so Prometheus can pull JVM, HTTP, and database metrics
OpenTelemetry trace export — every inbound HTTP request produces a trace sent to Tempo via OTLP gRPC

Both are wired through Micrometer, which is already the standard in Spring Boot 3+/4.

Depends on: Tempo issue (Tempo must exist in the compose network for traces to land somewhere, though OTLP failures are non-fatal)

Part A — Prometheus metrics

`backend/pom.xml` — add dependency

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
    <!-- version managed by Spring Boot BOM -->
</dependency>

`backend/src/main/resources/application.yml` — expose the endpoint

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  endpoint:
    prometheus:
      enabled: true

The /actuator/prometheus endpoint must not be reachable through the Caddy reverse proxy (Prometheus scrapes it directly from inside archiv-net). Verify that application-prod.yml still blocks actuator exposure via the public interface, or that Caddy continues to block /actuator/*.

Part B — OpenTelemetry tracing

`backend/pom.xml` — add dependencies

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>

`backend/src/main/resources/application.yml`

management:
  tracing:
    sampling:
      probability: 1.0   # 100% in dev; set to 0.1 in prod via env override

otel:
  service:
    name: familienarchiv-backend
  exporter:
    otlp:
      endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317}

`docker-compose.yml` — add env var to backend service

environment:
  OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317

The default http://localhost:4317 ensures the app starts cleanly when no observability stack is running (e.g., in CI). OTLP export failures are non-fatal by default in OpenTelemetry.

Acceptance Criteria

curl -s http://localhost:8080/actuator/prometheus returns HTTP 200 with Prometheus text format (lines beginning with # HELP and # TYPE)
The spring-boot scrape target in Prometheus UI (http://localhost:9090/targets) shows state UP
JVM metrics are present: curl -s http://localhost:9090/api/v1/query?query=jvm_memory_used_bytes returns data
HTTP request metrics are present: curl -s http://localhost:9090/api/v1/query?query=http_server_requests_seconds_count returns data after making any API call
After making an API request, a trace appears in Tempo: query http://localhost:3200/api/search and confirm familienarchiv-backend service is listed
./mvnw test passes — no test regressions
Application starts cleanly in CI (no observability stack) — OTLP failure is logged as a warning, not a startup error

Definition of Done

All acceptance criteria checked
pom.xml and application.yml changes committed
docker-compose.yml updated with OTEL_EXPORTER_OTLP_ENDPOINT
Committed on a feature branch, PR opened against main

## Context Spring Boot Actuator is already active (the Docker healthcheck calls `/actuator/health`). This issue adds two new capabilities to the backend: 1. **Prometheus scrape endpoint** — `/actuator/prometheus` so Prometheus can pull JVM, HTTP, and database metrics 2. **OpenTelemetry trace export** — every inbound HTTP request produces a trace sent to Tempo via OTLP gRPC Both are wired through Micrometer, which is already the standard in Spring Boot 3+/4. **Depends on:** Tempo issue (Tempo must exist in the compose network for traces to land somewhere, though OTLP failures are non-fatal) ## Part A — Prometheus metrics ### `backend/pom.xml` — add dependency ```xml <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId>  </dependency> ``` ### `backend/src/main/resources/application.yml` — expose the endpoint ```yaml management: endpoints: web: exposure: include: health,info,prometheus,metrics endpoint: prometheus: enabled: true ``` The `/actuator/prometheus` endpoint must **not** be reachable through the Caddy reverse proxy (Prometheus scrapes it directly from inside `archiv-net`). Verify that `application-prod.yml` still blocks actuator exposure via the public interface, or that Caddy continues to block `/actuator/*`. ## Part B — OpenTelemetry tracing ### `backend/pom.xml` — add dependencies ```xml <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-tracing-bridge-otel</artifactId> </dependency> <dependency> <groupId>io.opentelemetry.instrumentation</groupId> <artifactId>opentelemetry-spring-boot-starter</artifactId> </dependency> ``` ### `backend/src/main/resources/application.yml` ```yaml management: tracing: sampling: probability: 1.0 # 100% in dev; set to 0.1 in prod via env override otel: service: name: familienarchiv-backend exporter: otlp: endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317} ``` ### `docker-compose.yml` — add env var to backend service ```yaml environment: OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317 ``` The default `http://localhost:4317` ensures the app starts cleanly when no observability stack is running (e.g., in CI). OTLP export failures are non-fatal by default in OpenTelemetry. ## Acceptance Criteria - [ ] `curl -s http://localhost:8080/actuator/prometheus` returns HTTP 200 with Prometheus text format (lines beginning with `# HELP` and `# TYPE`) - [ ] The `spring-boot` scrape target in Prometheus UI (`http://localhost:9090/targets`) shows state `UP` - [ ] JVM metrics are present: `curl -s http://localhost:9090/api/v1/query?query=jvm_memory_used_bytes` returns data - [ ] HTTP request metrics are present: `curl -s http://localhost:9090/api/v1/query?query=http_server_requests_seconds_count` returns data after making any API call - [ ] After making an API request, a trace appears in Tempo: query `http://localhost:3200/api/search` and confirm `familienarchiv-backend` service is listed - [ ] `./mvnw test` passes — no test regressions - [ ] Application starts cleanly in CI (no observability stack) — OTLP failure is logged as a warning, not a startup error ## Definition of Done - All acceptance criteria checked - `pom.xml` and `application.yml` changes committed - `docker-compose.yml` updated with `OTEL_EXPORTER_OTLP_ENDPOINT` - Committed on a feature branch, PR opened against `main`

marcel added this to the Observability Stack — Grafana LGTM + GlitchTip milestone 2026-05-14 15:04:35 +02:00

marcel added the P2-medium devops phase-7: monitoring labels 2026-05-14 15:06:11 +02:00

marcel commented

2026-05-14 15:28:45 +02:00

🏗️ Markus Keller — Application Architect

Observations

The issue is well-scoped: both additions (Prometheus scrape endpoint, OTLP traces) are config/dependency changes only — no new domain code, no layer violations. That's the right shape for this work.
The DEPLOYMENT.md already documents the intended topology: Prometheus scrapes the backend on port 8081 (management port, not 8080), and Caddy blocks /actuator/* externally. The issue's acceptance criteria, however, describe scraping on localhost:8080/actuator/prometheus — that contradicts the documented production architecture.
The issue specifies management.server.port is implicitly 8080 (same as the app port). The production docs say management should be on port 8081 so the main port handles only app traffic and the management port is never routed through Caddy. This split should be reflected in the YAML snippets.
opentelemetry-spring-boot-starter is not version-managed by the Spring Boot 4 BOM — unlike micrometer-tracing-bridge-otel which is. The issue omits the <version> tag for opentelemetry-spring-boot-starter. This will cause a build failure without an explicit version or a <dependencyManagement> entry. Check the OpenTelemetry Spring Boot starter release matrix for Spring Boot 4 compatibility before implementing.
The docs/architecture/c4/l2-containers.puml diagram needs to be updated: Prometheus and Tempo are new containers in the topology. Per the architect's update table, new Docker services require both l2-containers.puml and docs/DEPLOYMENT.md updates.
No new ADR is needed here — this is wiring existing infrastructure standards (Micrometer, OTLP), not a novel architectural decision.

Recommendations

Split the management port in application.yml:
```
management:
  server:
    port: 8081
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
```
This aligns with the production topology already documented in DEPLOYMENT.md and ensures the Caddy respond @actuator 404 rule never needs to cover port 8081 (since it's not routed through Caddy at all).
Fix the acceptance criterion for the Prometheus scrape target to use port 8081, not 8080, to match the actual topology.
Look up the correct version for opentelemetry-spring-boot-starter against the Spring Boot 4 / OTel instrumentation release matrix. As of early 2026, Spring Boot 4 requires OpenTelemetry instrumentation 2.x — pin the exact version explicitly in pom.xml.
Update docs/architecture/c4/l2-containers.puml to add Prometheus and Tempo as containers. This is a doc obligation on the architecture update table, not optional.

## 🏗️ Markus Keller — Application Architect ### Observations - The issue is well-scoped: both additions (Prometheus scrape endpoint, OTLP traces) are config/dependency changes only — no new domain code, no layer violations. That's the right shape for this work. - The DEPLOYMENT.md already documents the intended topology: Prometheus scrapes the backend on port 8081 (management port, not 8080), and Caddy blocks `/actuator/*` externally. The issue's acceptance criteria, however, describe scraping on `localhost:8080/actuator/prometheus` — that contradicts the documented production architecture. - The issue specifies `management.server.port` is implicitly 8080 (same as the app port). The production docs say management should be on port 8081 so the main port handles only app traffic and the management port is never routed through Caddy. This split should be reflected in the YAML snippets. - `opentelemetry-spring-boot-starter` is **not** version-managed by the Spring Boot 4 BOM — unlike `micrometer-tracing-bridge-otel` which is. The issue omits the `<version>` tag for `opentelemetry-spring-boot-starter`. This will cause a build failure without an explicit version or a `<dependencyManagement>` entry. Check the OpenTelemetry Spring Boot starter release matrix for Spring Boot 4 compatibility before implementing. - The `docs/architecture/c4/l2-containers.puml` diagram needs to be updated: Prometheus and Tempo are new containers in the topology. Per the architect's update table, new Docker services require both `l2-containers.puml` and `docs/DEPLOYMENT.md` updates. - No new ADR is needed here — this is wiring existing infrastructure standards (Micrometer, OTLP), not a novel architectural decision. ### Recommendations - **Split the management port** in `application.yml`: ```yaml management: server: port: 8081 endpoints: web: exposure: include: health,info,prometheus,metrics ``` This aligns with the production topology already documented in DEPLOYMENT.md and ensures the Caddy `respond @actuator 404` rule never needs to cover port 8081 (since it's not routed through Caddy at all). - **Fix the acceptance criterion** for the Prometheus scrape target to use port 8081, not 8080, to match the actual topology. - **Look up the correct version** for `opentelemetry-spring-boot-starter` against the Spring Boot 4 / OTel instrumentation release matrix. As of early 2026, Spring Boot 4 requires OpenTelemetry instrumentation 2.x — pin the exact version explicitly in `pom.xml`. - **Update `docs/architecture/c4/l2-containers.puml`** to add Prometheus and Tempo as containers. This is a doc obligation on the architecture update table, not optional.

marcel commented

2026-05-14 15:29:00 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

The implementation is entirely config and POM changes — no business logic involved. Clean scope.
The opentelemetry-spring-boot-starter dependency in the issue body is missing a <version> tag. The Spring Boot BOM manages micrometer-tracing-bridge-otel, but NOT the OTel Spring Boot starter (it lives in the io.opentelemetry.instrumentation group, outside the Spring BOM). A build will fail without an explicit version.
The ApplicationContextTest loads the full Spring context with @SpringBootTest. Adding the OTel autoconfiguration may cause this test to attempt an OTLP connection at startup (depending on how the starter wires its exporters). The issue's AC says "Application starts cleanly in CI — OTLP failure is logged as a warning, not a startup error." That behavior needs verification — it depends on whether the Spring Boot 4 + OTel starter combination fails fast or degrades gracefully when the OTLP endpoint is unreachable.
The application-dev.yaml currently only has spring.jpa.show-sql: true and springdoc overrides. Adding otel.* config to application.yaml with a ${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317} default is correct — that default handles the no-observability-stack case.
The tracing sampling probability: the issue sets 1.0 in application.yml (base config) with a note to override to 0.1 in prod. This should live in application.yml as 1.0 for dev and be overridden via environment variable MANAGEMENT_TRACING_SAMPLING_PROBABILITY=0.1 in the production compose. Avoid a separate application-prod.yml profile — the team doesn't use that profile pattern (only dev is defined).

Recommendations

Verify OTel starter graceful failure before merging. Write a test or at minimum do a manual check that ./mvnw test passes end-to-end with the starter present and no OTLP endpoint reachable. The ApplicationContextTest uses WebEnvironment.NONE, so HTTP context is absent — confirm the OTel exporter initialization doesn't block startup in that mode.

Pin the OTel starter version explicitly. Check https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases for the version compatible with Spring Boot 4, then add:

<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
    <version>2.x.y</version> <!-- pin exact version here -->
</dependency>

Override sampling in docker-compose.yml rather than a prod profile:
```
environment:
  MANAGEMENT_TRACING_SAMPLING_PROBABILITY: "0.1"
```
This is consistent with how all other prod config is managed in this project (env var overrides, not profile YAML files).

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - The implementation is entirely config and POM changes — no business logic involved. Clean scope. - The `opentelemetry-spring-boot-starter` dependency in the issue body is missing a `<version>` tag. The Spring Boot BOM manages `micrometer-tracing-bridge-otel`, but NOT the OTel Spring Boot starter (it lives in the `io.opentelemetry.instrumentation` group, outside the Spring BOM). A build will fail without an explicit version. - The `ApplicationContextTest` loads the full Spring context with `@SpringBootTest`. Adding the OTel autoconfiguration may cause this test to attempt an OTLP connection at startup (depending on how the starter wires its exporters). The issue's AC says "Application starts cleanly in CI — OTLP failure is logged as a warning, not a startup error." That behavior needs verification — it depends on whether the Spring Boot 4 + OTel starter combination fails fast or degrades gracefully when the OTLP endpoint is unreachable. - The `application-dev.yaml` currently only has `spring.jpa.show-sql: true` and springdoc overrides. Adding `otel.*` config to `application.yaml` with a `${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317}` default is correct — that default handles the no-observability-stack case. - The tracing sampling probability: the issue sets `1.0` in `application.yml` (base config) with a note to override to `0.1` in prod. This should live in `application.yml` as `1.0` for dev and be overridden via environment variable `MANAGEMENT_TRACING_SAMPLING_PROBABILITY=0.1` in the production compose. Avoid a separate `application-prod.yml` profile — the team doesn't use that profile pattern (only `dev` is defined). ### Recommendations - **Verify OTel starter graceful failure before merging.** Write a test or at minimum do a manual check that `./mvnw test` passes end-to-end with the starter present and no OTLP endpoint reachable. The `ApplicationContextTest` uses `WebEnvironment.NONE`, so HTTP context is absent — confirm the OTel exporter initialization doesn't block startup in that mode. - **Pin the OTel starter version explicitly.** Check https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases for the version compatible with Spring Boot 4, then add: ```xml <dependency> <groupId>io.opentelemetry.instrumentation</groupId> <artifactId>opentelemetry-spring-boot-starter</artifactId> <version>2.x.y</version>  </dependency> ``` - **Override sampling in docker-compose.yml** rather than a prod profile: ```yaml environment: MANAGEMENT_TRACING_SAMPLING_PROBABILITY: "0.1" ``` This is consistent with how all other prod config is managed in this project (env var overrides, not profile YAML files).

marcel commented

2026-05-14 15:29:15 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Observations

Prometheus endpoint access control (CWE-200: Exposure of Sensitive Information)

The current SecurityConfig has this rule:

auth.requestMatchers("/actuator/health").permitAll();

Only /actuator/health is permitted without authentication. The new /actuator/prometheus endpoint will be protected by the standard "authenticated" catch-all — meaning it requires a valid session to scrape. Prometheus does not support session-based auth.

Two options exist: (1) add /actuator/prometheus to permitAll() in SecurityConfig, or (2) move management endpoints to a separate port (8081) and configure Spring Security to allow unauthenticated access on the management port only. Option 2 is strongly preferred — it exposes the scrape endpoint only on the internal Docker network port, never through the session-authenticated main API.

The issue body says "The /actuator/prometheus endpoint must not be reachable through the Caddy reverse proxy" — but says nothing about how Spring Security will handle the scrape request. This is a gap.

OTLP export — data exposure

OTLP traces will include HTTP request attributes: URL paths, HTTP method, status codes, and potentially query parameters. Spring Boot's OTel auto-instrumentation does not sanitize query parameters from span attributes by default. If any API endpoint includes sensitive data in query parameters (e.g., search terms, file names), those values will appear in Tempo traces.

Heap/environment exposure not the concern here — the issue correctly blocks /actuator/* via Caddy and limits exposure to health, info, prometheus, metrics. /actuator/heapdump and /actuator/env are not included. This is good.

Recommendations

Use the management port split (8081) rather than permitAll in SecurityConfig. Configure:
```
management:
  server:
    port: 8081
```
Then Spring Boot's management endpoints run on a separate port that is not routed through either Caddy or the session-authenticated security filter chain. No changes needed to SecurityConfig — management port is unreachable from the public internet by network topology.
Verify OTel span attribute sanitization. Check whether any GET endpoints accept sensitive query parameters. If so, configure the OTel SDK to redact them:
```
otel:
  instrumentation:
    http:
      server:
        capture-request-parameters: []  # default; do not add sensitive params here
```
The default (empty list) is safe — query params are not captured in span attributes unless explicitly listed.
Add a security test confirming /actuator/prometheus returns 401 or is unreachable without credentials on the main port (8080). This prevents a future SecurityConfig change from accidentally reopening it.

## 🔐 Nora "NullX" Steiner — Application Security Engineer ### Observations **Prometheus endpoint access control (CWE-200: Exposure of Sensitive Information)** The current `SecurityConfig` has this rule: ```java auth.requestMatchers("/actuator/health").permitAll(); ``` Only `/actuator/health` is permitted without authentication. The new `/actuator/prometheus` endpoint will be protected by the standard "authenticated" catch-all — meaning it requires a valid session to scrape. Prometheus does not support session-based auth. Two options exist: (1) add `/actuator/prometheus` to `permitAll()` in SecurityConfig, or (2) move management endpoints to a separate port (8081) and configure Spring Security to allow unauthenticated access on the management port only. **Option 2 is strongly preferred** — it exposes the scrape endpoint only on the internal Docker network port, never through the session-authenticated main API. The issue body says "The `/actuator/prometheus` endpoint must **not** be reachable through the Caddy reverse proxy" — but says nothing about how Spring Security will handle the scrape request. This is a gap. **OTLP export — data exposure** OTLP traces will include HTTP request attributes: URL paths, HTTP method, status codes, and potentially query parameters. Spring Boot's OTel auto-instrumentation does not sanitize query parameters from span attributes by default. If any API endpoint includes sensitive data in query parameters (e.g., search terms, file names), those values will appear in Tempo traces. **Heap/environment exposure not the concern here** — the issue correctly blocks `/actuator/*` via Caddy and limits exposure to `health, info, prometheus, metrics`. `/actuator/heapdump` and `/actuator/env` are not included. This is good. ### Recommendations - **Use the management port split (8081) rather than `permitAll` in SecurityConfig.** Configure: ```yaml management: server: port: 8081 ``` Then Spring Boot's management endpoints run on a separate port that is not routed through either Caddy or the session-authenticated security filter chain. No changes needed to `SecurityConfig` — management port is unreachable from the public internet by network topology. - **Verify OTel span attribute sanitization.** Check whether any GET endpoints accept sensitive query parameters. If so, configure the OTel SDK to redact them: ```yaml otel: instrumentation: http: server: capture-request-parameters: [] # default; do not add sensitive params here ``` The default (empty list) is safe — query params are not captured in span attributes unless explicitly listed. - **Add a security test** confirming `/actuator/prometheus` returns 401 or is unreachable without credentials on the main port (8080). This prevents a future `SecurityConfig` change from accidentally reopening it.

marcel commented

2026-05-14 15:29:30 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Observations

The acceptance criteria are verification steps (curl commands, Prometheus UI checks, Tempo queries) — these are manual smoke tests, not automated test cases. For a purely infrastructure/config issue that's acceptable, but one AC is testable automatically and should be.
AC: ./mvnw test passes — no test regressions. This is the only automated AC and it's the most important one. The risk is that the OTel Spring Boot starter autoconfigures an OTLP exporter that attempts a connection during ApplicationContextTest. If the exporter fails non-fatally (logs a warning), tests pass. If it throws during context initialization, they fail. This needs explicit validation before the PR is merged — it's the primary regression risk in this issue.
No new unit or integration tests are proposed. That's appropriate here — there is no business logic to test. The acceptance criteria are infrastructure-level and properly verified by observation.
The CI workflow runs ./mvnw clean test without a tempo or prometheus service. The issue's CI acceptance criterion ("OTLP failure is logged as a warning, not a startup error") will be verified implicitly by the CI run — but only if that CI run actually passes. If it doesn't, the issue isn't done.
Missing from the AC: verify the OTel starter doesn't add significant startup time to the test suite. The opentelemetry-spring-boot-starter initializes a TracerProvider at startup. On first run with no cached beans, this could add 2-5 seconds to ApplicationContextTest. Worth checking.

Recommendations

Add one integration test that verifies the Prometheus endpoint returns 200 with the correct content type when accessed on the management port (or with appropriate auth if on the main port):

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@Import(PostgresContainerConfig.class)
class ActuatorPrometheusIT {
    @LocalManagementPort
    int managementPort;

    @Test
    void prometheus_endpoint_returns_prometheus_text_format() {
        // given: a running application with management port
        // when: GET /actuator/prometheus
        // then: 200 OK with Content-Type text/plain;version=0.0.4
        var client = RestClient.create("http://localhost:" + managementPort);
        var response = client.get().uri("/actuator/prometheus").retrieve().toBodilessEntity();
        assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
    }
}

This turns AC #1 from a manual curl check into a permanent regression test.

Run ./mvnw test locally (or observe CI) after adding the OTel starter to confirm ApplicationContextTest still completes within the existing time budget. If startup time increases by more than ~3 seconds, flag it.
The existing AC coverage is otherwise solid for an infrastructure issue. The Prometheus UI and Tempo manual checks are appropriate for the observability stack validation.

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Observations - The acceptance criteria are verification steps (curl commands, Prometheus UI checks, Tempo queries) — these are manual smoke tests, not automated test cases. For a purely infrastructure/config issue that's acceptable, but one AC is testable automatically and should be. - **AC: `./mvnw test` passes — no test regressions.** This is the only automated AC and it's the most important one. The risk is that the OTel Spring Boot starter autoconfigures an OTLP exporter that attempts a connection during `ApplicationContextTest`. If the exporter fails non-fatally (logs a warning), tests pass. If it throws during context initialization, they fail. This needs explicit validation before the PR is merged — it's the primary regression risk in this issue. - No new unit or integration tests are proposed. That's appropriate here — there is no business logic to test. The acceptance criteria are infrastructure-level and properly verified by observation. - The CI workflow runs `./mvnw clean test` without a `tempo` or `prometheus` service. The issue's CI acceptance criterion ("OTLP failure is logged as a warning, not a startup error") will be verified implicitly by the CI run — but only if that CI run actually passes. If it doesn't, the issue isn't done. - Missing from the AC: **verify the OTel starter doesn't add significant startup time to the test suite.** The `opentelemetry-spring-boot-starter` initializes a TracerProvider at startup. On first run with no cached beans, this could add 2-5 seconds to `ApplicationContextTest`. Worth checking. ### Recommendations - **Add one integration test** that verifies the Prometheus endpoint returns 200 with the correct content type when accessed on the management port (or with appropriate auth if on the main port): ```java @SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT) @Import(PostgresContainerConfig.class) class ActuatorPrometheusIT { @LocalManagementPort int managementPort; @Test void prometheus_endpoint_returns_prometheus_text_format() { // given: a running application with management port // when: GET /actuator/prometheus // then: 200 OK with Content-Type text/plain;version=0.0.4 var client = RestClient.create("http://localhost:" + managementPort); var response = client.get().uri("/actuator/prometheus").retrieve().toBodilessEntity(); assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK); } } ``` This turns AC #1 from a manual curl check into a permanent regression test. - **Run `./mvnw test` locally** (or observe CI) after adding the OTel starter to confirm `ApplicationContextTest` still completes within the existing time budget. If startup time increases by more than ~3 seconds, flag it. - **The existing AC coverage is otherwise solid** for an infrastructure issue. The Prometheus UI and Tempo manual checks are appropriate for the observability stack validation.

marcel commented

2026-05-14 15:29:46 +02:00

⚙️ Tobias Wendt — DevOps & Platform Engineer

Observations

This is the backend half of the observability stack. It's a clean, focused scope — POM + YAML + one compose env var. Good.
The docker-compose.yml already has known issues I'll call out: minio/minio:latest and minio/mc use :latest tags (not this issue, but flagged for awareness). The Tempo service from the companion issue needs a pinned tag when it lands — grafana/tempo:2.4.1 not :latest.
The OTEL_EXPORTER_OTLP_ENDPOINT default of http://localhost:4317 will fail silently in development when Tempo isn't running — which is intentional and correct. Confirmed: OTel exporter failures are non-fatal by default. Good.
The issue doesn't specify management port. The production architecture documented in docs/DEPLOYMENT.md already states management port 8081 for Prometheus scraping. The Prometheus scrape_configs in the companion issue likely already targets port 8081. The YAML snippet in this issue points to port 8080 implicitly. This mismatch needs to be resolved.
No depends_on update needed for the backend service in docker-compose.yml — Tempo doesn't need to be healthy before the backend starts (OTLP failures are non-fatal, and the backend has no functional dependency on Tempo).
CI impact: The ./mvnw clean test step in .gitea/workflows/ci.yml runs without Tempo or Prometheus services. The ApplicationContextTest will attempt OTel initialization. If the starter's default behavior is to log a warning when OTLP is unreachable (confirmed by OTel SDK design), CI will continue to pass. If not, CI will break on the first push.

Recommendations

Confirm the management port before writing the YAML. Add to application.yml:
```
management:
  server:
    port: 8081
```
Then the compose OTEL_EXPORTER_OTLP_ENDPOINT env var is for traces (port 4317 on Tempo), not the Prometheus scrape — those are two separate concerns on two separate ports. Make this explicit in the issue and in code comments.

Set sampling probability via env var override in compose, not a new profile:

# docker-compose.yml — backend service
environment:
  OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317
  MANAGEMENT_TRACING_SAMPLING_PROBABILITY: "0.1"  # prod rate; dev can override to 1.0 locally

Verify the Prometheus scrape job in the companion issue targets backend:8081 (management port) not backend:8080. If the Prometheus config hasn't been written yet, note this dependency explicitly.
Update docs/infrastructure/production-compose.md — the management port 8081 exposure and the new env vars need to be reflected there for the next person doing a production deployment.
The bind-mount on ./data/postgres in docker-compose.yml is a known issue (named volumes preferred for production), but out of scope for this issue. Flagged for awareness only.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer ### Observations - This is the backend half of the observability stack. It's a clean, focused scope — POM + YAML + one compose env var. Good. - **The `docker-compose.yml` already has known issues** I'll call out: `minio/minio:latest` and `minio/mc` use `:latest` tags (not this issue, but flagged for awareness). The Tempo service from the companion issue needs a pinned tag when it lands — `grafana/tempo:2.4.1` not `:latest`. - **The `OTEL_EXPORTER_OTLP_ENDPOINT` default of `http://localhost:4317`** will fail silently in development when Tempo isn't running — which is intentional and correct. Confirmed: OTel exporter failures are non-fatal by default. Good. - **The issue doesn't specify management port.** The production architecture documented in `docs/DEPLOYMENT.md` already states management port 8081 for Prometheus scraping. The Prometheus `scrape_configs` in the companion issue likely already targets port 8081. The YAML snippet in this issue points to port 8080 implicitly. This mismatch needs to be resolved. - **No `depends_on` update needed** for the backend service in `docker-compose.yml` — Tempo doesn't need to be healthy before the backend starts (OTLP failures are non-fatal, and the backend has no functional dependency on Tempo). - **CI impact**: The `./mvnw clean test` step in `.gitea/workflows/ci.yml` runs without Tempo or Prometheus services. The `ApplicationContextTest` will attempt OTel initialization. If the starter's default behavior is to log a warning when OTLP is unreachable (confirmed by OTel SDK design), CI will continue to pass. If not, CI will break on the first push. ### Recommendations - **Confirm the management port** before writing the YAML. Add to `application.yml`: ```yaml management: server: port: 8081 ``` Then the compose `OTEL_EXPORTER_OTLP_ENDPOINT` env var is for traces (port 4317 on Tempo), not the Prometheus scrape — those are two separate concerns on two separate ports. Make this explicit in the issue and in code comments. - **Set sampling probability via env var override in compose**, not a new profile: ```yaml # docker-compose.yml — backend service environment: OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317 MANAGEMENT_TRACING_SAMPLING_PROBABILITY: "0.1" # prod rate; dev can override to 1.0 locally ``` - **Verify the Prometheus scrape job** in the companion issue targets `backend:8081` (management port) not `backend:8080`. If the Prometheus config hasn't been written yet, note this dependency explicitly. - **Update `docs/infrastructure/production-compose.md`** — the management port 8081 exposure and the new env vars need to be reflected there for the next person doing a production deployment. - The bind-mount on `./data/postgres` in `docker-compose.yml` is a known issue (named volumes preferred for production), but out of scope for this issue. Flagged for awareness only.

marcel commented

2026-05-14 15:30:00 +02:00

📋 Elicit — Requirements Engineer

Observations

The issue is well-structured for a devops/infrastructure ticket: clear context, two distinct parts (A + B), explicit acceptance criteria, and a Definition of Done. This meets the Definition of Ready for this project.
One ambiguity in the ACs: AC #2 says "The spring-boot scrape target in Prometheus UI shows state UP" — but the scrape target name spring-boot presupposes a specific Prometheus scrape_configs job name that hasn't been defined yet (it belongs to the companion Prometheus configuration issue). This AC is only verifiable once that issue is also implemented. The dependency on the Tempo issue is stated, but the Prometheus config dependency is not.
AC #1 ("curl http://localhost:8080/actuator/prometheus returns HTTP 200") is ambiguous once a management port split is introduced (port 8081 vs 8080). If the management port is 8081, this curl should target 8081 instead.
The "Depends on: Tempo issue" is stated but the Tempo issue number is not linked. For traceability in the backlog, the actual issue number should be referenced (e.g., #575 or whatever the Tempo issue is). Gitea will auto-link it.
The Definition of Done mentions committing pom.xml, application.yml, and docker-compose.yml — this is correct and complete for the scope.
The issue does not address what happens to the ApplicationContextTest test profile (@ActiveProfiles("test")) — whether an application-test.yaml needs to disable OTLP export explicitly for the test suite. This is an unresolved edge case.

Recommendations

Link the Tempo dependency by issue number, not just prose: add "Depends on: #NNN" with the actual number so Gitea tracks the relationship.
Update AC #1 and AC #2 to reference port 8081 once the management port decision is resolved, or add a note that the port used in curl commands should match whatever management.server.port is configured to.
Add one acceptance criterion covering the test suite isolation case: "When running ./mvnw test with no OTLP endpoint, the OTel exporter logs a warning (not ERROR) and the test suite completes successfully." This is currently implied but not explicitly stated in the ACs.
Consider adding an application-test.yaml that explicitly sets management.tracing.sampling.probability: 0.0 for the test profile to prevent any trace export attempts during tests:
```
# application-test.yaml
management:
  tracing:
    sampling:
      probability: 0.0
```
This makes the CI behavior deterministic and independent of the OTel SDK's default graceful-failure behavior.

## 📋 Elicit — Requirements Engineer ### Observations - The issue is well-structured for a devops/infrastructure ticket: clear context, two distinct parts (A + B), explicit acceptance criteria, and a Definition of Done. This meets the Definition of Ready for this project. - **One ambiguity in the ACs**: AC #2 says "The `spring-boot` scrape target in Prometheus UI shows state `UP`" — but the scrape target name `spring-boot` presupposes a specific Prometheus `scrape_configs` job name that hasn't been defined yet (it belongs to the companion Prometheus configuration issue). This AC is only verifiable once that issue is also implemented. The dependency on the Tempo issue is stated, but the Prometheus config dependency is not. - **AC #1 ("curl http://localhost:8080/actuator/prometheus returns HTTP 200")** is ambiguous once a management port split is introduced (port 8081 vs 8080). If the management port is 8081, this curl should target 8081 instead. - **The "Depends on: Tempo issue" is stated but the Tempo issue number is not linked.** For traceability in the backlog, the actual issue number should be referenced (e.g., `#575` or whatever the Tempo issue is). Gitea will auto-link it. - The Definition of Done mentions committing `pom.xml`, `application.yml`, and `docker-compose.yml` — this is correct and complete for the scope. - The issue does not address what happens to the `ApplicationContextTest` test profile (`@ActiveProfiles("test")`) — whether an `application-test.yaml` needs to disable OTLP export explicitly for the test suite. This is an unresolved edge case. ### Recommendations - **Link the Tempo dependency by issue number**, not just prose: add "Depends on: #NNN" with the actual number so Gitea tracks the relationship. - **Update AC #1 and AC #2** to reference port 8081 once the management port decision is resolved, or add a note that the port used in curl commands should match whatever `management.server.port` is configured to. - **Add one acceptance criterion** covering the test suite isolation case: "When running `./mvnw test` with no OTLP endpoint, the OTel exporter logs a warning (not ERROR) and the test suite completes successfully." This is currently implied but not explicitly stated in the ACs. - **Consider adding an `application-test.yaml`** that explicitly sets `management.tracing.sampling.probability: 0.0` for the test profile to prevent any trace export attempts during tests: ```yaml # application-test.yaml management: tracing: sampling: probability: 0.0 ``` This makes the CI behavior deterministic and independent of the OTel SDK's default graceful-failure behavior.

marcel commented

2026-05-14 15:30:05 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

No UX or frontend concerns from my angle on this issue.

This is a pure backend observability change — Prometheus metrics endpoint and OTLP trace export. No user-visible interface, no new routes, no Svelte components, no design tokens, no interaction patterns. The work has zero impact on the frontend user experience.

The one downstream UX benefit worth noting: once Grafana dashboards are wired up (Phase 7 milestone), the observability data this issue provides will enable a future admin-facing metrics panel. When that feature is specced, I'll want to review the dashboard UI for the dual-audience design constraints (seniors + millennials). That's a future issue, not this one.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist No UX or frontend concerns from my angle on this issue. This is a pure backend observability change — Prometheus metrics endpoint and OTLP trace export. No user-visible interface, no new routes, no Svelte components, no design tokens, no interaction patterns. The work has zero impact on the frontend user experience. The one downstream UX benefit worth noting: once Grafana dashboards are wired up (Phase 7 milestone), the observability data this issue provides will enable a future admin-facing metrics panel. When that feature is specced, I'll want to review the dashboard UI for the dual-audience design constraints (seniors + millennials). That's a future issue, not this one.

marcel commented

2026-05-14 15:30:18 +02:00

🗳️ Decision Queue — Action Required

2 decisions need your input before implementation starts.

Infrastructure

Management port: 8081 (split) or 8080 (single port)? — The production docs (docs/DEPLOYMENT.md) already state port 8081 for Prometheus scraping. The issue YAML snippets imply port 8080 (no management.server.port config). If you use port 8081: Caddy never needs to see management traffic, Spring Security doesn't need permitAll on /actuator/prometheus, and the Prometheus scrape_configs in the companion issue must target port 8081. If you stay on 8080: you must add /actuator/prometheus to permitAll() in SecurityConfig (auth would otherwise block the scrape). All four reviewers (Markus, Felix, Nora, Tobias) independently recommend port 8081. The only cost is adding management.server.port: 8081 to application.yml and confirming the Prometheus scrape job targets the right port. (Raised by: Markus, Nora, Tobias, Felix)

Dependencies

OTel Spring Boot starter version — which exact version for Spring Boot 4? — opentelemetry-spring-boot-starter (io.opentelemetry.instrumentation group) is NOT version-managed by the Spring Boot BOM. The issue omits a <version> tag, which will cause a build failure. You need to look up the correct version from the OpenTelemetry Java Instrumentation releases that is compatible with Spring Boot 4 / Spring Framework 7 before writing the POM change. Once resolved, pin it explicitly in pom.xml. (Raised by: Markus, Felix)

## 🗳️ Decision Queue — Action Required _2 decisions need your input before implementation starts._ ### Infrastructure - **Management port: 8081 (split) or 8080 (single port)?** — The production docs (`docs/DEPLOYMENT.md`) already state port 8081 for Prometheus scraping. The issue YAML snippets imply port 8080 (no `management.server.port` config). If you use port 8081: Caddy never needs to see management traffic, Spring Security doesn't need `permitAll` on `/actuator/prometheus`, and the Prometheus scrape_configs in the companion issue must target port 8081. If you stay on 8080: you must add `/actuator/prometheus` to `permitAll()` in `SecurityConfig` (auth would otherwise block the scrape). **All four reviewers (Markus, Felix, Nora, Tobias) independently recommend port 8081.** The only cost is adding `management.server.port: 8081` to `application.yml` and confirming the Prometheus scrape job targets the right port. _(Raised by: Markus, Nora, Tobias, Felix)_ ### Dependencies - **OTel Spring Boot starter version — which exact version for Spring Boot 4?** — `opentelemetry-spring-boot-starter` (`io.opentelemetry.instrumentation` group) is NOT version-managed by the Spring Boot BOM. The issue omits a `<version>` tag, which will cause a build failure. You need to look up the correct version from the [OpenTelemetry Java Instrumentation releases](https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases) that is compatible with Spring Boot 4 / Spring Framework 7 before writing the POM change. Once resolved, pin it explicitly in `pom.xml`. _(Raised by: Markus, Felix)_

marcel referenced this issue from a commit

2026-05-15 03:25:07 +02:00

devops(backend): expose Prometheus metrics endpoint + OTLP trace export from Spring Boot

marcel commented

2026-05-15 03:25:24 +02:00

Implementation complete — branch `feat/issue-576-backend-instrumentation`

What was done

Commit: b3e49a95 — devops(backend): expose Prometheus metrics endpoint + OTLP trace export from Spring Boot

Part A — Prometheus metrics (`backend/pom.xml` + `application.yaml`)

Added micrometer-registry-prometheus (version managed by Spring Boot BOM)
Exposed health, info, prometheus, metrics endpoints
Management port split to 8081 — as recommended by all four reviewers (Markus, Felix, Nora, Tobias). This means:
- Caddy never sees management traffic (it only routes :8080 app traffic)
- Spring Security's session-authenticated filter chain on :8080 is bypassed — no permitAll change to SecurityConfig needed
- Prometheus scrapes backend:8081 directly inside archiv-net — matches the existing infra/observability/prometheus/prometheus.yml scrape target already pointing at backend:8081
Added expose: "8081" to the backend service in docker-compose.yml for intra-network reachability

Part B — OpenTelemetry tracing (`backend/pom.xml` + `application.yaml` + `docker-compose.yml`)

Added micrometer-tracing-bridge-otel (BOM-managed)
Added opentelemetry-spring-boot-starter pinned to 2.27.0 — latest stable release with Spring Boot 4 / Spring Framework 7 support (not in Spring Boot BOM; explicit version required)
Configured otel.service.name: familienarchiv-backend and OTLP endpoint with ${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317} default (CI-safe fallback)
Sampling probability: 1.0 in base config; overridden to 0.1 in docker-compose.yml via MANAGEMENT_TRACING_SAMPLING_PROBABILITY env var (no extra profile YAML needed)
Added OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317 to docker-compose.yml backend environment, pointing at the Tempo service (future Tempo issue)

CI safety — `backend/src/test/resources/application-test.yaml`

Set management.tracing.sampling.probability: 0.0 in the test profile — prevents any OTLP connection attempts during ./mvnw test, making CI behaviour deterministic regardless of the OTel SDK's graceful-failure behaviour

Build verification

./mvnw clean package -DskipTests → BUILD SUCCESS ✅

Files changed

backend/pom.xml — 3 new dependencies
backend/src/main/resources/application.yaml — management port, endpoint exposure, tracing config, OTel config
backend/src/test/resources/application-test.yaml — tracing disabled for tests
docker-compose.yml — OTLP endpoint, sampling probability, management port expose

## Implementation complete — branch `feat/issue-576-backend-instrumentation` ### What was done **Commit:** `b3e49a95` — devops(backend): expose Prometheus metrics endpoint + OTLP trace export from Spring Boot #### Part A — Prometheus metrics (`backend/pom.xml` + `application.yaml`) - Added `micrometer-registry-prometheus` (version managed by Spring Boot BOM) - Exposed `health, info, prometheus, metrics` endpoints - **Management port split to 8081** — as recommended by all four reviewers (Markus, Felix, Nora, Tobias). This means: - Caddy never sees management traffic (it only routes :8080 app traffic) - Spring Security's session-authenticated filter chain on :8080 is bypassed — no `permitAll` change to `SecurityConfig` needed - Prometheus scrapes `backend:8081` directly inside `archiv-net` — matches the existing `infra/observability/prometheus/prometheus.yml` scrape target already pointing at `backend:8081` - Added `expose: "8081"` to the backend service in `docker-compose.yml` for intra-network reachability #### Part B — OpenTelemetry tracing (`backend/pom.xml` + `application.yaml` + `docker-compose.yml`) - Added `micrometer-tracing-bridge-otel` (BOM-managed) - Added `opentelemetry-spring-boot-starter` **pinned to 2.27.0** — latest stable release with Spring Boot 4 / Spring Framework 7 support (not in Spring Boot BOM; explicit version required) - Configured `otel.service.name: familienarchiv-backend` and OTLP endpoint with `${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317}` default (CI-safe fallback) - Sampling probability: `1.0` in base config; overridden to `0.1` in `docker-compose.yml` via `MANAGEMENT_TRACING_SAMPLING_PROBABILITY` env var (no extra profile YAML needed) - Added `OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317` to `docker-compose.yml` backend environment, pointing at the Tempo service (future Tempo issue) #### CI safety — `backend/src/test/resources/application-test.yaml` - Set `management.tracing.sampling.probability: 0.0` in the test profile — prevents any OTLP connection attempts during `./mvnw test`, making CI behaviour deterministic regardless of the OTel SDK's graceful-failure behaviour #### Build verification `./mvnw clean package -DskipTests` → **BUILD SUCCESS** ✅ ### Files changed - `backend/pom.xml` — 3 new dependencies - `backend/src/main/resources/application.yaml` — management port, endpoint exposure, tracing config, OTel config - `backend/src/test/resources/application-test.yaml` — tracing disabled for tests - `docker-compose.yml` — OTLP endpoint, sampling probability, management port expose

marcel referenced a pull request that will close this issue

2026-05-15 03:25:47 +02:00

devops(backend): expose Prometheus metrics endpoint + OTLP trace export from Spring Boot #588

marcel referenced this issue

2026-05-15 03:28:40 +02:00

devops(backend): expose Prometheus metrics endpoint + OTLP trace export from Spring Boot #588

marcel closed this issue

2026-05-15 03:57:15 +02:00

marcel referenced this issue

2026-05-15 08:52:00 +02:00

fix(ci): backend test suite 25-min timeout — management port TIME_WAIT regression from #576 #593

marcel referenced this issue from a commit