fix(obs): wire Prometheus metrics and Loki job label for Grafana dashboards #606

marcel · 2026-05-16T11:21:47+02:00

marcel commented

2026-05-16 11:21:47 +02:00

Closes #604

Root causes found (via live server investigation)

1 — `ManagementWebSecurityAutoConfiguration` blocked Prometheus (401)

Spring Boot's ManagementWebSecurityAutoConfiguration auto-applies HTTP Basic auth to every management server when Spring Security is on the classpath. Because the management port (8081) is a separate child context, our SecurityConfig filter chain (on port 8080) never sees it — but Spring Boot's own management security auto-config does.

Prometheus was reaching the endpoint but receiving an HTML 401 page, which it rejected with received unsupported Content-Type "text/html".

Fix: Exclude ManagementWebSecurityAutoConfiguration in FamilienarchivApplication. The management port is already isolated at the network level (only Prometheus inside archiv-net can reach port 8081 — it is not published externally).

2 — Promtail never set a `job` label

The Loki dashboard queries use {job="$app"} and label_values(job) for its "App" dropdown. Promtail was shipping logs labelled with compose_service and compose_project but never job. The dropdown was empty and every panel returned no data (logs existed in Loki but were invisible).

Fix: One relabel rule mapping compose_service → job.

3 — Stale prometheus.yml comment

The comment said the spring-boot target would be DOWN until micrometer-registry-prometheus was added. That dependency has been in pom.xml for some time.

Fix: Remove the comment.

What was NOT changed

SecurityConfig.java — the personas' analysis was correct: its filter chain never sees port-8081 traffic. No change needed.
pom.xml — micrometer-registry-prometheus already present.
application.yaml — management port and endpoint exposure already configured.

Deployment notes

Backend: requires rebuild + redeploy (docker compose ... --build backend) for the ManagementWebSecurityAutoConfiguration exclusion to take effect.
Promtail: config-only change. The deployed server config was already patched and obs-promtail restarted to unblock Loki immediately.
Prometheus: no restart needed; the stale comment was cosmetic.

Acceptance criteria (from issue #604)

up{job="spring-boot"} shows 1 in Grafana → Explore → Prometheus after redeploy
"Spring Boot Observability" dashboard shows JVM / HTTP metrics
Loki "App" dropdown lists backend, frontend, etc.
Log lines from backend visible in Loki Logs dashboard panel
/actuator/health still returns 200 without credentials

Closes #604 ## Root causes found (via live server investigation) ### 1 — `ManagementWebSecurityAutoConfiguration` blocked Prometheus (401) Spring Boot's `ManagementWebSecurityAutoConfiguration` auto-applies HTTP Basic auth to **every management server** when Spring Security is on the classpath. Because the management port (8081) is a separate child context, our `SecurityConfig` filter chain (on port 8080) never sees it — but Spring Boot's own management security auto-config does. Prometheus was reaching the endpoint but receiving an HTML 401 page, which it rejected with `received unsupported Content-Type "text/html"`. **Fix:** Exclude `ManagementWebSecurityAutoConfiguration` in `FamilienarchivApplication`. The management port is already isolated at the network level (only Prometheus inside `archiv-net` can reach port 8081 — it is not published externally). ### 2 — Promtail never set a `job` label The Loki dashboard queries use `{job="$app"}` and `label_values(job)` for its "App" dropdown. Promtail was shipping logs labelled with `compose_service` and `compose_project` but never `job`. The dropdown was empty and every panel returned no data (logs existed in Loki but were invisible). **Fix:** One relabel rule mapping `compose_service` → `job`. ### 3 — Stale prometheus.yml comment The comment said the `spring-boot` target would be DOWN until `micrometer-registry-prometheus` was added. That dependency has been in `pom.xml` for some time. **Fix:** Remove the comment. ## What was NOT changed - `SecurityConfig.java` — the personas' analysis was correct: its filter chain never sees port-8081 traffic. No change needed. - `pom.xml` — `micrometer-registry-prometheus` already present. - `application.yaml` — management port and endpoint exposure already configured. ## Deployment notes - **Backend:** requires rebuild + redeploy (`docker compose ... --build backend`) for the `ManagementWebSecurityAutoConfiguration` exclusion to take effect. - **Promtail:** config-only change. The deployed server config was already patched and `obs-promtail` restarted to unblock Loki immediately. - **Prometheus:** no restart needed; the stale comment was cosmetic. ## Acceptance criteria (from issue #604) - [ ] `up{job="spring-boot"}` shows `1` in Grafana → Explore → Prometheus after redeploy - [ ] "Spring Boot Observability" dashboard shows JVM / HTTP metrics - [ ] Loki "App" dropdown lists `backend`, `frontend`, etc. - [ ] Log lines from `backend` visible in Loki Logs dashboard panel - [ ] `/actuator/health` still returns 200 without credentials

marcel commented

2026-05-16 11:24:13 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

Blockers

None.

Suggestions

Missing test before the fix (TDD discipline)

The Java change — excluding ManagementWebSecurityAutoConfiguration — was applied without a preceding failing test. Per red/green, the test should have existed first. For an integration-level security config like this, a @SpringBootTest with @LocalManagementPort is the right vehicle:

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.DEFINED_PORT)
class PrometheusEndpointSecurityTest {
    @LocalManagementPort
    private int managementPort;

    @Test
    void prometheus_endpoint_is_accessible_without_authentication() {
        ResponseEntity<String> response = new RestTemplate()
            .getForEntity("http://localhost:" + managementPort + "/actuator/prometheus", String.class);
        assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
    }
}

Without this test, re-enabling the auto-configuration (accidentally or via a framework upgrade) produces a silent regression undetectable by CI.

What's right

The comment in FamilienarchivApplication.java explains why, not what — that is exactly the right kind of comment. ✓
Removing the stale prometheus.yml comment is the correct move; lying comments are worse than no comments. ✓
The promtail relabel rule is a single, purposeful mapping with no ambiguity. ✓
No dead code, no unnecessary nesting, no multi-responsibility functions. ✓

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** ### Blockers None. ### Suggestions **Missing test before the fix (TDD discipline)** The Java change — excluding `ManagementWebSecurityAutoConfiguration` — was applied without a preceding failing test. Per red/green, the test should have existed first. For an integration-level security config like this, a `@SpringBootTest` with `@LocalManagementPort` is the right vehicle: ```java @SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.DEFINED_PORT) class PrometheusEndpointSecurityTest { @LocalManagementPort private int managementPort; @Test void prometheus_endpoint_is_accessible_without_authentication() { ResponseEntity<String> response = new RestTemplate() .getForEntity("http://localhost:" + managementPort + "/actuator/prometheus", String.class); assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK); } } ``` Without this test, re-enabling the auto-configuration (accidentally or via a framework upgrade) produces a silent regression undetectable by CI. ### What's right - The comment in `FamilienarchivApplication.java` explains *why*, not *what* — that is exactly the right kind of comment. ✓ - Removing the stale prometheus.yml comment is the correct move; lying comments are worse than no comments. ✓ - The promtail relabel rule is a single, purposeful mapping with no ambiguity. ✓ - No dead code, no unnecessary nesting, no multi-responsibility functions. ✓

marcel commented

2026-05-16 11:24:25 +02:00

🏛️ Markus Keller — Application Architect

Verdict: ⚠️ Approved with concerns

Blockers

Missing ADR

Excluding ManagementWebSecurityAutoConfiguration is an architectural decision with lasting consequences. It is counterintuitive — a developer scanning @SpringBootApplication annotations will see a security auto-config exclusion and ask "why?" The PR description already contains everything needed for a proper ADR (context, problem, decision, alternatives not taken, consequences). It should be committed to docs/adr/ rather than left only in PR history.

Per the doc-update table: "Architectural decision with lasting consequences → New ADR in docs/adr/". This is a blocker; the PR should not merge without it.

Suggested content (paraphrased from the PR description):

# ADR-XXX: Exclude ManagementWebSecurityAutoConfiguration for Prometheus scraping

## Context
Spring Boot's ManagementWebSecurityAutoConfiguration auto-applies HTTP Basic auth to the
management server (port 8081). Because the management port is a separate child context,
SecurityConfig (port 8080) never intercepts it — but the Boot auto-config does. Prometheus
received HTML 401 pages, which it rejected with "received unsupported Content-Type text/html".

## Decision
Exclude ManagementWebSecurityAutoConfiguration in FamilienarchivApplication.
Port 8081 is network-isolated inside archiv-net and not published externally.

## Alternatives rejected
- Custom management SecurityFilterChain: possible but adds complexity for a network-isolated port.
- Per-endpoint permit rules inside a management filter chain: see ADR-XXX+1 as a follow-up.

## Consequences
All actuator endpoints on port 8081 are unauthenticated. Defense relies entirely on
network isolation (archiv-net). If port 8081 is ever accidentally published, all actuator
endpoints (including /actuator/heapdump, /actuator/env) are exposed without auth.

Suggestions (non-blockers)

The promtail job label mapping (service name → job) should be noted in docs/DEPLOYMENT.md under the observability section. Operators querying Loki need to know that {job="backend"} is derived from the Docker Compose service name — this is not obvious from the Loki UI alone.

What's right

Correct diagnosis: the management child context architecture is exactly why SecurityConfig never ran on port 8081. ✓
Correct fix: exclude at the @SpringBootApplication level is the right lever for this. ✓
The inline comment explains the network isolation rationale clearly. ✓
Monolith-first, boring-technology choices throughout — no unnecessary infrastructure added. ✓

## 🏛️ Markus Keller — Application Architect **Verdict: ⚠️ Approved with concerns** ### Blockers **Missing ADR** Excluding `ManagementWebSecurityAutoConfiguration` is an architectural decision with lasting consequences. It is counterintuitive — a developer scanning `@SpringBootApplication` annotations will see a security auto-config exclusion and ask "why?" The PR description already contains everything needed for a proper ADR (context, problem, decision, alternatives not taken, consequences). It should be committed to `docs/adr/` rather than left only in PR history. Per the doc-update table: *"Architectural decision with lasting consequences → New ADR in `docs/adr/`"*. This is a blocker; the PR should not merge without it. Suggested content (paraphrased from the PR description): ```markdown # ADR-XXX: Exclude ManagementWebSecurityAutoConfiguration for Prometheus scraping ## Context Spring Boot's ManagementWebSecurityAutoConfiguration auto-applies HTTP Basic auth to the management server (port 8081). Because the management port is a separate child context, SecurityConfig (port 8080) never intercepts it — but the Boot auto-config does. Prometheus received HTML 401 pages, which it rejected with "received unsupported Content-Type text/html". ## Decision Exclude ManagementWebSecurityAutoConfiguration in FamilienarchivApplication. Port 8081 is network-isolated inside archiv-net and not published externally. ## Alternatives rejected - Custom management SecurityFilterChain: possible but adds complexity for a network-isolated port. - Per-endpoint permit rules inside a management filter chain: see ADR-XXX+1 as a follow-up. ## Consequences All actuator endpoints on port 8081 are unauthenticated. Defense relies entirely on network isolation (archiv-net). If port 8081 is ever accidentally published, all actuator endpoints (including /actuator/heapdump, /actuator/env) are exposed without auth. ``` ### Suggestions (non-blockers) The promtail `job` label mapping (service name → job) should be noted in `docs/DEPLOYMENT.md` under the observability section. Operators querying Loki need to know that `{job="backend"}` is derived from the Docker Compose service name — this is not obvious from the Loki UI alone. ### What's right - Correct diagnosis: the management child context architecture is exactly why `SecurityConfig` never ran on port 8081. ✓ - Correct fix: `exclude` at the `@SpringBootApplication` level is the right lever for this. ✓ - The inline comment explains the network isolation rationale clearly. ✓ - Monolith-first, boring-technology choices throughout — no unnecessary infrastructure added. ✓

marcel commented

2026-05-16 11:24:38 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ✅ Approved

Clean infra fix. Let me walk through each change:

FamilienarchivApplication.java
Excluding ManagementWebSecurityAutoConfiguration is the right lever. The management context is a Boot-internal child context; your main SecurityConfig never sees port 8081 traffic. The exclusion plus the inline comment is the standard approach. Deployment note in the PR description is correct: backend rebuild + redeploy required.

prometheus.yml
Stale comment removal — correct. The comment was actively wrong (micrometer was already present) and would have confused anyone debugging a future scrape failure. The remaining config is fine: backend:8081 uses Docker service DNS (not a fixed IP or localhost), which is the right approach for service-to-service calls inside a Compose network.

promtail-config.yml
The relabel rule is textbook Docker SD label mapping:

- source_labels: ['__meta_docker_container_label_com_docker_compose_service']
  target_label: 'job'

This sets job to the Compose service name (e.g., backend, frontend), which is exactly what Loki dashboard queries expect via {job="$app"}. No issues here.

One post-deploy check to verify:
Confirm that management.endpoints.web.exposure.include in application.yaml is scoped (e.g., prometheus,health) and not *. This PR didn't change that config, but given that port 8081 is now unauthenticated, an existing * would expose heapdump and env without auth. A quick grep of application.yaml before deploying is worth 30 seconds.

What's right:

Deployment notes are precise (rebuild vs. config-only vs. no-restart) — operators know exactly what to do. ✓
Network isolation of port 8081 (inside archiv-net, not ports:-published) is the correct design for internal scrapers. ✓
No new images, no tag changes, no secrets affected. ✓
The fix doesn't require adding a new sidecar, exporter, or separate auth proxy — minimal operational footprint. ✓

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ✅ Approved** Clean infra fix. Let me walk through each change: **`FamilienarchivApplication.java`** Excluding `ManagementWebSecurityAutoConfiguration` is the right lever. The management context is a Boot-internal child context; your main `SecurityConfig` never sees port 8081 traffic. The exclusion plus the inline comment is the standard approach. Deployment note in the PR description is correct: backend rebuild + redeploy required. **`prometheus.yml`** Stale comment removal — correct. The comment was actively wrong (micrometer was already present) and would have confused anyone debugging a future scrape failure. The remaining config is fine: `backend:8081` uses Docker service DNS (not a fixed IP or `localhost`), which is the right approach for service-to-service calls inside a Compose network. **`promtail-config.yml`** The relabel rule is textbook Docker SD label mapping: ```yaml - source_labels: ['__meta_docker_container_label_com_docker_compose_service'] target_label: 'job' ``` This sets `job` to the Compose service name (e.g., `backend`, `frontend`), which is exactly what Loki dashboard queries expect via `{job="$app"}`. No issues here. **One post-deploy check to verify:** Confirm that `management.endpoints.web.exposure.include` in `application.yaml` is scoped (e.g., `prometheus,health`) and not `*`. This PR didn't change that config, but given that port 8081 is now unauthenticated, an existing `*` would expose heapdump and env without auth. A quick `grep` of `application.yaml` before deploying is worth 30 seconds. **What's right:** - Deployment notes are precise (rebuild vs. config-only vs. no-restart) — operators know exactly what to do. ✓ - Network isolation of port 8081 (inside `archiv-net`, not `ports:`-published) is the correct design for internal scrapers. ✓ - No new images, no tag changes, no secrets affected. ✓ - The fix doesn't require adding a new sidecar, exporter, or separate auth proxy — minimal operational footprint. ✓

marcel commented

2026-05-16 11:24:46 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

This PR is a well-scoped bug fix and the requirements framing is exemplary.

Acceptance criteria quality
The five acceptance criteria in the PR description are clearly stated, specific, and independently verifiable:

up{job="spring-boot"} = 1 → observable in Grafana Explore
JVM/HTTP metrics visible in the dashboard → visual verification
Loki "App" dropdown lists services → UI-observable
Log lines from backend visible in Loki → log query verification
/actuator/health returns 200 without credentials → curl-verifiable regression check

Each criterion maps directly to one of the three root causes identified. The traceability from root cause → fix → acceptance criterion is clean.

Scope discipline
The "What was NOT changed" section is an excellent practice. It pre-empts scope questions and documents three deliberate non-changes (SecurityConfig.java, pom.xml, application.yaml). This pattern prevents reviewers from asking "why didn't you also change X?" and reduces churn.

One open question
The acceptance criteria checkboxes are unchecked in the PR description. Were any of them validated before opening the PR? The PR description mentions "The deployed server config was already patched and obs-promtail restarted to unblock Loki immediately" — which implies criteria 3 and 4 (Loki) were verified live. Criteria 1 and 2 (Prometheus/metrics) are pending the backend redeploy. It would help to mark which criteria have been validated and which await deployment, so the reviewer knows the current verification state.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** This PR is a well-scoped bug fix and the requirements framing is exemplary. **Acceptance criteria quality** The five acceptance criteria in the PR description are clearly stated, specific, and independently verifiable: - `up{job="spring-boot"}` = 1 → observable in Grafana Explore - JVM/HTTP metrics visible in the dashboard → visual verification - Loki "App" dropdown lists services → UI-observable - Log lines from `backend` visible in Loki → log query verification - `/actuator/health` returns 200 without credentials → curl-verifiable regression check Each criterion maps directly to one of the three root causes identified. The traceability from root cause → fix → acceptance criterion is clean. **Scope discipline** The "What was NOT changed" section is an excellent practice. It pre-empts scope questions and documents three deliberate non-changes (`SecurityConfig.java`, `pom.xml`, `application.yaml`). This pattern prevents reviewers from asking "why didn't you also change X?" and reduces churn. **One open question** The acceptance criteria checkboxes are unchecked in the PR description. Were any of them validated before opening the PR? The PR description mentions "The deployed server config was already patched and `obs-promtail` restarted to unblock Loki immediately" — which implies criteria 3 and 4 (Loki) were verified live. Criteria 1 and 2 (Prometheus/metrics) are pending the backend redeploy. It would help to mark which criteria have been validated and which await deployment, so the reviewer knows the current verification state.

marcel commented

2026-05-16 11:24:57 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

What this change does to the attack surface

Excluding ManagementWebSecurityAutoConfiguration removes all application-level authentication from every Spring Boot Actuator endpoint on port 8081 — not only /actuator/prometheus. The full unauthenticated endpoint list now includes (depending on what's exposed in application.yaml):

Endpoint	Risk if exposed
`/actuator/heapdump`	Full JVM heap — contains session tokens, passwords, plaintext strings
`/actuator/env`	All environment variables — database passwords, secrets
`/actuator/beans`	Full Spring context internals
`/actuator/mappings`	All registered routes — reconnaissance value
`/actuator/prometheus`	Metrics only — intended target
`/actuator/health`	Status only — low risk

Current defense: Port 8081 is inside archiv-net and not published to the host (ports: not set). This is a valid network defense.

Concern — single layer of defense:
The security posture is now "network isolation is the only defense for all actuator endpoints." Defense-in-depth principle says no single control should be the only barrier for sensitive data.

Recommended follow-up (not a merge blocker for this family project)

Rather than removing all management security, configure a dedicated management filter chain that permits only the two safe endpoints:

@Bean
@Order(1)
public SecurityFilterChain managementSecurity(HttpSecurity http) throws Exception {
    http
        .securityMatcher(EndpointRequest.toAnyEndpoint())
        .authorizeHttpRequests(auth -> auth
            .requestMatchers(EndpointRequest.to(PrometheusScrapeEndpoint.class, HealthEndpoint.class))
                .permitAll()
            .anyRequest().denyAll()  // heapdump, env, beans blocked even on port 8081
        )
        .csrf(AbstractHttpConfigurer::disable);
    return http.build();
}

This removes the 401 for Prometheus while keeping heapdump and env inaccessible even within archiv-net.

I'd suggest filing a follow-up issue to implement this before the observability stack is production-hardened.

What's right

The inline comment correctly documents the threat model ("management port (8081) is network-isolated inside archiv-net"). ✓
Network isolation via archiv-net (no ports: binding) is a real defense layer. ✓
The fix uses the correct Spring Boot mechanism (class exclusion at @SpringBootApplication level). ✓
No secrets added, no CORS changes, no new endpoints introduced. ✓
The promtail and prometheus YAML changes are security-neutral. ✓

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** ### What this change does to the attack surface Excluding `ManagementWebSecurityAutoConfiguration` removes all application-level authentication from **every** Spring Boot Actuator endpoint on port 8081 — not only `/actuator/prometheus`. The full unauthenticated endpoint list now includes (depending on what's exposed in `application.yaml`): | Endpoint | Risk if exposed | |---|---| | `/actuator/heapdump` | Full JVM heap — contains session tokens, passwords, plaintext strings | | `/actuator/env` | All environment variables — database passwords, secrets | | `/actuator/beans` | Full Spring context internals | | `/actuator/mappings` | All registered routes — reconnaissance value | | `/actuator/prometheus` | Metrics only — intended target | | `/actuator/health` | Status only — low risk | **Current defense:** Port 8081 is inside `archiv-net` and not published to the host (`ports:` not set). This is a valid network defense. **Concern — single layer of defense:** The security posture is now "network isolation is the only defense for all actuator endpoints." Defense-in-depth principle says no single control should be the only barrier for sensitive data. ### Recommended follow-up (not a merge blocker for this family project) Rather than removing all management security, configure a dedicated management filter chain that permits only the two safe endpoints: ```java @Bean @Order(1) public SecurityFilterChain managementSecurity(HttpSecurity http) throws Exception { http .securityMatcher(EndpointRequest.toAnyEndpoint()) .authorizeHttpRequests(auth -> auth .requestMatchers(EndpointRequest.to(PrometheusScrapeEndpoint.class, HealthEndpoint.class)) .permitAll() .anyRequest().denyAll() // heapdump, env, beans blocked even on port 8081 ) .csrf(AbstractHttpConfigurer::disable); return http.build(); } ``` This removes the 401 for Prometheus while keeping heapdump and env inaccessible even within `archiv-net`. I'd suggest filing a follow-up issue to implement this before the observability stack is production-hardened. ### What's right - The inline comment correctly documents the threat model ("management port (8081) is network-isolated inside archiv-net"). ✓ - Network isolation via `archiv-net` (no `ports:` binding) is a real defense layer. ✓ - The fix uses the correct Spring Boot mechanism (class exclusion at `@SpringBootApplication` level). ✓ - No secrets added, no CORS changes, no new endpoints introduced. ✓ - The promtail and prometheus YAML changes are security-neutral. ✓

marcel commented

2026-05-16 11:25:06 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

Missing automated test coverage

The Java change — excluding ManagementWebSecurityAutoConfiguration — has no corresponding automated test. Per the TDD discipline, a failing test should precede the fix. Without a test, CI cannot detect a future regression (e.g., if the exclusion is removed or Spring Boot's default behavior changes in a version upgrade).

Suggested test (@SpringBootTest with @LocalManagementPort):

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.DEFINED_PORT)
class PrometheusEndpointAccessTest {

    @LocalManagementPort
    private int managementPort;

    @Test
    void prometheus_endpoint_returns_200_without_authentication() {
        ResponseEntity<String> response = new RestTemplate()
            .getForEntity(
                "http://localhost:" + managementPort + "/actuator/prometheus",
                String.class
            );
        assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
        assertThat(response.getBody()).contains("jvm_memory_used_bytes");
    }
}

This test:

Is the regression guard for the ManagementWebSecurityAutoConfiguration exclusion
Uses @LocalManagementPort so it doesn't hardcode port 8081
Asserts on content (not just status) so it also catches a case where Prometheus is reachable but metrics are empty

YAML changes (promtail, prometheus)

The infrastructure config changes are not unit-testable in isolation — they require a running Docker Compose stack. The manual acceptance criteria in the PR description are the practical equivalent. This is acceptable for infra-layer changes.

Observation: The acceptance criteria checkboxes in the PR are manual verification steps, not automated gates. For a monitoring configuration, that's the pragmatic approach — just flag it so the next engineer knows this verification is human-driven, not CI-driven.

No other test quality concerns

No test deletions, no disabled tests, no flaky patterns introduced. The YAML changes are additive and safe.

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** ### Missing automated test coverage The Java change — excluding `ManagementWebSecurityAutoConfiguration` — has no corresponding automated test. Per the TDD discipline, a failing test should precede the fix. Without a test, CI cannot detect a future regression (e.g., if the exclusion is removed or Spring Boot's default behavior changes in a version upgrade). **Suggested test** (`@SpringBootTest` with `@LocalManagementPort`): ```java @SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.DEFINED_PORT) class PrometheusEndpointAccessTest { @LocalManagementPort private int managementPort; @Test void prometheus_endpoint_returns_200_without_authentication() { ResponseEntity<String> response = new RestTemplate() .getForEntity( "http://localhost:" + managementPort + "/actuator/prometheus", String.class ); assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK); assertThat(response.getBody()).contains("jvm_memory_used_bytes"); } } ``` This test: - Is the regression guard for the `ManagementWebSecurityAutoConfiguration` exclusion - Uses `@LocalManagementPort` so it doesn't hardcode port 8081 - Asserts on content (not just status) so it also catches a case where Prometheus is reachable but metrics are empty ### YAML changes (promtail, prometheus) The infrastructure config changes are not unit-testable in isolation — they require a running Docker Compose stack. The manual acceptance criteria in the PR description are the practical equivalent. This is acceptable for infra-layer changes. **Observation:** The acceptance criteria checkboxes in the PR are manual verification steps, not automated gates. For a monitoring configuration, that's the pragmatic approach — just flag it so the next engineer knows this verification is human-driven, not CI-driven. ### No other test quality concerns No test deletions, no disabled tests, no flaky patterns introduced. The YAML changes are additive and safe.

marcel commented

2026-05-16 11:25:08 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Verdict: ✅ Approved

This PR touches only:

A Java annotation on the Spring Boot main class
A YAML comment removal in prometheus.yml
A YAML relabel rule addition in promtail-config.yml

No Svelte components, no frontend routes, no CSS, no UI strings, no Tailwind classes, no accessibility markup. This change is entirely invisible to users.

Nothing to flag from a UX, accessibility, brand compliance, or responsive design perspective. LGTM.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** This PR touches only: - A Java annotation on the Spring Boot main class - A YAML comment removal in `prometheus.yml` - A YAML relabel rule addition in `promtail-config.yml` No Svelte components, no frontend routes, no CSS, no UI strings, no Tailwind classes, no accessibility markup. This change is entirely invisible to users. Nothing to flag from a UX, accessibility, brand compliance, or responsive design perspective. LGTM.

marcel referenced this pull request

2026-05-16 12:08:59 +02:00

fix(obs): wire up Spring Boot metrics and Loki log labels in Grafana #604

marcel commented

2026-05-16 12:18:59 +02:00

🏗️ Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns

What this PR documents well

The inline comments in SecurityConfig.java and application.yaml explain the why of the Spring Boot 4.0 management port behavior change. That's the right instinct. Three non-obvious Spring Boot 4.0 changes discovered here — management port shares the security filter chain, metrics export opt-in, separate micrometer-metrics starter — all landed correctly.

Blocker

Missing ADR for the management port security architecture change.

Per our doc checklist, "architectural decisions with lasting consequences" require a new ADR in docs/adr/. This qualifies: the discovery that Spring Boot 4.0's Jetty-based management server shares the main security filter chain changes the operational security model. Before this PR, the assumption was "port 8081 never sees Spring Security." That assumption was wrong and is now corrected — but the decision and its rationale exist only in code comments, not in the ADR log.

Future developers (and future-you during a security audit) need to understand:

What the old model was
What Spring Boot 4.0 changed
Why permitAll() in SecurityConfig is the correct fix, not a security hole
What the actual security boundary is (port 8081 not published externally in docker-compose)

Draft ADR entry:

# ADR-XXX: Spring Boot 4.0 management port shares the main security filter chain

## Context
Spring Boot 4.0 with Jetty no longer runs the management server (port 8081) on
an isolated filter chain. `/actuator/*` requests on 8081 traverse the same
Spring Security filter chain as the main application port (8080).

## Decision
Explicitly `permitAll()` on `/actuator/health` and `/actuator/prometheus` in
SecurityConfig. The outer security boundary is network isolation: port 8081 is
not published in docker-compose and is not routed through Caddy.

## Alternatives rejected
Exposing only through a separate security configuration — not possible in
Spring Boot 4.0 without custom `SecurityFilterChain` with port-specific matching.

## Consequences
Any future actuator endpoint added to `exposure.include` must be explicitly
evaluated: does it need `permitAll()` or authentication? The default is
authenticated (good). New endpoints must not be silently exposed without review.

Suggestion (non-blocking)

The management.endpoints.web.exposure.include: health,info,prometheus,metrics exposes info and metrics without explicit permitAll(). Those require authentication. This is correct behavior, but worth documenting next to the permitAll() block so the next person reviewing security rules can see the full picture at a glance.

## 🏗️ Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** ### What this PR documents well The inline comments in `SecurityConfig.java` and `application.yaml` explain the *why* of the Spring Boot 4.0 management port behavior change. That's the right instinct. Three non-obvious Spring Boot 4.0 changes discovered here — management port shares the security filter chain, metrics export opt-in, separate micrometer-metrics starter — all landed correctly. ### Blocker **Missing ADR for the management port security architecture change.** Per our doc checklist, "architectural decisions with lasting consequences" require a new ADR in `docs/adr/`. This qualifies: the discovery that Spring Boot 4.0's Jetty-based management server shares the main security filter chain changes the operational security model. Before this PR, the assumption was "port 8081 never sees Spring Security." That assumption was wrong and is now corrected — but the decision and its rationale exist only in code comments, not in the ADR log. Future developers (and future-you during a security audit) need to understand: - What the old model was - What Spring Boot 4.0 changed - Why `permitAll()` in `SecurityConfig` is the correct fix, not a security hole - What the actual security boundary is (port 8081 not published externally in docker-compose) Draft ADR entry: ```markdown # ADR-XXX: Spring Boot 4.0 management port shares the main security filter chain ## Context Spring Boot 4.0 with Jetty no longer runs the management server (port 8081) on an isolated filter chain. `/actuator/*` requests on 8081 traverse the same Spring Security filter chain as the main application port (8080). ## Decision Explicitly `permitAll()` on `/actuator/health` and `/actuator/prometheus` in SecurityConfig. The outer security boundary is network isolation: port 8081 is not published in docker-compose and is not routed through Caddy. ## Alternatives rejected Exposing only through a separate security configuration — not possible in Spring Boot 4.0 without custom `SecurityFilterChain` with port-specific matching. ## Consequences Any future actuator endpoint added to `exposure.include` must be explicitly evaluated: does it need `permitAll()` or authentication? The default is authenticated (good). New endpoints must not be silently exposed without review. ``` ### Suggestion (non-blocking) The `management.endpoints.web.exposure.include: health,info,prometheus,metrics` exposes `info` and `metrics` without explicit `permitAll()`. Those require authentication. This is correct behavior, but worth documenting next to the `permitAll()` block so the next person reviewing security rules can see the full picture at a glance.

marcel commented

2026-05-16 12:19:00 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

TDD

TDD discipline was followed: the test ActuatorPrometheusIT was written to drive out the three fixes (missing starter, missing permitAll(), missing enabled: true). The test was red before implementation and green after — confirmed by the implementation log. ✅

Code quality

ActuatorPrometheusIT.java — generally clean:

Test name reads as a behavior sentence ✅
@LocalManagementPort correctly injects the management port (different random port from @LocalServerPort) ✅
noThrowTemplate() is a single-purpose method with a revealing name ✅
@MockitoBean S3Client s3Client is the correct pattern for suppressing MinIO startup noise ✅

One concern in SecurityConfig.java — comment duplication:

The same fact is stated twice in the comment block:

// In Spring Boot 4.0 the management server (port 8081) shares the security filter chain;
// network isolation (port 8081 not published in docker-compose) is the security boundary.
// ...
// Note: in Spring Boot 4.0 the management port shares the security filter chain,
// so these paths must be explicitly permitted here even though they are served on port 8081.
// Network isolation (port 8081 not published in docker-compose) is the outer security boundary.

Lines 1–2 and lines 4–6 say the same thing twice. The comment should be pruned to one authoritative statement. Suggested replacement:

// Both /actuator/health and /actuator/prometheus must be open for unauthenticated access.
// In Spring Boot 4.0 with Jetty, the management port (8081) shares this security filter chain —
// so these paths must be explicitly permitted even though they are only reachable on port 8081.
// The outer security boundary is network isolation: port 8081 is not published in docker-compose
// and is not proxied by Caddy. See docs/adr/XXX for the full decision context.
auth.requestMatchers("/actuator/health", "/actuator/prometheus").permitAll();

That's one statement, no redundancy, and cross-references the ADR Markus is asking for.

Minor

The noThrowTemplate() anonymous inner class could be extracted to a package-private test helper if more IT tests need it later. No action needed now — YAGNI applies.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** ### TDD TDD discipline was followed: the test `ActuatorPrometheusIT` was written to drive out the three fixes (missing starter, missing `permitAll()`, missing `enabled: true`). The test was red before implementation and green after — confirmed by the implementation log. ✅ ### Code quality **`ActuatorPrometheusIT.java` — generally clean:** - Test name reads as a behavior sentence ✅ - `@LocalManagementPort` correctly injects the management port (different random port from `@LocalServerPort`) ✅ - `noThrowTemplate()` is a single-purpose method with a revealing name ✅ - `@MockitoBean S3Client s3Client` is the correct pattern for suppressing MinIO startup noise ✅ **One concern in `SecurityConfig.java` — comment duplication:** The same fact is stated twice in the comment block: ```java // In Spring Boot 4.0 the management server (port 8081) shares the security filter chain; // network isolation (port 8081 not published in docker-compose) is the security boundary. // ... // Note: in Spring Boot 4.0 the management port shares the security filter chain, // so these paths must be explicitly permitted here even though they are served on port 8081. // Network isolation (port 8081 not published in docker-compose) is the outer security boundary. ``` Lines 1–2 and lines 4–6 say the same thing twice. The comment should be pruned to one authoritative statement. Suggested replacement: ```java // Both /actuator/health and /actuator/prometheus must be open for unauthenticated access. // In Spring Boot 4.0 with Jetty, the management port (8081) shares this security filter chain — // so these paths must be explicitly permitted even though they are only reachable on port 8081. // The outer security boundary is network isolation: port 8081 is not published in docker-compose // and is not proxied by Caddy. See docs/adr/XXX for the full decision context. auth.requestMatchers("/actuator/health", "/actuator/prometheus").permitAll(); ``` That's one statement, no redundancy, and cross-references the ADR Markus is asking for. ### Minor The `noThrowTemplate()` anonymous inner class could be extracted to a package-private test helper if more IT tests need it later. No action needed now — YAGNI applies.

marcel commented

2026-05-16 12:19:00 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ✅ Approved

Solid infra fix. Three things checked:

promtail-config.yml — ✅ correct

- source_labels: ['__meta_docker_container_label_com_docker_compose_service']
  target_label: 'job'

This is the right relabel rule. Docker Compose sets com.docker.compose.service on every container. Promtail's Docker SD exposes it as __meta_docker_container_label_com_docker_compose_service. Mapping it to job aligns with the Grafana Loki variable convention {job="$app"} where $app is populated from the job label. Without this, Loki receives logs but dashboard panels that filter by {job="backend"} return nothing — which is exactly the symptom the issue described.

No concerns here.

prometheus.yml — ✅ clean

Removing the stale # Target will show as DOWN until... comment is correct. That comment was a workaround documentation for a broken state that no longer exists. Dead comments left in infra configs get copy-pasted into new configs and then no-one knows if they still apply.

Backend management port wiring — ✅ verified

Backend scrapes at backend:8081/actuator/prometheus. Port 8081 is not published externally (confirmed by looking at the management port configuration). Prometheus reaches it directly over archiv-net. Caddy never touches it. The network isolation model is intact.

One heads-up (informational, not a blocker)

The job label from promtail will be backend, frontend, db, etc. — whatever the Docker Compose service name is. The Grafana dashboard's $app variable must match these values exactly. If the dashboard was previously filtering by a different label (e.g. container_name or app), the panels won't light up until someone updates the variable definition in Grafana. Worth verifying in the running stack after this deploys.

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ✅ Approved** Solid infra fix. Three things checked: ### promtail-config.yml — ✅ correct ```yaml - source_labels: ['__meta_docker_container_label_com_docker_compose_service'] target_label: 'job' ``` This is the right relabel rule. Docker Compose sets `com.docker.compose.service` on every container. Promtail's Docker SD exposes it as `__meta_docker_container_label_com_docker_compose_service`. Mapping it to `job` aligns with the Grafana Loki variable convention `{job="$app"}` where `$app` is populated from the `job` label. Without this, Loki receives logs but dashboard panels that filter by `{job="backend"}` return nothing — which is exactly the symptom the issue described. No concerns here. ### prometheus.yml — ✅ clean Removing the stale `# Target will show as DOWN until...` comment is correct. That comment was a workaround documentation for a broken state that no longer exists. Dead comments left in infra configs get copy-pasted into new configs and then no-one knows if they still apply. ### Backend management port wiring — ✅ verified Backend scrapes at `backend:8081/actuator/prometheus`. Port 8081 is not published externally (confirmed by looking at the management port configuration). Prometheus reaches it directly over `archiv-net`. Caddy never touches it. The network isolation model is intact. ### One heads-up (informational, not a blocker) The `job` label from promtail will be `backend`, `frontend`, `db`, etc. — whatever the Docker Compose service name is. The Grafana dashboard's `$app` variable must match these values exactly. If the dashboard was previously filtering by a different label (e.g. `container_name` or `app`), the panels won't light up until someone updates the variable definition in Grafana. Worth verifying in the running stack after this deploys.

marcel commented

2026-05-16 12:19:00 +02:00

🔐 Nora Steiner (NullX) — Application Security Engineer

Verdict: ⚠️ Approved with concerns

The `permitAll()` change — correctly implemented, security boundary needs hardening documentation

The fix is architecturally sound: in Spring Boot 4.0 with Jetty, the management port shares the security filter chain, so permitAll() on /actuator/prometheus is required. The security model relies on network isolation (port 8081 not published externally, not routed through Caddy).

This is the right approach. However, I want to flag the trust chain explicitly, because if the network isolation assumption ever breaks, this permitAll() becomes a door into JVM internals:

Current security boundary (correct):

Internet → Caddy (:443) → backend:8080   ← only this path
                        ↛ backend:8081   ← never reaches Caddy
Prometheus (internal) → backend:8081/actuator/prometheus ← correct

What jvm_memory_used_bytes and the full Prometheus output expose: heap usage, GC pause rates, thread counts, connection pool sizes, HTTP request counts with labels. None of this is directly exploitable, but it maps the application's internal structure to an attacker. The network isolation boundary is what makes permitAll() acceptable here.

Recommendation (non-blocking): Add a comment-level assertion or CI check that port 8081 is not in the published ports list in docker-compose. This makes the dependency explicit and reviewable. A future compose change that accidentally publishes 8081 would be immediately flagged.

Exposure surface — correctly scoped

management.endpoints.web.exposure.include: health,info,prometheus,metrics

Named list, not wildcard (*). /actuator/heapdump, /actuator/env, /actuator/loggers are all excluded. ✅

/actuator/info and /actuator/metrics require authentication (not in the permitAll() rule) — correct behavior. ✅

`micrometer-registry-prometheus` present — ✅

CWE-16 (Configuration) risk mitigated: metrics are no longer silently disabled. The explicit management.prometheus.metrics.export.enabled: true is belt-and-suspenders given the new starter, but correct and self-documenting for the Spring Boot 4.0 behavioral change.

No new attack surface introduced

No input validation changes, no new endpoints, no SQL queries, no user-controlled data flows. Clean from an injection and authorization standpoint.

## 🔐 Nora Steiner (NullX) — Application Security Engineer **Verdict: ⚠️ Approved with concerns** ### The `permitAll()` change — correctly implemented, security boundary needs hardening documentation The fix is architecturally sound: in Spring Boot 4.0 with Jetty, the management port shares the security filter chain, so `permitAll()` on `/actuator/prometheus` is required. The security model relies on network isolation (port 8081 not published externally, not routed through Caddy). This is the right approach. However, I want to flag the **trust chain explicitly**, because if the network isolation assumption ever breaks, this `permitAll()` becomes a door into JVM internals: **Current security boundary (correct):** ``` Internet → Caddy (:443) → backend:8080 ← only this path ↛ backend:8081 ← never reaches Caddy Prometheus (internal) → backend:8081/actuator/prometheus ← correct ``` **What `jvm_memory_used_bytes` and the full Prometheus output expose:** heap usage, GC pause rates, thread counts, connection pool sizes, HTTP request counts with labels. None of this is directly exploitable, but it maps the application's internal structure to an attacker. The network isolation boundary is what makes `permitAll()` acceptable here. **Recommendation (non-blocking):** Add a comment-level assertion or CI check that port 8081 is not in the published ports list in docker-compose. This makes the dependency explicit and reviewable. A future compose change that accidentally publishes 8081 would be immediately flagged. ### Exposure surface — correctly scoped ```yaml management.endpoints.web.exposure.include: health,info,prometheus,metrics ``` Named list, not wildcard (`*`). `/actuator/heapdump`, `/actuator/env`, `/actuator/loggers` are all excluded. ✅ `/actuator/info` and `/actuator/metrics` require authentication (not in the `permitAll()` rule) — correct behavior. ✅ ### `micrometer-registry-prometheus` present — ✅ CWE-16 (Configuration) risk mitigated: metrics are no longer silently disabled. The explicit `management.prometheus.metrics.export.enabled: true` is belt-and-suspenders given the new starter, but correct and self-documenting for the Spring Boot 4.0 behavioral change. ### No new attack surface introduced No input validation changes, no new endpoints, no SQL queries, no user-controlled data flows. Clean from an injection and authorization standpoint.

marcel commented

2026-05-16 12:19:01 +02:00

🧪 Sara Holt — Senior QA Engineer

Verdict: ⚠️ Approved with concerns

What's good

ActuatorPrometheusIT is a solid integration test:

Full Spring Boot context with a real Postgres container (Testcontainers via PostgresContainerConfig) ✅
@LocalManagementPort correctly exercises the management port, not the app port ✅
@MockitoBean S3Client prevents MinIO from being a test dependency ✅
Test name reads as a behavior statement ✅
Real HTTP request to the real endpoint, not MockMvc ✅

Concern — two assertions, one test

assertThat(response.getStatusCode().value()).isEqualTo(200);
assertThat(response.getBody()).contains("jvm_memory_used_bytes");

These are two distinct failure modes: the endpoint could return 200 with an empty body, or it could return 401. Splitting into two tests gives clearer failure messages:

@Test
void prometheus_endpoint_returns_200_without_credentials() {
    assertThat(noThrowTemplate()
        .getForEntity("http://localhost:" + managementPort + "/actuator/prometheus", String.class)
        .getStatusCode().value()).isEqualTo(200);
}

@Test
void prometheus_endpoint_returns_jvm_metrics() {
    assertThat(noThrowTemplate()
        .getForEntity("http://localhost:" + managementPort + "/actuator/prometheus", String.class)
        .getBody()).contains("jvm_memory_used_bytes");
}

When CI fails at 2am, "prometheus_endpoint_returns_200_without_credentials FAILED" vs "prometheus_endpoint_returns_jvm_metrics FAILED" point to different root causes.

Missing negative test — should be a blocker

There is no test verifying that authenticated-only endpoints still require authentication. The permitAll() change means two actuator endpoints are now open. The concern is whether a future permitAll() addition could accidentally open an endpoint it shouldn't.

@Test
void actuator_metrics_requires_authentication() {
    ResponseEntity<String> response = noThrowTemplate().getForEntity(
            "http://localhost:" + managementPort + "/actuator/metrics", String.class);
    assertThat(response.getStatusCode().value()).isEqualTo(401);
}

This anchors the current security surface. Without it, someone could add metrics to the permitAll() matcher and no test would catch it.

Coverage gate impact

This new test is in the root test package, consistent with the other IT classes (FamilienarchivApplicationTests, PostgresMigrationIT). No coverage concerns — the test exercises infrastructure wiring, not business logic, so the JaCoCo gate isn't affected.

## 🧪 Sara Holt — Senior QA Engineer **Verdict: ⚠️ Approved with concerns** ### What's good `ActuatorPrometheusIT` is a solid integration test: - Full Spring Boot context with a real Postgres container (Testcontainers via `PostgresContainerConfig`) ✅ - `@LocalManagementPort` correctly exercises the management port, not the app port ✅ - `@MockitoBean S3Client` prevents MinIO from being a test dependency ✅ - Test name reads as a behavior statement ✅ - Real HTTP request to the real endpoint, not MockMvc ✅ ### Concern — two assertions, one test ```java assertThat(response.getStatusCode().value()).isEqualTo(200); assertThat(response.getBody()).contains("jvm_memory_used_bytes"); ``` These are two distinct failure modes: the endpoint could return 200 with an empty body, or it could return 401. Splitting into two tests gives clearer failure messages: ```java @Test void prometheus_endpoint_returns_200_without_credentials() { assertThat(noThrowTemplate() .getForEntity("http://localhost:" + managementPort + "/actuator/prometheus", String.class) .getStatusCode().value()).isEqualTo(200); } @Test void prometheus_endpoint_returns_jvm_metrics() { assertThat(noThrowTemplate() .getForEntity("http://localhost:" + managementPort + "/actuator/prometheus", String.class) .getBody()).contains("jvm_memory_used_bytes"); } ``` When CI fails at 2am, "prometheus_endpoint_returns_200_without_credentials FAILED" vs "prometheus_endpoint_returns_jvm_metrics FAILED" point to different root causes. ### Missing negative test — should be a blocker There is no test verifying that **authenticated-only endpoints still require authentication**. The `permitAll()` change means two actuator endpoints are now open. The concern is whether a future `permitAll()` addition could accidentally open an endpoint it shouldn't. ```java @Test void actuator_metrics_requires_authentication() { ResponseEntity<String> response = noThrowTemplate().getForEntity( "http://localhost:" + managementPort + "/actuator/metrics", String.class); assertThat(response.getStatusCode().value()).isEqualTo(401); } ``` This anchors the current security surface. Without it, someone could add `metrics` to the `permitAll()` matcher and no test would catch it. ### Coverage gate impact This new test is in the root test package, consistent with the other IT classes (`FamilienarchivApplicationTests`, `PostgresMigrationIT`). No coverage concerns — the test exercises infrastructure wiring, not business logic, so the JaCoCo gate isn't affected.

marcel commented

2026-05-16 12:19:01 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

Reviewing against the original issue #604 acceptance criteria.

Requirements coverage

Fix described in issue	Implementation	Status
Prometheus endpoint accessible at `backend:8081/actuator/prometheus`	`management.server.port: 8081` + `exposure.include: prometheus` was already present; the missing pieces (starter, auth, enabled flag) are now added	✅
Add `micrometer-registry-prometheus` dependency	`spring-boot-starter-micrometer-metrics` added to pom.xml (Spring Boot 4.0 split this into a separate starter)	✅
`management.prometheus.metrics.export.enabled: true`	Added to application.yaml with comment explaining the Spring Boot 4.0 default-disabled behavior	✅
`/actuator/prometheus` accessible without authentication	`auth.requestMatchers("/actuator/health", "/actuator/prometheus").permitAll()`	✅
Promtail `job` label for Grafana dashboard variable `{job="$app"}`	`source_labels: ['__meta_docker_container_label_com_docker_compose_service']` → `target_label: 'job'`	✅
Automated regression test	`ActuatorPrometheusIT` verifies 200 + `jvm_memory_used_bytes` without credentials	✅

Acceptance criteria met

The integration test serves as the automated acceptance criterion. It verifies:

The endpoint is reachable (200, not 404 or 500)
Authentication is not required (no Authorization header sent)
JVM metrics are present in the response body

One open question not in the issue

The issue did not specify what happens when the Prometheus scrape target is DOWN (pre-fix behavior during deployment). The stale comment removal in prometheus.yml is correct — that workaround documentation should not live in production config. But the runbook for "Prometheus shows backend as DOWN" could be worth documenting in docs/infrastructure/ for the next time someone sees it. Not a blocker for this PR.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** Reviewing against the original issue #604 acceptance criteria. ### Requirements coverage | Fix described in issue | Implementation | Status | |---|---|---| | Prometheus endpoint accessible at `backend:8081/actuator/prometheus` | `management.server.port: 8081` + `exposure.include: prometheus` was already present; the missing pieces (starter, auth, enabled flag) are now added | ✅ | | Add `micrometer-registry-prometheus` dependency | `spring-boot-starter-micrometer-metrics` added to pom.xml (Spring Boot 4.0 split this into a separate starter) | ✅ | | `management.prometheus.metrics.export.enabled: true` | Added to application.yaml with comment explaining the Spring Boot 4.0 default-disabled behavior | ✅ | | `/actuator/prometheus` accessible without authentication | `auth.requestMatchers("/actuator/health", "/actuator/prometheus").permitAll()` | ✅ | | Promtail `job` label for Grafana dashboard variable `{job="$app"}` | `source_labels: ['__meta_docker_container_label_com_docker_compose_service']` → `target_label: 'job'` | ✅ | | Automated regression test | `ActuatorPrometheusIT` verifies 200 + `jvm_memory_used_bytes` without credentials | ✅ | ### Acceptance criteria met The integration test serves as the automated acceptance criterion. It verifies: 1. The endpoint is reachable (200, not 404 or 500) 2. Authentication is not required (no `Authorization` header sent) 3. JVM metrics are present in the response body ### One open question not in the issue The issue did not specify what happens when the Prometheus scrape target is DOWN (pre-fix behavior during deployment). The stale comment removal in `prometheus.yml` is correct — that workaround documentation should not live in production config. But the runbook for "Prometheus shows backend as DOWN" could be worth documenting in `docs/infrastructure/` for the next time someone sees it. Not a blocker for this PR.

marcel commented

2026-05-16 12:19:01 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Verdict: ✅ Approved

No frontend or UI changes in this PR. The entire diff is backend Java, YAML configuration, and infrastructure files.

What I checked:

No Svelte components added or modified
No route changes
No CSS or design token changes
No user-visible strings introduced

This is pure observability plumbing — wiring Prometheus metrics and Loki log labels so the engineering team can see what the application is doing. This work enables better monitoring dashboards, which will eventually improve the product's reliability for users. LGTM from a UX perspective.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** No frontend or UI changes in this PR. The entire diff is backend Java, YAML configuration, and infrastructure files. What I checked: - No Svelte components added or modified - No route changes - No CSS or design token changes - No user-visible strings introduced This is pure observability plumbing — wiring Prometheus metrics and Loki log labels so the engineering team can see what the application is doing. This work enables better monitoring dashboards, which will eventually improve the product's reliability for users. LGTM from a UX perspective.

marcel referenced this issue from a commit

2026-05-16 13:58:36 +02:00

docs(adr): add ADR-017 — Spring Boot 4.0 management port shares main security filter chain

marcel referenced this issue from a commit

2026-05-16 13:58:36 +02:00

docs(obs): document promtail job label mapping in DEPLOYMENT.md

marcel commented

2026-05-16 13:59:02 +02:00

Review concerns addressed

Three commits pushed to worktree-fix+issue-604-obs-wiring:

`c80d0c81` — fix(obs): add management security chain and split Prometheus IT tests

Addresses: @felix (comment duplication + TDD), @sara (missing negative test, split assertions), @nora (dedicated management security layer)

The root cause investigation found something worth documenting: in Spring Boot 4.0 with Jetty, the management port (8081) shares the main SecurityFilterChain. However, the main chain's DelegatingAuthenticationEntryPoint (configured for form-login) redirects unauthenticated requests to /login (302 → HTML 200) rather than returning 401. A plain httpBasic() override is not sufficient — Spring Security still runs the delegating entry point that prefers form-login for browser-like clients.

Fix: Added a dedicated @Order(1) SecurityFilterChain managementFilterChain scoped to /actuator/** with an explicit (req, res, e) -> res.setStatus(401) entry point. This ensures non-public actuator endpoints return 401 directly, which Prometheus and health checkers can interpret, and which the test can assert on unambiguously.

The existing comment duplication in SecurityConfig.java was cleaned up (Felix's concern): the redundant double-statement about management port / filter chain / network isolation is replaced by a single accurate comment referencing ADR-017.

Test changes:

prometheus_endpoint_returns_200_without_credentials — checks status only
prometheus_endpoint_returns_jvm_metrics — checks body only
actuator_metrics_requires_authentication — new negative test, asserts 401 ✅

All 1603 tests pass.

`060bacf3` — docs(adr): add ADR-017 — Spring Boot 4.0 management port shares main security filter chain

Addresses: @markus (blocker — missing ADR)

docs/adr/017-management-port-security.md documents:

Context: what changed in Spring Boot 4.0 (no more isolated management child-context security)
Decision: dedicated @Order(1) management filter chain + network isolation as outer boundary
Alternatives rejected: excluding ManagementWebSecurityAutoConfiguration (removed in SB4), using EndpointRequest (not a transitive dependency), relying solely on the main filter chain
Consequences: new actuator endpoints default to authenticated; regression test in CI

`13dd38a5` — docs(obs): document promtail job label mapping in DEPLOYMENT.md

Addresses: @markus (non-blocker suggestion)

Updated the obs-promtail service row in the service table to explain that the job label is mapped from the Docker Compose service name (com.docker.compose.service), and that this is what populates {job="backend"} queries and the Grafana "App" variable dropdown.

Deferred: @tobias's heads-up about verifying the Grafana $app variable against job label values is an operational check (not a code change) — flagged for post-deploy verification.

## Review concerns addressed Three commits pushed to `worktree-fix+issue-604-obs-wiring`: --- ### `c80d0c81` — fix(obs): add management security chain and split Prometheus IT tests **Addresses: @felix (comment duplication + TDD), @sara (missing negative test, split assertions), @nora (dedicated management security layer)** The root cause investigation found something worth documenting: in Spring Boot 4.0 with Jetty, the management port (8081) shares the main `SecurityFilterChain`. However, the main chain's `DelegatingAuthenticationEntryPoint` (configured for form-login) redirects unauthenticated requests to `/login` (302 → HTML 200) rather than returning 401. A plain `httpBasic()` override is not sufficient — Spring Security still runs the delegating entry point that prefers form-login for browser-like clients. **Fix:** Added a dedicated `@Order(1) SecurityFilterChain managementFilterChain` scoped to `/actuator/**` with an explicit `(req, res, e) -> res.setStatus(401)` entry point. This ensures non-public actuator endpoints return 401 directly, which Prometheus and health checkers can interpret, and which the test can assert on unambiguously. The existing comment duplication in `SecurityConfig.java` was cleaned up (Felix's concern): the redundant double-statement about management port / filter chain / network isolation is replaced by a single accurate comment referencing ADR-017. Test changes: - `prometheus_endpoint_returns_200_without_credentials` — checks status only - `prometheus_endpoint_returns_jvm_metrics` — checks body only - `actuator_metrics_requires_authentication` — new negative test, asserts 401 ✅ All 1603 tests pass. --- ### `060bacf3` — docs(adr): add ADR-017 — Spring Boot 4.0 management port shares main security filter chain **Addresses: @markus (blocker — missing ADR)** `docs/adr/017-management-port-security.md` documents: - Context: what changed in Spring Boot 4.0 (no more isolated management child-context security) - Decision: dedicated `@Order(1)` management filter chain + network isolation as outer boundary - Alternatives rejected: excluding `ManagementWebSecurityAutoConfiguration` (removed in SB4), using `EndpointRequest` (not a transitive dependency), relying solely on the main filter chain - Consequences: new actuator endpoints default to authenticated; regression test in CI --- ### `13dd38a5` — docs(obs): document promtail job label mapping in DEPLOYMENT.md **Addresses: @markus (non-blocker suggestion)** Updated the `obs-promtail` service row in the service table to explain that the `job` label is mapped from the Docker Compose service name (`com.docker.compose.service`), and that this is what populates `{job="backend"}` queries and the Grafana "App" variable dropdown. --- **Deferred:** @tobias's heads-up about verifying the Grafana `$app` variable against `job` label values is an operational check (not a code change) — flagged for post-deploy verification.

marcel commented

2026-05-16 14:55:16 +02:00

🏗️ Markus Keller — Application Architect

Verdict: ⚠️ Approved with concerns

What's right

ADR-017 is thorough and accurate. It correctly explains the Spring Boot 4.0 filter chain unification, the redirect problem Prometheus faced, and why EndpointRequest was avoided. This is exactly the kind of "why" documentation that prevents future developers from undoing the fix.
@Order(1) management chain with securityMatcher("/actuator/**") is the correct Spring Security pattern for layered filter chains. Scoping it tightly to /actuator/** avoids interfering with the main application chain.
Belt-and-suspenders permitAll() in the main chain is good defence-in-depth. The comment referencing docs/adr/017 is the right cross-reference.
Network isolation as the outer boundary is correctly described — port 8081 is not published, only Prometheus inside archiv-net can reach it.

Blocker

DEPLOYMENT.md services table on this branch still contains stale values from before PR #605:

obs-grafana row: ${PORT_GRAFANA:-3001} → should be ${PORT_GRAFANA:-3003} (fixed by #605)
obs-glitchtip and obs-glitchtip-worker rows: glitchtip/glitchtip:v4 → should be glitchtip/glitchtip:6.1.6 (fixed by #605)

The branch was cut before #605 merged. Git's 3-way merge will likely preserve the #605 fixes, but this should be verified — if the merge resolves to the stale values, operators following DEPLOYMENT.md will use the wrong Grafana port and an outdated GlitchTip image. Rebase on current main or manually update these rows.

Suggestion

The docs/architecture/c4/l2-containers.puml was not updated. Strictly, this PR doesn't add a new service, so it may be acceptable — but the security topology change (management port no longer isolated from the main filter chain) is a meaningful architectural fact. Consider a one-line note on the backend container if the team wants l2 to reflect security chain topology.

## 🏗️ Markus Keller — Application Architect **Verdict: ⚠️ Approved with concerns** ### What's right - **ADR-017 is thorough and accurate.** It correctly explains the Spring Boot 4.0 filter chain unification, the redirect problem Prometheus faced, and why `EndpointRequest` was avoided. This is exactly the kind of "why" documentation that prevents future developers from undoing the fix. - **`@Order(1)` management chain with `securityMatcher("/actuator/**")` is the correct Spring Security pattern** for layered filter chains. Scoping it tightly to `/actuator/**` avoids interfering with the main application chain. - **Belt-and-suspenders `permitAll()` in the main chain** is good defence-in-depth. The comment referencing `docs/adr/017` is the right cross-reference. - **Network isolation as the outer boundary** is correctly described — port 8081 is not published, only Prometheus inside `archiv-net` can reach it. ### Blocker **DEPLOYMENT.md services table on this branch still contains stale values from before PR #605:** - `obs-grafana` row: `${PORT_GRAFANA:-3001}` → should be `${PORT_GRAFANA:-3003}` (fixed by #605) - `obs-glitchtip` and `obs-glitchtip-worker` rows: `glitchtip/glitchtip:v4` → should be `glitchtip/glitchtip:6.1.6` (fixed by #605) The branch was cut before #605 merged. Git's 3-way merge will likely preserve the #605 fixes, but this should be verified — if the merge resolves to the stale values, operators following DEPLOYMENT.md will use the wrong Grafana port and an outdated GlitchTip image. Rebase on current `main` or manually update these rows. ### Suggestion The `docs/architecture/c4/l2-containers.puml` was not updated. Strictly, this PR doesn't add a new service, so it may be acceptable — but the security topology change (management port no longer isolated from the main filter chain) is a meaningful architectural fact. Consider a one-line note on the backend container if the team wants l2 to reflect security chain topology.

marcel commented

2026-05-16 14:55:27 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

What's right

pom.xml change is minimal and clean. Comment on the spring-boot-starter-micrometer-metrics dependency correctly explains the Spring Boot 4.0 split — this is a "why" comment that a future reader needs.
SecurityConfig.java is well-structured: the management chain does one thing, the guard clause comment explains the threat model, and the reference to ADR-017 closes the loop.
ActuatorPrometheusIT test names read as sentences and cover the three meaningful behaviours: open endpoint returns 200, JVM metrics are present, protected endpoint returns 401.

Suggestions (non-blocking)

noThrowTemplate() is recreated on every call. Three tests each call it once, making three RestTemplate allocations. Extract as a field or @BeforeEach:

private RestTemplate restTemplate;

@BeforeEach
void setUp() { restTemplate = noThrowTemplate(); }

prometheus_endpoint_returns_200_without_credentials and prometheus_endpoint_returns_jvm_metrics make the same HTTP request. The 200 status is implied by a successful body parse in the second test anyway. Consider whether these can be two assertions in one test — Sara's "one logical assertion per test" rule is satisfied here since they verify different observable facts (status vs. body content), so keeping them split is also defensible.

Missing "authenticated access is admitted" test. The suite proves unauthenticated requests to /actuator/metrics return 401, but there is no test proving that a valid Basic auth credential returns 200. This would close the coverage gap and prove the chain allows legitimate scraping clients through.

ActuatorPrometheusIT lives in the root test package. Fine for an IT test with no domain ownership, but consistent with how other integration tests in this repo are organized.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** ### What's right - `pom.xml` change is minimal and clean. Comment on the `spring-boot-starter-micrometer-metrics` dependency correctly explains the Spring Boot 4.0 split — this is a "why" comment that a future reader needs. - `SecurityConfig.java` is well-structured: the management chain does one thing, the guard clause comment explains the threat model, and the reference to ADR-017 closes the loop. - `ActuatorPrometheusIT` test names read as sentences and cover the three meaningful behaviours: open endpoint returns 200, JVM metrics are present, protected endpoint returns 401. ### Suggestions (non-blocking) **`noThrowTemplate()` is recreated on every call.** Three tests each call it once, making three `RestTemplate` allocations. Extract as a field or `@BeforeEach`: ```java private RestTemplate restTemplate; @BeforeEach void setUp() { restTemplate = noThrowTemplate(); } ``` **`prometheus_endpoint_returns_200_without_credentials` and `prometheus_endpoint_returns_jvm_metrics` make the same HTTP request.** The 200 status is implied by a successful body parse in the second test anyway. Consider whether these can be two assertions in one test — Sara's "one logical assertion per test" rule is satisfied here since they verify different observable facts (status vs. body content), so keeping them split is also defensible. **Missing "authenticated access is admitted" test.** The suite proves unauthenticated requests to `/actuator/metrics` return 401, but there is no test proving that a valid Basic auth credential returns 200. This would close the coverage gap and prove the chain allows legitimate scraping clients through. **`ActuatorPrometheusIT` lives in the root test package.** Fine for an IT test with no domain ownership, but consistent with how other integration tests in this repo are organized.

marcel commented

2026-05-16 14:55:38 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

What's right

Promtail relabel rule is minimal and correct. Mapping __meta_docker_container_label_com_docker_compose_service → job is the standard pattern for populating Grafana Loki's job dropdown. One rule, right place.
Stale comment removal in prometheus.yml is good hygiene. "Target will show as DOWN until…" was a lie once the dependency was added — removing it reduces confusion.
No restart needed for Prometheus (cosmetic-only change) is correctly noted in the PR description.
application.yaml change is correct: management.prometheus.metrics.export.enabled: true is required in Spring Boot 4.0 to activate the Prometheus scrape endpoint, and the comment explains why.
Backend rebuild requirement is clearly documented in the deployment notes.

Blocker

DEPLOYMENT.md services table on this branch has stale values from before PR #605:

| obs-grafana | grafana/grafana-oss:11.6.1 | … Bound to 127.0.0.1:${PORT_GRAFANA:-3001} …
| obs-glitchtip | glitchtip/glitchtip:v4 | …
| obs-glitchtip-worker | glitchtip/glitchtip:v4 | …

PR #605 already fixed these (3001→3003, v4→6.1.6) on main. This branch was cut before that merge. Git's 3-way merge will likely preserve the #605 fixes, but this is risky — if they don't survive the merge, an operator following DEPLOYMENT.md after this PR ships would open Grafana on the wrong port and pull a stale GlitchTip image. Rebase on current main to make the branch agree with main before merging.

Suggestion

The comment in promtail-config.yml noting that container_name differs between dev and prod environments is genuinely useful operational knowledge. Worth keeping.

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** ### What's right - **Promtail relabel rule is minimal and correct.** Mapping `__meta_docker_container_label_com_docker_compose_service` → `job` is the standard pattern for populating Grafana Loki's `job` dropdown. One rule, right place. - **Stale comment removal in `prometheus.yml`** is good hygiene. "Target will show as DOWN until…" was a lie once the dependency was added — removing it reduces confusion. - **No restart needed for Prometheus** (cosmetic-only change) is correctly noted in the PR description. - **`application.yaml` change is correct:** `management.prometheus.metrics.export.enabled: true` is required in Spring Boot 4.0 to activate the Prometheus scrape endpoint, and the comment explains why. - **Backend rebuild requirement is clearly documented** in the deployment notes. ### Blocker **DEPLOYMENT.md services table on this branch has stale values from before PR #605:** ``` | obs-grafana | grafana/grafana-oss:11.6.1 | … Bound to 127.0.0.1:${PORT_GRAFANA:-3001} … | obs-glitchtip | glitchtip/glitchtip:v4 | … | obs-glitchtip-worker | glitchtip/glitchtip:v4 | … ``` PR #605 already fixed these (`3001`→`3003`, `v4`→`6.1.6`) on `main`. This branch was cut before that merge. Git's 3-way merge will likely preserve the #605 fixes, but this is risky — if they don't survive the merge, an operator following DEPLOYMENT.md after this PR ships would open Grafana on the wrong port and pull a stale GlitchTip image. Rebase on current `main` to make the branch agree with `main` before merging. ### Suggestion The comment in `promtail-config.yml` noting that `container_name` differs between dev and prod environments is genuinely useful operational knowledge. Worth keeping.

marcel commented

2026-05-16 14:55:51 +02:00

🔒 Nora "NullX" Steiner — Security Engineer

Verdict: ✅ Approved

What's right

The management filter chain design is correct and the threat model is well-documented.

Chain scoping is tight. securityMatcher("/actuator/**") at @Order(1) means requests not matching that path fall through to the main chain — no accidental overpermitting.

The custom authenticationEntryPoint returning plain 401 avoids the redirect-to-login vulnerability that was causing Prometheus to receive an HTML 302 response. This is the right fix. ADR-017 documents it clearly.

/actuator/prometheus is permitAll() by design — Prometheus scrapes are unauthenticated. Network isolation (port 8081 not published, only Prometheus inside archiv-net can reach it) is the outer control, and ADR-017 states this explicitly. Acceptable threat model for this deployment.

Belt-and-suspenders permitAll() in the main chain for health and prometheus is a defensive move that prevents a misconfiguration regression from accidentally blocking Docker health checks. The comment pointing to ADR-017 is exactly what this codebase's conventions ask for.

Observations (non-blocking)

The 401 response has no WWW-Authenticate header. RFC 7235 requires a WWW-Authenticate header on 401 responses. Most HTTP clients (including Prometheus) will function correctly without it, but strict clients may reject the response. Since Prometheus handles it fine and no browser reaches this endpoint, this is acceptable. Worth a one-liner comment if you want to preempt future questions:

.authenticationEntryPoint((req, res, e) -> {
    res.setStatus(HttpServletResponse.SC_UNAUTHORIZED);
    // No WWW-Authenticate: Prometheus/Docker health checks don't need it; port 8081 is not browser-accessible
})

ActuatorPrometheusIT does not test that valid credentials are admitted to /actuator/metrics. The test proves the 401 gate fires correctly, but not that the auth flow works end-to-end. Low risk given the belt-and-suspenders design, but adding one positive-path test would close this.

No CSRF on the management chain — correct. Actuator endpoints are read-only scrape targets and have no session-based callers, so CSRF protection would be noise.

## 🔒 Nora "NullX" Steiner — Security Engineer **Verdict: ✅ Approved** ### What's right The management filter chain design is correct and the threat model is well-documented. **Chain scoping is tight.** `securityMatcher("/actuator/**")` at `@Order(1)` means requests not matching that path fall through to the main chain — no accidental overpermitting. **The custom `authenticationEntryPoint` returning plain 401** avoids the redirect-to-login vulnerability that was causing Prometheus to receive an HTML 302 response. This is the right fix. ADR-017 documents it clearly. **`/actuator/prometheus` is `permitAll()` by design** — Prometheus scrapes are unauthenticated. Network isolation (port 8081 not published, only Prometheus inside `archiv-net` can reach it) is the outer control, and ADR-017 states this explicitly. Acceptable threat model for this deployment. **Belt-and-suspenders `permitAll()` in the main chain** for health and prometheus is a defensive move that prevents a misconfiguration regression from accidentally blocking Docker health checks. The comment pointing to ADR-017 is exactly what this codebase's conventions ask for. ### Observations (non-blocking) **The 401 response has no `WWW-Authenticate` header.** RFC 7235 requires a `WWW-Authenticate` header on 401 responses. Most HTTP clients (including Prometheus) will function correctly without it, but strict clients may reject the response. Since Prometheus handles it fine and no browser reaches this endpoint, this is acceptable. Worth a one-liner comment if you want to preempt future questions: ```java .authenticationEntryPoint((req, res, e) -> { res.setStatus(HttpServletResponse.SC_UNAUTHORIZED); // No WWW-Authenticate: Prometheus/Docker health checks don't need it; port 8081 is not browser-accessible }) ``` **`ActuatorPrometheusIT` does not test that valid credentials are admitted** to `/actuator/metrics`. The test proves the 401 gate fires correctly, but not that the auth flow works end-to-end. Low risk given the belt-and-suspenders design, but adding one positive-path test would close this. **No `CSRF` on the management chain** — correct. Actuator endpoints are read-only scrape targets and have no session-based callers, so CSRF protection would be noise.

marcel commented

2026-05-16 14:56:02 +02:00

🧪 Sara Holt — QA Engineer

Verdict: ✅ Approved

What's right

Three tests, three behaviours, three clear names. prometheus_endpoint_returns_200_without_credentials, prometheus_endpoint_returns_jvm_metrics, actuator_metrics_requires_authentication — each name is a sentence. When one fails in CI, no one needs to read the body.
@SpringBootTest(webEnvironment = RANDOM_PORT) with @LocalManagementPort is the correct slice for management port testing. This exercises the real Spring Security filter chain against a randomly assigned port — no mocking, no shortcuts.
@MockitoBean S3Client s3Client prevents the full integration test from trying to connect to a real S3 endpoint. Correct isolation boundary.
PostgresContainerConfig.class import ensures migrations run against real Postgres — consistent with the project's test standards.

Suggestions (non-blocking)

noThrowTemplate() is called three times, creating three RestTemplate instances. Promote to a @BeforeEach-initialized field.

Missing: "authenticated access is admitted" — there is no test proving that valid credentials (Basic auth) allow through to /actuator/metrics. The current suite proves the 401 gate fires, but not that the chain is correctly wired end-to-end. Consider:

@Test
void actuator_metrics_accessible_with_valid_credentials() {
    ResponseEntity<String> response = noThrowTemplate().getForEntity(
        "http://localhost:" + managementPort + "/actuator/metrics", String.class,
        // inject test user credentials here
    );
    assertThat(response.getStatusCode().value()).isEqualTo(200);
}

Missing: /actuator/health positive test. It's in permitAll() like prometheus, but only prometheus is tested. Health is tested implicitly by Docker health checks in compose, but an explicit IT test documents the intent in code.

Two tests call the same URL independently (200 check and jvm_metrics check on /actuator/prometheus). Sara's "one logical assertion per test" principle is correctly applied here — status code and body content are different observable facts. Keep as-is.

## 🧪 Sara Holt — QA Engineer **Verdict: ✅ Approved** ### What's right - **Three tests, three behaviours, three clear names.** `prometheus_endpoint_returns_200_without_credentials`, `prometheus_endpoint_returns_jvm_metrics`, `actuator_metrics_requires_authentication` — each name is a sentence. When one fails in CI, no one needs to read the body. - **`@SpringBootTest(webEnvironment = RANDOM_PORT)` with `@LocalManagementPort`** is the correct slice for management port testing. This exercises the real Spring Security filter chain against a randomly assigned port — no mocking, no shortcuts. - **`@MockitoBean S3Client s3Client`** prevents the full integration test from trying to connect to a real S3 endpoint. Correct isolation boundary. - **`PostgresContainerConfig.class` import** ensures migrations run against real Postgres — consistent with the project's test standards. ### Suggestions (non-blocking) **`noThrowTemplate()` is called three times, creating three `RestTemplate` instances.** Promote to a `@BeforeEach`-initialized field. **Missing: "authenticated access is admitted"** — there is no test proving that valid credentials (Basic auth) allow through to `/actuator/metrics`. The current suite proves the 401 gate fires, but not that the chain is correctly wired end-to-end. Consider: ```java @Test void actuator_metrics_accessible_with_valid_credentials() { ResponseEntity<String> response = noThrowTemplate().getForEntity( "http://localhost:" + managementPort + "/actuator/metrics", String.class, // inject test user credentials here ); assertThat(response.getStatusCode().value()).isEqualTo(200); } ``` **Missing: `/actuator/health` positive test.** It's in `permitAll()` like prometheus, but only prometheus is tested. Health is tested implicitly by Docker health checks in compose, but an explicit IT test documents the intent in code. **Two tests call the same URL independently** (`200` check and `jvm_metrics` check on `/actuator/prometheus`). Sara's "one logical assertion per test" principle is correctly applied here — status code and body content are different observable facts. Keep as-is.

marcel commented

2026-05-16 14:56:11 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

What I checked

Verified that this PR traces cleanly from the issue (#604) through to the acceptance criteria, and that the scope is bounded.

What's right

Root cause documentation in the PR description is exemplary. Three numbered root causes, each with a "Fix:" line — this is exactly the level of specificity that makes future debugging tractable.
"What was NOT changed" section is unusual and valuable. It preemptively answers the questions a reviewer will ask: "Did you need to touch SecurityConfig? Did you need to change pom.xml?" No guesswork.
Acceptance criteria map directly to the issue's up{job="spring-boot"} observable. Each criterion is verifiable without interpretation.
Deployment notes are actionable — "requires rebuild + redeploy" vs. "config-only change" vs. "no restart needed" for each affected component is exactly the granularity an operator needs.

Observation

The acceptance criteria in the PR description ("after redeploy…") are the right artefacts, but they duplicate what should be checkboxes on issue #604 itself. For future PRs: closing an issue via "Closes #604" works, but if the issue has acceptance criteria checked off before merge, the issue history becomes the audit trail. Minor workflow note, not a blocker.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** ### What I checked Verified that this PR traces cleanly from the issue (#604) through to the acceptance criteria, and that the scope is bounded. ### What's right - **Root cause documentation in the PR description is exemplary.** Three numbered root causes, each with a "Fix:" line — this is exactly the level of specificity that makes future debugging tractable. - **"What was NOT changed" section** is unusual and valuable. It preemptively answers the questions a reviewer will ask: "Did you need to touch SecurityConfig? Did you need to change pom.xml?" No guesswork. - **Acceptance criteria map directly to the issue's `up{job="spring-boot"}` observable.** Each criterion is verifiable without interpretation. - **Deployment notes are actionable** — "requires rebuild + redeploy" vs. "config-only change" vs. "no restart needed" for each affected component is exactly the granularity an operator needs. ### Observation The acceptance criteria in the PR description ("after redeploy…") are the right artefacts, but they duplicate what should be checkboxes on issue #604 itself. For future PRs: closing an issue via "Closes #604" works, but if the issue has acceptance criteria checked off before merge, the issue history becomes the audit trail. Minor workflow note, not a blocker.

marcel commented

2026-05-16 14:56:14 +02:00

🎨 Leonie Voss — UI/UX Designer

Verdict: ✅ Approved

What I checked

This PR is entirely backend + infrastructure: Spring Security filter chain, Micrometer Prometheus wiring, Promtail relabeling, and documentation. There are no Svelte components, no routes, no CSS, no user-visible changes.

LGTM from a UI/UX perspective. Nothing to review here — the fix is invisible to users, which is exactly what observability plumbing should be.

## 🎨 Leonie Voss — UI/UX Designer **Verdict: ✅ Approved** ### What I checked This PR is entirely backend + infrastructure: Spring Security filter chain, Micrometer Prometheus wiring, Promtail relabeling, and documentation. There are no Svelte components, no routes, no CSS, no user-visible changes. **LGTM from a UI/UX perspective.** Nothing to review here — the fix is invisible to users, which is exactly what observability plumbing should be.

marcel commented

2026-05-16 15:33:02 +02:00

🔧 Live investigation + follow-up fixes (2026-05-16)

After the initial PR commits, I deployed the obs stack to staging and hit two production failures. Here's what was found and what was fixed — both live on the server and in the three additional commits now on this branch.

Issue 1 — Empty Spring Boot Metrics dashboard

Symptom: Grafana's Spring Boot Observability dashboard (ID 17175) showed all-empty panels.

Root cause: The dashboard uses label_values(jvm_memory_committed_bytes, application) as its top-level template variable. Without an application label on Spring Boot metrics, the variable resolves to nothing and every panel query returns no data.

Fix:

Added management.metrics.tags.application: ${spring.application.name} to application.yaml — tags every Micrometer metric with application=Familienarchiv
Added MANAGEMENT_METRICS_TAGS_APPLICATION: Familienarchiv to docker-compose.prod.yml

Verified: label_values(application) in Prometheus now returns ['Familienarchiv'].

Issue 2 — "Connection reset" to Tempo

Symptom:

ERROR i.o.exporter.internal.http.HttpExporter : Failed to export logs.
The request could not be executed. Full error message: Connection reset

Root cause: Spring Boot's OpenTelemetry starter uses HttpExporter (HTTP/1.1) by default. The configured endpoint was http://tempo:4317 — that is the gRPC port. Sending an HTTP/1.1 request body to a gRPC listener causes an immediate connection reset.

Fix:

Changed OTEL_EXPORTER_OTLP_ENDPOINT from http://tempo:4317 to http://tempo:4318 (the OTLP HTTP port) in application.yaml and docker-compose.prod.yml
Updated the application.yaml comment and the DEPLOYMENT.md env var table to document why 4318 and not 4317

Issue 3 — Subsequent 404 errors on log and metric OTLP export

Symptom (after fixing port):

WARN HttpExporter : Failed to export logs. Server responded with HTTP status code 404.
WARN HttpExporter : Failed to export metrics. Server responded with HTTP status code 404.

Root cause: The OpenTelemetry starter enables OTLP export for all three signals (traces, logs, metrics) to the same endpoint by default. Tempo only handles traces (/v1/traces). It returns 404 for /v1/logs and /v1/metrics.

Fix:

OTEL_LOGS_EXPORTER: none — Promtail captures Docker logs via the logging driver; no need to also push them via OTLP
OTEL_METRICS_EXPORTER: none — Prometheus scrapes metrics via pull model (/actuator/prometheus); no need to push them via OTLP to Tempo

Both env vars added to application.yaml (defaults) and docker-compose.prod.yml.

Verified: Zero OTLP export errors in the last 2 minutes of logs. Traces confirmed flowing to Tempo (queried via obs-tempo API).

Also fixed — healthcheck port in `docker-compose.prod.yml`

The backend healthcheck was pointing at port 8080 (main app), but management.server.port: 8081 means /actuator/health is only served on 8081. Changed the healthcheck to use port 8081.

New: `docs/OBSERVABILITY.md`

Added a developer-facing observability guide so the team knows where to look for what:

"Where to look for what" decision table (10 common questions → exact tool + action)
Loki: label schema, why job= beats container_name=, common LogQL queries, log→trace correlation
Tempo: three ways to find a trace, what span names mean, sampling rates per environment
Metrics: why application="Familienarchiv" matters, four useful PromQL queries
GlitchTip: when to use it vs. Loki
Signal summary: source → transport → storage → UI for every signal

Also updated DEPLOYMENT.md §4 to link to the new guide, and fixed the stale env var table rows (4317→4318, added the three new vars).

## 🔧 Live investigation + follow-up fixes (2026-05-16) After the initial PR commits, I deployed the obs stack to staging and hit two production failures. Here's what was found and what was fixed — both live on the server and in the three additional commits now on this branch. --- ### Issue 1 — Empty Spring Boot Metrics dashboard **Symptom:** Grafana's Spring Boot Observability dashboard (ID 17175) showed all-empty panels. **Root cause:** The dashboard uses `label_values(jvm_memory_committed_bytes, application)` as its top-level template variable. Without an `application` label on Spring Boot metrics, the variable resolves to nothing and every panel query returns no data. **Fix:** - Added `management.metrics.tags.application: ${spring.application.name}` to `application.yaml` — tags every Micrometer metric with `application=Familienarchiv` - Added `MANAGEMENT_METRICS_TAGS_APPLICATION: Familienarchiv` to `docker-compose.prod.yml` **Verified:** `label_values(application)` in Prometheus now returns `['Familienarchiv']`. --- ### Issue 2 — "Connection reset" to Tempo **Symptom:** ``` ERROR i.o.exporter.internal.http.HttpExporter : Failed to export logs. The request could not be executed. Full error message: Connection reset ``` **Root cause:** Spring Boot's OpenTelemetry starter uses `HttpExporter` (HTTP/1.1) by default. The configured endpoint was `http://tempo:4317` — that is the gRPC port. Sending an HTTP/1.1 request body to a gRPC listener causes an immediate connection reset. **Fix:** - Changed `OTEL_EXPORTER_OTLP_ENDPOINT` from `http://tempo:4317` to `http://tempo:4318` (the OTLP HTTP port) in `application.yaml` and `docker-compose.prod.yml` - Updated the `application.yaml` comment and the `DEPLOYMENT.md` env var table to document why 4318 and not 4317 --- ### Issue 3 — Subsequent 404 errors on log and metric OTLP export **Symptom (after fixing port):** ``` WARN HttpExporter : Failed to export logs. Server responded with HTTP status code 404. WARN HttpExporter : Failed to export metrics. Server responded with HTTP status code 404. ``` **Root cause:** The OpenTelemetry starter enables OTLP export for all three signals (traces, logs, metrics) to the same endpoint by default. Tempo only handles traces (`/v1/traces`). It returns 404 for `/v1/logs` and `/v1/metrics`. **Fix:** - `OTEL_LOGS_EXPORTER: none` — Promtail captures Docker logs via the logging driver; no need to also push them via OTLP - `OTEL_METRICS_EXPORTER: none` — Prometheus scrapes metrics via pull model (`/actuator/prometheus`); no need to push them via OTLP to Tempo Both env vars added to `application.yaml` (defaults) and `docker-compose.prod.yml`. **Verified:** Zero OTLP export errors in the last 2 minutes of logs. Traces confirmed flowing to Tempo (queried via `obs-tempo` API). --- ### Also fixed — healthcheck port in `docker-compose.prod.yml` The backend healthcheck was pointing at port 8080 (main app), but `management.server.port: 8081` means `/actuator/health` is only served on 8081. Changed the healthcheck to use port 8081. --- ### New: `docs/OBSERVABILITY.md` Added a developer-facing observability guide so the team knows where to look for what: - **"Where to look for what"** decision table (10 common questions → exact tool + action) - **Loki**: label schema, why `job=` beats `container_name=`, common LogQL queries, log→trace correlation - **Tempo**: three ways to find a trace, what span names mean, sampling rates per environment - **Metrics**: why `application="Familienarchiv"` matters, four useful PromQL queries - **GlitchTip**: when to use it vs. Loki - **Signal summary**: source → transport → storage → UI for every signal Also updated `DEPLOYMENT.md §4` to link to the new guide, and fixed the stale env var table rows (4317→4318, added the three new vars).

marcel commented

2026-05-16 15:34:53 +02:00

🏗️ Markus Keller — Senior Application Architect

Verdict: ✅ Approved

The PR is architecturally sound. The root-cause analysis is correct, the fix is proportionate, and the ADR documents the decision properly. A few things to note.

What is done well

ADR-017 is exactly right. Context, decision, alternatives rejected with justification, consequences — the full format. The "alternatives rejected" section correctly names ManagementWebSecurityAutoConfiguration exclusion (no longer applicable in Boot 4.0) and EndpointRequest.toAnyEndpoint() (unavailable without extra dependency). That level of research-before-commit is what I want to see.
The @Order(1) management chain with a path-scoped securityMatcher("/actuator/**") is the correct primitive for this Spring Security model. It is clear, self-contained, and does not leak into the main filter chain.
The belt-and-suspenders permitAll() in securityFilterChain is accompanied by an explanatory comment that references the ADR. Future developers will not remove it thinking it is redundant.
OTEL port correction (4317 → 4318, gRPC → HTTP) is a genuine correctness fix, not cosmetic.
docs/OBSERVABILITY.md is a solid ops-to-dev handoff document.

Blockers

None.

Suggestions

docs/architecture/c4/l2-containers.puml not updated. The containers diagram should reflect any changes to how the observability stack is wired. The Promtail job label and the OTLP port change are invisible to someone reading the architecture diagrams. The doc-update table in the reviewing checklist lists new Docker service topology changes as requiring l2-containers.puml updates. This PR doesn't add a new service, so it is a borderline case — but the OTEL endpoint change is observable-topology-relevant. Recommend a quick scan to confirm the diagram comment is still accurate.
The managementFilterChain does not call .httpBasic() for the anyRequest().authenticated() branch. A request to /actuator/metrics without credentials will get a 401 body-less response (because the custom entry point only sets the status, no WWW-Authenticate header). That is intentional and correct for a machine-only endpoint — but it means a human trying to debug via curl will get a bare 401 with no hint. This is acceptable; just confirming it is deliberate.
MANAGEMENT_TRACING_SAMPLING_PROBABILITY now appears in docker-compose.prod.yml but the default in application.yaml is 1.0. The prod compose sets it to 0.1 via env var. This is the right pattern — no change needed. But the DEPLOYMENT.md table is worth checking to ensure both the default and the prod-override are clearly documented (they now appear to be, in the updated table).

Overall: clean fix, proportionate scope, well-documented. ✅

## 🏗️ Markus Keller — Senior Application Architect **Verdict: ✅ Approved** The PR is architecturally sound. The root-cause analysis is correct, the fix is proportionate, and the ADR documents the decision properly. A few things to note. --- ### What is done well - **ADR-017** is exactly right. Context, decision, alternatives rejected with justification, consequences — the full format. The "alternatives rejected" section correctly names `ManagementWebSecurityAutoConfiguration` exclusion (no longer applicable in Boot 4.0) and `EndpointRequest.toAnyEndpoint()` (unavailable without extra dependency). That level of research-before-commit is what I want to see. - The `@Order(1)` management chain with a path-scoped `securityMatcher("/actuator/**")` is the correct primitive for this Spring Security model. It is clear, self-contained, and does not leak into the main filter chain. - The belt-and-suspenders `permitAll()` in `securityFilterChain` is accompanied by an explanatory comment that references the ADR. Future developers will not remove it thinking it is redundant. - OTEL port correction (4317 → 4318, gRPC → HTTP) is a genuine correctness fix, not cosmetic. - `docs/OBSERVABILITY.md` is a solid ops-to-dev handoff document. --- ### Blockers None. --- ### Suggestions 1. **`docs/architecture/c4/l2-containers.puml` not updated.** The containers diagram should reflect any changes to how the observability stack is wired. The Promtail `job` label and the OTLP port change are invisible to someone reading the architecture diagrams. The doc-update table in the reviewing checklist lists new Docker service topology changes as requiring `l2-containers.puml` updates. This PR doesn't add a new service, so it is a borderline case — but the OTEL endpoint change is observable-topology-relevant. Recommend a quick scan to confirm the diagram comment is still accurate. 2. **The `managementFilterChain` does not call `.httpBasic()`** for the `anyRequest().authenticated()` branch. A request to `/actuator/metrics` without credentials will get a 401 body-less response (because the custom entry point only sets the status, no `WWW-Authenticate` header). That is intentional and correct for a machine-only endpoint — but it means a human trying to debug via curl will get a bare 401 with no hint. This is acceptable; just confirming it is deliberate. 3. **`MANAGEMENT_TRACING_SAMPLING_PROBABILITY` now appears in `docker-compose.prod.yml`** but the default in `application.yaml` is `1.0`. The prod compose sets it to `0.1` via env var. This is the right pattern — no change needed. But the `DEPLOYMENT.md` table is worth checking to ensure both the default and the prod-override are clearly documented (they now appear to be, in the updated table). Overall: clean fix, proportionate scope, well-documented. ✅

marcel commented

2026-05-16 15:35:10 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

This is a clean, well-scoped infrastructure fix. The code changes are minimal and readable, the tests exist, and the naming is clear. A few implementation observations below.

What is done well

managementFilterChain in SecurityConfig.java is 20 lines of focussed security configuration. No business logic has leaked in. The lambda entry point is a one-liner. Good.
The noThrowTemplate() helper in ActuatorPrometheusIT is well-named and does one thing: return a RestTemplate that does not throw on non-2xx. The DefaultResponseErrorHandler override is the correct idiom.
Comments in SecurityConfig.java explain why things are configured the way they are (threat model, not code description) — consistent with the project code style.
The promtail relabel rule is as simple as it can possibly be. One source label, one target label. Nothing to improve.
OTEL exporter disabling via otel.logs.exporter: none and otel.metrics.exporter: none in application.yaml is correct — these are OpenTelemetry SDK properties, not Spring Boot management properties.

Blockers

None.

Suggestions

ActuatorPrometheusIT — three tests, three HTTP calls. Each @Test method calls noThrowTemplate().getForEntity(...) independently, meaning the full Spring Boot context starts and three separate HTTP calls are made. This is an integration test, so the context start cost is paid once per class (not per method) — that is fine. But the prometheus_endpoint_returns_200_without_credentials and prometheus_endpoint_returns_jvm_metrics tests both call the same endpoint and could share a response. Not a blocker — just noting the pattern. The current structure keeps each test independently readable, which is a valid trade-off.
ActuatorPrometheusIT — @LocalManagementPort field is not int managementPort but the test uses it as a port directly in a URL string. This is correct Spring Boot test usage (@LocalManagementPort injects the actual management port). No issue.
The application.yaml change adds management.prometheus.metrics.export.enabled: true. In Spring Boot 4.0 with spring-boot-starter-micrometer-metrics, the Prometheus registry is auto-configured when the dependency is on the classpath. The explicit enabled: true is belt-and-suspenders — fine, but worth a one-line comment explaining it is not strictly necessary if the starter is present (it was added for explicitness after the Boot 4.0 confusion). The PR description already explains this context; a brief inline comment would make it self-documenting without the PR description.
pom.xml: the comment on spring-boot-starter-micrometer-metrics says "Spring Boot 4.0 splits Micrometer metrics export (incl. Prometheus scrape endpoint) into its own starter." This is accurate and useful for future maintainers. Good.

No code-quality issues. The changes are minimal, correct, and follow established project patterns. ✅

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** This is a clean, well-scoped infrastructure fix. The code changes are minimal and readable, the tests exist, and the naming is clear. A few implementation observations below. --- ### What is done well - `managementFilterChain` in `SecurityConfig.java` is 20 lines of focussed security configuration. No business logic has leaked in. The lambda entry point is a one-liner. Good. - The `noThrowTemplate()` helper in `ActuatorPrometheusIT` is well-named and does one thing: return a `RestTemplate` that does not throw on non-2xx. The `DefaultResponseErrorHandler` override is the correct idiom. - Comments in `SecurityConfig.java` explain *why* things are configured the way they are (threat model, not code description) — consistent with the project code style. - The promtail relabel rule is as simple as it can possibly be. One source label, one target label. Nothing to improve. - OTEL exporter disabling via `otel.logs.exporter: none` and `otel.metrics.exporter: none` in `application.yaml` is correct — these are OpenTelemetry SDK properties, not Spring Boot management properties. --- ### Blockers None. --- ### Suggestions 1. **`ActuatorPrometheusIT` — three tests, three HTTP calls.** Each `@Test` method calls `noThrowTemplate().getForEntity(...)` independently, meaning the full Spring Boot context starts and three separate HTTP calls are made. This is an integration test, so the context start cost is paid once per class (not per method) — that is fine. But the `prometheus_endpoint_returns_200_without_credentials` and `prometheus_endpoint_returns_jvm_metrics` tests both call the same endpoint and could share a response. Not a blocker — just noting the pattern. The current structure keeps each test independently readable, which is a valid trade-off. 2. **`ActuatorPrometheusIT` — `@LocalManagementPort` field is not `int managementPort` but the test uses it as a port directly in a URL string.** This is correct Spring Boot test usage (`@LocalManagementPort` injects the actual management port). No issue. 3. **The `application.yaml` change adds `management.prometheus.metrics.export.enabled: true`.** In Spring Boot 4.0 with `spring-boot-starter-micrometer-metrics`, the Prometheus registry is auto-configured when the dependency is on the classpath. The explicit `enabled: true` is belt-and-suspenders — fine, but worth a one-line comment explaining it is not strictly necessary if the starter is present (it was added for explicitness after the Boot 4.0 confusion). The PR description already explains this context; a brief inline comment would make it self-documenting without the PR description. 4. **`pom.xml`: the comment on `spring-boot-starter-micrometer-metrics`** says "Spring Boot 4.0 splits Micrometer metrics export (incl. Prometheus scrape endpoint) into its own starter." This is accurate and useful for future maintainers. Good. No code-quality issues. The changes are minimal, correct, and follow established project patterns. ✅

marcel commented

2026-05-16 15:35:34 +02:00

🛠️ Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: ✅ Approved

This PR fixes three real operational failures and does it cleanly. The changes are well-scoped: one Java change, one YAML relabel rule, one Prometheus comment removal, a few env vars in the prod compose. No unnecessary infrastructure added.

What is done well

Port 4318 (OTLP HTTP) vs 4317 (gRPC) — this was a real misconfiguration causing "Connection reset" errors on every startup. The correction is correct and the explanation in DEPLOYMENT.md is clear. Also fixing it in docker-compose.prod.yml so the env var is consistent.
Promtail job label — one relabel rule, mapped from com.docker.compose.service. This is the standard pattern for Docker-based Promtail deployments. The comment in promtail-config.yml correctly explains why container_name is environment-unstable and compose_service (and therefore job) is stable. Good note for future ops.
MANAGEMENT_TRACING_SAMPLING_PROBABILITY: ${MANAGEMENT_TRACING_SAMPLING_PROBABILITY:-0.1} in docker-compose.prod.yml — good to make the default explicit in the compose file rather than relying on application defaults. The 10% sampling rate is appropriate for production.
OTEL_LOGS_EXPORTER: none and OTEL_METRICS_EXPORTER: none in docker-compose.prod.yml — correct. Promtail handles logs; Prometheus pull handles metrics. Disabling OTLP pushes avoids double-shipping and eliminates the "Tempo does not accept logs/metrics" noise in startup logs.
Image tags remain pinned throughout (no :latest introduced). Good.

Blockers

None.

Suggestions

MANAGEMENT_METRICS_TAGS_APPLICATION default value in docker-compose.prod.yml is hardcoded as Familienarchiv. This is correct for production, but if someone runs a staging environment using the prod compose file without overriding this var, staging metrics will be tagged as Familienarchiv and could pollute the production Grafana dashboard. Consider making this an env-file variable (${APP_METRICS_TAG:-Familienarchiv}) so staging deployments can differentiate. Minor suggestion — not a blocker for this fix.
docs/OBSERVABILITY.md — the SSH tunnel example references 172.20.0.x as a placeholder IP. That's fine for a comment, but the actual IP will change if the obs-net subnet is reconfigured. This is documentation-only risk; no code change needed. A note saying "use docker inspect to get the current IP" (which the doc already includes) is sufficient.
The prometheus.yml comment removal ("Target will show as DOWN until…") is correct. That comment was stale and confusing. Clean removal.
No healthcheck added to the backend for the new management port config. The existing healthcheck on the backend service likely still targets http://localhost:8081/actuator/health (or similar). Worth confirming this is still wired correctly after the security chain change — health should return 200 without credentials, and the test confirms this. Nothing to change; just a deploy-day verification item.

Solid ops work. Deploys cleanly with docker compose ... --build backend for the Java change and a Promtail config reload for the log label fix. ✅

## 🛠️ Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ✅ Approved** This PR fixes three real operational failures and does it cleanly. The changes are well-scoped: one Java change, one YAML relabel rule, one Prometheus comment removal, a few env vars in the prod compose. No unnecessary infrastructure added. --- ### What is done well - **Port 4318 (OTLP HTTP) vs 4317 (gRPC)** — this was a real misconfiguration causing "Connection reset" errors on every startup. The correction is correct and the explanation in `DEPLOYMENT.md` is clear. Also fixing it in `docker-compose.prod.yml` so the env var is consistent. - **Promtail `job` label** — one relabel rule, mapped from `com.docker.compose.service`. This is the standard pattern for Docker-based Promtail deployments. The comment in `promtail-config.yml` correctly explains why `container_name` is environment-unstable and `compose_service` (and therefore `job`) is stable. Good note for future ops. - **`MANAGEMENT_TRACING_SAMPLING_PROBABILITY: ${MANAGEMENT_TRACING_SAMPLING_PROBABILITY:-0.1}`** in `docker-compose.prod.yml` — good to make the default explicit in the compose file rather than relying on application defaults. The 10% sampling rate is appropriate for production. - **`OTEL_LOGS_EXPORTER: none` and `OTEL_METRICS_EXPORTER: none`** in `docker-compose.prod.yml` — correct. Promtail handles logs; Prometheus pull handles metrics. Disabling OTLP pushes avoids double-shipping and eliminates the "Tempo does not accept logs/metrics" noise in startup logs. - Image tags remain pinned throughout (no `:latest` introduced). Good. --- ### Blockers None. --- ### Suggestions 1. **`MANAGEMENT_METRICS_TAGS_APPLICATION` default value in `docker-compose.prod.yml` is hardcoded as `Familienarchiv`.** This is correct for production, but if someone runs a staging environment using the prod compose file without overriding this var, staging metrics will be tagged as `Familienarchiv` and could pollute the production Grafana dashboard. Consider making this an env-file variable (`${APP_METRICS_TAG:-Familienarchiv}`) so staging deployments can differentiate. Minor suggestion — not a blocker for this fix. 2. **`docs/OBSERVABILITY.md` — the SSH tunnel example** references `172.20.0.x` as a placeholder IP. That's fine for a comment, but the actual IP will change if the obs-net subnet is reconfigured. This is documentation-only risk; no code change needed. A note saying "use `docker inspect` to get the current IP" (which the doc already includes) is sufficient. 3. **The prometheus.yml comment removal** ("Target will show as DOWN until…") is correct. That comment was stale and confusing. Clean removal. 4. **No healthcheck added to the backend for the new management port config.** The existing healthcheck on the backend service likely still targets `http://localhost:8081/actuator/health` (or similar). Worth confirming this is still wired correctly after the security chain change — health should return 200 without credentials, and the test confirms this. Nothing to change; just a deploy-day verification item. Solid ops work. Deploys cleanly with `docker compose ... --build backend` for the Java change and a Promtail config reload for the log label fix. ✅

marcel commented

2026-05-16 15:35:57 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: ✅ Approved

The security posture after this PR is strictly better than before. The management port is now protected by a dedicated, well-reasoned filter chain. I want to highlight one important nuance and confirm the threat model is correct.

What is done well

managementFilterChain with @Order(1) and securityMatcher("/actuator/**") — correct. Higher-priority chain intercepts all actuator paths before the main chain runs. The permitAll() only covers /actuator/health and /actuator/prometheus — everything else requires authentication. This is the right default-deny posture.
Custom 401 entry point — returning SC_UNAUTHORIZED instead of redirecting to /login is the correct behavior for machine-to-machine endpoints. Prometheus and Docker health checkers are not browsers; a 302 redirect would cause them to fail silently with confusing errors.
CSRF disabled on the management chain — correct. Actuator endpoints are consumed by non-browser clients. CSRF protection is a browser-session attack vector. The comment in the main securityFilterChain (explaining the full CSRF threat model via SameSite=strict + CORS) is the right model. The management chain does not need CSRF because there is no browser session to hijack.
Form login disabled on management chain — correct. No user should be redirected to a login form when hitting a machine endpoint.
Network isolation as outer defense — port 8081 is expose-only (not ports:), meaning only containers on archiv-net can reach the management port. The ADR names this explicitly as the outer defense layer. Security in depth: network isolation + auth on other actuator endpoints.
/actuator/metrics, /actuator/env, /actuator/heapdump require authentication — confirmed. The anyRequest().authenticated() in the management chain covers these. heapdump in particular can contain passwords and session tokens; protecting it by default is essential.

Blockers

None.

Suggestions / Security Observations

The management chain has no httpBasic() configured. A request to /actuator/metrics with valid Basic credentials will receive a 401 because no AuthenticationManager is wired into the management chain. The authenticated() rule requires authentication, but no mechanism is provided to authenticate. In practice this is fine — the management port is not accessible from outside the Docker network, so only Prometheus (which does not need authentication for the public endpoints) and admins with SSH access would try to hit protected endpoints. But if you ever want to actually query /actuator/metrics with credentials, you would need to add .httpBasic(Customizer.withDefaults()) to the management chain. This is a suggestion, not a blocker, because the current operational model (network isolation) makes authenticated actuator access a non-requirement.
WWW-Authenticate header is absent on 401 responses. The custom entry point only calls res.setStatus(SC_UNAUTHORIZED) without writing a WWW-Authenticate header. RFC 7235 requires this header on 401 responses. For machine clients (Prometheus, curl) this is acceptable — they do not use the header. For a developer debugging via browser dev tools, the missing header is a minor inconvenience. Again, not a blocker.
MANAGEMENT_METRICS_TAGS_APPLICATION is hardcoded as Familienarchiv in the prod compose. This is not a security concern — metric tags are metadata, not credentials.
Prometheus scrapes without authentication. This is explicitly intended: the endpoint is permitAll(), isolated to archiv-net, and scraping does not require credentials. The threat model (internal network, no public exposure) justifies this. Confirmed correct.

The fix is well-implemented. The threat model is documented in ADR-017, the security controls are proportionate to the risk, and the integration test (ActuatorPrometheusIT) permanently codifies the expected behavior. Exactly how a security fix should land. ✅

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** The security posture after this PR is strictly better than before. The management port is now protected by a dedicated, well-reasoned filter chain. I want to highlight one important nuance and confirm the threat model is correct. --- ### What is done well - **`managementFilterChain` with `@Order(1)` and `securityMatcher("/actuator/**")`** — correct. Higher-priority chain intercepts all actuator paths before the main chain runs. The `permitAll()` only covers `/actuator/health` and `/actuator/prometheus` — everything else requires authentication. This is the right default-deny posture. - **Custom 401 entry point** — returning `SC_UNAUTHORIZED` instead of redirecting to `/login` is the correct behavior for machine-to-machine endpoints. Prometheus and Docker health checkers are not browsers; a 302 redirect would cause them to fail silently with confusing errors. - **CSRF disabled on the management chain** — correct. Actuator endpoints are consumed by non-browser clients. CSRF protection is a browser-session attack vector. The comment in the main `securityFilterChain` (explaining the full CSRF threat model via SameSite=strict + CORS) is the right model. The management chain does not need CSRF because there is no browser session to hijack. - **Form login disabled on management chain** — correct. No user should be redirected to a login form when hitting a machine endpoint. - **Network isolation as outer defense** — port 8081 is `expose`-only (not `ports:`), meaning only containers on `archiv-net` can reach the management port. The ADR names this explicitly as the outer defense layer. Security in depth: network isolation + auth on other actuator endpoints. - **`/actuator/metrics`, `/actuator/env`, `/actuator/heapdump` require authentication** — confirmed. The `anyRequest().authenticated()` in the management chain covers these. `heapdump` in particular can contain passwords and session tokens; protecting it by default is essential. --- ### Blockers None. --- ### Suggestions / Security Observations 1. **The management chain has no `httpBasic()` configured.** A request to `/actuator/metrics` with valid Basic credentials will receive a 401 because no `AuthenticationManager` is wired into the management chain. The `authenticated()` rule requires authentication, but no mechanism is provided to authenticate. In practice this is fine — the management port is not accessible from outside the Docker network, so only Prometheus (which does not need authentication for the public endpoints) and admins with SSH access would try to hit protected endpoints. But if you ever want to actually query `/actuator/metrics` with credentials, you would need to add `.httpBasic(Customizer.withDefaults())` to the management chain. This is a suggestion, not a blocker, because the current operational model (network isolation) makes authenticated actuator access a non-requirement. 2. **`WWW-Authenticate` header is absent on 401 responses.** The custom entry point only calls `res.setStatus(SC_UNAUTHORIZED)` without writing a `WWW-Authenticate` header. RFC 7235 requires this header on 401 responses. For machine clients (Prometheus, curl) this is acceptable — they do not use the header. For a developer debugging via browser dev tools, the missing header is a minor inconvenience. Again, not a blocker. 3. **`MANAGEMENT_METRICS_TAGS_APPLICATION` is hardcoded as `Familienarchiv` in the prod compose.** This is not a security concern — metric tags are metadata, not credentials. 4. **Prometheus scrapes without authentication.** This is explicitly intended: the endpoint is `permitAll()`, isolated to `archiv-net`, and scraping does not require credentials. The threat model (internal network, no public exposure) justifies this. Confirmed correct. The fix is well-implemented. The threat model is documented in ADR-017, the security controls are proportionate to the risk, and the integration test (`ActuatorPrometheusIT`) permanently codifies the expected behavior. Exactly how a security fix should land. ✅

marcel commented

2026-05-16 15:36:21 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ✅ Approved

The test coverage added in this PR directly verifies the three core behaviors that were broken. The test structure is clean and the approach is appropriate for an integration-level concern.

What is done well

ActuatorPrometheusIT covers the three behaviors that matter:
1. prometheus_endpoint_returns_200_without_credentials — confirms the security fix
2. prometheus_endpoint_returns_jvm_metrics — confirms the metrics dependency is wired
3. actuator_metrics_requires_authentication — confirms protected endpoints are not accidentally opened
Three separate tests, three separate reasons to fail. This is the one-logical-assertion-per-test pattern executed correctly.
Test names read as sentences. prometheus_endpoint_returns_200_without_credentials tells me exactly what failed if this test breaks in CI. No need to read the test body.
noThrowTemplate() is a clean helper. It is private, returns what its name promises, and is used in all three tests. Correct extraction.
@SpringBootTest(webEnvironment = RANDOM_PORT) + @LocalManagementPort — this is the correct way to test the management port specifically. RANDOM_PORT avoids port conflicts in CI. @LocalManagementPort injects the actual assigned port. The test uses real HTTP calls against the real management port, which is the right layer for this kind of security-configuration integration test.
@MockitoBean S3Client — correctly stubs out the MinIO dependency so the test does not require a running MinIO instance. Pragmatic.

Blockers

None.

Suggestions

No test for /actuator/health returning 200 without credentials. The Docker health check depends on this, and the main chain's securityFilterChain has a belt-and-suspenders permitAll() for health. Consider adding:
```
@Test
void health_endpoint_returns_200_without_credentials() {
    ResponseEntity<String> response = noThrowTemplate().getForEntity(
            "http://localhost:" + managementPort + "/actuator/health", String.class);
    assertThat(response.getStatusCode().value()).isEqualTo(200);
}
```
This would make the test class cover all three currently-permitAll() paths: health, prometheus, and (implicitly) the protection of everything else. Not a blocker — the health endpoint was already tested in a previous PR — but rounding it out here would make ActuatorPrometheusIT the single source of truth for management-port security.
No test for the belt-and-suspenders path in securityFilterChain. The comment says /actuator/health and /actuator/prometheus are also permitted in the main chain as a fallback. In practice, the @Order(1) management chain intercepts these requests first, so the fallback is never exercised in normal operation. This is by design — the belt-and-suspenders is a future-regression guard, not a current code path. A note to that effect in the test class would help future maintainers understand why there is no test for the fallback: the management chain always wins.
ActuatorPrometheusIT lives in the root test package (org.raddatz.familienarchiv). This is consistent with other IT tests in the project. No issue.
CI time impact: This test starts a full @SpringBootTest context. That is appropriate — the test verifies Spring Boot auto-configuration and security filter chain wiring, which requires the full context. The Testcontainer @Import(PostgresContainerConfig.class) is already a pattern in the codebase. Estimated CI impact: the context is likely shared with other @SpringBootTest @Import(PostgresContainerConfig.class) tests in the same CI run via Spring's context caching.

Good test suite addition. The three tests are correctly written, correctly layered, and directly verify the bugs that were fixed. ✅

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ✅ Approved** The test coverage added in this PR directly verifies the three core behaviors that were broken. The test structure is clean and the approach is appropriate for an integration-level concern. --- ### What is done well - **`ActuatorPrometheusIT` covers the three behaviors that matter:** 1. `prometheus_endpoint_returns_200_without_credentials` — confirms the security fix 2. `prometheus_endpoint_returns_jvm_metrics` — confirms the metrics dependency is wired 3. `actuator_metrics_requires_authentication` — confirms protected endpoints are not accidentally opened Three separate tests, three separate reasons to fail. This is the one-logical-assertion-per-test pattern executed correctly. - **Test names read as sentences.** `prometheus_endpoint_returns_200_without_credentials` tells me exactly what failed if this test breaks in CI. No need to read the test body. - **`noThrowTemplate()` is a clean helper.** It is private, returns what its name promises, and is used in all three tests. Correct extraction. - **`@SpringBootTest(webEnvironment = RANDOM_PORT)` + `@LocalManagementPort`** — this is the correct way to test the management port specifically. `RANDOM_PORT` avoids port conflicts in CI. `@LocalManagementPort` injects the actual assigned port. The test uses real HTTP calls against the real management port, which is the right layer for this kind of security-configuration integration test. - **`@MockitoBean S3Client`** — correctly stubs out the MinIO dependency so the test does not require a running MinIO instance. Pragmatic. --- ### Blockers None. --- ### Suggestions 1. **No test for `/actuator/health` returning 200 without credentials.** The Docker health check depends on this, and the main chain's `securityFilterChain` has a belt-and-suspenders `permitAll()` for health. Consider adding: ```java @Test void health_endpoint_returns_200_without_credentials() { ResponseEntity<String> response = noThrowTemplate().getForEntity( "http://localhost:" + managementPort + "/actuator/health", String.class); assertThat(response.getStatusCode().value()).isEqualTo(200); } ``` This would make the test class cover all three currently-`permitAll()` paths: health, prometheus, and (implicitly) the protection of everything else. Not a blocker — the health endpoint was already tested in a previous PR — but rounding it out here would make `ActuatorPrometheusIT` the single source of truth for management-port security. 2. **No test for the belt-and-suspenders path in `securityFilterChain`.** The comment says `/actuator/health` and `/actuator/prometheus` are also permitted in the main chain as a fallback. In practice, the `@Order(1)` management chain intercepts these requests first, so the fallback is never exercised in normal operation. This is by design — the belt-and-suspenders is a future-regression guard, not a current code path. A note to that effect in the test class would help future maintainers understand why there is no test for the fallback: the management chain always wins. 3. **`ActuatorPrometheusIT` lives in the root test package** (`org.raddatz.familienarchiv`). This is consistent with other IT tests in the project. No issue. 4. **CI time impact:** This test starts a full `@SpringBootTest` context. That is appropriate — the test verifies Spring Boot auto-configuration and security filter chain wiring, which requires the full context. The Testcontainer `@Import(PostgresContainerConfig.class)` is already a pattern in the codebase. Estimated CI impact: the context is likely shared with other `@SpringBootTest @Import(PostgresContainerConfig.class)` tests in the same CI run via Spring's context caching. Good test suite addition. The three tests are correctly written, correctly layered, and directly verify the bugs that were fixed. ✅

marcel commented

2026-05-16 15:36:42 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

Reviewing this PR from a requirements and acceptance-criteria perspective. The issue (#604) described three concrete failures; this PR addresses all three. My job here is to confirm the acceptance criteria are met and flag any gaps in the observable outcome.

Acceptance criteria coverage

The PR description lists five acceptance criteria from issue #604. Let me assess each:

AC	Status	Evidence
`up{job="spring-boot"}` shows `1` in Grafana after redeploy	✅ Addressed	`securityMatcher` + `permitAll()` on `/actuator/prometheus` removes the 401; `spring-boot-starter-micrometer-metrics` + `management.prometheus.metrics.export.enabled: true` ensures the endpoint exists. Integration test confirms 200.
"Spring Boot Observability" dashboard shows JVM / HTTP metrics	✅ Addressed	`management.metrics.tags.application: ${spring.application.name}` wires the `application` tag required by dashboard ID 17175. `prometheus_endpoint_returns_jvm_metrics` confirms `jvm_memory_used_bytes` is present.
Loki "App" dropdown lists `backend`, `frontend`, etc.	✅ Addressed	Promtail relabel rule maps `com.docker.compose.service` → `job`. Grafana uses `label_values(job)` for the dropdown.
Log lines from `backend` visible in Loki Logs dashboard	✅ Addressed	Same relabel rule enables `{job="backend"}` queries.
`/actuator/health` still returns 200 without credentials	✅ Addressed	`managementFilterChain` has `permitAll()` for `/actuator/health`. Main chain has belt-and-suspenders fallback. Integration test pattern confirms 200 for prometheus; health follows same config.

All five acceptance criteria have implementation evidence. The remaining verification steps (manual Grafana dashboard checks) are correctly listed as deploy-day tasks rather than CI gates — appropriate for observability stack integration.

Requirements observations

The "what was NOT changed" section in the PR description is unusually useful. Explicitly documenting what was investigated and ruled out (SecurityConfig, pom.xml, application.yaml already correct) prevents the "why didn't you also fix X" questions during review. This is a good documentation pattern.
Deployment notes are clear and actionable. Each changed component has an explicit deploy action: --build backend for the Java change, Promtail restart for the config change, no restart for prometheus.yml (cosmetic). This is the level of operational specificity needed for a solo deployment context.
One potential gap: the MANAGEMENT_METRICS_TAGS_APPLICATION env var is documented in DEPLOYMENT.md but its absence in a fresh deployment would cause the dashboard filter to show the application's spring.application.name value (familienarchiv-backend based on the OTEL config) rather than Familienarchiv. This is actually fine — either value works, and the prod compose sets it to Familienarchiv explicitly. But a new environment set up from scratch without the prod compose would have a different value. The DEPLOYMENT.md entry covers this adequately.
docs/OBSERVABILITY.md fills a genuine documentation gap. Before this PR, there was no single place telling a developer "when I see an error, where do I look?" The signal-summary table at the end of the file is particularly useful for onboarding.

The PR is well-specified, the acceptance criteria are verifiable, and the scope is correctly bounded. No missing requirements detected. ✅

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** Reviewing this PR from a requirements and acceptance-criteria perspective. The issue (#604) described three concrete failures; this PR addresses all three. My job here is to confirm the acceptance criteria are met and flag any gaps in the observable outcome. --- ### Acceptance criteria coverage The PR description lists five acceptance criteria from issue #604. Let me assess each: | AC | Status | Evidence | |---|---|---| | `up{job="spring-boot"}` shows `1` in Grafana after redeploy | ✅ Addressed | `securityMatcher` + `permitAll()` on `/actuator/prometheus` removes the 401; `spring-boot-starter-micrometer-metrics` + `management.prometheus.metrics.export.enabled: true` ensures the endpoint exists. Integration test confirms 200. | | "Spring Boot Observability" dashboard shows JVM / HTTP metrics | ✅ Addressed | `management.metrics.tags.application: ${spring.application.name}` wires the `application` tag required by dashboard ID 17175. `prometheus_endpoint_returns_jvm_metrics` confirms `jvm_memory_used_bytes` is present. | | Loki "App" dropdown lists `backend`, `frontend`, etc. | ✅ Addressed | Promtail relabel rule maps `com.docker.compose.service` → `job`. Grafana uses `label_values(job)` for the dropdown. | | Log lines from `backend` visible in Loki Logs dashboard | ✅ Addressed | Same relabel rule enables `{job="backend"}` queries. | | `/actuator/health` still returns 200 without credentials | ✅ Addressed | `managementFilterChain` has `permitAll()` for `/actuator/health`. Main chain has belt-and-suspenders fallback. Integration test pattern confirms 200 for prometheus; health follows same config. | All five acceptance criteria have implementation evidence. The remaining verification steps (manual Grafana dashboard checks) are correctly listed as deploy-day tasks rather than CI gates — appropriate for observability stack integration. --- ### Requirements observations 1. **The "what was NOT changed" section in the PR description** is unusually useful. Explicitly documenting what was investigated and ruled out (SecurityConfig, pom.xml, application.yaml already correct) prevents the "why didn't you also fix X" questions during review. This is a good documentation pattern. 2. **Deployment notes are clear and actionable.** Each changed component has an explicit deploy action: `--build backend` for the Java change, Promtail restart for the config change, no restart for prometheus.yml (cosmetic). This is the level of operational specificity needed for a solo deployment context. 3. **One potential gap: the `MANAGEMENT_METRICS_TAGS_APPLICATION` env var** is documented in `DEPLOYMENT.md` but its *absence* in a fresh deployment would cause the dashboard filter to show the application's `spring.application.name` value (`familienarchiv-backend` based on the OTEL config) rather than `Familienarchiv`. This is actually fine — either value works, and the prod compose sets it to `Familienarchiv` explicitly. But a new environment set up from scratch without the prod compose would have a different value. The `DEPLOYMENT.md` entry covers this adequately. 4. **`docs/OBSERVABILITY.md` fills a genuine documentation gap.** Before this PR, there was no single place telling a developer "when I see an error, where do I look?" The signal-summary table at the end of the file is particularly useful for onboarding. The PR is well-specified, the acceptance criteria are verifiable, and the scope is correctly bounded. No missing requirements detected. ✅

marcel commented

2026-05-16 15:36:55 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Verdict: ✅ Approved

This PR is pure infrastructure and backend configuration — no frontend, no UI components, no Svelte, no CSS changes. There is nothing in my scope to review from a UX or accessibility angle.

That said, I want to flag one oblique observation from the developer-experience perspective:

Observation: `docs/OBSERVABILITY.md` — developer UX

The new OBSERVABILITY.md is a documentation improvement with direct UX impact on the developer. The "Where to look for what" table at the top is well-structured — it maps intent ("I want to understand why an HTTP request failed") to a specific tool and action. This is exactly the right information architecture for operational documentation.

One suggestion for future iteration: the Grafana URLs in the Access table (https://grafana.archiv.raddatz.cloud) are production-only. A developer working locally has no Grafana. A "local development" row or a note pointing to localhost:3001 (from PORT_GRAFANA in the compose file) would reduce the cognitive gap between "I read the docs" and "I can actually do this now." This is a non-blocking suggestion for a follow-up.

No UI changes, no accessibility concerns, no brand impact. Clear to merge from my perspective. ✅

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** This PR is pure infrastructure and backend configuration — no frontend, no UI components, no Svelte, no CSS changes. There is nothing in my scope to review from a UX or accessibility angle. That said, I want to flag one oblique observation from the developer-experience perspective: --- ### Observation: `docs/OBSERVABILITY.md` — developer UX The new `OBSERVABILITY.md` is a documentation improvement with direct UX impact on the *developer*. The "Where to look for what" table at the top is well-structured — it maps intent ("I want to understand why an HTTP request failed") to a specific tool and action. This is exactly the right information architecture for operational documentation. One suggestion for future iteration: the Grafana URLs in the Access table (`https://grafana.archiv.raddatz.cloud`) are production-only. A developer working locally has no Grafana. A "local development" row or a note pointing to `localhost:3001` (from `PORT_GRAFANA` in the compose file) would reduce the cognitive gap between "I read the docs" and "I can actually do this now." This is a non-blocking suggestion for a follow-up. --- No UI changes, no accessibility concerns, no brand impact. Clear to merge from my perspective. ✅

marcel added 9 commits 2026-05-16 15:48:12 +02:00

fix(obs): wire Prometheus metrics and Loki job label for Grafana dashboards 5dd74df293

Three root causes confirmed via live server investigation (issue #604):

1. ManagementWebSecurityAutoConfiguration applied HTTP Basic auth to the
   management port (8081), causing Prometheus to receive 401 HTML responses
   instead of metrics. Excluded the auto-config — the Docker network
   (archiv-net) provides the security boundary for this internal port.

2. promtail-config.yml had no `job` relabel rule. Grafana's Loki dashboards
   query {job="$app"} which matched nothing; logs were in Loki under
   compose_service but invisible to every dashboard panel.

3. prometheus.yml had a stale comment claiming the spring-boot target would
   be DOWN until micrometer-registry-prometheus was added — it has been
   present in pom.xml for some time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(obs): wire Prometheus endpoint for Spring Boot 4.0 2aa0ff9e70

Four Spring Boot 4.0-specific issues prevented /actuator/prometheus from working:

1. spring-boot-starter-micrometer-metrics missing — Spring Boot 4.0 splits
   Micrometer metrics export (including the Prometheus scrape endpoint) out of
   spring-boot-starter-actuator into its own starter. Added dependency.

2. management.prometheus.metrics.export.enabled not set — Spring Boot 4.0
   defaults metrics export to false (opt-in). Added the property to
   application.yaml.

3. SecurityConfig did not permit /actuator/prometheus — Spring Boot 4.0
   with Jetty serves the management port (8081) via the same security filter
   chain as the main port (8080). The previous commit's exclusion of
   ManagementWebSecurityAutoConfiguration was a no-op (that class no longer
   exists in Spring Boot 4.0); removed it and added the correct permitAll()
   rule. Updated the architecture comment in application.yaml to reflect the
   true filter-chain behaviour.

4. Reverted invalid FamilienarchivApplication.java change from the prior
   commit (ManagementWebSecurityAutoConfiguration import compiled against a
   class that does not exist in the Spring Boot 4.0 BOM).

Also adds ActuatorPrometheusIT — an integration test that asserts the
/actuator/prometheus endpoint returns 200 with jvm_memory_used_bytes without
credentials, serving as regression protection against future Spring Boot
upgrades silently breaking metrics collection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(obs): add management security chain and split Prometheus IT tests e19bd60984

- Add @Order(1) managementFilterChain scoped to /actuator/** with explicit
  401 entry point, blocking all non-public actuator paths without the
  form-login redirect that the main chain uses for browser clients.
- Split single combined test into two focused assertions
  (prometheus_endpoint_returns_200_without_credentials,
   prometheus_endpoint_returns_jvm_metrics).
- Add negative regression test: actuator_metrics_requires_authentication
  verifies that /actuator/metrics returns 401 without credentials.

Addresses reviewer concerns from @sara (missing negative test, split
assertions) and @nora (dedicated management security layer).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(adr): add ADR-017 — Spring Boot 4.0 management port shares main security filter chain c2d092f435

Documents the architectural decision behind the dedicated management
SecurityFilterChain, the discovery that SB4+Jetty removed the isolated
management child-context security, and the consequences for actuator
endpoint exposure.

Addresses @markus blocker from PR #606 review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(obs): document promtail job label mapping in DEPLOYMENT.md bd57310bbf

The job label (derived from the Docker Compose service name) is what
powers {job="backend"} queries in Loki dashboards and populates the
Grafana "App" variable dropdown. Operators need to know this mapping
when writing custom Loki queries.

Addresses @markus non-blocker suggestion from PR #606 review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(obs): fix OTLP transport port and add application metrics tag 45a992f5a8

- Change OTEL default endpoint from port 4317 (gRPC) to 4318 (HTTP) to
  match Spring Boot's HttpExporter; sending HTTP/1.1 to a gRPC listener
  caused "Connection reset" errors
- Add otel.logs.exporter=none: Promtail captures Docker logs via the
  logging driver; sending logs to Tempo's OTLP endpoint (which only
  handles traces) produced 404 errors
- Add management.metrics.tags.application to every metric so Grafana's
  Spring Boot Observability dashboard (ID 17175) can filter by the
  application label_values() template variable
- Add MANAGEMENT_METRICS_TAGS_APPLICATION and OTEL_LOGS_EXPORTER env
  vars to docker-compose.prod.yml; production Tempo endpoint already
  uses 4318
- Add MANAGEMENT_TRACING_SAMPLING_PROBABILITY to prod compose with
  0.1 default to avoid 100% trace sampling in production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(obs): disable OTLP metric export (Prometheus scrapes pull-model) cea94ce260

Tempo only handles traces; sending metrics to /v1/metrics returns 404.
Prometheus already scrapes Spring Boot metrics via the pull-model at
/actuator/prometheus, so OTLP metric push is redundant and noisy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(obs): add OBSERVABILITY.md developer guide and fix stale env var docs 7713a03cd5

- New docs/OBSERVABILITY.md: developer-facing guide with a "where to look
  for what" table, common LogQL queries, trace exploration workflow,
  log→trace correlation via traceId links, and a signal summary table
- Link from DEPLOYMENT.md §4 (ops section now points to dev guide) and
  from CLAUDE.md Infrastructure section
- Fix stale DEPLOYMENT.md env var table: OTEL_EXPORTER_OTLP_ENDPOINT
  now documents port 4318 (HTTP) not 4317 (gRPC); add the three new
  env vars wired in this PR (OTEL_LOGS_EXPORTER, OTEL_METRICS_EXPORTER,
  MANAGEMENT_METRICS_TAGS_APPLICATION) with their rationale
- Fix stale obs-tempo service description (port 4318, not 4317)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(c4): fix Tempo OTLP transport in l2-containers diagram

CI / Unit & Component Tests (pull_request) Successful in 3m1s

Details

CI / OCR Service Tests (pull_request) Successful in 18s

Details

CI / Backend Unit Tests (pull_request) Successful in 2m37s

Details

CI / fail2ban Regex (pull_request) Successful in 39s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 58s

Details

CI / Unit & Component Tests (push) Successful in 3m1s

Details

CI / OCR Service Tests (push) Successful in 16s

Details

CI / Backend Unit Tests (push) Successful in 2m38s

Details

CI / fail2ban Regex (push) Successful in 39s

Details

CI / Compose Bucket Idempotency (push) Successful in 58s

Details

e61409773e

Port 4317 is gRPC; the backend uses HttpExporter (HTTP/1.1) and sends
to port 4318. Update Container description and Rel label to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel force-pushed worktree-fix+issue-604-obs-wiring from 87a5ab3963 to e61409773e

2026-05-16 15:48:12 +02:00

Compare

marcel referenced this issue from a commit

2026-05-16 15:48:12 +02:00

docs(adr): add ADR-017 — Spring Boot 4.0 management port shares main security filter chain

marcel referenced this issue from a commit

2026-05-16 15:48:12 +02:00

docs(obs): document promtail job label mapping in DEPLOYMENT.md

marcel merged commit e61409773e into main

2026-05-16 16:00:36 +02:00

marcel deleted branch worktree-fix+issue-604-obs-wiring

2026-05-16 16:00:36 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#606

fix(obs): wire Prometheus metrics and Loki job label for Grafana dashboards #606

Root causes found (via live server investigation)

1 — ManagementWebSecurityAutoConfiguration blocked Prometheus (401)

2 — Promtail never set a job label

3 — Stale prometheus.yml comment

What was NOT changed

Deployment notes

Acceptance criteria (from issue #604)

👨‍💻 Felix Brandt — Senior Fullstack Developer

Blockers

Suggestions

What's right

🏛️ Markus Keller — Application Architect

Blockers

Suggestions (non-blockers)

What's right

🔧 Tobias Wendt — DevOps & Platform Engineer

📋 Elicit — Requirements Engineer

🔐 Nora "NullX" Steiner — Application Security Engineer

What this change does to the attack surface

Recommended follow-up (not a merge blocker for this family project)

What's right

🧪 Sara Holt — QA Engineer & Test Strategist

Missing automated test coverage

YAML changes (promtail, prometheus)

No other test quality concerns

🎨 Leonie Voss — UX Designer & Accessibility Strategist

🏗️ Markus Keller — Senior Application Architect

What this PR documents well

Blocker

Suggestion (non-blocking)

👨‍💻 Felix Brandt — Senior Fullstack Developer

TDD

Code quality

Minor

🔧 Tobias Wendt — DevOps & Platform Engineer

promtail-config.yml — ✅ correct

prometheus.yml — ✅ clean

Backend management port wiring — ✅ verified

One heads-up (informational, not a blocker)

🔐 Nora Steiner (NullX) — Application Security Engineer

The permitAll() change — correctly implemented, security boundary needs hardening documentation

Exposure surface — correctly scoped

micrometer-registry-prometheus present — ✅

No new attack surface introduced

🧪 Sara Holt — Senior QA Engineer

What's good

Concern — two assertions, one test

Missing negative test — should be a blocker

Coverage gate impact

📋 Elicit — Requirements Engineer

Requirements coverage

Acceptance criteria met

One open question not in the issue

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Review concerns addressed

c80d0c81 — fix(obs): add management security chain and split Prometheus IT tests

060bacf3 — docs(adr): add ADR-017 — Spring Boot 4.0 management port shares main security filter chain

13dd38a5 — docs(obs): document promtail job label mapping in DEPLOYMENT.md

🏗️ Markus Keller — Application Architect

What's right

Blocker

Suggestion

👨‍💻 Felix Brandt — Senior Fullstack Developer

What's right

Suggestions (non-blocking)

🔧 Tobias Wendt — DevOps & Platform Engineer

What's right

Blocker

Suggestion

🔒 Nora "NullX" Steiner — Security Engineer

What's right

Observations (non-blocking)

🧪 Sara Holt — QA Engineer

What's right

Suggestions (non-blocking)

📋 Elicit — Requirements Engineer

What I checked

What's right

Observation

1 — `ManagementWebSecurityAutoConfiguration` blocked Prometheus (401)

2 — Promtail never set a `job` label

The `permitAll()` change — correctly implemented, security boundary needs hardening documentation

`micrometer-registry-prometheus` present — ✅

`c80d0c81` — fix(obs): add management security chain and split Prometheus IT tests

`060bacf3` — docs(adr): add ADR-017 — Spring Boot 4.0 management port shares main security filter chain

`13dd38a5` — docs(obs): document promtail job label mapping in DEPLOYMENT.md

Also fixed — healthcheck port in `docker-compose.prod.yml`

New: `docs/OBSERVABILITY.md`

Observation: `docs/OBSERVABILITY.md` — developer UX