fix(obs): wire Prometheus metrics and Loki job label for Grafana dashboards #606
Reference in New Issue
Block a user
Delete Branch "worktree-fix+issue-604-obs-wiring"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #604
Root causes found (via live server investigation)
1 —
ManagementWebSecurityAutoConfigurationblocked Prometheus (401)Spring Boot's
ManagementWebSecurityAutoConfigurationauto-applies HTTP Basic auth to every management server when Spring Security is on the classpath. Because the management port (8081) is a separate child context, ourSecurityConfigfilter chain (on port 8080) never sees it — but Spring Boot's own management security auto-config does.Prometheus was reaching the endpoint but receiving an HTML 401 page, which it rejected with
received unsupported Content-Type "text/html".Fix: Exclude
ManagementWebSecurityAutoConfigurationinFamilienarchivApplication. The management port is already isolated at the network level (only Prometheus insidearchiv-netcan reach port 8081 — it is not published externally).2 — Promtail never set a
joblabelThe Loki dashboard queries use
{job="$app"}andlabel_values(job)for its "App" dropdown. Promtail was shipping logs labelled withcompose_serviceandcompose_projectbut neverjob. The dropdown was empty and every panel returned no data (logs existed in Loki but were invisible).Fix: One relabel rule mapping
compose_service→job.3 — Stale prometheus.yml comment
The comment said the
spring-boottarget would be DOWN untilmicrometer-registry-prometheuswas added. That dependency has been inpom.xmlfor some time.Fix: Remove the comment.
What was NOT changed
SecurityConfig.java— the personas' analysis was correct: its filter chain never sees port-8081 traffic. No change needed.pom.xml—micrometer-registry-prometheusalready present.application.yaml— management port and endpoint exposure already configured.Deployment notes
docker compose ... --build backend) for theManagementWebSecurityAutoConfigurationexclusion to take effect.obs-promtailrestarted to unblock Loki immediately.Acceptance criteria (from issue #604)
up{job="spring-boot"}shows1in Grafana → Explore → Prometheus after redeploybackend,frontend, etc.backendvisible in Loki Logs dashboard panel/actuator/healthstill returns 200 without credentials👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
Blockers
None.
Suggestions
Missing test before the fix (TDD discipline)
The Java change — excluding
ManagementWebSecurityAutoConfiguration— was applied without a preceding failing test. Per red/green, the test should have existed first. For an integration-level security config like this, a@SpringBootTestwith@LocalManagementPortis the right vehicle:Without this test, re-enabling the auto-configuration (accidentally or via a framework upgrade) produces a silent regression undetectable by CI.
What's right
FamilienarchivApplication.javaexplains why, not what — that is exactly the right kind of comment. ✓🏛️ Markus Keller — Application Architect
Verdict: ⚠️ Approved with concerns
Blockers
Missing ADR
Excluding
ManagementWebSecurityAutoConfigurationis an architectural decision with lasting consequences. It is counterintuitive — a developer scanning@SpringBootApplicationannotations will see a security auto-config exclusion and ask "why?" The PR description already contains everything needed for a proper ADR (context, problem, decision, alternatives not taken, consequences). It should be committed todocs/adr/rather than left only in PR history.Per the doc-update table: "Architectural decision with lasting consequences → New ADR in
docs/adr/". This is a blocker; the PR should not merge without it.Suggested content (paraphrased from the PR description):
Suggestions (non-blockers)
The promtail
joblabel mapping (service name → job) should be noted indocs/DEPLOYMENT.mdunder the observability section. Operators querying Loki need to know that{job="backend"}is derived from the Docker Compose service name — this is not obvious from the Loki UI alone.What's right
SecurityConfignever ran on port 8081. ✓excludeat the@SpringBootApplicationlevel is the right lever for this. ✓🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ✅ Approved
Clean infra fix. Let me walk through each change:
FamilienarchivApplication.javaExcluding
ManagementWebSecurityAutoConfigurationis the right lever. The management context is a Boot-internal child context; your mainSecurityConfignever sees port 8081 traffic. The exclusion plus the inline comment is the standard approach. Deployment note in the PR description is correct: backend rebuild + redeploy required.prometheus.ymlStale comment removal — correct. The comment was actively wrong (micrometer was already present) and would have confused anyone debugging a future scrape failure. The remaining config is fine:
backend:8081uses Docker service DNS (not a fixed IP orlocalhost), which is the right approach for service-to-service calls inside a Compose network.promtail-config.ymlThe relabel rule is textbook Docker SD label mapping:
This sets
jobto the Compose service name (e.g.,backend,frontend), which is exactly what Loki dashboard queries expect via{job="$app"}. No issues here.One post-deploy check to verify:
Confirm that
management.endpoints.web.exposure.includeinapplication.yamlis scoped (e.g.,prometheus,health) and not*. This PR didn't change that config, but given that port 8081 is now unauthenticated, an existing*would expose heapdump and env without auth. A quickgrepofapplication.yamlbefore deploying is worth 30 seconds.What's right:
archiv-net, notports:-published) is the correct design for internal scrapers. ✓📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
This PR is a well-scoped bug fix and the requirements framing is exemplary.
Acceptance criteria quality
The five acceptance criteria in the PR description are clearly stated, specific, and independently verifiable:
up{job="spring-boot"}= 1 → observable in Grafana Explorebackendvisible in Loki → log query verification/actuator/healthreturns 200 without credentials → curl-verifiable regression checkEach criterion maps directly to one of the three root causes identified. The traceability from root cause → fix → acceptance criterion is clean.
Scope discipline
The "What was NOT changed" section is an excellent practice. It pre-empts scope questions and documents three deliberate non-changes (
SecurityConfig.java,pom.xml,application.yaml). This pattern prevents reviewers from asking "why didn't you also change X?" and reduces churn.One open question
The acceptance criteria checkboxes are unchecked in the PR description. Were any of them validated before opening the PR? The PR description mentions "The deployed server config was already patched and
obs-promtailrestarted to unblock Loki immediately" — which implies criteria 3 and 4 (Loki) were verified live. Criteria 1 and 2 (Prometheus/metrics) are pending the backend redeploy. It would help to mark which criteria have been validated and which await deployment, so the reviewer knows the current verification state.🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
What this change does to the attack surface
Excluding
ManagementWebSecurityAutoConfigurationremoves all application-level authentication from every Spring Boot Actuator endpoint on port 8081 — not only/actuator/prometheus. The full unauthenticated endpoint list now includes (depending on what's exposed inapplication.yaml):/actuator/heapdump/actuator/env/actuator/beans/actuator/mappings/actuator/prometheus/actuator/healthCurrent defense: Port 8081 is inside
archiv-netand not published to the host (ports:not set). This is a valid network defense.Concern — single layer of defense:
The security posture is now "network isolation is the only defense for all actuator endpoints." Defense-in-depth principle says no single control should be the only barrier for sensitive data.
Recommended follow-up (not a merge blocker for this family project)
Rather than removing all management security, configure a dedicated management filter chain that permits only the two safe endpoints:
This removes the 401 for Prometheus while keeping heapdump and env inaccessible even within
archiv-net.I'd suggest filing a follow-up issue to implement this before the observability stack is production-hardened.
What's right
archiv-net(noports:binding) is a real defense layer. ✓@SpringBootApplicationlevel). ✓🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
Missing automated test coverage
The Java change — excluding
ManagementWebSecurityAutoConfiguration— has no corresponding automated test. Per the TDD discipline, a failing test should precede the fix. Without a test, CI cannot detect a future regression (e.g., if the exclusion is removed or Spring Boot's default behavior changes in a version upgrade).Suggested test (
@SpringBootTestwith@LocalManagementPort):This test:
ManagementWebSecurityAutoConfigurationexclusion@LocalManagementPortso it doesn't hardcode port 8081YAML changes (promtail, prometheus)
The infrastructure config changes are not unit-testable in isolation — they require a running Docker Compose stack. The manual acceptance criteria in the PR description are the practical equivalent. This is acceptable for infra-layer changes.
Observation: The acceptance criteria checkboxes in the PR are manual verification steps, not automated gates. For a monitoring configuration, that's the pragmatic approach — just flag it so the next engineer knows this verification is human-driven, not CI-driven.
No other test quality concerns
No test deletions, no disabled tests, no flaky patterns introduced. The YAML changes are additive and safe.
🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
This PR touches only:
prometheus.ymlpromtail-config.ymlNo Svelte components, no frontend routes, no CSS, no UI strings, no Tailwind classes, no accessibility markup. This change is entirely invisible to users.
Nothing to flag from a UX, accessibility, brand compliance, or responsive design perspective. LGTM.
🏗️ Markus Keller — Senior Application Architect
Verdict: ⚠️ Approved with concerns
What this PR documents well
The inline comments in
SecurityConfig.javaandapplication.yamlexplain the why of the Spring Boot 4.0 management port behavior change. That's the right instinct. Three non-obvious Spring Boot 4.0 changes discovered here — management port shares the security filter chain, metrics export opt-in, separate micrometer-metrics starter — all landed correctly.Blocker
Missing ADR for the management port security architecture change.
Per our doc checklist, "architectural decisions with lasting consequences" require a new ADR in
docs/adr/. This qualifies: the discovery that Spring Boot 4.0's Jetty-based management server shares the main security filter chain changes the operational security model. Before this PR, the assumption was "port 8081 never sees Spring Security." That assumption was wrong and is now corrected — but the decision and its rationale exist only in code comments, not in the ADR log.Future developers (and future-you during a security audit) need to understand:
permitAll()inSecurityConfigis the correct fix, not a security holeDraft ADR entry:
Suggestion (non-blocking)
The
management.endpoints.web.exposure.include: health,info,prometheus,metricsexposesinfoandmetricswithout explicitpermitAll(). Those require authentication. This is correct behavior, but worth documenting next to thepermitAll()block so the next person reviewing security rules can see the full picture at a glance.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
TDD
TDD discipline was followed: the test
ActuatorPrometheusITwas written to drive out the three fixes (missing starter, missingpermitAll(), missingenabled: true). The test was red before implementation and green after — confirmed by the implementation log. ✅Code quality
ActuatorPrometheusIT.java— generally clean:@LocalManagementPortcorrectly injects the management port (different random port from@LocalServerPort) ✅noThrowTemplate()is a single-purpose method with a revealing name ✅@MockitoBean S3Client s3Clientis the correct pattern for suppressing MinIO startup noise ✅One concern in
SecurityConfig.java— comment duplication:The same fact is stated twice in the comment block:
Lines 1–2 and lines 4–6 say the same thing twice. The comment should be pruned to one authoritative statement. Suggested replacement:
That's one statement, no redundancy, and cross-references the ADR Markus is asking for.
Minor
The
noThrowTemplate()anonymous inner class could be extracted to a package-private test helper if more IT tests need it later. No action needed now — YAGNI applies.🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ✅ Approved
Solid infra fix. Three things checked:
promtail-config.yml — ✅ correct
This is the right relabel rule. Docker Compose sets
com.docker.compose.serviceon every container. Promtail's Docker SD exposes it as__meta_docker_container_label_com_docker_compose_service. Mapping it tojobaligns with the Grafana Loki variable convention{job="$app"}where$appis populated from thejoblabel. Without this, Loki receives logs but dashboard panels that filter by{job="backend"}return nothing — which is exactly the symptom the issue described.No concerns here.
prometheus.yml — ✅ clean
Removing the stale
# Target will show as DOWN until...comment is correct. That comment was a workaround documentation for a broken state that no longer exists. Dead comments left in infra configs get copy-pasted into new configs and then no-one knows if they still apply.Backend management port wiring — ✅ verified
Backend scrapes at
backend:8081/actuator/prometheus. Port 8081 is not published externally (confirmed by looking at the management port configuration). Prometheus reaches it directly overarchiv-net. Caddy never touches it. The network isolation model is intact.One heads-up (informational, not a blocker)
The
joblabel from promtail will bebackend,frontend,db, etc. — whatever the Docker Compose service name is. The Grafana dashboard's$appvariable must match these values exactly. If the dashboard was previously filtering by a different label (e.g.container_nameorapp), the panels won't light up until someone updates the variable definition in Grafana. Worth verifying in the running stack after this deploys.🔐 Nora Steiner (NullX) — Application Security Engineer
Verdict: ⚠️ Approved with concerns
The
permitAll()change — correctly implemented, security boundary needs hardening documentationThe fix is architecturally sound: in Spring Boot 4.0 with Jetty, the management port shares the security filter chain, so
permitAll()on/actuator/prometheusis required. The security model relies on network isolation (port 8081 not published externally, not routed through Caddy).This is the right approach. However, I want to flag the trust chain explicitly, because if the network isolation assumption ever breaks, this
permitAll()becomes a door into JVM internals:Current security boundary (correct):
What
jvm_memory_used_bytesand the full Prometheus output expose: heap usage, GC pause rates, thread counts, connection pool sizes, HTTP request counts with labels. None of this is directly exploitable, but it maps the application's internal structure to an attacker. The network isolation boundary is what makespermitAll()acceptable here.Recommendation (non-blocking): Add a comment-level assertion or CI check that port 8081 is not in the published ports list in docker-compose. This makes the dependency explicit and reviewable. A future compose change that accidentally publishes 8081 would be immediately flagged.
Exposure surface — correctly scoped
Named list, not wildcard (
*)./actuator/heapdump,/actuator/env,/actuator/loggersare all excluded. ✅/actuator/infoand/actuator/metricsrequire authentication (not in thepermitAll()rule) — correct behavior. ✅micrometer-registry-prometheuspresent — ✅CWE-16 (Configuration) risk mitigated: metrics are no longer silently disabled. The explicit
management.prometheus.metrics.export.enabled: trueis belt-and-suspenders given the new starter, but correct and self-documenting for the Spring Boot 4.0 behavioral change.No new attack surface introduced
No input validation changes, no new endpoints, no SQL queries, no user-controlled data flows. Clean from an injection and authorization standpoint.
🧪 Sara Holt — Senior QA Engineer
Verdict: ⚠️ Approved with concerns
What's good
ActuatorPrometheusITis a solid integration test:PostgresContainerConfig) ✅@LocalManagementPortcorrectly exercises the management port, not the app port ✅@MockitoBean S3Clientprevents MinIO from being a test dependency ✅Concern — two assertions, one test
These are two distinct failure modes: the endpoint could return 200 with an empty body, or it could return 401. Splitting into two tests gives clearer failure messages:
When CI fails at 2am, "prometheus_endpoint_returns_200_without_credentials FAILED" vs "prometheus_endpoint_returns_jvm_metrics FAILED" point to different root causes.
Missing negative test — should be a blocker
There is no test verifying that authenticated-only endpoints still require authentication. The
permitAll()change means two actuator endpoints are now open. The concern is whether a futurepermitAll()addition could accidentally open an endpoint it shouldn't.This anchors the current security surface. Without it, someone could add
metricsto thepermitAll()matcher and no test would catch it.Coverage gate impact
This new test is in the root test package, consistent with the other IT classes (
FamilienarchivApplicationTests,PostgresMigrationIT). No coverage concerns — the test exercises infrastructure wiring, not business logic, so the JaCoCo gate isn't affected.📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Reviewing against the original issue #604 acceptance criteria.
Requirements coverage
backend:8081/actuator/prometheusmanagement.server.port: 8081+exposure.include: prometheuswas already present; the missing pieces (starter, auth, enabled flag) are now addedmicrometer-registry-prometheusdependencyspring-boot-starter-micrometer-metricsadded to pom.xml (Spring Boot 4.0 split this into a separate starter)management.prometheus.metrics.export.enabled: true/actuator/prometheusaccessible without authenticationauth.requestMatchers("/actuator/health", "/actuator/prometheus").permitAll()joblabel for Grafana dashboard variable{job="$app"}source_labels: ['__meta_docker_container_label_com_docker_compose_service']→target_label: 'job'ActuatorPrometheusITverifies 200 +jvm_memory_used_byteswithout credentialsAcceptance criteria met
The integration test serves as the automated acceptance criterion. It verifies:
Authorizationheader sent)One open question not in the issue
The issue did not specify what happens when the Prometheus scrape target is DOWN (pre-fix behavior during deployment). The stale comment removal in
prometheus.ymlis correct — that workaround documentation should not live in production config. But the runbook for "Prometheus shows backend as DOWN" could be worth documenting indocs/infrastructure/for the next time someone sees it. Not a blocker for this PR.🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
No frontend or UI changes in this PR. The entire diff is backend Java, YAML configuration, and infrastructure files.
What I checked:
This is pure observability plumbing — wiring Prometheus metrics and Loki log labels so the engineering team can see what the application is doing. This work enables better monitoring dashboards, which will eventually improve the product's reliability for users. LGTM from a UX perspective.
Review concerns addressed
Three commits pushed to
worktree-fix+issue-604-obs-wiring:c80d0c81— fix(obs): add management security chain and split Prometheus IT testsAddresses: @felix (comment duplication + TDD), @sara (missing negative test, split assertions), @nora (dedicated management security layer)
The root cause investigation found something worth documenting: in Spring Boot 4.0 with Jetty, the management port (8081) shares the main
SecurityFilterChain. However, the main chain'sDelegatingAuthenticationEntryPoint(configured for form-login) redirects unauthenticated requests to/login(302 → HTML 200) rather than returning 401. A plainhttpBasic()override is not sufficient — Spring Security still runs the delegating entry point that prefers form-login for browser-like clients.Fix: Added a dedicated
@Order(1) SecurityFilterChain managementFilterChainscoped to/actuator/**with an explicit(req, res, e) -> res.setStatus(401)entry point. This ensures non-public actuator endpoints return 401 directly, which Prometheus and health checkers can interpret, and which the test can assert on unambiguously.The existing comment duplication in
SecurityConfig.javawas cleaned up (Felix's concern): the redundant double-statement about management port / filter chain / network isolation is replaced by a single accurate comment referencing ADR-017.Test changes:
prometheus_endpoint_returns_200_without_credentials— checks status onlyprometheus_endpoint_returns_jvm_metrics— checks body onlyactuator_metrics_requires_authentication— new negative test, asserts 401 ✅All 1603 tests pass.
060bacf3— docs(adr): add ADR-017 — Spring Boot 4.0 management port shares main security filter chainAddresses: @markus (blocker — missing ADR)
docs/adr/017-management-port-security.mddocuments:@Order(1)management filter chain + network isolation as outer boundaryManagementWebSecurityAutoConfiguration(removed in SB4), usingEndpointRequest(not a transitive dependency), relying solely on the main filter chain13dd38a5— docs(obs): document promtail job label mapping in DEPLOYMENT.mdAddresses: @markus (non-blocker suggestion)
Updated the
obs-promtailservice row in the service table to explain that thejoblabel is mapped from the Docker Compose service name (com.docker.compose.service), and that this is what populates{job="backend"}queries and the Grafana "App" variable dropdown.Deferred: @tobias's heads-up about verifying the Grafana
$appvariable againstjoblabel values is an operational check (not a code change) — flagged for post-deploy verification.🏗️ Markus Keller — Application Architect
Verdict: ⚠️ Approved with concerns
What's right
EndpointRequestwas avoided. This is exactly the kind of "why" documentation that prevents future developers from undoing the fix.@Order(1)management chain withsecurityMatcher("/actuator/**")is the correct Spring Security pattern for layered filter chains. Scoping it tightly to/actuator/**avoids interfering with the main application chain.permitAll()in the main chain is good defence-in-depth. The comment referencingdocs/adr/017is the right cross-reference.archiv-netcan reach it.Blocker
DEPLOYMENT.md services table on this branch still contains stale values from before PR #605:
obs-grafanarow:${PORT_GRAFANA:-3001}→ should be${PORT_GRAFANA:-3003}(fixed by #605)obs-glitchtipandobs-glitchtip-workerrows:glitchtip/glitchtip:v4→ should beglitchtip/glitchtip:6.1.6(fixed by #605)The branch was cut before #605 merged. Git's 3-way merge will likely preserve the #605 fixes, but this should be verified — if the merge resolves to the stale values, operators following DEPLOYMENT.md will use the wrong Grafana port and an outdated GlitchTip image. Rebase on current
mainor manually update these rows.Suggestion
The
docs/architecture/c4/l2-containers.pumlwas not updated. Strictly, this PR doesn't add a new service, so it may be acceptable — but the security topology change (management port no longer isolated from the main filter chain) is a meaningful architectural fact. Consider a one-line note on the backend container if the team wants l2 to reflect security chain topology.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
What's right
pom.xmlchange is minimal and clean. Comment on thespring-boot-starter-micrometer-metricsdependency correctly explains the Spring Boot 4.0 split — this is a "why" comment that a future reader needs.SecurityConfig.javais well-structured: the management chain does one thing, the guard clause comment explains the threat model, and the reference to ADR-017 closes the loop.ActuatorPrometheusITtest names read as sentences and cover the three meaningful behaviours: open endpoint returns 200, JVM metrics are present, protected endpoint returns 401.Suggestions (non-blocking)
noThrowTemplate()is recreated on every call. Three tests each call it once, making threeRestTemplateallocations. Extract as a field or@BeforeEach:prometheus_endpoint_returns_200_without_credentialsandprometheus_endpoint_returns_jvm_metricsmake the same HTTP request. The 200 status is implied by a successful body parse in the second test anyway. Consider whether these can be two assertions in one test — Sara's "one logical assertion per test" rule is satisfied here since they verify different observable facts (status vs. body content), so keeping them split is also defensible.Missing "authenticated access is admitted" test. The suite proves unauthenticated requests to
/actuator/metricsreturn 401, but there is no test proving that a valid Basic auth credential returns 200. This would close the coverage gap and prove the chain allows legitimate scraping clients through.ActuatorPrometheusITlives in the root test package. Fine for an IT test with no domain ownership, but consistent with how other integration tests in this repo are organized.🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
What's right
__meta_docker_container_label_com_docker_compose_service→jobis the standard pattern for populating Grafana Loki'sjobdropdown. One rule, right place.prometheus.ymlis good hygiene. "Target will show as DOWN until…" was a lie once the dependency was added — removing it reduces confusion.application.yamlchange is correct:management.prometheus.metrics.export.enabled: trueis required in Spring Boot 4.0 to activate the Prometheus scrape endpoint, and the comment explains why.Blocker
DEPLOYMENT.md services table on this branch has stale values from before PR #605:
PR #605 already fixed these (
3001→3003,v4→6.1.6) onmain. This branch was cut before that merge. Git's 3-way merge will likely preserve the #605 fixes, but this is risky — if they don't survive the merge, an operator following DEPLOYMENT.md after this PR ships would open Grafana on the wrong port and pull a stale GlitchTip image. Rebase on currentmainto make the branch agree withmainbefore merging.Suggestion
The comment in
promtail-config.ymlnoting thatcontainer_namediffers between dev and prod environments is genuinely useful operational knowledge. Worth keeping.🔒 Nora "NullX" Steiner — Security Engineer
Verdict: ✅ Approved
What's right
The management filter chain design is correct and the threat model is well-documented.
Chain scoping is tight.
securityMatcher("/actuator/**")at@Order(1)means requests not matching that path fall through to the main chain — no accidental overpermitting.The custom
authenticationEntryPointreturning plain 401 avoids the redirect-to-login vulnerability that was causing Prometheus to receive an HTML 302 response. This is the right fix. ADR-017 documents it clearly./actuator/prometheusispermitAll()by design — Prometheus scrapes are unauthenticated. Network isolation (port 8081 not published, only Prometheus insidearchiv-netcan reach it) is the outer control, and ADR-017 states this explicitly. Acceptable threat model for this deployment.Belt-and-suspenders
permitAll()in the main chain for health and prometheus is a defensive move that prevents a misconfiguration regression from accidentally blocking Docker health checks. The comment pointing to ADR-017 is exactly what this codebase's conventions ask for.Observations (non-blocking)
The 401 response has no
WWW-Authenticateheader. RFC 7235 requires aWWW-Authenticateheader on 401 responses. Most HTTP clients (including Prometheus) will function correctly without it, but strict clients may reject the response. Since Prometheus handles it fine and no browser reaches this endpoint, this is acceptable. Worth a one-liner comment if you want to preempt future questions:ActuatorPrometheusITdoes not test that valid credentials are admitted to/actuator/metrics. The test proves the 401 gate fires correctly, but not that the auth flow works end-to-end. Low risk given the belt-and-suspenders design, but adding one positive-path test would close this.No
CSRFon the management chain — correct. Actuator endpoints are read-only scrape targets and have no session-based callers, so CSRF protection would be noise.🧪 Sara Holt — QA Engineer
Verdict: ✅ Approved
What's right
prometheus_endpoint_returns_200_without_credentials,prometheus_endpoint_returns_jvm_metrics,actuator_metrics_requires_authentication— each name is a sentence. When one fails in CI, no one needs to read the body.@SpringBootTest(webEnvironment = RANDOM_PORT)with@LocalManagementPortis the correct slice for management port testing. This exercises the real Spring Security filter chain against a randomly assigned port — no mocking, no shortcuts.@MockitoBean S3Client s3Clientprevents the full integration test from trying to connect to a real S3 endpoint. Correct isolation boundary.PostgresContainerConfig.classimport ensures migrations run against real Postgres — consistent with the project's test standards.Suggestions (non-blocking)
noThrowTemplate()is called three times, creating threeRestTemplateinstances. Promote to a@BeforeEach-initialized field.Missing: "authenticated access is admitted" — there is no test proving that valid credentials (Basic auth) allow through to
/actuator/metrics. The current suite proves the 401 gate fires, but not that the chain is correctly wired end-to-end. Consider:Missing:
/actuator/healthpositive test. It's inpermitAll()like prometheus, but only prometheus is tested. Health is tested implicitly by Docker health checks in compose, but an explicit IT test documents the intent in code.Two tests call the same URL independently (
200check andjvm_metricscheck on/actuator/prometheus). Sara's "one logical assertion per test" principle is correctly applied here — status code and body content are different observable facts. Keep as-is.📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
What I checked
Verified that this PR traces cleanly from the issue (#604) through to the acceptance criteria, and that the scope is bounded.
What's right
up{job="spring-boot"}observable. Each criterion is verifiable without interpretation.Observation
The acceptance criteria in the PR description ("after redeploy…") are the right artefacts, but they duplicate what should be checkboxes on issue #604 itself. For future PRs: closing an issue via "Closes #604" works, but if the issue has acceptance criteria checked off before merge, the issue history becomes the audit trail. Minor workflow note, not a blocker.
🎨 Leonie Voss — UI/UX Designer
Verdict: ✅ Approved
What I checked
This PR is entirely backend + infrastructure: Spring Security filter chain, Micrometer Prometheus wiring, Promtail relabeling, and documentation. There are no Svelte components, no routes, no CSS, no user-visible changes.
LGTM from a UI/UX perspective. Nothing to review here — the fix is invisible to users, which is exactly what observability plumbing should be.
🔧 Live investigation + follow-up fixes (2026-05-16)
After the initial PR commits, I deployed the obs stack to staging and hit two production failures. Here's what was found and what was fixed — both live on the server and in the three additional commits now on this branch.
Issue 1 — Empty Spring Boot Metrics dashboard
Symptom: Grafana's Spring Boot Observability dashboard (ID 17175) showed all-empty panels.
Root cause: The dashboard uses
label_values(jvm_memory_committed_bytes, application)as its top-level template variable. Without anapplicationlabel on Spring Boot metrics, the variable resolves to nothing and every panel query returns no data.Fix:
management.metrics.tags.application: ${spring.application.name}toapplication.yaml— tags every Micrometer metric withapplication=FamilienarchivMANAGEMENT_METRICS_TAGS_APPLICATION: Familienarchivtodocker-compose.prod.ymlVerified:
label_values(application)in Prometheus now returns['Familienarchiv'].Issue 2 — "Connection reset" to Tempo
Symptom:
Root cause: Spring Boot's OpenTelemetry starter uses
HttpExporter(HTTP/1.1) by default. The configured endpoint washttp://tempo:4317— that is the gRPC port. Sending an HTTP/1.1 request body to a gRPC listener causes an immediate connection reset.Fix:
OTEL_EXPORTER_OTLP_ENDPOINTfromhttp://tempo:4317tohttp://tempo:4318(the OTLP HTTP port) inapplication.yamlanddocker-compose.prod.ymlapplication.yamlcomment and theDEPLOYMENT.mdenv var table to document why 4318 and not 4317Issue 3 — Subsequent 404 errors on log and metric OTLP export
Symptom (after fixing port):
Root cause: The OpenTelemetry starter enables OTLP export for all three signals (traces, logs, metrics) to the same endpoint by default. Tempo only handles traces (
/v1/traces). It returns 404 for/v1/logsand/v1/metrics.Fix:
OTEL_LOGS_EXPORTER: none— Promtail captures Docker logs via the logging driver; no need to also push them via OTLPOTEL_METRICS_EXPORTER: none— Prometheus scrapes metrics via pull model (/actuator/prometheus); no need to push them via OTLP to TempoBoth env vars added to
application.yaml(defaults) anddocker-compose.prod.yml.Verified: Zero OTLP export errors in the last 2 minutes of logs. Traces confirmed flowing to Tempo (queried via
obs-tempoAPI).Also fixed — healthcheck port in
docker-compose.prod.ymlThe backend healthcheck was pointing at port 8080 (main app), but
management.server.port: 8081means/actuator/healthis only served on 8081. Changed the healthcheck to use port 8081.New:
docs/OBSERVABILITY.mdAdded a developer-facing observability guide so the team knows where to look for what:
job=beatscontainer_name=, common LogQL queries, log→trace correlationapplication="Familienarchiv"matters, four useful PromQL queriesAlso updated
DEPLOYMENT.md §4to link to the new guide, and fixed the stale env var table rows (4317→4318, added the three new vars).🏗️ Markus Keller — Senior Application Architect
Verdict: ✅ Approved
The PR is architecturally sound. The root-cause analysis is correct, the fix is proportionate, and the ADR documents the decision properly. A few things to note.
What is done well
ManagementWebSecurityAutoConfigurationexclusion (no longer applicable in Boot 4.0) andEndpointRequest.toAnyEndpoint()(unavailable without extra dependency). That level of research-before-commit is what I want to see.@Order(1)management chain with a path-scopedsecurityMatcher("/actuator/**")is the correct primitive for this Spring Security model. It is clear, self-contained, and does not leak into the main filter chain.permitAll()insecurityFilterChainis accompanied by an explanatory comment that references the ADR. Future developers will not remove it thinking it is redundant.docs/OBSERVABILITY.mdis a solid ops-to-dev handoff document.Blockers
None.
Suggestions
docs/architecture/c4/l2-containers.pumlnot updated. The containers diagram should reflect any changes to how the observability stack is wired. The Promtailjoblabel and the OTLP port change are invisible to someone reading the architecture diagrams. The doc-update table in the reviewing checklist lists new Docker service topology changes as requiringl2-containers.pumlupdates. This PR doesn't add a new service, so it is a borderline case — but the OTEL endpoint change is observable-topology-relevant. Recommend a quick scan to confirm the diagram comment is still accurate.The
managementFilterChaindoes not call.httpBasic()for theanyRequest().authenticated()branch. A request to/actuator/metricswithout credentials will get a 401 body-less response (because the custom entry point only sets the status, noWWW-Authenticateheader). That is intentional and correct for a machine-only endpoint — but it means a human trying to debug via curl will get a bare 401 with no hint. This is acceptable; just confirming it is deliberate.MANAGEMENT_TRACING_SAMPLING_PROBABILITYnow appears indocker-compose.prod.ymlbut the default inapplication.yamlis1.0. The prod compose sets it to0.1via env var. This is the right pattern — no change needed. But theDEPLOYMENT.mdtable is worth checking to ensure both the default and the prod-override are clearly documented (they now appear to be, in the updated table).Overall: clean fix, proportionate scope, well-documented. ✅
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
This is a clean, well-scoped infrastructure fix. The code changes are minimal and readable, the tests exist, and the naming is clear. A few implementation observations below.
What is done well
managementFilterChaininSecurityConfig.javais 20 lines of focussed security configuration. No business logic has leaked in. The lambda entry point is a one-liner. Good.noThrowTemplate()helper inActuatorPrometheusITis well-named and does one thing: return aRestTemplatethat does not throw on non-2xx. TheDefaultResponseErrorHandleroverride is the correct idiom.SecurityConfig.javaexplain why things are configured the way they are (threat model, not code description) — consistent with the project code style.otel.logs.exporter: noneandotel.metrics.exporter: noneinapplication.yamlis correct — these are OpenTelemetry SDK properties, not Spring Boot management properties.Blockers
None.
Suggestions
ActuatorPrometheusIT— three tests, three HTTP calls. Each@Testmethod callsnoThrowTemplate().getForEntity(...)independently, meaning the full Spring Boot context starts and three separate HTTP calls are made. This is an integration test, so the context start cost is paid once per class (not per method) — that is fine. But theprometheus_endpoint_returns_200_without_credentialsandprometheus_endpoint_returns_jvm_metricstests both call the same endpoint and could share a response. Not a blocker — just noting the pattern. The current structure keeps each test independently readable, which is a valid trade-off.ActuatorPrometheusIT—@LocalManagementPortfield is notint managementPortbut the test uses it as a port directly in a URL string. This is correct Spring Boot test usage (@LocalManagementPortinjects the actual management port). No issue.The
application.yamlchange addsmanagement.prometheus.metrics.export.enabled: true. In Spring Boot 4.0 withspring-boot-starter-micrometer-metrics, the Prometheus registry is auto-configured when the dependency is on the classpath. The explicitenabled: trueis belt-and-suspenders — fine, but worth a one-line comment explaining it is not strictly necessary if the starter is present (it was added for explicitness after the Boot 4.0 confusion). The PR description already explains this context; a brief inline comment would make it self-documenting without the PR description.pom.xml: the comment onspring-boot-starter-micrometer-metricssays "Spring Boot 4.0 splits Micrometer metrics export (incl. Prometheus scrape endpoint) into its own starter." This is accurate and useful for future maintainers. Good.No code-quality issues. The changes are minimal, correct, and follow established project patterns. ✅
🛠️ Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: ✅ Approved
This PR fixes three real operational failures and does it cleanly. The changes are well-scoped: one Java change, one YAML relabel rule, one Prometheus comment removal, a few env vars in the prod compose. No unnecessary infrastructure added.
What is done well
DEPLOYMENT.mdis clear. Also fixing it indocker-compose.prod.ymlso the env var is consistent.joblabel — one relabel rule, mapped fromcom.docker.compose.service. This is the standard pattern for Docker-based Promtail deployments. The comment inpromtail-config.ymlcorrectly explains whycontainer_nameis environment-unstable andcompose_service(and thereforejob) is stable. Good note for future ops.MANAGEMENT_TRACING_SAMPLING_PROBABILITY: ${MANAGEMENT_TRACING_SAMPLING_PROBABILITY:-0.1}indocker-compose.prod.yml— good to make the default explicit in the compose file rather than relying on application defaults. The 10% sampling rate is appropriate for production.OTEL_LOGS_EXPORTER: noneandOTEL_METRICS_EXPORTER: noneindocker-compose.prod.yml— correct. Promtail handles logs; Prometheus pull handles metrics. Disabling OTLP pushes avoids double-shipping and eliminates the "Tempo does not accept logs/metrics" noise in startup logs.:latestintroduced). Good.Blockers
None.
Suggestions
MANAGEMENT_METRICS_TAGS_APPLICATIONdefault value indocker-compose.prod.ymlis hardcoded asFamilienarchiv. This is correct for production, but if someone runs a staging environment using the prod compose file without overriding this var, staging metrics will be tagged asFamilienarchivand could pollute the production Grafana dashboard. Consider making this an env-file variable (${APP_METRICS_TAG:-Familienarchiv}) so staging deployments can differentiate. Minor suggestion — not a blocker for this fix.docs/OBSERVABILITY.md— the SSH tunnel example references172.20.0.xas a placeholder IP. That's fine for a comment, but the actual IP will change if the obs-net subnet is reconfigured. This is documentation-only risk; no code change needed. A note saying "usedocker inspectto get the current IP" (which the doc already includes) is sufficient.The prometheus.yml comment removal ("Target will show as DOWN until…") is correct. That comment was stale and confusing. Clean removal.
No healthcheck added to the backend for the new management port config. The existing healthcheck on the backend service likely still targets
http://localhost:8081/actuator/health(or similar). Worth confirming this is still wired correctly after the security chain change — health should return 200 without credentials, and the test confirms this. Nothing to change; just a deploy-day verification item.Solid ops work. Deploys cleanly with
docker compose ... --build backendfor the Java change and a Promtail config reload for the log label fix. ✅🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
The security posture after this PR is strictly better than before. The management port is now protected by a dedicated, well-reasoned filter chain. I want to highlight one important nuance and confirm the threat model is correct.
What is done well
managementFilterChainwith@Order(1)andsecurityMatcher("/actuator/**")— correct. Higher-priority chain intercepts all actuator paths before the main chain runs. ThepermitAll()only covers/actuator/healthand/actuator/prometheus— everything else requires authentication. This is the right default-deny posture.SC_UNAUTHORIZEDinstead of redirecting to/loginis the correct behavior for machine-to-machine endpoints. Prometheus and Docker health checkers are not browsers; a 302 redirect would cause them to fail silently with confusing errors.securityFilterChain(explaining the full CSRF threat model via SameSite=strict + CORS) is the right model. The management chain does not need CSRF because there is no browser session to hijack.expose-only (notports:), meaning only containers onarchiv-netcan reach the management port. The ADR names this explicitly as the outer defense layer. Security in depth: network isolation + auth on other actuator endpoints./actuator/metrics,/actuator/env,/actuator/heapdumprequire authentication — confirmed. TheanyRequest().authenticated()in the management chain covers these.heapdumpin particular can contain passwords and session tokens; protecting it by default is essential.Blockers
None.
Suggestions / Security Observations
The management chain has no
httpBasic()configured. A request to/actuator/metricswith valid Basic credentials will receive a 401 because noAuthenticationManageris wired into the management chain. Theauthenticated()rule requires authentication, but no mechanism is provided to authenticate. In practice this is fine — the management port is not accessible from outside the Docker network, so only Prometheus (which does not need authentication for the public endpoints) and admins with SSH access would try to hit protected endpoints. But if you ever want to actually query/actuator/metricswith credentials, you would need to add.httpBasic(Customizer.withDefaults())to the management chain. This is a suggestion, not a blocker, because the current operational model (network isolation) makes authenticated actuator access a non-requirement.WWW-Authenticateheader is absent on 401 responses. The custom entry point only callsres.setStatus(SC_UNAUTHORIZED)without writing aWWW-Authenticateheader. RFC 7235 requires this header on 401 responses. For machine clients (Prometheus, curl) this is acceptable — they do not use the header. For a developer debugging via browser dev tools, the missing header is a minor inconvenience. Again, not a blocker.MANAGEMENT_METRICS_TAGS_APPLICATIONis hardcoded asFamilienarchivin the prod compose. This is not a security concern — metric tags are metadata, not credentials.Prometheus scrapes without authentication. This is explicitly intended: the endpoint is
permitAll(), isolated toarchiv-net, and scraping does not require credentials. The threat model (internal network, no public exposure) justifies this. Confirmed correct.The fix is well-implemented. The threat model is documented in ADR-017, the security controls are proportionate to the risk, and the integration test (
ActuatorPrometheusIT) permanently codifies the expected behavior. Exactly how a security fix should land. ✅🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ✅ Approved
The test coverage added in this PR directly verifies the three core behaviors that were broken. The test structure is clean and the approach is appropriate for an integration-level concern.
What is done well
ActuatorPrometheusITcovers the three behaviors that matter:prometheus_endpoint_returns_200_without_credentials— confirms the security fixprometheus_endpoint_returns_jvm_metrics— confirms the metrics dependency is wiredactuator_metrics_requires_authentication— confirms protected endpoints are not accidentally openedThree separate tests, three separate reasons to fail. This is the one-logical-assertion-per-test pattern executed correctly.
Test names read as sentences.
prometheus_endpoint_returns_200_without_credentialstells me exactly what failed if this test breaks in CI. No need to read the test body.noThrowTemplate()is a clean helper. It is private, returns what its name promises, and is used in all three tests. Correct extraction.@SpringBootTest(webEnvironment = RANDOM_PORT)+@LocalManagementPort— this is the correct way to test the management port specifically.RANDOM_PORTavoids port conflicts in CI.@LocalManagementPortinjects the actual assigned port. The test uses real HTTP calls against the real management port, which is the right layer for this kind of security-configuration integration test.@MockitoBean S3Client— correctly stubs out the MinIO dependency so the test does not require a running MinIO instance. Pragmatic.Blockers
None.
Suggestions
No test for
/actuator/healthreturning 200 without credentials. The Docker health check depends on this, and the main chain'ssecurityFilterChainhas a belt-and-suspenderspermitAll()for health. Consider adding:This would make the test class cover all three currently-
permitAll()paths: health, prometheus, and (implicitly) the protection of everything else. Not a blocker — the health endpoint was already tested in a previous PR — but rounding it out here would makeActuatorPrometheusITthe single source of truth for management-port security.No test for the belt-and-suspenders path in
securityFilterChain. The comment says/actuator/healthand/actuator/prometheusare also permitted in the main chain as a fallback. In practice, the@Order(1)management chain intercepts these requests first, so the fallback is never exercised in normal operation. This is by design — the belt-and-suspenders is a future-regression guard, not a current code path. A note to that effect in the test class would help future maintainers understand why there is no test for the fallback: the management chain always wins.ActuatorPrometheusITlives in the root test package (org.raddatz.familienarchiv). This is consistent with other IT tests in the project. No issue.CI time impact: This test starts a full
@SpringBootTestcontext. That is appropriate — the test verifies Spring Boot auto-configuration and security filter chain wiring, which requires the full context. The Testcontainer@Import(PostgresContainerConfig.class)is already a pattern in the codebase. Estimated CI impact: the context is likely shared with other@SpringBootTest @Import(PostgresContainerConfig.class)tests in the same CI run via Spring's context caching.Good test suite addition. The three tests are correctly written, correctly layered, and directly verify the bugs that were fixed. ✅
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Reviewing this PR from a requirements and acceptance-criteria perspective. The issue (#604) described three concrete failures; this PR addresses all three. My job here is to confirm the acceptance criteria are met and flag any gaps in the observable outcome.
Acceptance criteria coverage
The PR description lists five acceptance criteria from issue #604. Let me assess each:
up{job="spring-boot"}shows1in Grafana after redeploysecurityMatcher+permitAll()on/actuator/prometheusremoves the 401;spring-boot-starter-micrometer-metrics+management.prometheus.metrics.export.enabled: trueensures the endpoint exists. Integration test confirms 200.management.metrics.tags.application: ${spring.application.name}wires theapplicationtag required by dashboard ID 17175.prometheus_endpoint_returns_jvm_metricsconfirmsjvm_memory_used_bytesis present.backend,frontend, etc.com.docker.compose.service→job. Grafana useslabel_values(job)for the dropdown.backendvisible in Loki Logs dashboard{job="backend"}queries./actuator/healthstill returns 200 without credentialsmanagementFilterChainhaspermitAll()for/actuator/health. Main chain has belt-and-suspenders fallback. Integration test pattern confirms 200 for prometheus; health follows same config.All five acceptance criteria have implementation evidence. The remaining verification steps (manual Grafana dashboard checks) are correctly listed as deploy-day tasks rather than CI gates — appropriate for observability stack integration.
Requirements observations
The "what was NOT changed" section in the PR description is unusually useful. Explicitly documenting what was investigated and ruled out (SecurityConfig, pom.xml, application.yaml already correct) prevents the "why didn't you also fix X" questions during review. This is a good documentation pattern.
Deployment notes are clear and actionable. Each changed component has an explicit deploy action:
--build backendfor the Java change, Promtail restart for the config change, no restart for prometheus.yml (cosmetic). This is the level of operational specificity needed for a solo deployment context.One potential gap: the
MANAGEMENT_METRICS_TAGS_APPLICATIONenv var is documented inDEPLOYMENT.mdbut its absence in a fresh deployment would cause the dashboard filter to show the application'sspring.application.namevalue (familienarchiv-backendbased on the OTEL config) rather thanFamilienarchiv. This is actually fine — either value works, and the prod compose sets it toFamilienarchivexplicitly. But a new environment set up from scratch without the prod compose would have a different value. TheDEPLOYMENT.mdentry covers this adequately.docs/OBSERVABILITY.mdfills a genuine documentation gap. Before this PR, there was no single place telling a developer "when I see an error, where do I look?" The signal-summary table at the end of the file is particularly useful for onboarding.The PR is well-specified, the acceptance criteria are verifiable, and the scope is correctly bounded. No missing requirements detected. ✅
🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
This PR is pure infrastructure and backend configuration — no frontend, no UI components, no Svelte, no CSS changes. There is nothing in my scope to review from a UX or accessibility angle.
That said, I want to flag one oblique observation from the developer-experience perspective:
Observation:
docs/OBSERVABILITY.md— developer UXThe new
OBSERVABILITY.mdis a documentation improvement with direct UX impact on the developer. The "Where to look for what" table at the top is well-structured — it maps intent ("I want to understand why an HTTP request failed") to a specific tool and action. This is exactly the right information architecture for operational documentation.One suggestion for future iteration: the Grafana URLs in the Access table (
https://grafana.archiv.raddatz.cloud) are production-only. A developer working locally has no Grafana. A "local development" row or a note pointing tolocalhost:3001(fromPORT_GRAFANAin the compose file) would reduce the cognitive gap between "I read the docs" and "I can actually do this now." This is a non-blocking suggestion for a follow-up.No UI changes, no accessibility concerns, no brand impact. Clear to merge from my perspective. ✅
The job label (derived from the Docker Compose service name) is what powers {job="backend"} queries in Loki dashboards and populates the Grafana "App" variable dropdown. Operators need to know this mapping when writing custom Loki queries. Addresses @markus non-blocker suggestion from PR #606 review. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>87a5ab3963toe61409773e