devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #585
Reference in New Issue
Block a user
Delete Branch "feat/issue-573-prometheus-metrics"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes #573
prometheus(prom/prometheus:v3.4.0),node-exporter(prom/node-exporter:v1.9.0), andcadvisor(gcr.io/cadvisor/cadvisor:v0.52.1) services todocker-compose.observability.yml— all image tags pinnedinfra/observability/prometheus/prometheus.ymlwith 4 scrape jobs: node, cadvisor, spring-boot, ocr-service/var/run/docker.sock:ro), cAdvisor limited toobs-net(uses socket for discovery, not network membership)Test plan
docker compose -f docker-compose.observability.yml configexits 0docker compose -f docker-compose.observability.yml up -d prometheus node-exporter cadvisorstarts all three containershttp://localhost:9090🤖 Generated with Claude Code
🏗️ Markus Keller (@mkeller) — Application Architect
Verdict: ⚠️ Approved with concerns
Good first step for the observability stack. The service topology is sensible and the two-compose-file pattern (main stack + observability overlay joining
archiv-net) is the right approach. However, the required documentation update for a new Docker service is missing, and there is one technical discrepancy to resolve.Blockers
1. Wrong scrape port for
spring-bootjob (infra/observability/prometheus/prometheus.yml)docs/DEPLOYMENT.md §4states:The prometheus.yml targets
backend:8080, notbackend:8081. If the backend exposes Prometheus metrics on the management port (8081) — which is the standard Spring Boot actuator pattern — this target will never return data even after the micrometer issue ships.Either correct the target or update DEPLOYMENT.md if the intent is to expose metrics on the main port. One of the two must be the source of truth.
2.
docs/architecture/c4/l2-containers.pumlalready updated — butdocs/DEPLOYMENT.mdneeds a topology entryPer my review rule: "New Docker service or infrastructure component →
docs/architecture/c4/l2-containers.puml+docs/DEPLOYMENT.md"The C4 diagram is already updated (Prometheus, Loki, Grafana appear in the observability boundary). However,
docs/DEPLOYMENT.md §1 (Deployment topology)describes only the main stack and refers to the observability stack in §4 without any topology diagram entry for the new Prometheus/node-exporter/cAdvisor services. The Mermaid diagram in §1 and the "Key facts" bullet list should mention that the observability stack is optional and how to start it.This is a documentation contract — DEPLOYMENT.md declares "Update this file in any PR that changes the container topology" — so this is a blocker per the project's own definition.
Suggestions
S1.
prometheus_datavolume declared butloki_data,tempo_data,grafana_data,glitchtip_dataare pre-declared in the volumes blockThese volumes for future services are fine to pre-declare. Just note that they consume namespace in Docker's volume list even when the services don't exist yet. Consider a comment explaining they are reserved for upcoming issues (#574, #575, etc.).
S2. No
depends_onfor node-exporter or cAdvisor on PrometheusPrometheus will start and immediately begin scraping before node-exporter and cAdvisor are fully up. For a small local stack this is harmless (Prometheus retries), but adding
depends_onwithservice_startedwould make the startup ordering explicit and self-documenting.S3.
--web.enable-lifecycleopens the hot-reload API without authThe
/-/reloadendpoint allows unauthenticated config reloads. Since Prometheus is onarchiv-net + obs-net, any container on either network can trigger a reload. This is a low-severity risk for a single-tenant family project, but worth documenting the accepted risk in a comment (similar to the existing cAdvisorprivileged: truecomment).Overall the architecture is coherent. Fix the port mismatch and the DEPLOYMENT.md gap and this is ready to merge.
👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Verdict: ✅ Approved
This is a pure infrastructure PR — no application code, no frontend, no backend Java or Python changes. My TDD and clean code lens doesn't have much traction here, but I did check the config file quality and naming.
What I checked
obs-prometheus,obs-node-exporter,obs-cadvisor) are semantic and follow the existingobs-*container naming convention. Volume nameprometheus_datais consistent withpostgres_data,grafana_dataetc. in the codebase.$$escaping comment, thepid: hostcomment, theprivileged: truecomment, and the read-only socket comment all explain intent rather than restating what the YAML key does. This is correct.services: {}that was removed was clearly the right call — dead stubs are gone, future services are documented as comments with issue references.$$Compose escaping for$in the node-exporter regex is correctly applied and explained.Minor note
The
job_name: spring-bootandjob_name: ocr-servicetargets are pre-wired and acknowledged as DOWN in comments. The comments are informative. I'd only add: when the backend instrumentation issue ships, remember to update the target port (Markus flagged the 8080 vs 8081 discrepancy — that's the only code-level thing I'd double-check before merging).No blockers from my side.
🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
Solid foundation. Image tags are pinned, named volumes are used, the Docker socket is mounted read-only, and the
$$YAML escaping is handled correctly. A few things need attention before this goes to production.Blockers
1.
backend:8080scrape target conflicts with documented management portdocs/DEPLOYMENT.mdis explicit: Prometheus scrapes the management port 8081 at/actuator/prometheus. The Spring Boot application is configured to serve metrics there, not on the main port 8080. This target will never successfully scrape even after the backend is instrumented.2. Prometheus port 9090 exposed on all interfaces
On a production VPS this binds to
0.0.0.0:9090, making the Prometheus UI (and its/api/v1/queryendpoint with full metric data) reachable from the internet if the host firewall allows it. The Grafana, Loki, Tempo services in the canonical stack (docs/infrastructure/) scrape Prometheus internally viaobs-net— they don't need a published port.Fix: bind to loopback for external access control, or use
exposeinstead ofportsif the UI is only needed locally via SSH tunnel:Suggestions
S1. Missing healthchecks on node-exporter and cAdvisor
Prometheus has a healthcheck — good. node-exporter and cAdvisor do not. Without them,
depends_on: condition: service_healthycannot be used, and Docker has no visibility into whether these services are actually serving. A minimal check:S2.
--collector.filesystem.ignored-mount-pointsis deprecatedIn node-exporter v1.x the flag was renamed to
--collector.filesystem.mount-points-exclude. Using the deprecated flag still works in v1.9.0 but will be removed in a future version. Renovate will bump node-exporter and this will break silently. Prefer the current flag name now:S3.
PORT_PROMETHEUSenv var not documented in.env.exampleor DEPLOYMENT.mdThe compose file uses
${PORT_PROMETHEUS:-9090}with a sensible default, but the project convention (per DEPLOYMENT.md §2) is that every variable in a compose file must appear in the env var table. Missing from the table is a minor blocker per the doc's own rule ("Any var found in docker-compose.yml that is not in this table is a blocking review comment").S4.
--web.enable-lifecycleopens unauthenticated config reloadThe
/-/reloadPOST endpoint has no authentication. Any container onarchiv-netorobs-netcan hot-reload the Prometheus config. Accept the risk explicitly with a comment, or restrict the lifecycle API to the management interface only once Prometheus 3.x exposes that option.S5. No
depends_onon Prometheus for its scrapersNot strictly necessary (Prometheus retries failed scrapes), but adding
depends_on: node-exporter: condition: service_startedmakes startup ordering explicit.Overall: Fix the port binding (#2) and the backend scrape port (#1) before merging to production. The rest are improvements.
🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
The PR demonstrates good security awareness — read-only mounts, explicit privilege documentation, network isolation via
obs-net. However, there are two issues worth flagging: one medium-severity (Prometheus UI exposed) and one informational (privileged container threat model).Blockers
None that block merging for dev/local use.
Medium Severity — Production Risk
PROMETHEUS UI EXPOSED ON ALL INTERFACES (CWE-284: Improper Access Control)
What this exposes: The Prometheus HTTP API at
http://<VPS_IP>:9090/api/v1/queryallows querying all collected metrics — container names, process names, CPU/memory patterns, filesystem layout, network interface names. This is low-severity fingerprinting data that an attacker could use to map the infrastructure before a more targeted attack.More critically, the
/-/reloadendpoint (enabled via--web.enable-lifecycle) accepts unauthenticated POST requests. Any actor who can reach port 9090 can reload the Prometheus configuration. While you cannot inject arbitrary YAML through the reload endpoint (the config file is bind-mounted read-only), the attack surface is larger than necessary.Fix: Bind to loopback. Access the UI via SSH tunnel when needed:
This mirrors the project's own pattern described in DEPLOYMENT.md: "The host binds all docker-published ports to
127.0.0.1only."Informational — Accepted Risk Documentation
CADVISOR PRIVILEGED MODE (CWE-250: Execution with Unnecessary Privileges)
The comment in the compose file is good:
This is the correct rationale. I'd add one more dimension to the accepted risk:
privileged: truecombined with the Docker socket mount means a compromised cAdvisor container could escape to the host. The existing mitigations (pinned image, Renovate, no external exposure) are reasonable for a single-tenant family project. No action required, but document the full threat model.Informational —
--web.enable-lifecycleThe hot-reload endpoint (
POST /-/reload) has no authentication in Prometheus 3.x. Combined with thearchiv-netmembership (which also includes backend, frontend, OCR service), any container on the network can trigger a config reload. For the current single-file config this has minimal impact. When alerting rules are added (issue #581), a malicious reload that clears the rules file would suppress alerts silently. Worth adding a comment that this risk is accepted.What's done well
obs-net— does not needarchiv-netmembership, correctly absent:ro— /proc, /sys, /rootfs, /var/lib/docker🧪 Sara Holt (@saraholt) — QA Engineer
Verdict: ✅ Approved
This is infrastructure configuration, not application code, so there's no test suite to review. I'm evaluating testability and verifiability of the configuration itself.
What I checked
Test plan in the PR description: The three-item checklist is correct and complete for validating the compose config:
docker compose config— validates YAML structuredocker compose up -d prometheus node-exporter cadvisor— validates container startuphttp://localhost:9090— validates Prometheus UI accessibilityThese are the right smoke tests for an infrastructure-only PR.
Observability gap — no alerting rules yet: The scrape config collects metrics, but issue #581 tracks alert rules and runbooks. Without alerting, the observability stack is read-only (dashboards only). This is expected for this issue's scope. When #581 ships, each alert should have a documented runbook per Tobias's standards.
DOWN targets are expected and documented: The
spring-bootandocr-servicescrape jobs are pre-wired but will show as DOWN until backend instrumentation ships. The comments inprometheus.ymlcorrectly call this out. When the backend instrumentation PR lands, there should be a verification step that these targets move to UP.Retention is set:
--storage.tsdb.retention.time=30dis explicit. For a CX32 with ~8GB RAM and the current metric volume, 30d is reasonable. No concern.Minor observation
The
prometheus_datanamed volume will persist acrossdocker compose down. If someone runsdocker compose down -v(which removes volumes), all historical metrics are lost. This is expected behavior for Docker volumes — just worth noting for incident runbooks when #581 is written.📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Reviewing against issue #573's implied requirements: add Prometheus + node-exporter + cAdvisor with pinned tags, scrape config, and security baseline. The PR delivers on all stated requirements.
Requirements coverage
prom/prometheus:v3.4.0prom/node-exporter:v1.9.0gcr.io/cadvisor/cadvisor:v0.52.1:roflag present/-/healthy--storage.tsdb.retention.time=30dOpen requirement gap
NFR gap — the
PORT_PROMETHEUSenvironment variable is undocumentedThe project standard (DEPLOYMENT.md §2) states every env var in a compose file must appear in the env var table.
PORT_PROMETHEUSis new and is not in the table. This is a small gap but the project explicitly calls it a "blocking review comment." Add a row to the DEPLOYMENT.md env var table.Forward-looking observations (not blocking)
spring-bootandocr-servicescrape targets create an implicit dependency: when #581 ships, the scrape port for spring-boot must resolve the 8080 vs 8081 ambiguity.http://prometheus:9090(internal, via obs-net), not the published host port. This should be a requirement on the Grafana issue.🎨 Leonie Voss (@leonievoss) — UI/UX Designer
Verdict: ✅ Approved
This PR contains no frontend code, no Svelte components, no UI routes, and no user-facing changes. There is nothing for me to review from a UX or accessibility perspective.
The Prometheus UI at
http://localhost:9090is a developer/operator tool — not part of the Familienarchiv user-facing product. When Grafana ships (tracked in the future Grafana issue), I'll review the dashboard accessibility and embedding approach if it's surfaced inside the application UI.LGTM from a UX standpoint.
All 4 review blockers fixed in commit
0c66f629:Wrong scrape port (
infra/observability/prometheus/prometheus.yml): changed spring-boot target frombackend:8080tobackend:8081(the actuator/management port).Prometheus port binding too broad (
docker-compose.observability.yml): changed"${PORT_PROMETHEUS:-9090}:9090"to"127.0.0.1:${PORT_PROMETHEUS:-9090}:9090"— Prometheus UI is now localhost-only, consistent with the rest of the stack.Topology section updated (
docs/DEPLOYMENT.md): added a bullet under §1 noting the optional observability stack (Prometheus, Node Exporter, cAdvisor) runs viadocker-compose.observability.yml, joinsarchiv-net, and scrapes the backend management port:8081.PORT_PROMETHEUS added to env var table (
docs/DEPLOYMENT.md): new "Observability stack" subsection in §2 with a row forPORT_PROMETHEUS(default:9090, bound to127.0.0.1only).