devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #573
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
This issue adds the three metric-collection services to
docker-compose.observability.ymland their Prometheus scrape configuration. After this issue:Depends on: scaffold issue (compose file and
infra/observability/directory must exist first)Services to Add
Also declare an internal
obs-netbridge network for observability-only traffic that does not need to reach the application stack.Config File to Create
infra/observability/prometheus/prometheus.yml:Acceptance Criteria
docker compose -f docker-compose.observability.yml up -d prometheus node-exporter cadvisorstarts all three containers without errorhttp://localhost:9090curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].labels.job'listsnode,cadvisor, andspring-bootcurl -s http://localhost:9090/api/v1/query?query=node_cpu_seconds_totalreturns datacurl -s http://localhost:9090/api/v1/query?query=container_cpu_usage_seconds_totalreturns dataspring-boottarget shows asDOWN(expected until backend instrumentation) — not as a config errorDefinition of Done
main🔧 Tobias Wendt — DevOps & Platform Engineer
Observations
Image pinning is inconsistent.
prometheus:v3.4.0is pinned — good.node-exporter:latestandcadvisor:latestare not. The explicit versioning philosophy from the existing stack (postgres:16-alpine, mailpit:v1.29.7) applies here too. Pick the current stable tags for both:prom/node-exporter:v1.9.0andgcr.io/cadvisor/cadvisor:v0.52.1(or whatever is latest at time of implementation).obs-netnetwork architecture is sound. Splitting observability traffic onto a separate bridge network is the right call. It keeps Prometheus's scrape requests offarchiv-netand makes the dependency topology visible in the Compose file.cAdvisor needs
archiv-netto see application containers — that is correct, but the issue should state why explicitly in a comment so future reviewers don't remove it thinking it's unnecessary.No
healthcheckon Prometheus. The existing stack has healthchecks on every service. Prometheus exposes/-/healthy. Add it:node-exporterandcAdvisoruseexpose:(notports:) — correct. These should never be reachable from outside the obs-net. The issue spec correctly keeps them internal.prometheus_datanamed volume is correct. But it needs to be declared in the top-levelvolumes:block of the compose file (the issue YAML snippet shows it referenced asprometheus_data:/prometheuswithout the declaration block).PORT_PROMETHEUSenv var default of 9090. In production, Prometheus should NOT be exposed to the host at all — scraping happens on the internal network. The Caddy Caddyfile already blocks actuator externally; Prometheus should be handled similarly. Bind the port to127.0.0.1if local access is needed:"127.0.0.1:${PORT_PROMETHEUS:-9090}:9090".--web.enable-lifecycleis a mild risk. This flag allows POST/-/reloadto hot-reload config without a restart. It's useful operationally but represents a management endpoint. If Prometheus is bound to127.0.0.1only (see above), this is fine. If exposed more broadly, it's a target.ocr-servicescrape job in prometheus.yml is speculative. The Python OCR service does not currently expose Prometheus metrics — there's noprometheus-clientdependency in the service. The comment in the issue ("Only if the Python service exposes Prometheus metrics; skip if not") should be reflected in theprometheus.ymlwith an inline comment, not left as dead config.The
spring-bootscrape target will show as DOWN until the backend instrumentation issue lands. The issue acknowledges this. Confirm the service name in the Prometheus config: the Docker service is namedbackend(both indocker-compose.ymlanddocker-compose.prod.yml). Thearchive-backend:8080hostname shown in the issue spec is the container name from the dev compose, not the service DNS name. Inside the obs-net/archiv-net the service DNS isbackend. Usetargets: ['backend:8080'].Recommendations
/-/healthy; node-exporter::9100/metricsreturn 200; cAdvisor::8080/healthz).archive-backend:8080tobackend:8080.# TODO: remove or add prometheus-client to ocr-servicecomment to theocr-servicejob inprometheus.ymlrather than a conditional comment.prometheus_data:in the top-level volumes block of the compose file.Open Decisions
docker-compose.observability.ymllive in the compose invocation pattern? The existing pattern isdocker compose -f docker-compose.yml(dev) or-f docker-compose.prod.yml(prod, standalone). Should observability be an overlay (-f docker-compose.yml -f docker-compose.observability.yml) or a standalone file started independently? Standalone is simpler but means nodepends_onto the app services. Overlay integrates cleanly but extends the start command further. The current issue spec treats it as standalone — confirm this is intentional.🏛️ Markus Keller — Application Architect
Observations
Architecture doc update is required. Adding three new infrastructure components (Prometheus, node-exporter, cAdvisor) triggers a mandatory update to
docs/architecture/c4/l2-containers.puml. Per the documentation table in CLAUDE.md: "New Docker service / infrastructure component →docs/architecture/c4/l2-containers.puml+docs/DEPLOYMENT.md". Both must be updated in this PR.An ADR is warranted here. This issue introduces the first observability infrastructure for the project — a lasting architectural decision. A short ADR in
docs/adr/should capture: why Prometheus+node-exporter+cAdvisor was chosen over alternatives (Netdata, Datadog agent, etc.), the obs-net isolation rationale, the retention policy choice (30d), and the "spring-boot target is intentionally DOWN" note. This is exactly the kind of "why the codebase is the way it is" context ADRs exist for.Service topology is well-scoped. Keeping Prometheus in its own compose file with a dedicated
obs-netbridge is the right isolation strategy. It means the observability stack can be started or stopped without touching the application stack.The
archiv-netdual-attachment for cAdvisor needs clarification. cAdvisor is attached to botharchiv-netandobs-net. The stated reason is to see application containers — but cAdvisor discovers containers via the Docker socket (/var/run/docker.run:rw), not via network membership. Attaching toarchiv-netis therefore not needed for container metrics. Thearchiv-netattachment would be needed only if cAdvisor needed to make direct HTTP calls to application containers. Revisit whether this dual attachment is actually necessary.The
privileged: trueon cAdvisor is a known requirement. This is a documented cAdvisor requirement for full container metrics on Linux. It's acceptable for this use case, but it should carry a comment in the compose file explaining why it's privileged — reviewers will flag it otherwise.prometheus_datavolume retention of 30 days is reasonable for a single-VPS deployment. Given the ~23 EUR/month budget constraint, 30d of host+container metrics at 15s intervals will consume roughly 1–3 GB depending on the number of time series. This fits comfortably on the CX32.The issue depends on a scaffold issue ("compose file and
infra/observability/directory must exist first"). This is appropriate sequencing. Confirm the scaffold issue is merged before this one starts, or make this issue's branch build on top of the scaffold branch.Recommendations
docs/adr/ADR-00X-observability-stack.md) before implementing. Contents: decision, alternatives considered, consequences (VPS disk usage, privileged container), and the "intentionally DOWN scrape target" note.docs/architecture/c4/l2-containers.pumlanddocs/DEPLOYMENT.mdin the same PR. These are mandatory per the documentation matrix.archiv-netfrom cAdvisor unless there's a specific reason it needs network access to app containers (not just Docker socket access). Least-privilege network topology.# privileged: true — required for cgroup and namespace metrics, see cAdvisor docscomment in the compose file.Open Decisions
infra/observability/live inside the existinginfra/directory (alongsidecaddy/,gitea/) or at the project root alongside the compose files? Theinfra/directory already holds service-specific config subdirectories. Puttinginfra/observability/prometheus/prometheus.ymlthere follows the established pattern. The issue spec already proposes this — just make sure the scaffold issue creates that directory structure first.🔐 Nora "NullX" Steiner — Application Security Engineer
Observations
cAdvisor runs
privileged: truewith root filesystem bind-mounts. This is the highest-risk element of this issue. A compromised cAdvisor container has read access to the entire host filesystem (/:/rootfs:ro) and write access to/var/run(Docker socketrw). The threat model: if cAdvisor has a known RCE vulnerability and theobs-netis reachable from an attacker-controlled vector, host compromise is the outcome. Mitigations to implement:obs-netis an internal bridge with no external port exposure on cAdvisor.@sha256:...) in production for the most sensitive containers./var/run:/var/run:rwis overly broad. The intent is to give cAdvisor access to the Docker socket at/var/run/docker.sock. Mount the socket directly instead:Mounting all of
/var/runasrwexposes more than the socket and grants write access to other runtime files. CWE-732: Incorrect Permission Assignment.--web.enable-lifecycleon Prometheus. This flag enables unauthenticated POST to/-/reloadand/-/quit. If Prometheus is accessible beyond localhost (e.g., ifPORT_PROMETHEUSmaps to0.0.0.0), any process on the host — or any container on a shared network — can trigger a config reload or shutdown. If it's only needed for config reloads during development, omit it from the production config or bind Prometheus to127.0.0.1only./actuator/prometheusis not yet exposed by the backend. I checked:pom.xmlhasspring-boot-starter-actuatorbut noio.micrometer:micrometer-registry-prometheusdependency.application.yamlhas nomanagement.endpoints.web.exposure.includesetting, so onlyhealthis exposed by default. Addingmicrometer-registry-prometheusis a separate issue, but note: when that dependency is added, themanagement.endpoints.web.exposure.includeconfig inapplication.yamlmust explicitly includeprometheus— and the(block_actuator)snippet in the Caddyfile already blocks/actuator/*externally, so Prometheus's scrape happens over the Docker network. This is the correct architecture — Prometheus hitsbackend:8080internally, not through Caddy.No authentication on Prometheus. For a single-VPS deployment where Prometheus is bound to
127.0.0.1, unauthenticated access is acceptable (only local processes can reach it). If the port is ever opened externally (e.g., via a Caddy vhost), it would need auth. This is a follow-up concern, not a blocker, but worth noting in the ADR.node-exporter
pid: hostmode gives the container access to the host's process namespace. This is standard and required for accurate CPU/memory metrics, but should carry a comment:# pid: host — required for process-level CPU/memory metrics; cgroup isolation applies.Recommendations
/var/run:/var/run:rwwith/var/run/docker.sock:/var/run/docker.sock:roin the cAdvisor definition. Read-only on the socket is sufficient; cAdvisor only reads container metadata.127.0.0.1in production to prevent network-level access even if firewall rules are misconfigured.latesttags are a vulnerability waiting to happen.privileged: trueon cAdvisor so future reviewers understand it's an accepted risk, not an oversight.--web.enable-lifecycleto the production Compose config unless there's a specific operational need. Keep it dev-only if it's needed at all.🧪 Sara Holt — QA Engineer & Test Strategist
Observations
The acceptance criteria are well-formed and specific. The
curlcommands withjqare executable, unambiguous, and environment-independent. This is the right level of precision for infrastructure acceptance criteria.AC #4 and #5 require Prometheus to have scraped at least one interval. After
docker compose up, Prometheus needs ~15 seconds before any metric data exists. Thecurlqueries fornode_cpu_seconds_totalandcontainer_cpu_usage_seconds_totalwill return empty datasets if run immediately. Add a brief timing note to the AC: "Wait at least 30 seconds after all containers start before running metric queries."AC #3 (
spring-boottarget listed as DOWN) is subtly tricky to verify.curl | jq '.data.activeTargets[].labels.job'lists only active targets. A DOWN target is still active (Prometheus is scraping it, it's just failing). Confirm thisjqfilter returns"spring-boot"even when the target is DOWN — it should, sinceactiveTargetsincludes DOWN targets by design.No rollback or teardown test. What does
docker compose -f docker-compose.observability.yml downproduce? Does the prometheus_data volume persist correctly? Does re-starting after a down retain previous data? This isn't a blocker, but it's worth a manual test pass during implementation.The Definition of Done says "All acceptance criteria checked" but doesn't specify who checks them. For an infrastructure issue with
curl-based ACs, a CI smoke test would be the right home. However, running the full observability stack in CI is expensive. Manual verification against the local dev stack before merge is the pragmatic choice — make this explicit in the DoD.The
ocr-servicescrape target has no acceptance criterion. The issue includes it inprometheus.ymlbut none of the ACs verify it. Either add an AC ("ocr-service target shows as UNKNOWN or DOWN — not a config error") or remove the job from the config for now. Dead targets with no verification path are noise in the acceptance checklist.Missing: verification that
obs-netis actually isolated. AC should include: "Confirm node-exporter and cAdvisor are NOT reachable from a container onarchiv-netthat is not attached toobs-net." A one-liner:docker exec archive-backend curl -f http://obs-node-exporter:9100/metrics 2>&1 | grep -q "Connection refused".Recommendations
ocr-servicetarget status or remove the job from the initialprometheus.ymlto keep the acceptance checklist complete and clean.archiv-net-only containers.👩💻 Felix Brandt — Senior Fullstack Developer
Observations
This issue is infrastructure-only — no application code changes, no new backend endpoints, no frontend work. From my angle, there are two things worth noting.
1. The
/actuator/prometheusscrape target requires backend code changes not in scope here.The
spring-bootjob inprometheus.ymlpoints atbackend:8080/actuator/prometheus. I checkedpom.xml: there is nomicrometer-registry-prometheusdependency. I checkedapplication.yaml: there is nomanagement.endpoints.web.exposure.includeentry. Without both additions, the/actuator/prometheusendpoint returns 404 and the target stays DOWN — which the issue acknowledges. That's fine. But the next issue (backend instrumentation) will need:The Caddyfile already blocks
/actuator/*externally — Prometheus will scrape over the Docker network, which is correct. Just noting this for whoever picks up the instrumentation issue.2. The container name
archive-backendin the issue's Prometheus config is wrong.In both compose files, the Docker service name is
backend. The hostname Docker assigns for inter-container DNS is the service name, not the container name.archive-backendis thecontainer_namevalue in the dev compose — that alias also resolves, but it's fragile (container names can be overridden; service names cannot). Use the service name:targets: ['backend:8080']. Same principle in production.Recommendations
prometheus.ymltobackend:8080(notarchive-backend:8080).No concerns about code quality or patterns — this issue adds YAML config files, not application code.
🎨 Leonie Voss — UX Designer & Accessibility Strategist
Observations
This is a pure infrastructure issue with no frontend or UI changes. No UX concerns apply.
What I did check: the Prometheus UI (accessible at
localhost:9090per the ACs) is a third-party interface that we do not own or style. No brand compliance, accessibility, or responsive design review applies to it.The one forward-looking UX note: when the Grafana dashboards are built (the milestone is "Observability Stack — Grafana LGTM + GlitchTip"), the dashboard screens shown to users will need UX review — especially the node/container metrics views if they're surfaced to a non-technical audience. Tag me when those Grafana dashboard issues land.
No concerns or recommendations for this infrastructure issue.
📋 Elicit — Requirements Engineer
Observations
The issue is well-structured for a DevOps-class ticket. Context, services, config, and ACs are all present. A few precision gaps are worth closing before implementation.
The "depends on scaffold issue" is an unresolved blocker. The issue states "scaffold issue (compose file and
infra/observability/directory must exist first)" but does not link to the scaffold issue by number. If the scaffold issue doesn't exist yet, this issue can't start. If it does exist, link it with a "Blocked by #NNN" reference so the dependency is machine-readable in Gitea.docker-compose.observability.ymlinvocation is ambiguous. The ACs usedocker compose -f docker-compose.observability.yml up -d prometheus node-exporter cadvisor. But the compose file attaches toarchiv-net— which is defined indocker-compose.yml. Startingdocker-compose.observability.ymlin isolation will fail to findarchiv-netunless the main stack is already running. The issue should state: "Run afterdocker compose up -d(main stack) is already running." This is an implicit precondition that should be explicit in the ACs.The
ocr-servicescrape job has no acceptance criterion and no verification path. Either: (a) add an AC for it, or (b) explicitly mark it as out-of-scope for this issue with a TODO comment inprometheus.yml. Leaving it in without either creates a gap between the spec and the DoD.Retention policy of 30 days is stated but not justified. For a family archive project, 30 days might be too short for seasonal patterns or long-term capacity planning. This doesn't need to block the issue, but it should be documented as a decision (in the ADR that Markus recommends) so it can be revisited.
The
PORT_PROMETHEUSenv var is introduced without updating.env.example. If there's an.env.examplefile in the repo (common for this stack pattern), it needs the new variable documented. Otherwise the next developer will not know to set it.The
$$escaping in theignored-mount-pointscommand ($$|/)) is correct YAML Compose escaping for a$literal. Worth a comment so reviewers don't "fix" it.Recommendations
docker compose up -d) is already running."ocr-servicejob ambiguity: add an AC or mark it as out-of-scope..env.examplewithPORT_PROMETHEUS=9090if that file exists.$$escaping in the node-exporter command for future reviewers.Open Decisions
🗳️ Decision Queue — Action Required
3 decisions need your input before implementation starts.
Infrastructure / Architecture
Standalone vs. overlay compose pattern for observability. The issue treats
docker-compose.observability.ymlas a standalone file started after the main stack. The alternative is an overlay (-f docker-compose.yml -f docker-compose.observability.yml), which enablesdepends_onrelationships but lengthens the start command. Standalone is simpler and fits the "opt-in observability" model; overlay integrates more tightly. The current spec already implies standalone — confirm this is intentional, then document the precondition ("main stack must be running first") explicitly in the ACs. (Raised by: Tobias, Elicit)Is
archiv-netattachment on cAdvisor actually needed? cAdvisor discovers containers via the Docker socket (/var/run/docker.sock), not via network membership. If cAdvisor doesn't need to make direct HTTP calls to application containers, thearchiv-netattachment is unnecessary and violates least-privilege network topology. Remove it unless there's a documented reason to keep it. (Raised by: Markus)Observability Scope
Implementation complete — branch
feat/issue-573-prometheus-metricsWhat was implemented
Commit:
0c9973fd— devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metricsFiles changed:
docker-compose.observability.yml— three new services addedinfra/observability/prometheus/prometheus.yml— new Prometheus scrape configReviewer feedback addressed
Tobias (DevOps):
prom/node-exporter:v1.9.0andgcr.io/cadvisor/cadvisor:v0.52.1(notlatest)/-/healthyvia wget, 30s/5s/3 retries)prometheus_datavolume is declared in the top-levelvolumes:block (already present from scaffold issue #572)$$escaping in node-exporter command explained with inline commentNora (Security):
/var/run:/var/run:rwwith/var/run/docker.sock:/var/run/docker.sock:ro— read-only socket mount, not the full/var/rundirectoryprivileged: trueon cAdvisor carries an explaining comment noting it is an accepted risk with pinned image + Renovatepid: hoston node-exporter carries an explaining commentMarkus (Architect):
obs-netonly — thearchiv-netattachment has been removed. cAdvisor discovers containers via the Docker socket, not via network membership, soarchiv-netwas unnecessary. Least-privilege network topology applied.Felix / Tobias:
backend:8080(Docker service name, notarchive-backend:8080container name) for reliable DNS resolutionElicit:
ocr-servicejob retained with a# TODO: remove or add prometheus-client to ocr-serviceinline comment to make the speculative status explicitVerification
Next steps
mainmicrometer-registry-prometheus+ expose/actuator/prometheus) — spring-boot and ocr-service targets will show DOWN until those are done, which is expected per spec