devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #573

New Issue

marcel · 2026-05-14T15:03:52+02:00

marcel commented

2026-05-14 15:03:52 +02:00

Context

This issue adds the three metric-collection services to docker-compose.observability.yml and their Prometheus scrape configuration. After this issue:

Host-level CPU, RAM, disk, and network metrics are collected from the machine running Docker
Per-container CPU and RAM are collected for every container in the stack
Spring Boot application metrics will be scraped once the backend instrumentation issue is complete (the scrape target can be configured now even if the endpoint does not yet exist)

Depends on: scaffold issue (compose file and infra/observability/ directory must exist first)

Services to Add

# docker-compose.observability.yml

prometheus:
  image: prom/prometheus:v3.4.0   # pin to latest stable v3
  container_name: obs-prometheus
  restart: unless-stopped
  volumes:
    - ./infra/observability/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    - prometheus_data:/prometheus
  command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--storage.tsdb.path=/prometheus'
    - '--storage.tsdb.retention.time=30d'
    - '--web.enable-lifecycle'
  ports:
    - "${PORT_PROMETHEUS:-9090}:9090"
  networks:
    - archiv-net
    - obs-net

node-exporter:
  image: prom/node-exporter:latest
  container_name: obs-node-exporter
  restart: unless-stopped
  pid: host
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
  command:
    - '--path.procfs=/host/proc'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
  expose:
    - "9100"
  networks:
    - obs-net

cadvisor:
  image: gcr.io/cadvisor/cadvisor:latest
  container_name: obs-cadvisor
  restart: unless-stopped
  privileged: true
  volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:rw
    - /sys:/sys:ro
    - /var/lib/docker:/var/lib/docker:ro
  expose:
    - "8080"
  networks:
    - archiv-net   # needs to see all application containers
    - obs-net

Also declare an internal obs-net bridge network for observability-only traffic that does not need to reach the application stack.

Config File to Create

infra/observability/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: node
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: cadvisor
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: spring-boot
    metrics_path: /actuator/prometheus
    static_configs:
      - targets: ['archive-backend:8080']
    # Will show as DOWN until backend instrumentation issue is complete — that is expected

  - job_name: ocr-service
    metrics_path: /metrics   # Only if the Python service exposes Prometheus metrics; skip if not
    static_configs:
      - targets: ['archive-ocr:8000']

Acceptance Criteria

docker compose -f docker-compose.observability.yml up -d prometheus node-exporter cadvisor starts all three containers without error
Prometheus UI accessible at http://localhost:9090
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].labels.job' lists node, cadvisor, and spring-boot
Node Exporter metrics accessible: curl -s http://localhost:9090/api/v1/query?query=node_cpu_seconds_total returns data
cAdvisor metrics accessible: curl -s http://localhost:9090/api/v1/query?query=container_cpu_usage_seconds_total returns data
spring-boot target shows as DOWN (expected until backend instrumentation) — not as a config error

Definition of Done

All acceptance criteria checked
Config file committed alongside compose changes
Committed on a feature branch, PR opened against main

## Context This issue adds the three metric-collection services to `docker-compose.observability.yml` and their Prometheus scrape configuration. After this issue: - Host-level CPU, RAM, disk, and network metrics are collected from the machine running Docker - Per-container CPU and RAM are collected for every container in the stack - Spring Boot application metrics will be scraped once the backend instrumentation issue is complete (the scrape target can be configured now even if the endpoint does not yet exist) **Depends on:** scaffold issue (compose file and `infra/observability/` directory must exist first) ## Services to Add ```yaml # docker-compose.observability.yml prometheus: image: prom/prometheus:v3.4.0 # pin to latest stable v3 container_name: obs-prometheus restart: unless-stopped volumes: - ./infra/observability/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=30d' - '--web.enable-lifecycle' ports: - "${PORT_PROMETHEUS:-9090}:9090" networks: - archiv-net - obs-net node-exporter: image: prom/node-exporter:latest container_name: obs-node-exporter restart: unless-stopped pid: host volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)' expose: - "9100" networks: - obs-net cadvisor: image: gcr.io/cadvisor/cadvisor:latest container_name: obs-cadvisor restart: unless-stopped privileged: true volumes: - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker:/var/lib/docker:ro expose: - "8080" networks: - archiv-net # needs to see all application containers - obs-net ``` Also declare an internal `obs-net` bridge network for observability-only traffic that does not need to reach the application stack. ## Config File to Create `infra/observability/prometheus/prometheus.yml`: ```yaml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: node static_configs: - targets: ['node-exporter:9100'] - job_name: cadvisor static_configs: - targets: ['cadvisor:8080'] - job_name: spring-boot metrics_path: /actuator/prometheus static_configs: - targets: ['archive-backend:8080'] # Will show as DOWN until backend instrumentation issue is complete — that is expected - job_name: ocr-service metrics_path: /metrics # Only if the Python service exposes Prometheus metrics; skip if not static_configs: - targets: ['archive-ocr:8000'] ``` ## Acceptance Criteria - [ ] `docker compose -f docker-compose.observability.yml up -d prometheus node-exporter cadvisor` starts all three containers without error - [ ] Prometheus UI accessible at `http://localhost:9090` - [ ] `curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].labels.job'` lists `node`, `cadvisor`, and `spring-boot` - [ ] Node Exporter metrics accessible: `curl -s http://localhost:9090/api/v1/query?query=node_cpu_seconds_total` returns data - [ ] cAdvisor metrics accessible: `curl -s http://localhost:9090/api/v1/query?query=container_cpu_usage_seconds_total` returns data - [ ] `spring-boot` target shows as `DOWN` (expected until backend instrumentation) — not as a config error ## Definition of Done - All acceptance criteria checked - Config file committed alongside compose changes - Committed on a feature branch, PR opened against `main`

marcel added this to the Observability Stack — Grafana LGTM + GlitchTip milestone 2026-05-14 15:03:52 +02:00

marcel added the P2-medium devops phase-7: monitoring labels 2026-05-14 15:06:09 +02:00

marcel commented

2026-05-14 15:28:33 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Observations

Image pinning is inconsistent. prometheus:v3.4.0 is pinned — good. node-exporter:latest and cadvisor:latest are not. The explicit versioning philosophy from the existing stack (postgres:16-alpine, mailpit:v1.29.7) applies here too. Pick the current stable tags for both: prom/node-exporter:v1.9.0 and gcr.io/cadvisor/cadvisor:v0.52.1 (or whatever is latest at time of implementation).
obs-net network architecture is sound. Splitting observability traffic onto a separate bridge network is the right call. It keeps Prometheus's scrape requests off archiv-net and makes the dependency topology visible in the Compose file.
cAdvisor needs archiv-net to see application containers — that is correct, but the issue should state why explicitly in a comment so future reviewers don't remove it thinking it's unnecessary.

No healthcheck on Prometheus. The existing stack has healthchecks on every service. Prometheus exposes /-/healthy. Add it:

healthcheck:
  test: ["CMD", "wget", "-qO-", "http://localhost:9090/-/healthy"]
  interval: 30s
  timeout: 5s
  retries: 3

node-exporter and cAdvisor use expose: (not ports:) — correct. These should never be reachable from outside the obs-net. The issue spec correctly keeps them internal.
prometheus_data named volume is correct. But it needs to be declared in the top-level volumes: block of the compose file (the issue YAML snippet shows it referenced as prometheus_data:/prometheus without the declaration block).
PORT_PROMETHEUS env var default of 9090. In production, Prometheus should NOT be exposed to the host at all — scraping happens on the internal network. The Caddy Caddyfile already blocks actuator externally; Prometheus should be handled similarly. Bind the port to 127.0.0.1 if local access is needed: "127.0.0.1:${PORT_PROMETHEUS:-9090}:9090".
--web.enable-lifecycle is a mild risk. This flag allows POST /-/reload to hot-reload config without a restart. It's useful operationally but represents a management endpoint. If Prometheus is bound to 127.0.0.1 only (see above), this is fine. If exposed more broadly, it's a target.
ocr-service scrape job in prometheus.yml is speculative. The Python OCR service does not currently expose Prometheus metrics — there's no prometheus-client dependency in the service. The comment in the issue ("Only if the Python service exposes Prometheus metrics; skip if not") should be reflected in the prometheus.yml with an inline comment, not left as dead config.
The spring-boot scrape target will show as DOWN until the backend instrumentation issue lands. The issue acknowledges this. Confirm the service name in the Prometheus config: the Docker service is named backend (both in docker-compose.yml and docker-compose.prod.yml). The archive-backend:8080 hostname shown in the issue spec is the container name from the dev compose, not the service DNS name. Inside the obs-net/archiv-net the service DNS is backend. Use targets: ['backend:8080'].

Recommendations

Pin node-exporter and cAdvisor image tags. Add them to Renovate so they get auto-bump PRs.
Add healthchecks to all three services (Prometheus: /-/healthy; node-exporter: :9100/metrics return 200; cAdvisor: :8080/healthz).
Bind Prometheus port to 127.0.0.1 in production. In dev, the open binding is fine for local browser access. Use the overlay pattern: open in base compose, bound in prod compose.
Fix the scrape target hostname from archive-backend:8080 to backend:8080.
Add # TODO: remove or add prometheus-client to ocr-service comment to the ocr-service job in prometheus.yml rather than a conditional comment.
Declare prometheus_data: in the top-level volumes block of the compose file.

Open Decisions

Where does docker-compose.observability.yml live in the compose invocation pattern? The existing pattern is docker compose -f docker-compose.yml (dev) or -f docker-compose.prod.yml (prod, standalone). Should observability be an overlay (-f docker-compose.yml -f docker-compose.observability.yml) or a standalone file started independently? Standalone is simpler but means no depends_on to the app services. Overlay integrates cleanly but extends the start command further. The current issue spec treats it as standalone — confirm this is intentional.

## 🔧 Tobias Wendt — DevOps & Platform Engineer ### Observations - **Image pinning is inconsistent.** `prometheus:v3.4.0` is pinned — good. `node-exporter:latest` and `cadvisor:latest` are not. The explicit versioning philosophy from the existing stack (postgres:16-alpine, mailpit:v1.29.7) applies here too. Pick the current stable tags for both: `prom/node-exporter:v1.9.0` and `gcr.io/cadvisor/cadvisor:v0.52.1` (or whatever is latest at time of implementation). - **`obs-net` network architecture is sound.** Splitting observability traffic onto a separate bridge network is the right call. It keeps Prometheus's scrape requests off `archiv-net` and makes the dependency topology visible in the Compose file. - **cAdvisor needs `archiv-net` to see application containers — that is correct**, but the issue should state *why* explicitly in a comment so future reviewers don't remove it thinking it's unnecessary. - **No `healthcheck` on Prometheus.** The existing stack has healthchecks on every service. Prometheus exposes `/-/healthy`. Add it: ```yaml healthcheck: test: ["CMD", "wget", "-qO-", "http://localhost:9090/-/healthy"] interval: 30s timeout: 5s retries: 3 ``` - **`node-exporter` and `cAdvisor` use `expose:` (not `ports:`) — correct.** These should never be reachable from outside the obs-net. The issue spec correctly keeps them internal. - **`prometheus_data` named volume is correct.** But it needs to be declared in the top-level `volumes:` block of the compose file (the issue YAML snippet shows it referenced as `prometheus_data:/prometheus` without the declaration block). - **`PORT_PROMETHEUS` env var default of 9090.** In production, Prometheus should NOT be exposed to the host at all — scraping happens on the internal network. The Caddy Caddyfile already blocks actuator externally; Prometheus should be handled similarly. Bind the port to `127.0.0.1` if local access is needed: `"127.0.0.1:${PORT_PROMETHEUS:-9090}:9090"`. - **`--web.enable-lifecycle` is a mild risk.** This flag allows POST `/-/reload` to hot-reload config without a restart. It's useful operationally but represents a management endpoint. If Prometheus is bound to `127.0.0.1` only (see above), this is fine. If exposed more broadly, it's a target. - **`ocr-service` scrape job in prometheus.yml is speculative.** The Python OCR service does not currently expose Prometheus metrics — there's no `prometheus-client` dependency in the service. The comment in the issue ("Only if the Python service exposes Prometheus metrics; skip if not") should be reflected in the `prometheus.yml` with an inline comment, not left as dead config. - **The `spring-boot` scrape target will show as DOWN until the backend instrumentation issue lands.** The issue acknowledges this. Confirm the service name in the Prometheus config: the Docker service is named `backend` (both in `docker-compose.yml` and `docker-compose.prod.yml`). The `archive-backend:8080` hostname shown in the issue spec is the *container name* from the dev compose, not the service DNS name. Inside the obs-net/archiv-net the service DNS is `backend`. Use `targets: ['backend:8080']`. ### Recommendations 1. **Pin node-exporter and cAdvisor image tags.** Add them to Renovate so they get auto-bump PRs. 2. **Add healthchecks to all three services** (Prometheus: `/-/healthy`; node-exporter: `:9100/metrics` return 200; cAdvisor: `:8080/healthz`). 3. **Bind Prometheus port to 127.0.0.1 in production.** In dev, the open binding is fine for local browser access. Use the overlay pattern: open in base compose, bound in prod compose. 4. **Fix the scrape target hostname** from `archive-backend:8080` to `backend:8080`. 5. **Add `# TODO: remove or add prometheus-client to ocr-service` comment** to the `ocr-service` job in `prometheus.yml` rather than a conditional comment. 6. **Declare `prometheus_data:` in the top-level volumes block** of the compose file. ### Open Decisions - **Where does `docker-compose.observability.yml` live in the compose invocation pattern?** The existing pattern is `docker compose -f docker-compose.yml` (dev) or `-f docker-compose.prod.yml` (prod, standalone). Should observability be an overlay (`-f docker-compose.yml -f docker-compose.observability.yml`) or a standalone file started independently? Standalone is simpler but means no `depends_on` to the app services. Overlay integrates cleanly but extends the start command further. The current issue spec treats it as standalone — confirm this is intentional.

marcel commented

2026-05-14 15:28:53 +02:00

🏛️ Markus Keller — Application Architect

Observations

Architecture doc update is required. Adding three new infrastructure components (Prometheus, node-exporter, cAdvisor) triggers a mandatory update to docs/architecture/c4/l2-containers.puml. Per the documentation table in CLAUDE.md: "New Docker service / infrastructure component → docs/architecture/c4/l2-containers.puml + docs/DEPLOYMENT.md". Both must be updated in this PR.
An ADR is warranted here. This issue introduces the first observability infrastructure for the project — a lasting architectural decision. A short ADR in docs/adr/ should capture: why Prometheus+node-exporter+cAdvisor was chosen over alternatives (Netdata, Datadog agent, etc.), the obs-net isolation rationale, the retention policy choice (30d), and the "spring-boot target is intentionally DOWN" note. This is exactly the kind of "why the codebase is the way it is" context ADRs exist for.
Service topology is well-scoped. Keeping Prometheus in its own compose file with a dedicated obs-net bridge is the right isolation strategy. It means the observability stack can be started or stopped without touching the application stack.
The archiv-net dual-attachment for cAdvisor needs clarification. cAdvisor is attached to both archiv-net and obs-net. The stated reason is to see application containers — but cAdvisor discovers containers via the Docker socket (/var/run/docker.run:rw), not via network membership. Attaching to archiv-net is therefore not needed for container metrics. The archiv-net attachment would be needed only if cAdvisor needed to make direct HTTP calls to application containers. Revisit whether this dual attachment is actually necessary.
The privileged: true on cAdvisor is a known requirement. This is a documented cAdvisor requirement for full container metrics on Linux. It's acceptable for this use case, but it should carry a comment in the compose file explaining why it's privileged — reviewers will flag it otherwise.
prometheus_data volume retention of 30 days is reasonable for a single-VPS deployment. Given the ~23 EUR/month budget constraint, 30d of host+container metrics at 15s intervals will consume roughly 1–3 GB depending on the number of time series. This fits comfortably on the CX32.
The issue depends on a scaffold issue ("compose file and infra/observability/ directory must exist first"). This is appropriate sequencing. Confirm the scaffold issue is merged before this one starts, or make this issue's branch build on top of the scaffold branch.

Recommendations

Write a short ADR (docs/adr/ADR-00X-observability-stack.md) before implementing. Contents: decision, alternatives considered, consequences (VPS disk usage, privileged container), and the "intentionally DOWN scrape target" note.
Update docs/architecture/c4/l2-containers.puml and docs/DEPLOYMENT.md in the same PR. These are mandatory per the documentation matrix.
Remove archiv-net from cAdvisor unless there's a specific reason it needs network access to app containers (not just Docker socket access). Least-privilege network topology.
Add a # privileged: true — required for cgroup and namespace metrics, see cAdvisor docs comment in the compose file.

Open Decisions

Should infra/observability/ live inside the existing infra/ directory (alongside caddy/, gitea/) or at the project root alongside the compose files? The infra/ directory already holds service-specific config subdirectories. Putting infra/observability/prometheus/prometheus.yml there follows the established pattern. The issue spec already proposes this — just make sure the scaffold issue creates that directory structure first.

## 🏛️ Markus Keller — Application Architect ### Observations - **Architecture doc update is required.** Adding three new infrastructure components (Prometheus, node-exporter, cAdvisor) triggers a mandatory update to `docs/architecture/c4/l2-containers.puml`. Per the documentation table in CLAUDE.md: "New Docker service / infrastructure component → `docs/architecture/c4/l2-containers.puml` + `docs/DEPLOYMENT.md`". Both must be updated in this PR. - **An ADR is warranted here.** This issue introduces the first observability infrastructure for the project — a lasting architectural decision. A short ADR in `docs/adr/` should capture: why Prometheus+node-exporter+cAdvisor was chosen over alternatives (Netdata, Datadog agent, etc.), the obs-net isolation rationale, the retention policy choice (30d), and the "spring-boot target is intentionally DOWN" note. This is exactly the kind of "why the codebase is the way it is" context ADRs exist for. - **Service topology is well-scoped.** Keeping Prometheus in its own compose file with a dedicated `obs-net` bridge is the right isolation strategy. It means the observability stack can be started or stopped without touching the application stack. - **The `archiv-net` dual-attachment for cAdvisor needs clarification.** cAdvisor is attached to both `archiv-net` and `obs-net`. The stated reason is to see application containers — but cAdvisor discovers containers via the Docker socket (`/var/run/docker.run:rw`), not via network membership. Attaching to `archiv-net` is therefore not needed for container metrics. The `archiv-net` attachment *would* be needed only if cAdvisor needed to make direct HTTP calls to application containers. Revisit whether this dual attachment is actually necessary. - **The `privileged: true` on cAdvisor is a known requirement.** This is a documented cAdvisor requirement for full container metrics on Linux. It's acceptable for this use case, but it should carry a comment in the compose file explaining *why* it's privileged — reviewers will flag it otherwise. - **`prometheus_data` volume retention of 30 days is reasonable** for a single-VPS deployment. Given the ~23 EUR/month budget constraint, 30d of host+container metrics at 15s intervals will consume roughly 1–3 GB depending on the number of time series. This fits comfortably on the CX32. - **The issue depends on a scaffold issue** ("compose file and `infra/observability/` directory must exist first"). This is appropriate sequencing. Confirm the scaffold issue is merged before this one starts, or make this issue's branch build on top of the scaffold branch. ### Recommendations 1. **Write a short ADR** (`docs/adr/ADR-00X-observability-stack.md`) before implementing. Contents: decision, alternatives considered, consequences (VPS disk usage, privileged container), and the "intentionally DOWN scrape target" note. 2. **Update `docs/architecture/c4/l2-containers.puml`** and **`docs/DEPLOYMENT.md`** in the same PR. These are mandatory per the documentation matrix. 3. **Remove `archiv-net` from cAdvisor** unless there's a specific reason it needs network access to app containers (not just Docker socket access). Least-privilege network topology. 4. **Add a `# privileged: true — required for cgroup and namespace metrics, see cAdvisor docs` comment** in the compose file. ### Open Decisions - **Should `infra/observability/` live inside the existing `infra/` directory (alongside `caddy/`, `gitea/`) or at the project root alongside the compose files?** The `infra/` directory already holds service-specific config subdirectories. Putting `infra/observability/prometheus/prometheus.yml` there follows the established pattern. The issue spec already proposes this — just make sure the scaffold issue creates that directory structure first.

marcel commented

2026-05-14 15:29:14 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Observations

cAdvisor runs privileged: true with root filesystem bind-mounts. This is the highest-risk element of this issue. A compromised cAdvisor container has read access to the entire host filesystem (/:/rootfs:ro) and write access to /var/run (Docker socket rw). The threat model: if cAdvisor has a known RCE vulnerability and the obs-net is reachable from an attacker-controlled vector, host compromise is the outcome. Mitigations to implement:
- Confirm obs-net is an internal bridge with no external port exposure on cAdvisor.
- Keep the cAdvisor image pinned and on Renovate so CVEs get patched quickly.
- Consider pinning to a digest (@sha256:...) in production for the most sensitive containers.
/var/run:/var/run:rw is overly broad. The intent is to give cAdvisor access to the Docker socket at /var/run/docker.sock. Mount the socket directly instead:
```
volumes:
  - /var/run/docker.sock:/var/run/docker.sock:ro  # read-only is sufficient for cAdvisor
  - /:/rootfs:ro
  - /sys:/sys:ro
```
Mounting all of /var/run as rw exposes more than the socket and grants write access to other runtime files. CWE-732: Incorrect Permission Assignment.
--web.enable-lifecycle on Prometheus. This flag enables unauthenticated POST to /-/reload and /-/quit. If Prometheus is accessible beyond localhost (e.g., if PORT_PROMETHEUS maps to 0.0.0.0), any process on the host — or any container on a shared network — can trigger a config reload or shutdown. If it's only needed for config reloads during development, omit it from the production config or bind Prometheus to 127.0.0.1 only.
/actuator/prometheus is not yet exposed by the backend. I checked: pom.xml has spring-boot-starter-actuator but no io.micrometer:micrometer-registry-prometheus dependency. application.yaml has no management.endpoints.web.exposure.include setting, so only health is exposed by default. Adding micrometer-registry-prometheus is a separate issue, but note: when that dependency is added, the management.endpoints.web.exposure.include config in application.yaml must explicitly include prometheus — and the (block_actuator) snippet in the Caddyfile already blocks /actuator/* externally, so Prometheus's scrape happens over the Docker network. This is the correct architecture — Prometheus hits backend:8080 internally, not through Caddy.
No authentication on Prometheus. For a single-VPS deployment where Prometheus is bound to 127.0.0.1, unauthenticated access is acceptable (only local processes can reach it). If the port is ever opened externally (e.g., via a Caddy vhost), it would need auth. This is a follow-up concern, not a blocker, but worth noting in the ADR.
node-exporter pid: host mode gives the container access to the host's process namespace. This is standard and required for accurate CPU/memory metrics, but should carry a comment: # pid: host — required for process-level CPU/memory metrics; cgroup isolation applies.

Recommendations

Replace /var/run:/var/run:rw with /var/run/docker.sock:/var/run/docker.sock:ro in the cAdvisor definition. Read-only on the socket is sufficient; cAdvisor only reads container metadata.
Bind Prometheus to 127.0.0.1 in production to prevent network-level access even if firewall rules are misconfigured.
Pin node-exporter and cAdvisor image tags and add them to Renovate — privileged containers with latest tags are a vulnerability waiting to happen.
Add a comment explaining the threat model for the privileged: true on cAdvisor so future reviewers understand it's an accepted risk, not an oversight.
Do not add --web.enable-lifecycle to the production Compose config unless there's a specific operational need. Keep it dev-only if it's needed at all.

## 🔐 Nora "NullX" Steiner — Application Security Engineer ### Observations - **cAdvisor runs `privileged: true` with root filesystem bind-mounts.** This is the highest-risk element of this issue. A compromised cAdvisor container has read access to the entire host filesystem (`/:/rootfs:ro`) and write access to `/var/run` (Docker socket `rw`). The threat model: if cAdvisor has a known RCE vulnerability and the `obs-net` is reachable from an attacker-controlled vector, host compromise is the outcome. **Mitigations to implement:** - Confirm `obs-net` is an internal bridge with no external port exposure on cAdvisor. - Keep the cAdvisor image pinned and on Renovate so CVEs get patched quickly. - Consider pinning to a digest (`@sha256:...`) in production for the most sensitive containers. - **`/var/run:/var/run:rw` is overly broad.** The intent is to give cAdvisor access to the Docker socket at `/var/run/docker.sock`. Mount the socket directly instead: ```yaml volumes: - /var/run/docker.sock:/var/run/docker.sock:ro # read-only is sufficient for cAdvisor - /:/rootfs:ro - /sys:/sys:ro ``` Mounting all of `/var/run` as `rw` exposes more than the socket and grants write access to other runtime files. **CWE-732: Incorrect Permission Assignment.** - **`--web.enable-lifecycle` on Prometheus.** This flag enables unauthenticated POST to `/-/reload` and `/-/quit`. If Prometheus is accessible beyond localhost (e.g., if `PORT_PROMETHEUS` maps to `0.0.0.0`), any process on the host — or any container on a shared network — can trigger a config reload or shutdown. If it's only needed for config reloads during development, omit it from the production config or bind Prometheus to `127.0.0.1` only. - **`/actuator/prometheus` is not yet exposed by the backend.** I checked: `pom.xml` has `spring-boot-starter-actuator` but no `io.micrometer:micrometer-registry-prometheus` dependency. `application.yaml` has no `management.endpoints.web.exposure.include` setting, so only `health` is exposed by default. Adding `micrometer-registry-prometheus` is a separate issue, but note: when that dependency is added, the `management.endpoints.web.exposure.include` config in `application.yaml` must explicitly include `prometheus` — and the `(block_actuator)` snippet in the Caddyfile already blocks `/actuator/*` externally, so Prometheus's scrape happens over the Docker network. **This is the correct architecture** — Prometheus hits `backend:8080` internally, not through Caddy. - **No authentication on Prometheus.** For a single-VPS deployment where Prometheus is bound to `127.0.0.1`, unauthenticated access is acceptable (only local processes can reach it). If the port is ever opened externally (e.g., via a Caddy vhost), it would need auth. This is a follow-up concern, not a blocker, but worth noting in the ADR. - **node-exporter `pid: host` mode** gives the container access to the host's process namespace. This is standard and required for accurate CPU/memory metrics, but should carry a comment: `# pid: host — required for process-level CPU/memory metrics; cgroup isolation applies`. ### Recommendations 1. **Replace `/var/run:/var/run:rw` with `/var/run/docker.sock:/var/run/docker.sock:ro`** in the cAdvisor definition. Read-only on the socket is sufficient; cAdvisor only reads container metadata. 2. **Bind Prometheus to `127.0.0.1`** in production to prevent network-level access even if firewall rules are misconfigured. 3. **Pin node-exporter and cAdvisor image tags** and add them to Renovate — privileged containers with `latest` tags are a vulnerability waiting to happen. 4. **Add a comment explaining the threat model** for the `privileged: true` on cAdvisor so future reviewers understand it's an accepted risk, not an oversight. 5. **Do not add `--web.enable-lifecycle` to the production Compose config** unless there's a specific operational need. Keep it dev-only if it's needed at all.

marcel commented

2026-05-14 15:29:30 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Observations

The acceptance criteria are well-formed and specific. The curl commands with jq are executable, unambiguous, and environment-independent. This is the right level of precision for infrastructure acceptance criteria.
AC #4 and #5 require Prometheus to have scraped at least one interval. After docker compose up, Prometheus needs ~15 seconds before any metric data exists. The curl queries for node_cpu_seconds_total and container_cpu_usage_seconds_total will return empty datasets if run immediately. Add a brief timing note to the AC: "Wait at least 30 seconds after all containers start before running metric queries."
AC #3 (spring-boot target listed as DOWN) is subtly tricky to verify. curl | jq '.data.activeTargets[].labels.job' lists only active targets. A DOWN target is still active (Prometheus is scraping it, it's just failing). Confirm this jq filter returns "spring-boot" even when the target is DOWN — it should, since activeTargets includes DOWN targets by design.
No rollback or teardown test. What does docker compose -f docker-compose.observability.yml down produce? Does the prometheus_data volume persist correctly? Does re-starting after a down retain previous data? This isn't a blocker, but it's worth a manual test pass during implementation.
The Definition of Done says "All acceptance criteria checked" but doesn't specify who checks them. For an infrastructure issue with curl-based ACs, a CI smoke test would be the right home. However, running the full observability stack in CI is expensive. Manual verification against the local dev stack before merge is the pragmatic choice — make this explicit in the DoD.
The ocr-service scrape target has no acceptance criterion. The issue includes it in prometheus.yml but none of the ACs verify it. Either add an AC ("ocr-service target shows as UNKNOWN or DOWN — not a config error") or remove the job from the config for now. Dead targets with no verification path are noise in the acceptance checklist.
Missing: verification that obs-net is actually isolated. AC should include: "Confirm node-exporter and cAdvisor are NOT reachable from a container on archiv-net that is not attached to obs-net." A one-liner: docker exec archive-backend curl -f http://obs-node-exporter:9100/metrics 2>&1 | grep -q "Connection refused".

Recommendations

Add timing note to AC #4 and #5: "Wait ≥30 seconds after all containers are healthy before querying metric data."
Either add an AC for ocr-service target status or remove the job from the initial prometheus.yml to keep the acceptance checklist complete and clean.
Add a network isolation AC: verify node-exporter is not reachable from archiv-net-only containers.
Mark the DoD verification as manual: "All ACs verified manually against local dev stack before PR merge." CI integration of the observability stack is a follow-up scope item.

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Observations - **The acceptance criteria are well-formed and specific.** The `curl` commands with `jq` are executable, unambiguous, and environment-independent. This is the right level of precision for infrastructure acceptance criteria. - **AC #4 and #5 require Prometheus to have scraped at least one interval.** After `docker compose up`, Prometheus needs ~15 seconds before any metric data exists. The `curl` queries for `node_cpu_seconds_total` and `container_cpu_usage_seconds_total` will return empty datasets if run immediately. Add a brief timing note to the AC: "Wait at least 30 seconds after all containers start before running metric queries." - **AC #3 (`spring-boot` target listed as DOWN) is subtly tricky to verify.** `curl | jq '.data.activeTargets[].labels.job'` lists only *active* targets. A DOWN target is still active (Prometheus is scraping it, it's just failing). Confirm this `jq` filter returns `"spring-boot"` even when the target is DOWN — it should, since `activeTargets` includes DOWN targets by design. - **No rollback or teardown test.** What does `docker compose -f docker-compose.observability.yml down` produce? Does the prometheus_data volume persist correctly? Does re-starting after a down retain previous data? This isn't a blocker, but it's worth a manual test pass during implementation. - **The Definition of Done says "All acceptance criteria checked" but doesn't specify who checks them.** For an infrastructure issue with `curl`-based ACs, a CI smoke test would be the right home. However, running the full observability stack in CI is expensive. Manual verification against the local dev stack before merge is the pragmatic choice — make this explicit in the DoD. - **The `ocr-service` scrape target has no acceptance criterion.** The issue includes it in `prometheus.yml` but none of the ACs verify it. Either add an AC ("ocr-service target shows as UNKNOWN or DOWN — not a config error") or remove the job from the config for now. Dead targets with no verification path are noise in the acceptance checklist. - **Missing: verification that `obs-net` is actually isolated.** AC should include: "Confirm node-exporter and cAdvisor are NOT reachable from a container on `archiv-net` that is not attached to `obs-net`." A one-liner: `docker exec archive-backend curl -f http://obs-node-exporter:9100/metrics 2>&1 | grep -q "Connection refused"`. ### Recommendations 1. **Add timing note to AC #4 and #5**: "Wait ≥30 seconds after all containers are healthy before querying metric data." 2. **Either add an AC for `ocr-service` target status or remove the job from the initial `prometheus.yml`** to keep the acceptance checklist complete and clean. 3. **Add a network isolation AC**: verify node-exporter is not reachable from `archiv-net`-only containers. 4. **Mark the DoD verification as manual**: "All ACs verified manually against local dev stack before PR merge." CI integration of the observability stack is a follow-up scope item.

marcel commented

2026-05-14 15:29:41 +02:00

👩‍💻 Felix Brandt — Senior Fullstack Developer

Observations

This issue is infrastructure-only — no application code changes, no new backend endpoints, no frontend work. From my angle, there are two things worth noting.

1. The /actuator/prometheus scrape target requires backend code changes not in scope here.

The spring-boot job in prometheus.yml points at backend:8080/actuator/prometheus. I checked pom.xml: there is no micrometer-registry-prometheus dependency. I checked application.yaml: there is no management.endpoints.web.exposure.include entry. Without both additions, the /actuator/prometheus endpoint returns 404 and the target stays DOWN — which the issue acknowledges. That's fine. But the next issue (backend instrumentation) will need:

<!-- pom.xml -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

# application.yaml
management:
  endpoints:
    web:
      exposure:
        include: health,prometheus
  endpoint:
    prometheus:
      enabled: true

The Caddyfile already blocks /actuator/* externally — Prometheus will scrape over the Docker network, which is correct. Just noting this for whoever picks up the instrumentation issue.

2. The container name archive-backend in the issue's Prometheus config is wrong.

In both compose files, the Docker service name is backend. The hostname Docker assigns for inter-container DNS is the service name, not the container name. archive-backend is the container_name value in the dev compose — that alias also resolves, but it's fragile (container names can be overridden; service names cannot). Use the service name: targets: ['backend:8080']. Same principle in production.

Recommendations

Fix the scrape target hostname in prometheus.yml to backend:8080 (not archive-backend:8080).
Create a follow-up issue for backend instrumentation if one doesn't already exist — this issue explicitly defers it but someone needs to track it. The micrometer dependency + actuator config changes are ~15 minutes of work once this observability scaffold is in place.

No concerns about code quality or patterns — this issue adds YAML config files, not application code.

## 👩‍💻 Felix Brandt — Senior Fullstack Developer ### Observations This issue is infrastructure-only — no application code changes, no new backend endpoints, no frontend work. From my angle, there are two things worth noting. **1. The `/actuator/prometheus` scrape target requires backend code changes not in scope here.** The `spring-boot` job in `prometheus.yml` points at `backend:8080/actuator/prometheus`. I checked `pom.xml`: there is no `micrometer-registry-prometheus` dependency. I checked `application.yaml`: there is no `management.endpoints.web.exposure.include` entry. Without both additions, the `/actuator/prometheus` endpoint returns 404 and the target stays DOWN — which the issue acknowledges. That's fine. But the *next* issue (backend instrumentation) will need: ```xml  <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> ``` ```yaml # application.yaml management: endpoints: web: exposure: include: health,prometheus endpoint: prometheus: enabled: true ``` The Caddyfile already blocks `/actuator/*` externally — Prometheus will scrape over the Docker network, which is correct. Just noting this for whoever picks up the instrumentation issue. **2. The container name `archive-backend` in the issue's Prometheus config is wrong.** In both compose files, the Docker service name is `backend`. The hostname Docker assigns for inter-container DNS is the *service name*, not the container name. `archive-backend` is the `container_name` value in the dev compose — that alias also resolves, but it's fragile (container names can be overridden; service names cannot). Use the service name: `targets: ['backend:8080']`. Same principle in production. ### Recommendations 1. **Fix the scrape target hostname** in `prometheus.yml` to `backend:8080` (not `archive-backend:8080`). 2. **Create a follow-up issue for backend instrumentation** if one doesn't already exist — this issue explicitly defers it but someone needs to track it. The micrometer dependency + actuator config changes are ~15 minutes of work once this observability scaffold is in place. No concerns about code quality or patterns — this issue adds YAML config files, not application code.

marcel commented

2026-05-14 15:29:47 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Observations

This is a pure infrastructure issue with no frontend or UI changes. No UX concerns apply.

What I did check: the Prometheus UI (accessible at localhost:9090 per the ACs) is a third-party interface that we do not own or style. No brand compliance, accessibility, or responsive design review applies to it.

The one forward-looking UX note: when the Grafana dashboards are built (the milestone is "Observability Stack — Grafana LGTM + GlitchTip"), the dashboard screens shown to users will need UX review — especially the node/container metrics views if they're surfaced to a non-technical audience. Tag me when those Grafana dashboard issues land.

No concerns or recommendations for this infrastructure issue.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist ### Observations This is a pure infrastructure issue with no frontend or UI changes. No UX concerns apply. What I did check: the Prometheus UI (accessible at `localhost:9090` per the ACs) is a third-party interface that we do not own or style. No brand compliance, accessibility, or responsive design review applies to it. The one forward-looking UX note: when the Grafana dashboards are built (the milestone is "Observability Stack — Grafana LGTM + GlitchTip"), the dashboard screens shown to users *will* need UX review — especially the node/container metrics views if they're surfaced to a non-technical audience. Tag me when those Grafana dashboard issues land. No concerns or recommendations for this infrastructure issue.

marcel commented

2026-05-14 15:30:04 +02:00

📋 Elicit — Requirements Engineer

Observations

The issue is well-structured for a DevOps-class ticket. Context, services, config, and ACs are all present. A few precision gaps are worth closing before implementation.

The "depends on scaffold issue" is an unresolved blocker. The issue states "scaffold issue (compose file and infra/observability/ directory must exist first)" but does not link to the scaffold issue by number. If the scaffold issue doesn't exist yet, this issue can't start. If it does exist, link it with a "Blocked by #NNN" reference so the dependency is machine-readable in Gitea.
docker-compose.observability.yml invocation is ambiguous. The ACs use docker compose -f docker-compose.observability.yml up -d prometheus node-exporter cadvisor. But the compose file attaches to archiv-net — which is defined in docker-compose.yml. Starting docker-compose.observability.yml in isolation will fail to find archiv-net unless the main stack is already running. The issue should state: "Run after docker compose up -d (main stack) is already running." This is an implicit precondition that should be explicit in the ACs.
The ocr-service scrape job has no acceptance criterion and no verification path. Either: (a) add an AC for it, or (b) explicitly mark it as out-of-scope for this issue with a TODO comment in prometheus.yml. Leaving it in without either creates a gap between the spec and the DoD.
Retention policy of 30 days is stated but not justified. For a family archive project, 30 days might be too short for seasonal patterns or long-term capacity planning. This doesn't need to block the issue, but it should be documented as a decision (in the ADR that Markus recommends) so it can be revisited.
The PORT_PROMETHEUS env var is introduced without updating .env.example. If there's an .env.example file in the repo (common for this stack pattern), it needs the new variable documented. Otherwise the next developer will not know to set it.
The $$ escaping in the ignored-mount-points command ($$|/)) is correct YAML Compose escaping for a $ literal. Worth a comment so reviewers don't "fix" it.

Recommendations

Link the scaffold issue by number in the "Depends on" line.
Add explicit precondition to ACs: "Precondition: main stack (docker compose up -d) is already running."
Resolve the ocr-service job ambiguity: add an AC or mark it as out-of-scope.
Update .env.example with PORT_PROMETHEUS=9090 if that file exists.
Add inline comment on the $$ escaping in the node-exporter command for future reviewers.

Open Decisions

30-day retention: is this sufficient? For debugging a specific incident, 30 days is ample. For year-over-year capacity trend analysis (e.g., "does OCR memory grow month-over-month?"), it's too short. If long-term trend analysis is a goal of the observability stack, consider 90 days or Grafana Cloud's free tier for long-term retention. If the goal is only operational/debugging, 30 days is fine — document the choice.

## 📋 Elicit — Requirements Engineer ### Observations The issue is well-structured for a DevOps-class ticket. Context, services, config, and ACs are all present. A few precision gaps are worth closing before implementation. - **The "depends on scaffold issue" is an unresolved blocker.** The issue states "scaffold issue (compose file and `infra/observability/` directory must exist first)" but does not link to the scaffold issue by number. If the scaffold issue doesn't exist yet, this issue can't start. If it does exist, link it with a "Blocked by #NNN" reference so the dependency is machine-readable in Gitea. - **`docker-compose.observability.yml` invocation is ambiguous.** The ACs use `docker compose -f docker-compose.observability.yml up -d prometheus node-exporter cadvisor`. But the compose file attaches to `archiv-net` — which is defined in `docker-compose.yml`. Starting `docker-compose.observability.yml` in isolation will fail to find `archiv-net` unless the main stack is already running. The issue should state: "Run after `docker compose up -d` (main stack) is already running." This is an implicit precondition that should be explicit in the ACs. - **The `ocr-service` scrape job has no acceptance criterion and no verification path.** Either: (a) add an AC for it, or (b) explicitly mark it as out-of-scope for this issue with a TODO comment in `prometheus.yml`. Leaving it in without either creates a gap between the spec and the DoD. - **Retention policy of 30 days is stated but not justified.** For a family archive project, 30 days might be too short for seasonal patterns or long-term capacity planning. This doesn't need to block the issue, but it should be documented as a decision (in the ADR that Markus recommends) so it can be revisited. - **The `PORT_PROMETHEUS` env var is introduced without updating `.env.example`.** If there's an `.env.example` file in the repo (common for this stack pattern), it needs the new variable documented. Otherwise the next developer will not know to set it. - **The `$$ `escaping in the `ignored-mount-points` command** (`$$|/)`) is correct YAML Compose escaping for a `$` literal. Worth a comment so reviewers don't "fix" it. ### Recommendations 1. **Link the scaffold issue by number** in the "Depends on" line. 2. **Add explicit precondition to ACs**: "Precondition: main stack (`docker compose up -d`) is already running." 3. **Resolve the `ocr-service` job ambiguity**: add an AC or mark it as out-of-scope. 4. **Update `.env.example`** with `PORT_PROMETHEUS=9090` if that file exists. 5. **Add inline comment** on the `$$` escaping in the node-exporter command for future reviewers. ### Open Decisions - **30-day retention: is this sufficient?** For debugging a specific incident, 30 days is ample. For year-over-year capacity trend analysis (e.g., "does OCR memory grow month-over-month?"), it's too short. If long-term trend analysis is a goal of the observability stack, consider 90 days or Grafana Cloud's free tier for long-term retention. If the goal is only operational/debugging, 30 days is fine — document the choice.

marcel commented

2026-05-14 15:30:18 +02:00

🗳️ Decision Queue — Action Required

3 decisions need your input before implementation starts.

Infrastructure / Architecture

Standalone vs. overlay compose pattern for observability. The issue treats docker-compose.observability.yml as a standalone file started after the main stack. The alternative is an overlay (-f docker-compose.yml -f docker-compose.observability.yml), which enables depends_on relationships but lengthens the start command. Standalone is simpler and fits the "opt-in observability" model; overlay integrates more tightly. The current spec already implies standalone — confirm this is intentional, then document the precondition ("main stack must be running first") explicitly in the ACs. (Raised by: Tobias, Elicit)
Is archiv-net attachment on cAdvisor actually needed? cAdvisor discovers containers via the Docker socket (/var/run/docker.sock), not via network membership. If cAdvisor doesn't need to make direct HTTP calls to application containers, the archiv-net attachment is unnecessary and violates least-privilege network topology. Remove it unless there's a documented reason to keep it. (Raised by: Markus)

Observability Scope

30-day metric retention: debugging only vs. long-term trend analysis? At 15s scrape interval, 30 days fits comfortably on the CX32 (~1–3 GB). For incident debugging it's ample. For year-over-year capacity trends (e.g., "does OCR memory grow month-over-month?") it's too short. Decide the use-case scope, document it in the ADR, and either keep 30d or adjust. (Raised by: Elicit)

## 🗳️ Decision Queue — Action Required _3 decisions need your input before implementation starts._ ### Infrastructure / Architecture - **Standalone vs. overlay compose pattern for observability.** The issue treats `docker-compose.observability.yml` as a standalone file started after the main stack. The alternative is an overlay (`-f docker-compose.yml -f docker-compose.observability.yml`), which enables `depends_on` relationships but lengthens the start command. Standalone is simpler and fits the "opt-in observability" model; overlay integrates more tightly. The current spec already implies standalone — confirm this is intentional, then document the precondition ("main stack must be running first") explicitly in the ACs. _(Raised by: Tobias, Elicit)_ - **Is `archiv-net` attachment on cAdvisor actually needed?** cAdvisor discovers containers via the Docker socket (`/var/run/docker.sock`), not via network membership. If cAdvisor doesn't need to make direct HTTP calls to application containers, the `archiv-net` attachment is unnecessary and violates least-privilege network topology. Remove it unless there's a documented reason to keep it. _(Raised by: Markus)_ ### Observability Scope - **30-day metric retention: debugging only vs. long-term trend analysis?** At 15s scrape interval, 30 days fits comfortably on the CX32 (~1–3 GB). For incident debugging it's ample. For year-over-year capacity trends (e.g., "does OCR memory grow month-over-month?") it's too short. Decide the use-case scope, document it in the ADR, and either keep 30d or adjust. _(Raised by: Elicit)_

marcel referenced this issue

2026-05-15 01:24:00 +02:00

devops(observability): scaffold docker-compose.observability.yml and infra/observability/ directory structure #572

marcel referenced this issue

2026-05-15 01:26:30 +02:00

devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure #584

marcel commented

2026-05-15 01:48:02 +02:00

Implementation complete — branch `feat/issue-573-prometheus-metrics`

What was implemented

Commit: 0c9973fd — devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics

Files changed:

docker-compose.observability.yml — three new services added
infra/observability/prometheus/prometheus.yml — new Prometheus scrape config

Reviewer feedback addressed

Tobias (DevOps):

Image tags pinned: prom/node-exporter:v1.9.0 and gcr.io/cadvisor/cadvisor:v0.52.1 (not latest)
Healthcheck added to Prometheus (/-/healthy via wget, 30s/5s/3 retries)
prometheus_data volume is declared in the top-level volumes: block (already present from scaffold issue #572)
$$ escaping in node-exporter command explained with inline comment

Nora (Security):

Replaced /var/run:/var/run:rw with /var/run/docker.sock:/var/run/docker.sock:ro — read-only socket mount, not the full /var/run directory
privileged: true on cAdvisor carries an explaining comment noting it is an accepted risk with pinned image + Renovate
pid: host on node-exporter carries an explaining comment

Markus (Architect):

cAdvisor is on obs-net only — the archiv-net attachment has been removed. cAdvisor discovers containers via the Docker socket, not via network membership, so archiv-net was unnecessary. Least-privilege network topology applied.

Felix / Tobias:

Scrape target fixed: backend:8080 (Docker service name, not archive-backend:8080 container name) for reliable DNS resolution

Elicit:

ocr-service job retained with a # TODO: remove or add prometheus-client to ocr-service inline comment to make the speculative status explicit

Verification

docker compose -f docker-compose.observability.yml config --quiet  # exits 0

Next steps

Open PR against main
Backend instrumentation issue (add micrometer-registry-prometheus + expose /actuator/prometheus) — spring-boot and ocr-service targets will show DOWN until those are done, which is expected per spec

## Implementation complete — branch `feat/issue-573-prometheus-metrics` ### What was implemented **Commit:** `0c9973fd` — devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics **Files changed:** - `docker-compose.observability.yml` — three new services added - `infra/observability/prometheus/prometheus.yml` — new Prometheus scrape config --- ### Reviewer feedback addressed **Tobias (DevOps):** - Image tags pinned: `prom/node-exporter:v1.9.0` and `gcr.io/cadvisor/cadvisor:v0.52.1` (not `latest`) - Healthcheck added to Prometheus (`/-/healthy` via wget, 30s/5s/3 retries) - `prometheus_data` volume is declared in the top-level `volumes:` block (already present from scaffold issue #572) - `$$` escaping in node-exporter command explained with inline comment **Nora (Security):** - Replaced `/var/run:/var/run:rw` with `/var/run/docker.sock:/var/run/docker.sock:ro` — read-only socket mount, not the full `/var/run` directory - `privileged: true` on cAdvisor carries an explaining comment noting it is an accepted risk with pinned image + Renovate - `pid: host` on node-exporter carries an explaining comment **Markus (Architect):** - cAdvisor is on `obs-net` only — the `archiv-net` attachment has been removed. cAdvisor discovers containers via the Docker socket, not via network membership, so `archiv-net` was unnecessary. Least-privilege network topology applied. **Felix / Tobias:** - Scrape target fixed: `backend:8080` (Docker service name, not `archive-backend:8080` container name) for reliable DNS resolution **Elicit:** - `ocr-service` job retained with a `# TODO: remove or add prometheus-client to ocr-service` inline comment to make the speculative status explicit --- ### Verification ``` docker compose -f docker-compose.observability.yml config --quiet # exits 0 ``` --- ### Next steps - Open PR against `main` - Backend instrumentation issue (add `micrometer-registry-prometheus` + expose `/actuator/prometheus`) — spring-boot and ocr-service targets will show DOWN until those are done, which is expected per spec

marcel referenced a pull request that will close this issue

2026-05-15 01:48:23 +02:00

devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #585

marcel referenced this issue

2026-05-15 01:50:53 +02:00

devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #585

marcel closed this issue

2026-05-15 02:15:10 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#573

devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #573

Context

Services to Add

Config File to Create

Acceptance Criteria

Definition of Done

🔧 Tobias Wendt — DevOps & Platform Engineer

Observations

Recommendations

Open Decisions

🏛️ Markus Keller — Application Architect

Observations

Recommendations

Open Decisions

🔐 Nora "NullX" Steiner — Application Security Engineer

Observations

Recommendations

🧪 Sara Holt — QA Engineer & Test Strategist

Observations

Recommendations

👩‍💻 Felix Brandt — Senior Fullstack Developer

Observations

Recommendations

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Observations

📋 Elicit — Requirements Engineer

Observations

Recommendations

Open Decisions

🗳️ Decision Queue — Action Required

Infrastructure / Architecture

Observability Scope

Implementation complete — branch feat/issue-573-prometheus-metrics

What was implemented

Reviewer feedback addressed

Verification

Next steps

Implementation complete — branch `feat/issue-573-prometheus-metrics`