devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #585

marcel · 2026-05-15T01:48:23+02:00

marcel commented

2026-05-15 01:48:23 +02:00

Summary

Closes #573

Adds prometheus (prom/prometheus:v3.4.0), node-exporter (prom/node-exporter:v1.9.0), and cadvisor (gcr.io/cadvisor/cadvisor:v0.52.1) services to docker-compose.observability.yml — all image tags pinned
Creates infra/observability/prometheus/prometheus.yml with 4 scrape jobs: node, cadvisor, spring-boot, ocr-service
Security improvements: Docker socket mounted read-only (/var/run/docker.sock:ro), cAdvisor limited to obs-net (uses socket for discovery, not network membership)
Prometheus healthcheck added

Test plan

docker compose -f docker-compose.observability.yml config exits 0
docker compose -f docker-compose.observability.yml up -d prometheus node-exporter cadvisor starts all three containers
Prometheus UI accessible at http://localhost:9090

🤖 Generated with Claude Code

## Summary Closes #573 - Adds `prometheus` (prom/prometheus:v3.4.0), `node-exporter` (prom/node-exporter:v1.9.0), and `cadvisor` (gcr.io/cadvisor/cadvisor:v0.52.1) services to `docker-compose.observability.yml` — all image tags pinned - Creates `infra/observability/prometheus/prometheus.yml` with 4 scrape jobs: node, cadvisor, spring-boot, ocr-service - Security improvements: Docker socket mounted read-only (`/var/run/docker.sock:ro`), cAdvisor limited to `obs-net` (uses socket for discovery, not network membership) - Prometheus healthcheck added ## Test plan - [ ] `docker compose -f docker-compose.observability.yml config` exits 0 - [ ] `docker compose -f docker-compose.observability.yml up -d prometheus node-exporter cadvisor` starts all three containers - [ ] Prometheus UI accessible at `http://localhost:9090` 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel added 1 commit 2026-05-15 01:48:24 +02:00

devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics

CI / Unit & Component Tests (pull_request) Successful in 3m22s

Details

CI / OCR Service Tests (pull_request) Successful in 16s

Details

CI / Backend Unit Tests (pull_request) Successful in 4m40s

Details

CI / fail2ban Regex (pull_request) Successful in 39s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 57s

Details

0c9973fdff

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-15 01:49:38 +02:00

🏗️ Markus Keller (@mkeller) — Application Architect

Verdict: ⚠️ Approved with concerns

Good first step for the observability stack. The service topology is sensible and the two-compose-file pattern (main stack + observability overlay joining archiv-net) is the right approach. However, the required documentation update for a new Docker service is missing, and there is one technical discrepancy to resolve.

Blockers

1. Wrong scrape port for spring-boot job (infra/observability/prometheus/prometheus.yml)

docs/DEPLOYMENT.md §4 states:

Prometheus scraping: management port 8081, path /actuator/prometheus. Internal only; Caddy blocks /actuator/* externally.

The prometheus.yml targets backend:8080, not backend:8081. If the backend exposes Prometheus metrics on the management port (8081) — which is the standard Spring Boot actuator pattern — this target will never return data even after the micrometer issue ships.

# current (wrong)
- targets: ['backend:8080']

# should be
- targets: ['backend:8081']

Either correct the target or update DEPLOYMENT.md if the intent is to expose metrics on the main port. One of the two must be the source of truth.

2. docs/architecture/c4/l2-containers.puml already updated — but docs/DEPLOYMENT.md needs a topology entry

Per my review rule: "New Docker service or infrastructure component → docs/architecture/c4/l2-containers.puml + docs/DEPLOYMENT.md"

The C4 diagram is already updated (Prometheus, Loki, Grafana appear in the observability boundary). However, docs/DEPLOYMENT.md §1 (Deployment topology) describes only the main stack and refers to the observability stack in §4 without any topology diagram entry for the new Prometheus/node-exporter/cAdvisor services. The Mermaid diagram in §1 and the "Key facts" bullet list should mention that the observability stack is optional and how to start it.

This is a documentation contract — DEPLOYMENT.md declares "Update this file in any PR that changes the container topology" — so this is a blocker per the project's own definition.

Suggestions

S1. prometheus_data volume declared but loki_data, tempo_data, grafana_data, glitchtip_data are pre-declared in the volumes block

These volumes for future services are fine to pre-declare. Just note that they consume namespace in Docker's volume list even when the services don't exist yet. Consider a comment explaining they are reserved for upcoming issues (#574, #575, etc.).

S2. No depends_on for node-exporter or cAdvisor on Prometheus

Prometheus will start and immediately begin scraping before node-exporter and cAdvisor are fully up. For a small local stack this is harmless (Prometheus retries), but adding depends_on with service_started would make the startup ordering explicit and self-documenting.

S3. --web.enable-lifecycle opens the hot-reload API without auth

The /-/reload endpoint allows unauthenticated config reloads. Since Prometheus is on archiv-net + obs-net, any container on either network can trigger a reload. This is a low-severity risk for a single-tenant family project, but worth documenting the accepted risk in a comment (similar to the existing cAdvisor privileged: true comment).

Overall the architecture is coherent. Fix the port mismatch and the DEPLOYMENT.md gap and this is ready to merge.

## 🏗️ Markus Keller (@mkeller) — Application Architect **Verdict: ⚠️ Approved with concerns** Good first step for the observability stack. The service topology is sensible and the two-compose-file pattern (main stack + observability overlay joining `archiv-net`) is the right approach. However, the required documentation update for a new Docker service is missing, and there is one technical discrepancy to resolve. --- ### Blockers **1. Wrong scrape port for `spring-boot` job (`infra/observability/prometheus/prometheus.yml`)** `docs/DEPLOYMENT.md §4` states: > Prometheus scraping: management port **8081**, path `/actuator/prometheus`. Internal only; Caddy blocks `/actuator/*` externally. The prometheus.yml targets `backend:8080`, not `backend:8081`. If the backend exposes Prometheus metrics on the management port (8081) — which is the standard Spring Boot actuator pattern — this target will never return data even after the micrometer issue ships. ```yaml # current (wrong) - targets: ['backend:8080'] # should be - targets: ['backend:8081'] ``` Either correct the target **or** update DEPLOYMENT.md if the intent is to expose metrics on the main port. One of the two must be the source of truth. **2. `docs/architecture/c4/l2-containers.puml` already updated — but `docs/DEPLOYMENT.md` needs a topology entry** Per my review rule: *"New Docker service or infrastructure component → `docs/architecture/c4/l2-containers.puml` + `docs/DEPLOYMENT.md`"* The C4 diagram is already updated (Prometheus, Loki, Grafana appear in the observability boundary). However, `docs/DEPLOYMENT.md §1 (Deployment topology)` describes only the main stack and refers to the observability stack in §4 without any topology diagram entry for the new Prometheus/node-exporter/cAdvisor services. The Mermaid diagram in §1 and the "Key facts" bullet list should mention that the observability stack is optional and how to start it. This is a documentation contract — DEPLOYMENT.md declares "Update this file in any PR that changes the container topology" — so this is a blocker per the project's own definition. --- ### Suggestions **S1. `prometheus_data` volume declared but `loki_data`, `tempo_data`, `grafana_data`, `glitchtip_data` are pre-declared in the volumes block** These volumes for future services are fine to pre-declare. Just note that they consume namespace in Docker's volume list even when the services don't exist yet. Consider a comment explaining they are reserved for upcoming issues (#574, #575, etc.). **S2. No `depends_on` for node-exporter or cAdvisor on Prometheus** Prometheus will start and immediately begin scraping before node-exporter and cAdvisor are fully up. For a small local stack this is harmless (Prometheus retries), but adding `depends_on` with `service_started` would make the startup ordering explicit and self-documenting. **S3. `--web.enable-lifecycle` opens the hot-reload API without auth** The `/-/reload` endpoint allows unauthenticated config reloads. Since Prometheus is on `archiv-net + obs-net`, any container on either network can trigger a reload. This is a low-severity risk for a single-tenant family project, but worth documenting the accepted risk in a comment (similar to the existing cAdvisor `privileged: true` comment). Overall the architecture is coherent. Fix the port mismatch and the DEPLOYMENT.md gap and this is ready to merge.

marcel commented

2026-05-15 01:49:49 +02:00

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Verdict: ✅ Approved

This is a pure infrastructure PR — no application code, no frontend, no backend Java or Python changes. My TDD and clean code lens doesn't have much traction here, but I did check the config file quality and naming.

What I checked

Naming: Service names (obs-prometheus, obs-node-exporter, obs-cadvisor) are semantic and follow the existing obs-* container naming convention. Volume name prometheus_data is consistent with postgres_data, grafana_data etc. in the codebase.
Comments explain the why: The $$ escaping comment, the pid: host comment, the privileged: true comment, and the read-only socket comment all explain intent rather than restating what the YAML key does. This is correct.
No dead config: The placeholder services: {} that was removed was clearly the right call — dead stubs are gone, future services are documented as comments with issue references.
YAML escaping: The $$ Compose escaping for $ in the node-exporter regex is correctly applied and explained.

Minor note

The job_name: spring-boot and job_name: ocr-service targets are pre-wired and acknowledged as DOWN in comments. The comments are informative. I'd only add: when the backend instrumentation issue ships, remember to update the target port (Markus flagged the 8080 vs 8081 discrepancy — that's the only code-level thing I'd double-check before merging).

No blockers from my side.

## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer **Verdict: ✅ Approved** This is a pure infrastructure PR — no application code, no frontend, no backend Java or Python changes. My TDD and clean code lens doesn't have much traction here, but I did check the config file quality and naming. --- ### What I checked - **Naming**: Service names (`obs-prometheus`, `obs-node-exporter`, `obs-cadvisor`) are semantic and follow the existing `obs-*` container naming convention. Volume name `prometheus_data` is consistent with `postgres_data`, `grafana_data` etc. in the codebase. - **Comments explain the _why_**: The `$$` escaping comment, the `pid: host` comment, the `privileged: true` comment, and the read-only socket comment all explain intent rather than restating what the YAML key does. This is correct. - **No dead config**: The placeholder `services: {}` that was removed was clearly the right call — dead stubs are gone, future services are documented as comments with issue references. - **YAML escaping**: The `$$` Compose escaping for `$` in the node-exporter regex is correctly applied and explained. --- ### Minor note The `job_name: spring-boot` and `job_name: ocr-service` targets are pre-wired and acknowledged as DOWN in comments. The comments are informative. I'd only add: when the backend instrumentation issue ships, remember to update the target port (Markus flagged the 8080 vs 8081 discrepancy — that's the only code-level thing I'd double-check before merging). No blockers from my side.

marcel commented

2026-05-15 01:50:11 +02:00

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

Solid foundation. Image tags are pinned, named volumes are used, the Docker socket is mounted read-only, and the $$ YAML escaping is handled correctly. A few things need attention before this goes to production.

Blockers

1. backend:8080 scrape target conflicts with documented management port

docs/DEPLOYMENT.md is explicit: Prometheus scrapes the management port 8081 at /actuator/prometheus. The Spring Boot application is configured to serve metrics there, not on the main port 8080. This target will never successfully scrape even after the backend is instrumented.

# infra/observability/prometheus/prometheus.yml
- job_name: spring-boot
  metrics_path: /actuator/prometheus
  static_configs:
    - targets: ['backend:8081']   # ← management port, not 8080

2. Prometheus port 9090 exposed on all interfaces

ports:
  - "${PORT_PROMETHEUS:-9090}:9090"

On a production VPS this binds to 0.0.0.0:9090, making the Prometheus UI (and its /api/v1/query endpoint with full metric data) reachable from the internet if the host firewall allows it. The Grafana, Loki, Tempo services in the canonical stack (docs/infrastructure/) scrape Prometheus internally via obs-net — they don't need a published port.

Fix: bind to loopback for external access control, or use expose instead of ports if the UI is only needed locally via SSH tunnel:

ports:
  - "127.0.0.1:${PORT_PROMETHEUS:-9090}:9090"

Suggestions

S1. Missing healthchecks on node-exporter and cAdvisor

Prometheus has a healthcheck — good. node-exporter and cAdvisor do not. Without them, depends_on: condition: service_healthy cannot be used, and Docker has no visibility into whether these services are actually serving. A minimal check:

# node-exporter
healthcheck:
  test: ["CMD", "wget", "-qO-", "http://localhost:9100/metrics"]
  interval: 30s
  timeout: 5s
  retries: 3

# cAdvisor
healthcheck:
  test: ["CMD", "wget", "-qO-", "http://localhost:8080/healthz"]
  interval: 30s
  timeout: 5s
  retries: 3

S2. --collector.filesystem.ignored-mount-points is deprecated

In node-exporter v1.x the flag was renamed to --collector.filesystem.mount-points-exclude. Using the deprecated flag still works in v1.9.0 but will be removed in a future version. Renovate will bump node-exporter and this will break silently. Prefer the current flag name now:

- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

S3. PORT_PROMETHEUS env var not documented in .env.example or DEPLOYMENT.md

The compose file uses ${PORT_PROMETHEUS:-9090} with a sensible default, but the project convention (per DEPLOYMENT.md §2) is that every variable in a compose file must appear in the env var table. Missing from the table is a minor blocker per the doc's own rule ("Any var found in docker-compose.yml that is not in this table is a blocking review comment").

S4. --web.enable-lifecycle opens unauthenticated config reload

The /-/reload POST endpoint has no authentication. Any container on archiv-net or obs-net can hot-reload the Prometheus config. Accept the risk explicitly with a comment, or restrict the lifecycle API to the management interface only once Prometheus 3.x exposes that option.

S5. No depends_on on Prometheus for its scrapers

Not strictly necessary (Prometheus retries failed scrapes), but adding depends_on: node-exporter: condition: service_started makes startup ordering explicit.

Overall: Fix the port binding (#2) and the backend scrape port (#1) before merging to production. The rest are improvements.

## 🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** Solid foundation. Image tags are pinned, named volumes are used, the Docker socket is mounted read-only, and the `$$ ` YAML escaping is handled correctly. A few things need attention before this goes to production. --- ### Blockers **1. `backend:8080` scrape target conflicts with documented management port** `docs/DEPLOYMENT.md` is explicit: Prometheus scrapes the **management port 8081** at `/actuator/prometheus`. The Spring Boot application is configured to serve metrics there, not on the main port 8080. This target will never successfully scrape even after the backend is instrumented. ```yaml # infra/observability/prometheus/prometheus.yml - job_name: spring-boot metrics_path: /actuator/prometheus static_configs: - targets: ['backend:8081'] # ← management port, not 8080 ``` **2. Prometheus port 9090 exposed on all interfaces** ```yaml ports: - "${PORT_PROMETHEUS:-9090}:9090" ``` On a production VPS this binds to `0.0.0.0:9090`, making the Prometheus UI (and its `/api/v1/query` endpoint with full metric data) reachable from the internet if the host firewall allows it. The Grafana, Loki, Tempo services in the canonical stack (`docs/infrastructure/`) scrape Prometheus internally via `obs-net` — they don't need a published port. Fix: bind to loopback for external access control, or use `expose` instead of `ports` if the UI is only needed locally via SSH tunnel: ```yaml ports: - "127.0.0.1:${PORT_PROMETHEUS:-9090}:9090" ``` --- ### Suggestions **S1. Missing healthchecks on node-exporter and cAdvisor** Prometheus has a healthcheck — good. node-exporter and cAdvisor do not. Without them, `depends_on: condition: service_healthy` cannot be used, and Docker has no visibility into whether these services are actually serving. A minimal check: ```yaml # node-exporter healthcheck: test: ["CMD", "wget", "-qO-", "http://localhost:9100/metrics"] interval: 30s timeout: 5s retries: 3 # cAdvisor healthcheck: test: ["CMD", "wget", "-qO-", "http://localhost:8080/healthz"] interval: 30s timeout: 5s retries: 3 ``` **S2. `--collector.filesystem.ignored-mount-points` is deprecated** In node-exporter v1.x the flag was renamed to `--collector.filesystem.mount-points-exclude`. Using the deprecated flag still works in v1.9.0 but will be removed in a future version. Renovate will bump node-exporter and this will break silently. Prefer the current flag name now: ```yaml - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' ``` **S3. `PORT_PROMETHEUS` env var not documented in `.env.example` or DEPLOYMENT.md** The compose file uses `${PORT_PROMETHEUS:-9090}` with a sensible default, but the project convention (per DEPLOYMENT.md §2) is that every variable in a compose file must appear in the env var table. Missing from the table is a minor blocker per the doc's own rule ("Any var found in docker-compose.yml that is not in this table is a blocking review comment"). **S4. `--web.enable-lifecycle` opens unauthenticated config reload** The `/-/reload` POST endpoint has no authentication. Any container on `archiv-net` or `obs-net` can hot-reload the Prometheus config. Accept the risk explicitly with a comment, or restrict the lifecycle API to the management interface only once Prometheus 3.x exposes that option. **S5. No `depends_on` on Prometheus for its scrapers** Not strictly necessary (Prometheus retries failed scrapes), but adding `depends_on: node-exporter: condition: service_started` makes startup ordering explicit. **Overall**: Fix the port binding (#2) and the backend scrape port (#1) before merging to production. The rest are improvements.

marcel commented

2026-05-15 01:50:30 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

The PR demonstrates good security awareness — read-only mounts, explicit privilege documentation, network isolation via obs-net. However, there are two issues worth flagging: one medium-severity (Prometheus UI exposed) and one informational (privileged container threat model).

Blockers

None that block merging for dev/local use.

Medium Severity — Production Risk

PROMETHEUS UI EXPOSED ON ALL INTERFACES (CWE-284: Improper Access Control)

ports:
  - "${PORT_PROMETHEUS:-9090}:9090"

What this exposes: The Prometheus HTTP API at http://<VPS_IP>:9090/api/v1/query allows querying all collected metrics — container names, process names, CPU/memory patterns, filesystem layout, network interface names. This is low-severity fingerprinting data that an attacker could use to map the infrastructure before a more targeted attack.

More critically, the /-/reload endpoint (enabled via --web.enable-lifecycle) accepts unauthenticated POST requests. Any actor who can reach port 9090 can reload the Prometheus configuration. While you cannot inject arbitrary YAML through the reload endpoint (the config file is bind-mounted read-only), the attack surface is larger than necessary.

Fix: Bind to loopback. Access the UI via SSH tunnel when needed:

ports:
  - "127.0.0.1:${PORT_PROMETHEUS:-9090}:9090"

This mirrors the project's own pattern described in DEPLOYMENT.md: "The host binds all docker-published ports to 127.0.0.1 only."

Informational — Accepted Risk Documentation

CADVISOR PRIVILEGED MODE (CWE-250: Execution with Unnecessary Privileges)

The comment in the compose file is good:

# privileged: true — required for cgroup and namespace metrics, see cAdvisor docs.
# Accepted risk: cAdvisor is pinned, on Renovate, and not exposed outside obs-net.

This is the correct rationale. I'd add one more dimension to the accepted risk: privileged: true combined with the Docker socket mount means a compromised cAdvisor container could escape to the host. The existing mitigations (pinned image, Renovate, no external exposure) are reasonable for a single-tenant family project. No action required, but document the full threat model.

Informational — `--web.enable-lifecycle`

The hot-reload endpoint (POST /-/reload) has no authentication in Prometheus 3.x. Combined with the archiv-net membership (which also includes backend, frontend, OCR service), any container on the network can trigger a config reload. For the current single-file config this has minimal impact. When alerting rules are added (issue #581), a malicious reload that clears the rules file would suppress alerts silently. Worth adding a comment that this risk is accepted.

What's done well

Docker socket mounted read-only — correct, sufficient for container metadata discovery
cAdvisor limited to obs-net — does not need archiv-net membership, correctly absent
All host filesystem mounts are :ro — /proc, /sys, /rootfs, /var/lib/docker
Image tags pinned to specific versions — supply chain risk managed
No secrets or credentials in the compose file

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** The PR demonstrates good security awareness — read-only mounts, explicit privilege documentation, network isolation via `obs-net`. However, there are two issues worth flagging: one medium-severity (Prometheus UI exposed) and one informational (privileged container threat model). --- ### Blockers None that block merging for dev/local use. --- ### Medium Severity — Production Risk **PROMETHEUS UI EXPOSED ON ALL INTERFACES (CWE-284: Improper Access Control)** ```yaml ports: - "${PORT_PROMETHEUS:-9090}:9090" ``` **What this exposes**: The Prometheus HTTP API at `http://<VPS_IP>:9090/api/v1/query` allows querying all collected metrics — container names, process names, CPU/memory patterns, filesystem layout, network interface names. This is low-severity fingerprinting data that an attacker could use to map the infrastructure before a more targeted attack. More critically, the `/-/reload` endpoint (enabled via `--web.enable-lifecycle`) accepts unauthenticated POST requests. Any actor who can reach port 9090 can reload the Prometheus configuration. While you cannot inject arbitrary YAML through the reload endpoint (the config file is bind-mounted read-only), the attack surface is larger than necessary. **Fix**: Bind to loopback. Access the UI via SSH tunnel when needed: ```yaml ports: - "127.0.0.1:${PORT_PROMETHEUS:-9090}:9090" ``` This mirrors the project's own pattern described in DEPLOYMENT.md: "The host binds all docker-published ports to `127.0.0.1` only." --- ### Informational — Accepted Risk Documentation **CADVISOR PRIVILEGED MODE (CWE-250: Execution with Unnecessary Privileges)** The comment in the compose file is good: ```yaml # privileged: true — required for cgroup and namespace metrics, see cAdvisor docs. # Accepted risk: cAdvisor is pinned, on Renovate, and not exposed outside obs-net. ``` This is the correct rationale. I'd add one more dimension to the accepted risk: `privileged: true` combined with the Docker socket mount means a compromised cAdvisor container could escape to the host. The existing mitigations (pinned image, Renovate, no external exposure) are reasonable for a single-tenant family project. No action required, but document the full threat model. --- ### Informational — `--web.enable-lifecycle` The hot-reload endpoint (`POST /-/reload`) has no authentication in Prometheus 3.x. Combined with the `archiv-net` membership (which also includes backend, frontend, OCR service), any container on the network can trigger a config reload. For the current single-file config this has minimal impact. When alerting rules are added (issue #581), a malicious reload that clears the rules file would suppress alerts silently. Worth adding a comment that this risk is accepted. --- ### What's done well - Docker socket mounted read-only — correct, sufficient for container metadata discovery - cAdvisor limited to `obs-net` — does not need `archiv-net` membership, correctly absent - All host filesystem mounts are `:ro` — /proc, /sys, /rootfs, /var/lib/docker - Image tags pinned to specific versions — supply chain risk managed - No secrets or credentials in the compose file

marcel commented

2026-05-15 01:50:42 +02:00

🧪 Sara Holt (@saraholt) — QA Engineer

Verdict: ✅ Approved

This is infrastructure configuration, not application code, so there's no test suite to review. I'm evaluating testability and verifiability of the configuration itself.

What I checked

Test plan in the PR description: The three-item checklist is correct and complete for validating the compose config:

docker compose config — validates YAML structure
docker compose up -d prometheus node-exporter cadvisor — validates container startup
http://localhost:9090 — validates Prometheus UI accessibility

These are the right smoke tests for an infrastructure-only PR.

Observability gap — no alerting rules yet: The scrape config collects metrics, but issue #581 tracks alert rules and runbooks. Without alerting, the observability stack is read-only (dashboards only). This is expected for this issue's scope. When #581 ships, each alert should have a documented runbook per Tobias's standards.

DOWN targets are expected and documented: The spring-boot and ocr-service scrape jobs are pre-wired but will show as DOWN until backend instrumentation ships. The comments in prometheus.yml correctly call this out. When the backend instrumentation PR lands, there should be a verification step that these targets move to UP.

Retention is set: --storage.tsdb.retention.time=30d is explicit. For a CX32 with ~8GB RAM and the current metric volume, 30d is reasonable. No concern.

Minor observation

The prometheus_data named volume will persist across docker compose down. If someone runs docker compose down -v (which removes volumes), all historical metrics are lost. This is expected behavior for Docker volumes — just worth noting for incident runbooks when #581 is written.

## 🧪 Sara Holt (@saraholt) — QA Engineer **Verdict: ✅ Approved** This is infrastructure configuration, not application code, so there's no test suite to review. I'm evaluating testability and verifiability of the configuration itself. --- ### What I checked **Test plan in the PR description**: The three-item checklist is correct and complete for validating the compose config: 1. `docker compose config` — validates YAML structure 2. `docker compose up -d prometheus node-exporter cadvisor` — validates container startup 3. `http://localhost:9090` — validates Prometheus UI accessibility These are the right smoke tests for an infrastructure-only PR. **Observability gap — no alerting rules yet**: The scrape config collects metrics, but issue #581 tracks alert rules and runbooks. Without alerting, the observability stack is read-only (dashboards only). This is expected for this issue's scope. When #581 ships, each alert should have a documented runbook per Tobias's standards. **DOWN targets are expected and documented**: The `spring-boot` and `ocr-service` scrape jobs are pre-wired but will show as DOWN until backend instrumentation ships. The comments in `prometheus.yml` correctly call this out. When the backend instrumentation PR lands, there should be a verification step that these targets move to UP. **Retention is set**: `--storage.tsdb.retention.time=30d` is explicit. For a CX32 with ~8GB RAM and the current metric volume, 30d is reasonable. No concern. --- ### Minor observation The `prometheus_data` named volume will persist across `docker compose down`. If someone runs `docker compose down -v` (which removes volumes), all historical metrics are lost. This is expected behavior for Docker volumes — just worth noting for incident runbooks when #581 is written.

marcel commented

2026-05-15 01:50:53 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

Reviewing against issue #573's implied requirements: add Prometheus + node-exporter + cAdvisor with pinned tags, scrape config, and security baseline. The PR delivers on all stated requirements.

Requirements coverage

Requirement	Status
Prometheus service with pinned tag	✅ `prom/prometheus:v3.4.0`
Node Exporter with pinned tag	✅ `prom/node-exporter:v1.9.0`
cAdvisor with pinned tag	✅ `gcr.io/cadvisor/cadvisor:v0.52.1`
Scrape config for all 4 targets	✅ node, cadvisor, spring-boot, ocr-service
Docker socket read-only	✅ `:ro` flag present
Prometheus healthcheck	✅ wget on `/-/healthy`
30-day retention	✅ `--storage.tsdb.retention.time=30d`

Open requirement gap

NFR gap — the PORT_PROMETHEUS environment variable is undocumented

The project standard (DEPLOYMENT.md §2) states every env var in a compose file must appear in the env var table. PORT_PROMETHEUS is new and is not in the table. This is a small gap but the project explicitly calls it a "blocking review comment." Add a row to the DEPLOYMENT.md env var table.

Forward-looking observations (not blocking)

Issue #581 (wiring + runbooks) remains open. The pre-wired spring-boot and ocr-service scrape targets create an implicit dependency: when #581 ships, the scrape port for spring-boot must resolve the 8080 vs 8081 ambiguity.
Alerting requirements are unspecified: No alert rules exist yet. When #581 defines alerting, each alert should have: description, severity, likely cause, resolution steps — these are requirements, not just config.
Grafana datasource wiring: When Grafana ships, the Prometheus datasource URL should be http://prometheus:9090 (internal, via obs-net), not the published host port. This should be a requirement on the Grafana issue.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** Reviewing against issue #573's implied requirements: add Prometheus + node-exporter + cAdvisor with pinned tags, scrape config, and security baseline. The PR delivers on all stated requirements. --- ### Requirements coverage | Requirement | Status | |---|---| | Prometheus service with pinned tag | ✅ `prom/prometheus:v3.4.0` | | Node Exporter with pinned tag | ✅ `prom/node-exporter:v1.9.0` | | cAdvisor with pinned tag | ✅ `gcr.io/cadvisor/cadvisor:v0.52.1` | | Scrape config for all 4 targets | ✅ node, cadvisor, spring-boot, ocr-service | | Docker socket read-only | ✅ `:ro` flag present | | Prometheus healthcheck | ✅ wget on `/-/healthy` | | 30-day retention | ✅ `--storage.tsdb.retention.time=30d` | --- ### Open requirement gap **NFR gap — the `PORT_PROMETHEUS` environment variable is undocumented** The project standard (DEPLOYMENT.md §2) states every env var in a compose file must appear in the env var table. `PORT_PROMETHEUS` is new and is not in the table. This is a small gap but the project explicitly calls it a "blocking review comment." Add a row to the DEPLOYMENT.md env var table. --- ### Forward-looking observations (not blocking) - **Issue #581** (wiring + runbooks) remains open. The pre-wired `spring-boot` and `ocr-service` scrape targets create an implicit dependency: when #581 ships, the scrape port for spring-boot must resolve the 8080 vs 8081 ambiguity. - **Alerting requirements are unspecified**: No alert rules exist yet. When #581 defines alerting, each alert should have: description, severity, likely cause, resolution steps — these are requirements, not just config. - **Grafana datasource wiring**: When Grafana ships, the Prometheus datasource URL should be `http://prometheus:9090` (internal, via obs-net), not the published host port. This should be a requirement on the Grafana issue.

marcel commented

2026-05-15 01:50:56 +02:00

🎨 Leonie Voss (@leonievoss) — UI/UX Designer

Verdict: ✅ Approved

This PR contains no frontend code, no Svelte components, no UI routes, and no user-facing changes. There is nothing for me to review from a UX or accessibility perspective.

The Prometheus UI at http://localhost:9090 is a developer/operator tool — not part of the Familienarchiv user-facing product. When Grafana ships (tracked in the future Grafana issue), I'll review the dashboard accessibility and embedding approach if it's surfaced inside the application UI.

LGTM from a UX standpoint.

## 🎨 Leonie Voss (@leonievoss) — UI/UX Designer **Verdict: ✅ Approved** This PR contains no frontend code, no Svelte components, no UI routes, and no user-facing changes. There is nothing for me to review from a UX or accessibility perspective. The Prometheus UI at `http://localhost:9090` is a developer/operator tool — not part of the Familienarchiv user-facing product. When Grafana ships (tracked in the future Grafana issue), I'll review the dashboard accessibility and embedding approach if it's surfaced inside the application UI. LGTM from a UX standpoint.

marcel added 1 commit 2026-05-15 01:53:02 +02:00

devops(observability): fix Prometheus port binding, scrape port, and update DEPLOYMENT.md

CI / Unit & Component Tests (pull_request) Successful in 3m21s

Details

CI / OCR Service Tests (pull_request) Successful in 16s

Details

CI / Backend Unit Tests (pull_request) Successful in 4m35s

Details

CI / fail2ban Regex (pull_request) Successful in 38s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 57s

Details

0c66f6298b

- Fix spring-boot scrape target from backend:8080 to backend:8081 (actuator/management port)
- Restrict Prometheus host port binding to 127.0.0.1 to prevent unintended external exposure
- Add observability stack (Prometheus, Node Exporter, cAdvisor) to topology description
- Add PORT_PROMETHEUS env var to DEPLOYMENT.md reference table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-15 01:53:22 +02:00

All 4 review blockers fixed in commit 0c66f629:

Wrong scrape port (infra/observability/prometheus/prometheus.yml): changed spring-boot target from backend:8080 to backend:8081 (the actuator/management port).
Prometheus port binding too broad (docker-compose.observability.yml): changed "${PORT_PROMETHEUS:-9090}:9090" to "127.0.0.1:${PORT_PROMETHEUS:-9090}:9090" — Prometheus UI is now localhost-only, consistent with the rest of the stack.
Topology section updated (docs/DEPLOYMENT.md): added a bullet under §1 noting the optional observability stack (Prometheus, Node Exporter, cAdvisor) runs via docker-compose.observability.yml, joins archiv-net, and scrapes the backend management port :8081.
PORT_PROMETHEUS added to env var table (docs/DEPLOYMENT.md): new "Observability stack" subsection in §2 with a row for PORT_PROMETHEUS (default: 9090, bound to 127.0.0.1 only).

All 4 review blockers fixed in commit `0c66f629`: 1. **Wrong scrape port** (`infra/observability/prometheus/prometheus.yml`): changed spring-boot target from `backend:8080` to `backend:8081` (the actuator/management port). 2. **Prometheus port binding too broad** (`docker-compose.observability.yml`): changed `"${PORT_PROMETHEUS:-9090}:9090"` to `"127.0.0.1:${PORT_PROMETHEUS:-9090}:9090"` — Prometheus UI is now localhost-only, consistent with the rest of the stack. 3. **Topology section updated** (`docs/DEPLOYMENT.md`): added a bullet under §1 noting the optional observability stack (Prometheus, Node Exporter, cAdvisor) runs via `docker-compose.observability.yml`, joins `archiv-net`, and scrapes the backend management port `:8081`. 4. **PORT_PROMETHEUS added to env var table** (`docs/DEPLOYMENT.md`): new "Observability stack" subsection in §2 with a row for `PORT_PROMETHEUS` (default: `9090`, bound to `127.0.0.1` only).

marcel merged commit 6a118589c2 into main

2026-05-15 02:15:10 +02:00

marcel deleted branch feat/issue-573-prometheus-metrics

2026-05-15 02:15:10 +02:00

marcel referenced this issue from a commit

2026-05-15 02:15:10 +02:00

Merge pull request 'devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics' (#585) from feat/issue-573-prometheus-metrics into main

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#585

devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #585

Summary

Test plan

🏗️ Markus Keller (@mkeller) — Application Architect

Blockers

Suggestions

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

What I checked

Minor note

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Blockers

Suggestions

🔒 Nora "NullX" Steiner — Application Security Engineer

Blockers

Medium Severity — Production Risk

Informational — Accepted Risk Documentation

Informational — --web.enable-lifecycle

What's done well

🧪 Sara Holt (@saraholt) — QA Engineer

What I checked

Minor observation

📋 Elicit — Requirements Engineer

Requirements coverage

Open requirement gap

Forward-looking observations (not blocking)

🎨 Leonie Voss (@leonievoss) — UI/UX Designer

Informational — `--web.enable-lifecycle`