devops(observability): add Tempo for distributed trace storage (OTLP receiver) #587

Merged
marcel merged 1 commits from feat/issue-575-tempo into main 2026-05-15 03:21:12 +02:00
Owner

Summary

Closes #575

  • Adds obs-tempo (grafana/tempo:2.7.2) to docker-compose.observability.yml with expose-only ports (3200/4317/4318) — not exposed to host, only reachable on Docker network
  • Creates infra/observability/tempo/tempo.yml — OTLP dual-protocol (gRPC 4317 + HTTP 4318), 30-day block retention, metrics_generator with service-graphs + span-metrics
  • Healthcheck added (wget /ready, 15s start_period)
  • docs/DEPLOYMENT.md updated with Tempo service row
  • docs/architecture/c4/l2-containers.puml updated with Tempo container + Rel from backend

Test plan

  • docker compose -f docker-compose.observability.yml config exits 0
  • docker compose -f docker-compose.observability.yml up -d tempo starts without error
  • curl -s http://localhost:3200/ready returns ready

🤖 Generated with Claude Code

## Summary Closes #575 - Adds `obs-tempo` (grafana/tempo:2.7.2) to `docker-compose.observability.yml` with `expose`-only ports (3200/4317/4318) — not exposed to host, only reachable on Docker network - Creates `infra/observability/tempo/tempo.yml` — OTLP dual-protocol (gRPC 4317 + HTTP 4318), 30-day block retention, metrics_generator with service-graphs + span-metrics - Healthcheck added (wget /ready, 15s start_period) - `docs/DEPLOYMENT.md` updated with Tempo service row - `docs/architecture/c4/l2-containers.puml` updated with Tempo container + Rel from backend ## Test plan - [ ] `docker compose -f docker-compose.observability.yml config` exits 0 - [ ] `docker compose -f docker-compose.observability.yml up -d tempo` starts without error - [ ] `curl -s http://localhost:3200/ready` returns `ready` 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 1 commit 2026-05-15 03:02:30 +02:00
devops(observability): add Tempo for distributed trace storage (OTLP receiver)
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m22s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 4m32s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 56s
de08ffe989
Closes #575

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

🏗️ Markus Keller (@mkeller) — Application Architect

Verdict: Approved

What I checked

This is a pure infrastructure addition with no application code changes. My checklist from the persona matrix: new Docker service or infrastructure component → docs/architecture/c4/l2-containers.puml + docs/DEPLOYMENT.md. Both are updated. Good.

Positives

  • Network isolation is correctly designed: Tempo sits on both archiv-net (receives OTLP from backend) and obs-net (queried by Grafana). This matches the established pattern used by Promtail.
  • Local filesystem storage with a named volume (tempo_data) is the right call for a single-VPS deployment. No need for S3 backend here.
  • The overrides.defaults.metrics_generator.processors block mirrors the top-level metrics_generator.processors — this is Tempo's required pattern for activating processors per-tenant, not duplication. Fine.
  • Comment explaining why the API is unauthenticated is exactly the kind of threat-model comment that makes audits tractable.

Suggestions (non-blocking)

  • The C4 L2 diagram now shows Rel(backend, tempo, ...) but there is no corresponding Rel(grafana, tempo, ...). Grafana → Tempo is the query path that closes the observability loop. Since Grafana is still a placeholder (see future issue), this can stay deferred — but consider adding a TODO comment in the puml so it's not forgotten.
  • No ADR is required here: adding Tempo is an incremental step in the already-decided LGTM stack (Loki + Grafana + Tempo + Mimir). The pattern is established. An ADR would be warranted if the stack choice itself were being decided for the first time.

Doc compliance check

Required Present?
docs/architecture/c4/l2-containers.puml — new container + rel
docs/DEPLOYMENT.md — service row
New external system (none) N/A
DB migration (none) N/A

No blockers. Doc compliance is met.

## 🏗️ Markus Keller (@mkeller) — Application Architect **Verdict: ✅ Approved** ### What I checked This is a pure infrastructure addition with no application code changes. My checklist from the persona matrix: new Docker service or infrastructure component → `docs/architecture/c4/l2-containers.puml` + `docs/DEPLOYMENT.md`. Both are updated. Good. ### Positives - Network isolation is correctly designed: Tempo sits on both `archiv-net` (receives OTLP from backend) and `obs-net` (queried by Grafana). This matches the established pattern used by Promtail. - Local filesystem storage with a named volume (`tempo_data`) is the right call for a single-VPS deployment. No need for S3 backend here. - The `overrides.defaults.metrics_generator.processors` block mirrors the top-level `metrics_generator.processors` — this is Tempo's required pattern for activating processors per-tenant, not duplication. Fine. - Comment explaining why the API is unauthenticated is exactly the kind of threat-model comment that makes audits tractable. ### Suggestions (non-blocking) - The C4 L2 diagram now shows `Rel(backend, tempo, ...)` but there is no corresponding `Rel(grafana, tempo, ...)`. Grafana → Tempo is the query path that closes the observability loop. Since Grafana is still a placeholder `(see future issue)`, this can stay deferred — but consider adding a TODO comment in the puml so it's not forgotten. - No ADR is required here: adding Tempo is an incremental step in the already-decided LGTM stack (Loki + Grafana + Tempo + Mimir). The pattern is established. An ADR would be warranted if the stack choice itself were being decided for the first time. ### Doc compliance check | Required | Present? | |---|---| | `docs/architecture/c4/l2-containers.puml` — new container + rel | ✅ | | `docs/DEPLOYMENT.md` — service row | ✅ | | New external system (none) | N/A | | DB migration (none) | N/A | No blockers. Doc compliance is met.
Author
Owner

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Verdict: Approved

What I checked

No application code in this PR — purely infra YAML and docs. My review lens here is: does the config read clearly, are there dead/commented-out sections, and are names intent-revealing?

Positives

  • The inline comments on the expose ports are exactly right: # Grafana queries Tempo on this port (obs-net only) tells me what the port does and which network controls access. No "what" comments — only "why" comments.
  • The network comments (# backend (archive-backend) reaches tempo:4317 over this network) document the actual traffic flow, not just the YAML key's meaning.
  • The WAL + blocks split on a single named volume is explained inline — avoids the confusion of "why two paths?"

Suggestions (non-blocking)

  • tempo.yml has metrics_generator.processors defined at the top level and again under overrides.defaults.metrics_generator.processors. This is not dead code — Tempo requires the override block for per-tenant activation. A one-line comment clarifying this (# Required: overrides block activates processors globally across all tenants) would prevent a future reader from deleting it as apparent duplication.
  • The comment on the compose network entry says # backend (archive-backend) reaches tempo:4317 over this network — the container name is archive-backend (per the main compose) but the service is named backend. Consistent with how Promtail's comment is written, so not a bug, just worth noting.

No blockers.

## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer **Verdict: ✅ Approved** ### What I checked No application code in this PR — purely infra YAML and docs. My review lens here is: does the config read clearly, are there dead/commented-out sections, and are names intent-revealing? ### Positives - The inline comments on the `expose` ports are exactly right: `# Grafana queries Tempo on this port (obs-net only)` tells me what the port does and which network controls access. No "what" comments — only "why" comments. - The network comments (`# backend (archive-backend) reaches tempo:4317 over this network`) document the actual traffic flow, not just the YAML key's meaning. - The WAL + blocks split on a single named volume is explained inline — avoids the confusion of "why two paths?" ### Suggestions (non-blocking) - `tempo.yml` has `metrics_generator.processors` defined at the top level **and** again under `overrides.defaults.metrics_generator.processors`. This is not dead code — Tempo requires the override block for per-tenant activation. A one-line comment clarifying this (`# Required: overrides block activates processors globally across all tenants`) would prevent a future reader from deleting it as apparent duplication. - The comment on the compose network entry says `# backend (archive-backend) reaches tempo:4317 over this network` — the container name is `archive-backend` (per the main compose) but the service is named `backend`. Consistent with how Promtail's comment is written, so not a bug, just worth noting. No blockers.
Author
Owner

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: Approved

Solid PR. This follows every pattern already established in the observability stack. Here's my full checklist pass:

Checklist

Check Result
Pinned image tag grafana/tempo:2.7.2 — pinned, Renovate will bump
Named volume for persistent data tempo_data: — no bind mounts
No host-exposed ports expose: only, not ports: — consistent with Loki
No hardcoded secrets No credentials needed for this service
Healthcheck present wget /ready, same pattern as Loki
restart: unless-stopped
Network isolation correct dual-network setup mirrors Promtail's pattern

Positives

  • The healthcheck uses wget -qO- http://localhost:3200/ready | grep -q ready || exit 1 — exactly the same pattern as Loki's healthcheck. Consistent and reliable.
  • start_period: 15s is appropriate for Tempo. Loki uses 30s because it has more startup work; Tempo is lighter. Good calibration.
  • tempo_data volume is already declared in the volumes: section at the bottom. Also note grafana_data: and glitchtiip_data: are pre-declared there too — the compose file is thinking ahead cleanly.
  • Config file mounted :ro — correct, the container should not write back to its config.

One observation (non-blocking)

The metrics_generator WAL lives at /var/tempo/generator/wal — also on the tempo_data volume. This means metrics generator state, ingester WAL, and blocks all share one volume. For this deployment scale that's fine. If trace volume ever grows significantly, splitting the generator WAL to a separate volume would help with I/O isolation. Not an action item now — just something to know when sizing.

Cost note

tempo_data on the CX32 VPS (local disk). Trace retention is 30 days. At family-archive traffic volumes (not high), disk pressure from Tempo will be negligible. No cost concern.

No blockers. Ready to merge from an infra standpoint.

## 🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ✅ Approved** Solid PR. This follows every pattern already established in the observability stack. Here's my full checklist pass: ### Checklist | Check | Result | |---|---| | Pinned image tag | ✅ `grafana/tempo:2.7.2` — pinned, Renovate will bump | | Named volume for persistent data | ✅ `tempo_data:` — no bind mounts | | No host-exposed ports | ✅ `expose:` only, not `ports:` — consistent with Loki | | No hardcoded secrets | ✅ No credentials needed for this service | | Healthcheck present | ✅ wget `/ready`, same pattern as Loki | | `restart: unless-stopped` | ✅ | | Network isolation correct | ✅ dual-network setup mirrors Promtail's pattern | ### Positives - The healthcheck uses `wget -qO- http://localhost:3200/ready | grep -q ready || exit 1` — exactly the same pattern as Loki's healthcheck. Consistent and reliable. - `start_period: 15s` is appropriate for Tempo. Loki uses 30s because it has more startup work; Tempo is lighter. Good calibration. - `tempo_data` volume is already declared in the `volumes:` section at the bottom. Also note `grafana_data:` and `glitchtiip_data:` are pre-declared there too — the compose file is thinking ahead cleanly. - Config file mounted `:ro` — correct, the container should not write back to its config. ### One observation (non-blocking) The `metrics_generator` WAL lives at `/var/tempo/generator/wal` — also on the `tempo_data` volume. This means metrics generator state, ingester WAL, and blocks all share one volume. For this deployment scale that's fine. If trace volume ever grows significantly, splitting the generator WAL to a separate volume would help with I/O isolation. Not an action item now — just something to know when sizing. ### Cost note `tempo_data` on the CX32 VPS (local disk). Trace retention is 30 days. At family-archive traffic volumes (not high), disk pressure from Tempo will be negligible. No cost concern. No blockers. Ready to merge from an infra standpoint.
Author
Owner

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved

Pure infrastructure addition. No application code paths, no new authentication surfaces exposed to end users. Here's my security-focused pass:

Network isolation — LGTM

  • Port 3200 (Tempo HTTP API) is expose-only, reachable only via obs-net. The only service currently on obs-net that would query it is Grafana (not yet deployed). No host binding. No internet exposure.
  • Ports 4317 (OTLP gRPC) and 4318 (OTLP HTTP) are expose-only, reachable only via archiv-net. The backend can push traces; nothing else on obs-net can reach the OTLP receivers.
  • The comment in tempo.yml explicitly acknowledges the unauthenticated API surface and explains why network isolation is the compensating control. This is correct reasoning for a single-operator self-hosted deployment.

No authentication on Tempo API — accepted, documented

Tempo's HTTP API (port 3200) is unauthenticated by design for single-tenant deployments. The threat model note in the config file is the right move: it records the conscious decision so a future reviewer doesn't wonder "why is this open?" The control is Docker network isolation — no path from the internet or the host to port 3200. Acceptable for this deployment context.

OTLP receivers on 0.0.0.0 — acceptable given expose-only

distributor.receivers.otlp.protocols.grpc.endpoint: 0.0.0.0:4317 means Tempo listens on all interfaces inside the container. This is normal and correct — it's how container networking works. Because the port is expose (not ports), Docker's host-level firewall rules prevent it from being reachable from the host or internet.

metrics_generator — no secrets, no external calls

The metrics generator writes to a local WAL path. No remote write target configured here. Span-metrics and service-graphs are generated internally. No external telemetry leakage.

Supply chain

grafana/tempo:2.7.2 is a pinned, specific release from a reputable vendor. Not :latest. Renovate will handle future updates.

No blockers.

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** Pure infrastructure addition. No application code paths, no new authentication surfaces exposed to end users. Here's my security-focused pass: ### Network isolation — LGTM - Port 3200 (Tempo HTTP API) is `expose`-only, reachable only via `obs-net`. The only service currently on `obs-net` that would query it is Grafana (not yet deployed). No host binding. No internet exposure. ✅ - Ports 4317 (OTLP gRPC) and 4318 (OTLP HTTP) are `expose`-only, reachable only via `archiv-net`. The backend can push traces; nothing else on `obs-net` can reach the OTLP receivers. ✅ - The comment in `tempo.yml` explicitly acknowledges the unauthenticated API surface and explains why network isolation is the compensating control. This is correct reasoning for a single-operator self-hosted deployment. ### No authentication on Tempo API — accepted, documented Tempo's HTTP API (port 3200) is unauthenticated by design for single-tenant deployments. The threat model note in the config file is the right move: it records the conscious decision so a future reviewer doesn't wonder "why is this open?" The control is Docker network isolation — no path from the internet or the host to port 3200. Acceptable for this deployment context. ### OTLP receivers on 0.0.0.0 — acceptable given expose-only `distributor.receivers.otlp.protocols.grpc.endpoint: 0.0.0.0:4317` means Tempo listens on all interfaces *inside the container*. This is normal and correct — it's how container networking works. Because the port is `expose` (not `ports`), Docker's host-level firewall rules prevent it from being reachable from the host or internet. ✅ ### metrics_generator — no secrets, no external calls The metrics generator writes to a local WAL path. No remote write target configured here. Span-metrics and service-graphs are generated internally. No external telemetry leakage. ✅ ### Supply chain `grafana/tempo:2.7.2` is a pinned, specific release from a reputable vendor. Not `:latest`. Renovate will handle future updates. ✅ ### No blockers.
Author
Owner

🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

No production code changed, so no unit/integration/E2E test regression risk. My review here is focused on: is the infrastructure itself verifiable, and are there testing gaps to track?

What's present

The PR's test plan is manual but concrete:

- docker compose ... config   exits 0
- docker compose ... up tempo   starts without error
- curl -s http://localhost:3200/ready   returns "ready"

These are the right smoke tests for a new infra service. The healthcheck in the compose file (wget /ready) effectively automates the third check on every restart.

Concern (non-blocking but worth tracking)

No CI validation of the compose config. The test plan step docker compose -f docker-compose.observability.yml config is manual. Currently, none of the CI workflows (based on what I can see) validate that the observability compose file is syntactically valid. If someone edits tempo.yml in the future and introduces a YAML error, it won't be caught until deploy time.

Recommendation for a follow-up issue: add a CI step that runs docker compose -f docker-compose.observability.yml config as a lint gate. This is a one-liner job and would prevent broken observability configs from slipping through.

Concern (non-blocking)

metrics_generator processors appear twice — at the top-level and under overrides.defaults. This is correct Tempo behavior (the override block is required to activate processors per-tenant), but there's no test that would catch an accidental divergence between the two lists. A comment explaining this would prevent a future editor from "fixing" the duplication by removing one and breaking the other.

Positive

The healthcheck start_period: 15s means Docker will wait before marking the container unhealthy on cold start. If Grafana or other dependents are eventually added with depends_on: tempo: condition: service_healthy, they'll correctly wait. Forward-compatible healthcheck design.

No blockers, but I'd file a follow-up issue for the CI compose-lint step.

## 🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** No production code changed, so no unit/integration/E2E test regression risk. My review here is focused on: is the infrastructure itself verifiable, and are there testing gaps to track? ### What's present The PR's test plan is manual but concrete: ``` - docker compose ... config exits 0 - docker compose ... up tempo starts without error - curl -s http://localhost:3200/ready returns "ready" ``` These are the right smoke tests for a new infra service. The healthcheck in the compose file (`wget /ready`) effectively automates the third check on every restart. ### Concern (non-blocking but worth tracking) **No CI validation of the compose config.** The test plan step `docker compose -f docker-compose.observability.yml config` is manual. Currently, none of the CI workflows (based on what I can see) validate that the observability compose file is syntactically valid. If someone edits `tempo.yml` in the future and introduces a YAML error, it won't be caught until deploy time. Recommendation for a follow-up issue: add a CI step that runs `docker compose -f docker-compose.observability.yml config` as a lint gate. This is a one-liner job and would prevent broken observability configs from slipping through. ### Concern (non-blocking) **metrics_generator processors appear twice** — at the top-level and under `overrides.defaults`. This is correct Tempo behavior (the override block is required to activate processors per-tenant), but there's no test that would catch an accidental divergence between the two lists. A comment explaining this would prevent a future editor from "fixing" the duplication by removing one and breaking the other. ### Positive The healthcheck `start_period: 15s` means Docker will wait before marking the container unhealthy on cold start. If Grafana or other dependents are eventually added with `depends_on: tempo: condition: service_healthy`, they'll correctly wait. Forward-compatible healthcheck design. ✅ No blockers, but I'd file a follow-up issue for the CI compose-lint step.
Author
Owner

🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead

Verdict: Approved

This PR contains no frontend code, no UI components, no routes, and no user-facing changes whatsoever. Tempo is a backend trace storage service with no direct user interface in this deployment.

What I verified

  • No Svelte components added or modified
  • No route changes
  • No CSS or design token changes
  • No new user-facing error messages or copy

Incidental note

When Grafana is eventually wired up (the future issue referenced in the compose file comments), the Tempo data source configuration in Grafana will have a UI. At that point I'd want to see the Grafana provisioning config reviewed for usability of the dashboard layout and trace query interface — but that's a concern for that PR, not this one.

Nothing to flag. LGTM from a UI/UX perspective.

## 🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead **Verdict: ✅ Approved** This PR contains no frontend code, no UI components, no routes, and no user-facing changes whatsoever. Tempo is a backend trace storage service with no direct user interface in this deployment. ### What I verified - No Svelte components added or modified - No route changes - No CSS or design token changes - No new user-facing error messages or copy ### Incidental note When Grafana is eventually wired up (the future issue referenced in the compose file comments), the Tempo data source configuration in Grafana will have a UI. At that point I'd want to see the Grafana provisioning config reviewed for usability of the dashboard layout and trace query interface — but that's a concern for that PR, not this one. Nothing to flag. LGTM from a UI/UX perspective.
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved

Reviewing against issue #575 requirements (closing issue referenced in PR body). This is a pure infrastructure delivery PR — no user stories, no acceptance criteria for end-user behavior. My review focuses on: does the delivered scope match what was likely specified, and are there traceability gaps?

Scope completeness

The PR delivers:

  1. Tempo service in docker-compose.observability.yml
  2. tempo.yml config with OTLP dual-protocol
  3. 30-day block retention (matches Loki retention — consistent)
  4. metrics_generator with service-graphs + span-metrics
  5. DEPLOYMENT.md updated
  6. C4 L2 diagram updated

What's intentionally deferred (and correctly so)

  • Backend OTLP instrumentation (Spring Boot OpenTelemetry agent config) — no code added yet for the backend to actually send traces. This PR is "ready to receive"; the sending side is a separate concern.
  • Grafana data source wiring — also deferred to a future issue.

Both deferrals are clearly called out in the compose file comments. The PR closes the infrastructure side of the story cleanly.

One open question (non-blocking)

The PR description's test plan requires curl -s http://localhost:3200/ready — but in production, port 3200 is expose-only (not host-bound). This test plan step only works if Tempo has a host port or you exec into the container. The healthcheck inside the container uses wget correctly. Suggest clarifying the test plan: docker exec obs-tempo wget -qO- http://localhost:3200/ready is the portable version that works in both dev and production-like environments.

No blockers.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** Reviewing against issue #575 requirements (closing issue referenced in PR body). This is a pure infrastructure delivery PR — no user stories, no acceptance criteria for end-user behavior. My review focuses on: does the delivered scope match what was likely specified, and are there traceability gaps? ### Scope completeness The PR delivers: 1. Tempo service in docker-compose.observability.yml ✅ 2. tempo.yml config with OTLP dual-protocol ✅ 3. 30-day block retention (matches Loki retention — consistent) ✅ 4. metrics_generator with service-graphs + span-metrics ✅ 5. DEPLOYMENT.md updated ✅ 6. C4 L2 diagram updated ✅ ### What's intentionally deferred (and correctly so) - Backend OTLP instrumentation (Spring Boot OpenTelemetry agent config) — no code added yet for the backend to actually *send* traces. This PR is "ready to receive"; the sending side is a separate concern. - Grafana data source wiring — also deferred to a future issue. Both deferrals are clearly called out in the compose file comments. The PR closes the infrastructure side of the story cleanly. ### One open question (non-blocking) The PR description's test plan requires `curl -s http://localhost:3200/ready` — but in production, port 3200 is `expose`-only (not host-bound). This test plan step only works if Tempo has a host port or you exec into the container. The healthcheck inside the container uses `wget` correctly. Suggest clarifying the test plan: `docker exec obs-tempo wget -qO- http://localhost:3200/ready` is the portable version that works in both dev and production-like environments. No blockers.
marcel merged commit 2eff1ab14c into main 2026-05-15 03:21:12 +02:00
marcel deleted branch feat/issue-575-tempo 2026-05-15 03:21:12 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#587