devops(observability): add Tempo for distributed trace storage (OTLP receiver) #587
Reference in New Issue
Block a user
Delete Branch "feat/issue-575-tempo"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes #575
obs-tempo(grafana/tempo:2.7.2) todocker-compose.observability.ymlwithexpose-only ports (3200/4317/4318) — not exposed to host, only reachable on Docker networkinfra/observability/tempo/tempo.yml— OTLP dual-protocol (gRPC 4317 + HTTP 4318), 30-day block retention, metrics_generator with service-graphs + span-metricsdocs/DEPLOYMENT.mdupdated with Tempo service rowdocs/architecture/c4/l2-containers.pumlupdated with Tempo container + Rel from backendTest plan
docker compose -f docker-compose.observability.yml configexits 0docker compose -f docker-compose.observability.yml up -d tempostarts without errorcurl -s http://localhost:3200/readyreturnsready🤖 Generated with Claude Code
🏗️ Markus Keller (@mkeller) — Application Architect
Verdict: ✅ Approved
What I checked
This is a pure infrastructure addition with no application code changes. My checklist from the persona matrix: new Docker service or infrastructure component →
docs/architecture/c4/l2-containers.puml+docs/DEPLOYMENT.md. Both are updated. Good.Positives
archiv-net(receives OTLP from backend) andobs-net(queried by Grafana). This matches the established pattern used by Promtail.tempo_data) is the right call for a single-VPS deployment. No need for S3 backend here.overrides.defaults.metrics_generator.processorsblock mirrors the top-levelmetrics_generator.processors— this is Tempo's required pattern for activating processors per-tenant, not duplication. Fine.Suggestions (non-blocking)
Rel(backend, tempo, ...)but there is no correspondingRel(grafana, tempo, ...). Grafana → Tempo is the query path that closes the observability loop. Since Grafana is still a placeholder(see future issue), this can stay deferred — but consider adding a TODO comment in the puml so it's not forgotten.Doc compliance check
docs/architecture/c4/l2-containers.puml— new container + reldocs/DEPLOYMENT.md— service rowNo blockers. Doc compliance is met.
👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Verdict: ✅ Approved
What I checked
No application code in this PR — purely infra YAML and docs. My review lens here is: does the config read clearly, are there dead/commented-out sections, and are names intent-revealing?
Positives
exposeports are exactly right:# Grafana queries Tempo on this port (obs-net only)tells me what the port does and which network controls access. No "what" comments — only "why" comments.# backend (archive-backend) reaches tempo:4317 over this network) document the actual traffic flow, not just the YAML key's meaning.Suggestions (non-blocking)
tempo.ymlhasmetrics_generator.processorsdefined at the top level and again underoverrides.defaults.metrics_generator.processors. This is not dead code — Tempo requires the override block for per-tenant activation. A one-line comment clarifying this (# Required: overrides block activates processors globally across all tenants) would prevent a future reader from deleting it as apparent duplication.# backend (archive-backend) reaches tempo:4317 over this network— the container name isarchive-backend(per the main compose) but the service is namedbackend. Consistent with how Promtail's comment is written, so not a bug, just worth noting.No blockers.
🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: ✅ Approved
Solid PR. This follows every pattern already established in the observability stack. Here's my full checklist pass:
Checklist
grafana/tempo:2.7.2— pinned, Renovate will bumptempo_data:— no bind mountsexpose:only, notports:— consistent with Loki/ready, same pattern as Lokirestart: unless-stoppedPositives
wget -qO- http://localhost:3200/ready | grep -q ready || exit 1— exactly the same pattern as Loki's healthcheck. Consistent and reliable.start_period: 15sis appropriate for Tempo. Loki uses 30s because it has more startup work; Tempo is lighter. Good calibration.tempo_datavolume is already declared in thevolumes:section at the bottom. Also notegrafana_data:andglitchtiip_data:are pre-declared there too — the compose file is thinking ahead cleanly.:ro— correct, the container should not write back to its config.One observation (non-blocking)
The
metrics_generatorWAL lives at/var/tempo/generator/wal— also on thetempo_datavolume. This means metrics generator state, ingester WAL, and blocks all share one volume. For this deployment scale that's fine. If trace volume ever grows significantly, splitting the generator WAL to a separate volume would help with I/O isolation. Not an action item now — just something to know when sizing.Cost note
tempo_dataon the CX32 VPS (local disk). Trace retention is 30 days. At family-archive traffic volumes (not high), disk pressure from Tempo will be negligible. No cost concern.No blockers. Ready to merge from an infra standpoint.
🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
Pure infrastructure addition. No application code paths, no new authentication surfaces exposed to end users. Here's my security-focused pass:
Network isolation — LGTM
expose-only, reachable only viaobs-net. The only service currently onobs-netthat would query it is Grafana (not yet deployed). No host binding. No internet exposure. ✅expose-only, reachable only viaarchiv-net. The backend can push traces; nothing else onobs-netcan reach the OTLP receivers. ✅tempo.ymlexplicitly acknowledges the unauthenticated API surface and explains why network isolation is the compensating control. This is correct reasoning for a single-operator self-hosted deployment.No authentication on Tempo API — accepted, documented
Tempo's HTTP API (port 3200) is unauthenticated by design for single-tenant deployments. The threat model note in the config file is the right move: it records the conscious decision so a future reviewer doesn't wonder "why is this open?" The control is Docker network isolation — no path from the internet or the host to port 3200. Acceptable for this deployment context.
OTLP receivers on 0.0.0.0 — acceptable given expose-only
distributor.receivers.otlp.protocols.grpc.endpoint: 0.0.0.0:4317means Tempo listens on all interfaces inside the container. This is normal and correct — it's how container networking works. Because the port isexpose(notports), Docker's host-level firewall rules prevent it from being reachable from the host or internet. ✅metrics_generator — no secrets, no external calls
The metrics generator writes to a local WAL path. No remote write target configured here. Span-metrics and service-graphs are generated internally. No external telemetry leakage. ✅
Supply chain
grafana/tempo:2.7.2is a pinned, specific release from a reputable vendor. Not:latest. Renovate will handle future updates. ✅No blockers.
🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
No production code changed, so no unit/integration/E2E test regression risk. My review here is focused on: is the infrastructure itself verifiable, and are there testing gaps to track?
What's present
The PR's test plan is manual but concrete:
These are the right smoke tests for a new infra service. The healthcheck in the compose file (
wget /ready) effectively automates the third check on every restart.Concern (non-blocking but worth tracking)
No CI validation of the compose config. The test plan step
docker compose -f docker-compose.observability.yml configis manual. Currently, none of the CI workflows (based on what I can see) validate that the observability compose file is syntactically valid. If someone editstempo.ymlin the future and introduces a YAML error, it won't be caught until deploy time.Recommendation for a follow-up issue: add a CI step that runs
docker compose -f docker-compose.observability.yml configas a lint gate. This is a one-liner job and would prevent broken observability configs from slipping through.Concern (non-blocking)
metrics_generator processors appear twice — at the top-level and under
overrides.defaults. This is correct Tempo behavior (the override block is required to activate processors per-tenant), but there's no test that would catch an accidental divergence between the two lists. A comment explaining this would prevent a future editor from "fixing" the duplication by removing one and breaking the other.Positive
The healthcheck
start_period: 15smeans Docker will wait before marking the container unhealthy on cold start. If Grafana or other dependents are eventually added withdepends_on: tempo: condition: service_healthy, they'll correctly wait. Forward-compatible healthcheck design. ✅No blockers, but I'd file a follow-up issue for the CI compose-lint step.
🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead
Verdict: ✅ Approved
This PR contains no frontend code, no UI components, no routes, and no user-facing changes whatsoever. Tempo is a backend trace storage service with no direct user interface in this deployment.
What I verified
Incidental note
When Grafana is eventually wired up (the future issue referenced in the compose file comments), the Tempo data source configuration in Grafana will have a UI. At that point I'd want to see the Grafana provisioning config reviewed for usability of the dashboard layout and trace query interface — but that's a concern for that PR, not this one.
Nothing to flag. LGTM from a UI/UX perspective.
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Reviewing against issue #575 requirements (closing issue referenced in PR body). This is a pure infrastructure delivery PR — no user stories, no acceptance criteria for end-user behavior. My review focuses on: does the delivered scope match what was likely specified, and are there traceability gaps?
Scope completeness
The PR delivers:
What's intentionally deferred (and correctly so)
Both deferrals are clearly called out in the compose file comments. The PR closes the infrastructure side of the story cleanly.
One open question (non-blocking)
The PR description's test plan requires
curl -s http://localhost:3200/ready— but in production, port 3200 isexpose-only (not host-bound). This test plan step only works if Tempo has a host port or you exec into the container. The healthcheck inside the container useswgetcorrectly. Suggest clarifying the test plan:docker exec obs-tempo wget -qO- http://localhost:3200/readyis the portable version that works in both dev and production-like environments.No blockers.