devops(observability): add Tempo for distributed trace storage (OTLP receiver) #575
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Tempo stores distributed traces sent by the Spring Boot backend via OTLP (OpenTelemetry Protocol). It exposes an HTTP API that Grafana queries to render trace waterfall views. Once Tempo and the backend instrumentation issue are both complete, every HTTP request to the backend will produce a trace showing the time spent in each Spring component and downstream call.
Depends on: scaffold issue (compose file and
infra/observability/directory must exist first)Service to Add
Config File to Create
infra/observability/tempo/tempo.ymlAcceptance Criteria
docker compose -f docker-compose.observability.yml up -d tempostarts without errorcurl -s http://localhost:3200/readyreturnsreadyarchiv-netnetwork (verify bydocker exec archive-backend wget -qO- http://tempo:3200/readyonce backend instrumentation is complete — or use a temporary netcat check)curl -s http://localhost:3200/api/searchreturns trace dataDefinition of Done
main🏗️ Markus Keller — Senior Application Architect
Observations
docker-compose.observability.ymlas a separate file, consistent with ADR-009 (standalone compose files, not overlays). This is the right pattern for this repo.networks:block in the issue snippet shows Tempo joining botharchiv-netandobs-net. This is correct — the backend must reachtempo:4317overarchiv-net, and Grafana must reachtempo:3200overobs-net. Both networks need to be declared in the compose file.metrics_generatorblock (service-graphs + span-metrics) generates Prometheus metrics from traces. This requires a Prometheus scrape target for Tempo to be added eventually. The issue doesn't mention this dependency — worth documenting as a follow-up.block_retention: 720h(30 days) matches Loki retention per the issue. Good alignment.docs/architecture/c4/l2-containers.pumlneeds a newContainernode for Tempo and a relationship line from Backend → Tempo (OTLP gRPC). Per Markus's doc-update table, a new Docker service always requires updatingl2-containers.pumlanddocs/DEPLOYMENT.md. The issue's DoD doesn't mention either.infra/observability/directory referenced in the issue doesn't exist yet (the scaffold issue is a dependency). The issue correctly calls this out. However, the PR should ensureinfra/observability/tempo/is created with the config file.Recommendations
docs/architecture/c4/l2-containers.pumlto include the Tempo container and its Backend → Tempo trace relationship."docs/DEPLOYMENT.md§4 (Logs + observability) to replace the 'Future observability' placeholder with a reference to Tempo."metrics_generatorblock is optional for an initial Tempo install. If Prometheus isn't part of this milestone yet, defer themetrics_generatorconfig to the Prometheus issue to keep this issue narrowly scoped.obs-netnetwork definition explicitly in the issue (name, driver). Right now the snippet declaresobs-netas a network name but doesn't show the network declaration block — that must be in the compose file.Open Decisions
metrics_generatorin scope or deferred? Including it now means Tempo emits span-metrics that Prometheus can scrape — but Prometheus doesn't exist yet, so these metrics go nowhere until it does. Deferring keeps this issue minimal. Including it avoids a config change to a running Tempo instance later. This is a sequencing call, not a technical one.👨💻 Felix Brandt — Senior Fullstack Developer
Observations
tempo.ymlconfig is clean and well-commented. The OTLP dual-protocol support (gRPC 4317 + HTTP 4318) is the right choice — Spring Boot's Micrometer OTLP exporter defaults to gRPC, and having HTTP as a fallback avoids reconfiguration if the transport needs to change.expose:stanza (notports:) for 3200/4317/4318 is correct — these ports should be internal only. No complaint there.docker exec archive-backend wget -qO- http://tempo:3200/readyrelies on the backend container being onarchiv-netand Tempo also being onarchiv-net. This is satisfied by the compose snippet. Good.curl -s http://localhost:3200/api/searchreturns trace data") is only verifiable after the backend instrumentation issue is complete. It's correctly noted as conditional, but it means this issue can only be 3/4 green on its own — the DoD should make this explicit.Recommendations
tempo.ymluses local storage (backend: local). For a dev/prod single-node setup this is fine, but add a comment in the config file explaining why local storage is chosen (single VPS, no S3 backend needed) — future maintainers will otherwise wonder if this was an oversight./var/tempo/waland blocks path/var/tempo/blocksare correct for thegrafana/tempo:2.7.2image. The named volumetempo_datacovers/var/tempo— both subdirs will be on the same volume. This is correct.Open Decisions (omit this section entirely if none)
No open decisions from my angle. The config is well-specified and the implementation path is clear.
🚀 Tobias Wendt — DevOps & Platform Engineer
Observations
Good issue overall — the config is production-quality in most areas. Here's what I checked:
What's done right:
grafana/tempo:2.7.2is pinned. Excellent — matches the project standard.expose:notports:for all three ports. Tempo has no business being reachable from the host directly.restart: unless-stoppedis correct for a persistent service.tempo_datafor persistence — named volumes are managed by Docker and survivedocker compose down. Correct.block_retention: 720haligns with the project's 30-day Loki retention. Consistent.archiv-netmembership so the backend can reachtempo:4317. Required.Issues to address:
Missing healthcheck. Every service in this repo has a healthcheck —
db,minio,ocr-service,backend,frontend. Tempo exposes/readyon port 3200. The healthcheck is:Without a healthcheck,
depends_on: condition: service_healthycannot be used for Grafana → Tempo dependency ordering.obs-netnetwork not declared. The snippet showsobs-netin the networks list but thenetworks:top-level block is not shown. It must be declared explicitly in the compose file:Since
archiv-netis defined in the main compose file, it should be referenced asexternal: truein the observability compose file.start_periodfor Tempo. Tempo initializes its WAL and block stores on first start. Addstart_period: 15sto give it time before health probes begin.Volume declaration missing from snippet. The compose file needs:
This is presumably in the scaffold issue's compose file, but should be explicit in this issue's snippet to avoid ambiguity.
minio/minio:latestindocker-compose.yml(existing tech debt). Unrelated to this issue but flagged for visibility — the dev compose still uses:latestfor MinIO, which the prod file has already fixed. Not blocking here.DEPLOYMENT.mdupdate missing from DoD. The "Future observability" section currently says "Phase 7 adds Prometheus + Loki + Grafana" — Tempo should be added to that list. Update DoD to include this.Recommendations
networks:top-level block showingarchiv-net: external: trueandobs-net: driver: bridge.volumes: tempo_data:to the top-level volumes block.docs/DEPLOYMENT.md§4 to reference Tempo in the observability stack description."docker exec archive-backend wgetis good — but note thatarchive-backendis the dev container name. In prod the container name is project-namespaced (e.g.archiv-production-backend-1). Document this distinction so the runbook doesn't confuse environments.Open Decisions
obs-netbe a new bridge network or should Grafana/Tempo/Loki all joinarchiv-net? Using a dedicatedobs-netfor observability services reduces blast radius (a compromised Tempo cannot initiate connections to the backend DB). But it requires the backend to join both networks (it already does forarchiv-net; it would needobs-netif it ever pushes metrics directly to Prometheus). The current design (backend onarchiv-net, Tempo on both) is correct for the OTLP push model — no change needed unless Prometheus pull is added later. Just wanted to make the reasoning explicit.🔒 Nora "NullX" Steiner — Application Security Engineer
Observations
This is a new infrastructure service with no authentication on its ingress ports. Here's the threat model:
Attack surface introduced:
archiv-net/api/search,/api/traces/{id}, and admin endpointsFindings:
No authentication on OTLP receivers (expected, acceptable). OTLP receivers on 4317/4318 are internal-network only (
expose:, notports:). Onarchiv-netthe only processes that should be sending traces are the backend and (eventually) OCR service. Since OTLP has no authentication built in, the only protection is network isolation — whichexpose:provides. This is standard practice for internal OTLP collectors. Acceptable as-is.Tempo HTTP API (3200) is unauthenticated — ensure Grafana is the only consumer. The
/api/searchand/api/traces/{id}endpoints return trace data that may contain request parameters, user IDs, and internal service paths. Tempo 2.x has no built-in auth. Grafana authenticates users before they can query Tempo. The risk is: if another service onobs-netorarchiv-netcan reachtempo:3200, it gets unauthenticated trace data. Verify that only Grafana can reach 3200.Trace data may contain sensitive values. Spring Boot's default OTLP exporter includes HTTP request attributes in spans:
http.url,http.target, query parameters, and potentially header values. If any endpoint accepts sensitive data in query params (e.g./api/auth/reset-password?token=...), those values will appear in traces. The issue doesn't address span attribute sanitization — this should be handled in the backend instrumentation issue, but flag it now so it's not missed.block_retention: 720hmeans 30 days of traces persist on the volume. If the VPS is compromised, trace history is available to an attacker. This is acceptable for the threat model of a private family archive — no regulatory obligations, no PII in traces beyond what's already in the backend — but worth documenting.Recommendations
tempo.ymlexplaining the auth model: "Tempo HTTP API is unauthenticated; access controlled by network isolation (obs-net) — only Grafana should reach port 3200."/api/auth/*token params from trace attributes) should be addressed in the backend OTLP configuration issue."http://tempo:3200over the internal network only — never exposed externally.expose:directive (notports:) is the correct choice here. No concern.Open Decisions
No blocking security decisions — the network isolation model is correct for this deployment context.
🧪 Sara Holt — QA Engineer & Test Strategist
Observations
This is an infrastructure issue, so my review focuses on the acceptance criteria quality, verifiability, and what's testable before vs after the dependency issue lands.
AC review:
docker compose ... up -d tempostarts without errorcurl -s http://localhost:3200/readyreturnsreadyports:in the spec — only viaexpose:. The curl must run inside the obs-net or usedocker exec. Clarify in the AC.archiv-netdocker exec archive-backend nc -z tempo 4317 && echo ok/api/searchreturns trace dataIssues:
AC2 curl will fail from host. Port 3200 is
expose:d, notports:d.curl http://localhost:3200/readyfrom the host machine will get connection refused. The test command needs to be eitherdocker exec obs-tempo wget -qO- http://localhost:3200/readyor usedocker run --network obs-net --rm curlimages/curl:8.5.0 curl -s http://tempo:3200/ready.DoD says "all AC checked" but AC4 is blocked. This will stall the PR review if taken literally. Recommend splitting the DoD: "AC1–3 verified before merge; AC4 verified as part of the backend instrumentation PR."
No smoke test for volume persistence. After a
docker compose restart tempo, traces from before the restart should still be queryable. This verifies that the named volume is wired correctly. Easy to add: restart Tempo after AC4 is green, re-query/api/search, confirm traces persist. Add this as AC5.The
infra/observability/tempo/tempo.ymlpath must be created. The DoD mentions "config file committed alongside compose changes" — good. But the scaffold issue (which createsinfra/observability/) is listed as a dependency. If the scaffold issue isn't merged first, the PR for this issue will have no parent directory to commit into. Consider either: (a) absorbing the directory creation into this PR, or (b) making the dependency explicit in Gitea's blocked-by relationship.Recommendations
docker exec obs-tempo wget -qO- http://localhost:3200/ready(Tempo image shipswgetvia BusyBox).docker exec archive-backend nc -z tempo 4317 && echo "OTLP gRPC reachable"— addncavailability check or fall back to:docker run --rm --network archiv-net busybox nc -z tempo 4317.docker compose restart tempo, traces from AC4 are still returned by/api/search."📋 Elicit — Requirements Engineer
Observations
The issue is well-specified for an infrastructure ticket — context, service config, acceptance criteria, and DoD are all present. I'm reviewing for gaps in requirement completeness and testability precision.
Strengths:
Gaps:
"Scaffold issue" is unnamed and unlinked. The issue says "Depends on: scaffold issue (compose file and
infra/observability/directory must exist first)" — but doesn't link to the actual issue number. This means a developer picking this up cannot immediately navigate to the dependency. Add the issue number.AC2 verification command is ambiguous.
curl -s http://localhost:3200/readyassumes port 3200 is exposed to the host machine. The compose snippet usesexpose:(internal only), notports:. This AC will fail as written. (See also Sara's comment.)No NFR for startup time / resource usage. Tempo on a CX32 (8GB RAM, as referenced in Tobias's persona docs) with local storage should be lightweight at idle (~100MB RAM). No memory limit is specified. Given the OCR service already has a 12GB limit, Tempo should have an explicit
mem_limitto prevent resource contention. Suggestmem_limit: 512m— Tempo 2.x at idle uses ~80–150MB.No definition of "starts without error" in AC1. Does "without error" mean exit code 0 on the compose up command? Or does it mean the container status is
healthyafter the healthcheck passes? These are different — a container can berunningbut nothealthy. Recommend: "AC1:docker compose ... up -d tempocompletes, anddocker compose ps temposhows statusrunning (healthy)within 30 seconds."Data retention NFR is implicit. The config specifies
block_retention: 720h(30 days). The issue doesn't explain the rationale for 30 days. Is this a user requirement? A cost constraint? An operational decision? A brief comment ("matches Loki retention, see issue #NNN") would make this traceable.Missing: what happens when the volume fills up? Tempo's compactor handles retention automatically, but if writes exceed the retention policy (e.g. an unexpected trace volume spike), local storage could fill. No monitoring alert for disk usage is referenced. Not blocking for this issue, but should be in the Alertmanager backlog.
Recommendations
healthywithin 30s."docker exec obs-tempo(see Sara's comment).mem_limit: 512mto the Tempo service definition in the compose snippet.tempo.ymlnext toblock_retentionexplaining the rationale.Open Decisions
mem_limitvalue for Tempo. 512MB is conservative and safe for a small family archive with low trace volume. If span-metrics generation is enabled (themetrics_generatorblock), memory usage increases. The right value depends on expected trace volume, which isn't specified. 512MB is a safe default to start with and can be increased if Grafana dashboards show memory pressure.🎨 Leonie Voss — UX Designer & Accessibility Strategist
No UX or UI concerns from my angle. This issue adds a backend infrastructure service (Tempo) with no user-facing interface. There are no Svelte components, no design decisions, and no accessibility implications in scope.
One forward-looking note: when the Grafana dashboards for trace visualization are eventually added (as part of the Grafana issue in this milestone), I'll review the Grafana datasource and panel configuration from a user perspective — specifically whether trace waterfall views are navigable for the family members who may access Grafana for operational insight. But that's out of scope here.
This issue is clear to proceed from a UX standpoint.
🗳️ Decision Queue — Action Required
3 decisions need your input before implementation starts.
Infrastructure / Sequencing
metrics_generatorblock: include now or defer to Prometheus issue? Including it now means Tempo is already emitting span-metrics when Prometheus is added — no config change to a running instance. Deferring keeps this issue minimal and avoids configuring something with no consumer yet. (Raised by: Markus)mem_limitfor Tempo: 512MB or unconstrained? The OCR service already consumes up to 12GB on this VPS. Settingmem_limit: 512mprevents Tempo from competing with OCR under load. 512MB is safe for a low-traffic family archive. Ifmetrics_generatoris enabled, bump to 768MB. (Raised by: Elicit)Network Topology
obs-netisolation: current two-network design is recommended, but make the reasoning explicit. The current spec (backend onarchiv-net, Tempo onarchiv-net+obs-net, Grafana onobs-net) is the right call for the OTLP push model. No action needed unless you want a comment in the compose file explaining whyarchiv-net: external: trueis set. (Raised by: Tobias — recommends keeping current design, just documenting it)Implementation complete — branch
feat/issue-575-tempoAll tasks from the issue are done. Here's what was committed in
de08ffe9:Files changed
infra/observability/tempo/tempo.ymlmetrics_generator(service-graphs + span-metrics)docker-compose.observability.ymltemposervice — pinnedgrafana/tempo:2.7.2, healthcheck via/ready,expose:only (no host ports), joinedarchiv-net+obs-netdocs/DEPLOYMENT.mdobs-temporow to the observability services tabledocs/architecture/c4/l2-containers.pumltempocontainer inside the observability boundary +Rel(backend, tempo, ...)for OTLP gRPCReviewer feedback addressed
wget -qO- http://localhost:3200/ready,start_period: 15s).archiv-net: external: trueandobs-net: driver: bridgewere already present in the scaffold. Named volumetempo_datawas already declared.tempo.ymlexplaining the auth model — Tempo API is unauthenticated; access controlled byobs-netnetwork isolation.grafana/tempo:2.7.2.Verification
Next step
Open PR against
main→/review-pr