devops(observability): add Tempo for distributed trace storage (OTLP receiver) #575

New Issue

marcel · 2026-05-14T15:04:15+02:00

marcel commented

2026-05-14 15:04:15 +02:00

Context

Tempo stores distributed traces sent by the Spring Boot backend via OTLP (OpenTelemetry Protocol). It exposes an HTTP API that Grafana queries to render trace waterfall views. Once Tempo and the backend instrumentation issue are both complete, every HTTP request to the backend will produce a trace showing the time spent in each Spring component and downstream call.

Depends on: scaffold issue (compose file and infra/observability/ directory must exist first)

Service to Add

# docker-compose.observability.yml

tempo:
  image: grafana/tempo:2.7.2    # pin to latest stable v2
  container_name: obs-tempo
  restart: unless-stopped
  volumes:
    - ./infra/observability/tempo/tempo.yml:/etc/tempo.yml:ro
    - tempo_data:/var/tempo
  command: -config.file=/etc/tempo.yml
  expose:
    - "3200"   # Grafana queries Tempo on this port
    - "4317"   # OTLP gRPC — backend sends traces here
    - "4318"   # OTLP HTTP — alternative for backends preferring HTTP
  networks:
    - archiv-net   # backend needs to reach tempo:4317
    - obs-net

Config File to Create

`infra/observability/tempo/tempo.yml`

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 720h   # 30 days, matches Loki retention

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/blocks
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
  storage:
    path: /var/tempo/generator/wal
  processors:
    - service-graphs
    - span-metrics

overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

Acceptance Criteria

docker compose -f docker-compose.observability.yml up -d tempo starts without error
curl -s http://localhost:3200/ready returns ready
OTLP gRPC port 4317 is reachable from within the archiv-net network (verify by docker exec archive-backend wget -qO- http://tempo:3200/ready once backend instrumentation is complete — or use a temporary netcat check)
After sending a trace (requires backend instrumentation issue), querying curl -s http://localhost:3200/api/search returns trace data

Definition of Done

All acceptance criteria checked
Config file committed alongside compose changes
Committed on a feature branch, PR opened against main

## Context Tempo stores distributed traces sent by the Spring Boot backend via OTLP (OpenTelemetry Protocol). It exposes an HTTP API that Grafana queries to render trace waterfall views. Once Tempo and the backend instrumentation issue are both complete, every HTTP request to the backend will produce a trace showing the time spent in each Spring component and downstream call. **Depends on:** scaffold issue (compose file and `infra/observability/` directory must exist first) ## Service to Add ```yaml # docker-compose.observability.yml tempo: image: grafana/tempo:2.7.2 # pin to latest stable v2 container_name: obs-tempo restart: unless-stopped volumes: - ./infra/observability/tempo/tempo.yml:/etc/tempo.yml:ro - tempo_data:/var/tempo command: -config.file=/etc/tempo.yml expose: - "3200" # Grafana queries Tempo on this port - "4317" # OTLP gRPC — backend sends traces here - "4318" # OTLP HTTP — alternative for backends preferring HTTP networks: - archiv-net # backend needs to reach tempo:4317 - obs-net ``` ## Config File to Create ### `infra/observability/tempo/tempo.yml` ```yaml server: http_listen_port: 3200 distributor: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 ingester: max_block_duration: 5m compactor: compaction: block_retention: 720h # 30 days, matches Loki retention storage: trace: backend: local local: path: /var/tempo/blocks wal: path: /var/tempo/wal metrics_generator: registry: external_labels: source: tempo storage: path: /var/tempo/generator/wal processors: - service-graphs - span-metrics overrides: defaults: metrics_generator: processors: - service-graphs - span-metrics ``` ## Acceptance Criteria - [ ] `docker compose -f docker-compose.observability.yml up -d tempo` starts without error - [ ] `curl -s http://localhost:3200/ready` returns `ready` - [ ] OTLP gRPC port 4317 is reachable from within the `archiv-net` network (verify by `docker exec archive-backend wget -qO- http://tempo:3200/ready` once backend instrumentation is complete — or use a temporary netcat check) - [ ] After sending a trace (requires backend instrumentation issue), querying `curl -s http://localhost:3200/api/search` returns trace data ## Definition of Done - All acceptance criteria checked - Config file committed alongside compose changes - Committed on a feature branch, PR opened against `main`

marcel added this to the Observability Stack — Grafana LGTM + GlitchTip milestone 2026-05-14 15:04:15 +02:00

marcel added the P2-medium devops phase-7: monitoring labels 2026-05-14 15:06:10 +02:00

marcel commented

2026-05-14 15:28:26 +02:00

🏗️ Markus Keller — Senior Application Architect

Observations

The issue correctly uses docker-compose.observability.yml as a separate file, consistent with ADR-009 (standalone compose files, not overlays). This is the right pattern for this repo.
The networks: block in the issue snippet shows Tempo joining both archiv-net and obs-net. This is correct — the backend must reach tempo:4317 over archiv-net, and Grafana must reach tempo:3200 over obs-net. Both networks need to be declared in the compose file.
The metrics_generator block (service-graphs + span-metrics) generates Prometheus metrics from traces. This requires a Prometheus scrape target for Tempo to be added eventually. The issue doesn't mention this dependency — worth documenting as a follow-up.
The block_retention: 720h (30 days) matches Loki retention per the issue. Good alignment.
L2 container diagram gap: Adding Tempo to the observability stack means docs/architecture/c4/l2-containers.puml needs a new Container node for Tempo and a relationship line from Backend → Tempo (OTLP gRPC). Per Markus's doc-update table, a new Docker service always requires updating l2-containers.puml and docs/DEPLOYMENT.md. The issue's DoD doesn't mention either.
The infra/observability/ directory referenced in the issue doesn't exist yet (the scaffold issue is a dependency). The issue correctly calls this out. However, the PR should ensure infra/observability/tempo/ is created with the config file.

Recommendations

Add to the DoD: "Update docs/architecture/c4/l2-containers.puml to include the Tempo container and its Backend → Tempo trace relationship."
Add to the DoD: "Update docs/DEPLOYMENT.md §4 (Logs + observability) to replace the 'Future observability' placeholder with a reference to Tempo."
The metrics_generator block is optional for an initial Tempo install. If Prometheus isn't part of this milestone yet, defer the metrics_generator config to the Prometheus issue to keep this issue narrowly scoped.
Document the obs-net network definition explicitly in the issue (name, driver). Right now the snippet declares obs-net as a network name but doesn't show the network declaration block — that must be in the compose file.

Open Decisions

metrics_generator in scope or deferred? Including it now means Tempo emits span-metrics that Prometheus can scrape — but Prometheus doesn't exist yet, so these metrics go nowhere until it does. Deferring keeps this issue minimal. Including it avoids a config change to a running Tempo instance later. This is a sequencing call, not a technical one.

## 🏗️ Markus Keller — Senior Application Architect ### Observations - The issue correctly uses `docker-compose.observability.yml` as a **separate file**, consistent with ADR-009 (standalone compose files, not overlays). This is the right pattern for this repo. - The `networks:` block in the issue snippet shows Tempo joining both `archiv-net` and `obs-net`. This is correct — the backend must reach `tempo:4317` over `archiv-net`, and Grafana must reach `tempo:3200` over `obs-net`. Both networks need to be declared in the compose file. - The `metrics_generator` block (service-graphs + span-metrics) generates Prometheus metrics from traces. This requires a Prometheus scrape target for Tempo to be added eventually. The issue doesn't mention this dependency — worth documenting as a follow-up. - The `block_retention: 720h` (30 days) matches Loki retention per the issue. Good alignment. - **L2 container diagram gap:** Adding Tempo to the observability stack means `docs/architecture/c4/l2-containers.puml` needs a new `Container` node for Tempo and a relationship line from Backend → Tempo (OTLP gRPC). Per Markus's doc-update table, a new Docker service always requires updating `l2-containers.puml` **and** `docs/DEPLOYMENT.md`. The issue's DoD doesn't mention either. - The `infra/observability/` directory referenced in the issue doesn't exist yet (the scaffold issue is a dependency). The issue correctly calls this out. However, the PR should ensure `infra/observability/tempo/` is created with the config file. ### Recommendations - Add to the DoD: "Update `docs/architecture/c4/l2-containers.puml` to include the Tempo container and its Backend → Tempo trace relationship." - Add to the DoD: "Update `docs/DEPLOYMENT.md` §4 (Logs + observability) to replace the 'Future observability' placeholder with a reference to Tempo." - The `metrics_generator` block is optional for an initial Tempo install. If Prometheus isn't part of this milestone yet, defer the `metrics_generator` config to the Prometheus issue to keep this issue narrowly scoped. - Document the `obs-net` network definition explicitly in the issue (name, driver). Right now the snippet declares `obs-net` as a network name but doesn't show the network declaration block — that must be in the compose file. ### Open Decisions - **`metrics_generator` in scope or deferred?** Including it now means Tempo emits span-metrics that Prometheus can scrape — but Prometheus doesn't exist yet, so these metrics go nowhere until it does. Deferring keeps this issue minimal. Including it avoids a config change to a running Tempo instance later. This is a sequencing call, not a technical one.

marcel commented

2026-05-14 15:28:39 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

This issue is infrastructure-only — no Java/TypeScript/Python code changes. My review is scoped to the config file quality and the readiness signal for backend instrumentation work.
The tempo.yml config is clean and well-commented. The OTLP dual-protocol support (gRPC 4317 + HTTP 4318) is the right choice — Spring Boot's Micrometer OTLP exporter defaults to gRPC, and having HTTP as a fallback avoids reconfiguration if the transport needs to change.
The expose: stanza (not ports:) for 3200/4317/4318 is correct — these ports should be internal only. No complaint there.
The acceptance criterion docker exec archive-backend wget -qO- http://tempo:3200/ready relies on the backend container being on archiv-net and Tempo also being on archiv-net. This is satisfied by the compose snippet. Good.
The fourth acceptance criterion ("querying curl -s http://localhost:3200/api/search returns trace data") is only verifiable after the backend instrumentation issue is complete. It's correctly noted as conditional, but it means this issue can only be 3/4 green on its own — the DoD should make this explicit.

Recommendations

Add a note to AC4: "This criterion is blocked by the backend OTLP instrumentation issue and cannot be verified in isolation. Mark this issue complete when AC1–3 are green; AC4 is validated as part of the instrumentation issue."
The tempo.yml uses local storage (backend: local). For a dev/prod single-node setup this is fine, but add a comment in the config file explaining why local storage is chosen (single VPS, no S3 backend needed) — future maintainers will otherwise wonder if this was an oversight.
The WAL path /var/tempo/wal and blocks path /var/tempo/blocks are correct for the grafana/tempo:2.7.2 image. The named volume tempo_data covers /var/tempo — both subdirs will be on the same volume. This is correct.

Open Decisions (omit this section entirely if none)

No open decisions from my angle. The config is well-specified and the implementation path is clear.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - This issue is infrastructure-only — no Java/TypeScript/Python code changes. My review is scoped to the config file quality and the readiness signal for backend instrumentation work. - The `tempo.yml` config is clean and well-commented. The OTLP dual-protocol support (gRPC 4317 + HTTP 4318) is the right choice — Spring Boot's Micrometer OTLP exporter defaults to gRPC, and having HTTP as a fallback avoids reconfiguration if the transport needs to change. - The `expose:` stanza (not `ports:`) for 3200/4317/4318 is correct — these ports should be internal only. No complaint there. - The acceptance criterion `docker exec archive-backend wget -qO- http://tempo:3200/ready` relies on the backend container being on `archiv-net` and Tempo also being on `archiv-net`. This is satisfied by the compose snippet. Good. - The fourth acceptance criterion ("querying `curl -s http://localhost:3200/api/search` returns trace data") is only verifiable after the backend instrumentation issue is complete. It's correctly noted as conditional, but it means this issue can only be 3/4 green on its own — the DoD should make this explicit. ### Recommendations - Add a note to AC4: "This criterion is blocked by the backend OTLP instrumentation issue and cannot be verified in isolation. Mark this issue complete when AC1–3 are green; AC4 is validated as part of the instrumentation issue." - The `tempo.yml` uses local storage (`backend: local`). For a dev/prod single-node setup this is fine, but add a comment in the config file explaining why local storage is chosen (single VPS, no S3 backend needed) — future maintainers will otherwise wonder if this was an oversight. - The WAL path `/var/tempo/wal` and blocks path `/var/tempo/blocks` are correct for the `grafana/tempo:2.7.2` image. The named volume `tempo_data` covers `/var/tempo` — both subdirs will be on the same volume. This is correct. ### Open Decisions _(omit this section entirely if none)_ No open decisions from my angle. The config is well-specified and the implementation path is clear.

marcel commented

2026-05-14 15:28:58 +02:00

🚀 Tobias Wendt — DevOps & Platform Engineer

Observations

Good issue overall — the config is production-quality in most areas. Here's what I checked:

What's done right:

grafana/tempo:2.7.2 is pinned. Excellent — matches the project standard.
expose: not ports: for all three ports. Tempo has no business being reachable from the host directly.
restart: unless-stopped is correct for a persistent service.
Named volume tempo_data for persistence — named volumes are managed by Docker and survive docker compose down. Correct.
block_retention: 720h aligns with the project's 30-day Loki retention. Consistent.
archiv-net membership so the backend can reach tempo:4317. Required.

Issues to address:

Missing healthcheck. Every service in this repo has a healthcheck — db, minio, ocr-service, backend, frontend. Tempo exposes /ready on port 3200. The healthcheck is:
```
healthcheck:
  test: ["CMD-SHELL", "wget -qO- http://localhost:3200/ready | grep -q ready || exit 1"]
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 10s
```
Without a healthcheck, depends_on: condition: service_healthy cannot be used for Grafana → Tempo dependency ordering.
obs-net network not declared. The snippet shows obs-net in the networks list but the networks: top-level block is not shown. It must be declared explicitly in the compose file:
```
networks:
  archiv-net:
    external: true   # defined in docker-compose.yml / docker-compose.prod.yml
  obs-net:
    driver: bridge
```
Since archiv-net is defined in the main compose file, it should be referenced as external: true in the observability compose file.
start_period for Tempo. Tempo initializes its WAL and block stores on first start. Add start_period: 15s to give it time before health probes begin.
Volume declaration missing from snippet. The compose file needs:
```
volumes:
  tempo_data:
```
This is presumably in the scaffold issue's compose file, but should be explicit in this issue's snippet to avoid ambiguity.
minio/minio:latest in docker-compose.yml (existing tech debt). Unrelated to this issue but flagged for visibility — the dev compose still uses :latest for MinIO, which the prod file has already fixed. Not blocking here.
DEPLOYMENT.md update missing from DoD. The "Future observability" section currently says "Phase 7 adds Prometheus + Loki + Grafana" — Tempo should be added to that list. Update DoD to include this.

Recommendations

Add the healthcheck block to the Tempo service definition (spec above).
Add networks: top-level block showing archiv-net: external: true and obs-net: driver: bridge.
Add volumes: tempo_data: to the top-level volumes block.
Update DoD: "Update docs/DEPLOYMENT.md §4 to reference Tempo in the observability stack description."
The acceptance criterion using docker exec archive-backend wget is good — but note that archive-backend is the dev container name. In prod the container name is project-namespaced (e.g. archiv-production-backend-1). Document this distinction so the runbook doesn't confuse environments.

Open Decisions

Network isolation: should obs-net be a new bridge network or should Grafana/Tempo/Loki all join archiv-net? Using a dedicated obs-net for observability services reduces blast radius (a compromised Tempo cannot initiate connections to the backend DB). But it requires the backend to join both networks (it already does for archiv-net; it would need obs-net if it ever pushes metrics directly to Prometheus). The current design (backend on archiv-net, Tempo on both) is correct for the OTLP push model — no change needed unless Prometheus pull is added later. Just wanted to make the reasoning explicit.

## 🚀 Tobias Wendt — DevOps & Platform Engineer ### Observations Good issue overall — the config is production-quality in most areas. Here's what I checked: **What's done right:** - `grafana/tempo:2.7.2` is pinned. Excellent — matches the project standard. - `expose:` not `ports:` for all three ports. Tempo has no business being reachable from the host directly. - `restart: unless-stopped` is correct for a persistent service. - Named volume `tempo_data` for persistence — named volumes are managed by Docker and survive `docker compose down`. Correct. - `block_retention: 720h` aligns with the project's 30-day Loki retention. Consistent. - `archiv-net` membership so the backend can reach `tempo:4317`. Required. **Issues to address:** 1. **Missing healthcheck.** Every service in this repo has a healthcheck — `db`, `minio`, `ocr-service`, `backend`, `frontend`. Tempo exposes `/ready` on port 3200. The healthcheck is: ```yaml healthcheck: test: ["CMD-SHELL", "wget -qO- http://localhost:3200/ready | grep -q ready || exit 1"] interval: 10s timeout: 5s retries: 5 start_period: 10s ``` Without a healthcheck, `depends_on: condition: service_healthy` cannot be used for Grafana → Tempo dependency ordering. 2. **`obs-net` network not declared.** The snippet shows `obs-net` in the networks list but the `networks:` top-level block is not shown. It must be declared explicitly in the compose file: ```yaml networks: archiv-net: external: true # defined in docker-compose.yml / docker-compose.prod.yml obs-net: driver: bridge ``` Since `archiv-net` is defined in the main compose file, it should be referenced as `external: true` in the observability compose file. 3. **`start_period` for Tempo.** Tempo initializes its WAL and block stores on first start. Add `start_period: 15s` to give it time before health probes begin. 4. **Volume declaration missing from snippet.** The compose file needs: ```yaml volumes: tempo_data: ``` This is presumably in the scaffold issue's compose file, but should be explicit in this issue's snippet to avoid ambiguity. 5. **`minio/minio:latest` in `docker-compose.yml` (existing tech debt).** Unrelated to this issue but flagged for visibility — the dev compose still uses `:latest` for MinIO, which the prod file has already fixed. Not blocking here. 6. **`DEPLOYMENT.md` update missing from DoD.** The "Future observability" section currently says "Phase 7 adds Prometheus + Loki + Grafana" — Tempo should be added to that list. Update DoD to include this. ### Recommendations - Add the healthcheck block to the Tempo service definition (spec above). - Add `networks:` top-level block showing `archiv-net: external: true` and `obs-net: driver: bridge`. - Add `volumes: tempo_data:` to the top-level volumes block. - Update DoD: "Update `docs/DEPLOYMENT.md` §4 to reference Tempo in the observability stack description." - The acceptance criterion using `docker exec archive-backend wget` is good — but note that `archive-backend` is the dev container name. In prod the container name is project-namespaced (e.g. `archiv-production-backend-1`). Document this distinction so the runbook doesn't confuse environments. ### Open Decisions - **Network isolation: should `obs-net` be a new bridge network or should Grafana/Tempo/Loki all join `archiv-net`?** Using a dedicated `obs-net` for observability services reduces blast radius (a compromised Tempo cannot initiate connections to the backend DB). But it requires the backend to join both networks (it already does for `archiv-net`; it would need `obs-net` if it ever pushes metrics directly to Prometheus). The current design (backend on `archiv-net`, Tempo on both) is correct for the OTLP push model — no change needed unless Prometheus pull is added later. Just wanted to make the reasoning explicit.

marcel commented

2026-05-14 15:29:15 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

This is a new infrastructure service with no authentication on its ingress ports. Here's the threat model:

Attack surface introduced:

Port 4317 (OTLP gRPC) — accepts trace data from any process on archiv-net
Port 4318 (OTLP HTTP) — same
Port 3200 (Tempo HTTP API) — used by Grafana to query traces; also exposes /api/search, /api/traces/{id}, and admin endpoints

Findings:

No authentication on OTLP receivers (expected, acceptable). OTLP receivers on 4317/4318 are internal-network only (expose:, not ports:). On archiv-net the only processes that should be sending traces are the backend and (eventually) OCR service. Since OTLP has no authentication built in, the only protection is network isolation — which expose: provides. This is standard practice for internal OTLP collectors. Acceptable as-is.
Tempo HTTP API (3200) is unauthenticated — ensure Grafana is the only consumer. The /api/search and /api/traces/{id} endpoints return trace data that may contain request parameters, user IDs, and internal service paths. Tempo 2.x has no built-in auth. Grafana authenticates users before they can query Tempo. The risk is: if another service on obs-net or archiv-net can reach tempo:3200, it gets unauthenticated trace data. Verify that only Grafana can reach 3200.
Trace data may contain sensitive values. Spring Boot's default OTLP exporter includes HTTP request attributes in spans: http.url, http.target, query parameters, and potentially header values. If any endpoint accepts sensitive data in query params (e.g. /api/auth/reset-password?token=...), those values will appear in traces. The issue doesn't address span attribute sanitization — this should be handled in the backend instrumentation issue, but flag it now so it's not missed.
block_retention: 720h means 30 days of traces persist on the volume. If the VPS is compromised, trace history is available to an attacker. This is acceptable for the threat model of a private family archive — no regulatory obligations, no PII in traces beyond what's already in the backend — but worth documenting.

Recommendations

Add a comment in tempo.yml explaining the auth model: "Tempo HTTP API is unauthenticated; access controlled by network isolation (obs-net) — only Grafana should reach port 3200."
Link the backend instrumentation issue (in the "Depends on" section) with a note: "Span attribute sanitization (redact /api/auth/* token params from trace attributes) should be addressed in the backend OTLP configuration issue."
Verify in the Grafana datasource config (when that issue is implemented) that the Tempo datasource URL is http://tempo:3200 over the internal network only — never exposed externally.
The expose: directive (not ports:) is the correct choice here. No concern.

Open Decisions

No blocking security decisions — the network isolation model is correct for this deployment context.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Observations This is a new infrastructure service with no authentication on its ingress ports. Here's the threat model: **Attack surface introduced:** - Port 4317 (OTLP gRPC) — accepts trace data from any process on `archiv-net` - Port 4318 (OTLP HTTP) — same - Port 3200 (Tempo HTTP API) — used by Grafana to query traces; also exposes `/api/search`, `/api/traces/{id}`, and admin endpoints **Findings:** 1. **No authentication on OTLP receivers (expected, acceptable).** OTLP receivers on 4317/4318 are internal-network only (`expose:`, not `ports:`). On `archiv-net` the only processes that should be sending traces are the backend and (eventually) OCR service. Since OTLP has no authentication built in, the only protection is network isolation — which `expose:` provides. This is standard practice for internal OTLP collectors. Acceptable as-is. 2. **Tempo HTTP API (3200) is unauthenticated — ensure Grafana is the only consumer.** The `/api/search` and `/api/traces/{id}` endpoints return trace data that may contain request parameters, user IDs, and internal service paths. Tempo 2.x has no built-in auth. Grafana authenticates users before they can query Tempo. The risk is: if another service on `obs-net` or `archiv-net` can reach `tempo:3200`, it gets unauthenticated trace data. Verify that only Grafana can reach 3200. 3. **Trace data may contain sensitive values.** Spring Boot's default OTLP exporter includes HTTP request attributes in spans: `http.url`, `http.target`, query parameters, and potentially header values. If any endpoint accepts sensitive data in query params (e.g. `/api/auth/reset-password?token=...`), those values will appear in traces. The issue doesn't address span attribute sanitization — this should be handled in the backend instrumentation issue, but flag it now so it's not missed. 4. **`block_retention: 720h` means 30 days of traces persist on the volume.** If the VPS is compromised, trace history is available to an attacker. This is acceptable for the threat model of a private family archive — no regulatory obligations, no PII in traces beyond what's already in the backend — but worth documenting. ### Recommendations - Add a comment in `tempo.yml` explaining the auth model: "Tempo HTTP API is unauthenticated; access controlled by network isolation (obs-net) — only Grafana should reach port 3200." - Link the backend instrumentation issue (in the "Depends on" section) with a note: "Span attribute sanitization (redact `/api/auth/*` token params from trace attributes) should be addressed in the backend OTLP configuration issue." - Verify in the Grafana datasource config (when that issue is implemented) that the Tempo datasource URL is `http://tempo:3200` over the internal network only — never exposed externally. - The `expose:` directive (not `ports:`) is the correct choice here. No concern. ### Open Decisions No blocking security decisions — the network isolation model is correct for this deployment context.

marcel commented

2026-05-14 15:29:32 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Observations

This is an infrastructure issue, so my review focuses on the acceptance criteria quality, verifiability, and what's testable before vs after the dependency issue lands.

AC review:

#	Criterion	Verifiable standalone?	Gap
1	`docker compose ... up -d tempo` starts without error	✅ Yes	None
2	`curl -s http://localhost:3200/ready` returns `ready`	✅ Yes	Minor: port 3200 is not exposed to host via `ports:` in the spec — only via `expose:`. The curl must run inside the obs-net or use `docker exec`. Clarify in the AC.
3	OTLP gRPC 4317 reachable from `archiv-net`	✅ Partially	The netcat check is a good alternative. Specify the exact command: `docker exec archive-backend nc -z tempo 4317 && echo ok`
4	Querying `/api/search` returns trace data	❌ Blocked on backend instrumentation	Correctly noted as conditional, but the DoD says "all acceptance criteria checked" — this creates ambiguity.

Issues:

AC2 curl will fail from host. Port 3200 is expose:d, not ports:d. curl http://localhost:3200/ready from the host machine will get connection refused. The test command needs to be either docker exec obs-tempo wget -qO- http://localhost:3200/ready or use docker run --network obs-net --rm curlimages/curl:8.5.0 curl -s http://tempo:3200/ready.
DoD says "all AC checked" but AC4 is blocked. This will stall the PR review if taken literally. Recommend splitting the DoD: "AC1–3 verified before merge; AC4 verified as part of the backend instrumentation PR."
No smoke test for volume persistence. After a docker compose restart tempo, traces from before the restart should still be queryable. This verifies that the named volume is wired correctly. Easy to add: restart Tempo after AC4 is green, re-query /api/search, confirm traces persist. Add this as AC5.
The infra/observability/tempo/tempo.yml path must be created. The DoD mentions "config file committed alongside compose changes" — good. But the scaffold issue (which creates infra/observability/) is listed as a dependency. If the scaffold issue isn't merged first, the PR for this issue will have no parent directory to commit into. Consider either: (a) absorbing the directory creation into this PR, or (b) making the dependency explicit in Gitea's blocked-by relationship.

Recommendations

Fix AC2 command: use docker exec obs-tempo wget -qO- http://localhost:3200/ready (Tempo image ships wget via BusyBox).
Fix AC3 command: use docker exec archive-backend nc -z tempo 4317 && echo "OTLP gRPC reachable" — add nc availability check or fall back to: docker run --rm --network archiv-net busybox nc -z tempo 4317.
Split DoD: AC1–3 must be green before merge; AC4 is deferred to the instrumentation issue and linked.
Add AC5: "After docker compose restart tempo, traces from AC4 are still returned by /api/search."
Use Gitea's "blocked by" relationship to link the scaffold issue formally.

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Observations This is an infrastructure issue, so my review focuses on the acceptance criteria quality, verifiability, and what's testable before vs after the dependency issue lands. **AC review:** | # | Criterion | Verifiable standalone? | Gap | |---|-----------|----------------------|-----| | 1 | `docker compose ... up -d tempo` starts without error | ✅ Yes | None | | 2 | `curl -s http://localhost:3200/ready` returns `ready` | ✅ Yes | Minor: port 3200 is not exposed to host via `ports:` in the spec — only via `expose:`. The curl must run **inside the obs-net** or use `docker exec`. Clarify in the AC. | | 3 | OTLP gRPC 4317 reachable from `archiv-net` | ✅ Partially | The netcat check is a good alternative. Specify the exact command: `docker exec archive-backend nc -z tempo 4317 && echo ok` | | 4 | Querying `/api/search` returns trace data | ❌ Blocked on backend instrumentation | Correctly noted as conditional, but the DoD says "all acceptance criteria checked" — this creates ambiguity. | **Issues:** 1. **AC2 curl will fail from host.** Port 3200 is `expose:`d, not `ports:`d. `curl http://localhost:3200/ready` from the host machine will get connection refused. The test command needs to be either `docker exec obs-tempo wget -qO- http://localhost:3200/ready` or use `docker run --network obs-net --rm curlimages/curl:8.5.0 curl -s http://tempo:3200/ready`. 2. **DoD says "all AC checked" but AC4 is blocked.** This will stall the PR review if taken literally. Recommend splitting the DoD: "AC1–3 verified before merge; AC4 verified as part of the backend instrumentation PR." 3. **No smoke test for volume persistence.** After a `docker compose restart tempo`, traces from before the restart should still be queryable. This verifies that the named volume is wired correctly. Easy to add: restart Tempo after AC4 is green, re-query `/api/search`, confirm traces persist. Add this as AC5. 4. **The `infra/observability/tempo/tempo.yml` path must be created.** The DoD mentions "config file committed alongside compose changes" — good. But the scaffold issue (which creates `infra/observability/`) is listed as a dependency. If the scaffold issue isn't merged first, the PR for this issue will have no parent directory to commit into. Consider either: (a) absorbing the directory creation into this PR, or (b) making the dependency explicit in Gitea's blocked-by relationship. ### Recommendations - Fix AC2 command: use `docker exec obs-tempo wget -qO- http://localhost:3200/ready` (Tempo image ships `wget` via BusyBox). - Fix AC3 command: use `docker exec archive-backend nc -z tempo 4317 && echo "OTLP gRPC reachable"` — add `nc` availability check or fall back to: `docker run --rm --network archiv-net busybox nc -z tempo 4317`. - Split DoD: AC1–3 must be green before merge; AC4 is deferred to the instrumentation issue and linked. - Add AC5: "After `docker compose restart tempo`, traces from AC4 are still returned by `/api/search`." - Use Gitea's "blocked by" relationship to link the scaffold issue formally.

marcel commented

2026-05-14 15:29:52 +02:00

📋 Elicit — Requirements Engineer

Observations

The issue is well-specified for an infrastructure ticket — context, service config, acceptance criteria, and DoD are all present. I'm reviewing for gaps in requirement completeness and testability precision.

Strengths:

Context section clearly explains what Tempo does and why it's needed (good "why" framing).
The dependency on the scaffold issue is called out explicitly.
The compose snippet and config file are spec-level complete — an implementer can work directly from this.
AC items map to observable system behaviors (startup, readiness endpoint, network connectivity, data presence).

Gaps:

"Scaffold issue" is unnamed and unlinked. The issue says "Depends on: scaffold issue (compose file and infra/observability/ directory must exist first)" — but doesn't link to the actual issue number. This means a developer picking this up cannot immediately navigate to the dependency. Add the issue number.
AC2 verification command is ambiguous. curl -s http://localhost:3200/ready assumes port 3200 is exposed to the host machine. The compose snippet uses expose: (internal only), not ports:. This AC will fail as written. (See also Sara's comment.)
No NFR for startup time / resource usage. Tempo on a CX32 (8GB RAM, as referenced in Tobias's persona docs) with local storage should be lightweight at idle (~100MB RAM). No memory limit is specified. Given the OCR service already has a 12GB limit, Tempo should have an explicit mem_limit to prevent resource contention. Suggest mem_limit: 512m — Tempo 2.x at idle uses ~80–150MB.
No definition of "starts without error" in AC1. Does "without error" mean exit code 0 on the compose up command? Or does it mean the container status is healthy after the healthcheck passes? These are different — a container can be running but not healthy. Recommend: "AC1: docker compose ... up -d tempo completes, and docker compose ps tempo shows status running (healthy) within 30 seconds."
Data retention NFR is implicit. The config specifies block_retention: 720h (30 days). The issue doesn't explain the rationale for 30 days. Is this a user requirement? A cost constraint? An operational decision? A brief comment ("matches Loki retention, see issue #NNN") would make this traceable.
Missing: what happens when the volume fills up? Tempo's compactor handles retention automatically, but if writes exceed the retention policy (e.g. an unexpected trace volume spike), local storage could fill. No monitoring alert for disk usage is referenced. Not blocking for this issue, but should be in the Alertmanager backlog.

Recommendations

Link the scaffold issue by number in the "Depends on" section.
Rewrite AC1 to reference the healthcheck state: "container shows healthy within 30s."
Rewrite AC2 to use docker exec obs-tempo (see Sara's comment).
Add a mem_limit: 512m to the Tempo service definition in the compose snippet.
Add a one-line comment in tempo.yml next to block_retention explaining the rationale.

Open Decisions

mem_limit value for Tempo. 512MB is conservative and safe for a small family archive with low trace volume. If span-metrics generation is enabled (the metrics_generator block), memory usage increases. The right value depends on expected trace volume, which isn't specified. 512MB is a safe default to start with and can be increased if Grafana dashboards show memory pressure.

## 📋 Elicit — Requirements Engineer ### Observations The issue is well-specified for an infrastructure ticket — context, service config, acceptance criteria, and DoD are all present. I'm reviewing for gaps in requirement completeness and testability precision. **Strengths:** - Context section clearly explains what Tempo does and why it's needed (good "why" framing). - The dependency on the scaffold issue is called out explicitly. - The compose snippet and config file are spec-level complete — an implementer can work directly from this. - AC items map to observable system behaviors (startup, readiness endpoint, network connectivity, data presence). **Gaps:** 1. **"Scaffold issue" is unnamed and unlinked.** The issue says "Depends on: scaffold issue (compose file and `infra/observability/` directory must exist first)" — but doesn't link to the actual issue number. This means a developer picking this up cannot immediately navigate to the dependency. Add the issue number. 2. **AC2 verification command is ambiguous.** `curl -s http://localhost:3200/ready` assumes port 3200 is exposed to the host machine. The compose snippet uses `expose:` (internal only), not `ports:`. This AC will fail as written. (See also Sara's comment.) 3. **No NFR for startup time / resource usage.** Tempo on a CX32 (8GB RAM, as referenced in Tobias's persona docs) with local storage should be lightweight at idle (~100MB RAM). No memory limit is specified. Given the OCR service already has a 12GB limit, Tempo should have an explicit `mem_limit` to prevent resource contention. Suggest `mem_limit: 512m` — Tempo 2.x at idle uses ~80–150MB. 4. **No definition of "starts without error" in AC1.** Does "without error" mean exit code 0 on the compose up command? Or does it mean the container status is `healthy` after the healthcheck passes? These are different — a container can be `running` but not `healthy`. Recommend: "AC1: `docker compose ... up -d tempo` completes, and `docker compose ps tempo` shows status `running (healthy)` within 30 seconds." 5. **Data retention NFR is implicit.** The config specifies `block_retention: 720h` (30 days). The issue doesn't explain the rationale for 30 days. Is this a user requirement? A cost constraint? An operational decision? A brief comment ("matches Loki retention, see issue #NNN") would make this traceable. 6. **Missing: what happens when the volume fills up?** Tempo's compactor handles retention automatically, but if writes exceed the retention policy (e.g. an unexpected trace volume spike), local storage could fill. No monitoring alert for disk usage is referenced. Not blocking for this issue, but should be in the Alertmanager backlog. ### Recommendations - Link the scaffold issue by number in the "Depends on" section. - Rewrite AC1 to reference the healthcheck state: "container shows `healthy` within 30s." - Rewrite AC2 to use `docker exec obs-tempo` (see Sara's comment). - Add a `mem_limit: 512m` to the Tempo service definition in the compose snippet. - Add a one-line comment in `tempo.yml` next to `block_retention` explaining the rationale. ### Open Decisions - **`mem_limit` value for Tempo.** 512MB is conservative and safe for a small family archive with low trace volume. If span-metrics generation is enabled (the `metrics_generator` block), memory usage increases. The right value depends on expected trace volume, which isn't specified. 512MB is a safe default to start with and can be increased if Grafana dashboards show memory pressure.

marcel commented

2026-05-14 15:29:56 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

No UX or UI concerns from my angle. This issue adds a backend infrastructure service (Tempo) with no user-facing interface. There are no Svelte components, no design decisions, and no accessibility implications in scope.

One forward-looking note: when the Grafana dashboards for trace visualization are eventually added (as part of the Grafana issue in this milestone), I'll review the Grafana datasource and panel configuration from a user perspective — specifically whether trace waterfall views are navigable for the family members who may access Grafana for operational insight. But that's out of scope here.

This issue is clear to proceed from a UX standpoint.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist No UX or UI concerns from my angle. This issue adds a backend infrastructure service (Tempo) with no user-facing interface. There are no Svelte components, no design decisions, and no accessibility implications in scope. One forward-looking note: when the Grafana dashboards for trace visualization are eventually added (as part of the Grafana issue in this milestone), I'll review the Grafana datasource and panel configuration from a user perspective — specifically whether trace waterfall views are navigable for the family members who may access Grafana for operational insight. But that's out of scope here. This issue is clear to proceed from a UX standpoint.

marcel commented

2026-05-14 15:30:08 +02:00

🗳️ Decision Queue — Action Required

3 decisions need your input before implementation starts.

Infrastructure / Sequencing

metrics_generator block: include now or defer to Prometheus issue? Including it now means Tempo is already emitting span-metrics when Prometheus is added — no config change to a running instance. Deferring keeps this issue minimal and avoids configuring something with no consumer yet. (Raised by: Markus)
mem_limit for Tempo: 512MB or unconstrained? The OCR service already consumes up to 12GB on this VPS. Setting mem_limit: 512m prevents Tempo from competing with OCR under load. 512MB is safe for a low-traffic family archive. If metrics_generator is enabled, bump to 768MB. (Raised by: Elicit)

Network Topology

obs-net isolation: current two-network design is recommended, but make the reasoning explicit. The current spec (backend on archiv-net, Tempo on archiv-net + obs-net, Grafana on obs-net) is the right call for the OTLP push model. No action needed unless you want a comment in the compose file explaining why archiv-net: external: true is set. (Raised by: Tobias — recommends keeping current design, just documenting it)

## 🗳️ Decision Queue — Action Required _3 decisions need your input before implementation starts._ ### Infrastructure / Sequencing - **`metrics_generator` block: include now or defer to Prometheus issue?** Including it now means Tempo is already emitting span-metrics when Prometheus is added — no config change to a running instance. Deferring keeps this issue minimal and avoids configuring something with no consumer yet. _(Raised by: Markus)_ - **`mem_limit` for Tempo: 512MB or unconstrained?** The OCR service already consumes up to 12GB on this VPS. Setting `mem_limit: 512m` prevents Tempo from competing with OCR under load. 512MB is safe for a low-traffic family archive. If `metrics_generator` is enabled, bump to 768MB. _(Raised by: Elicit)_ ### Network Topology - **`obs-net` isolation: current two-network design is recommended, but make the reasoning explicit.** The current spec (backend on `archiv-net`, Tempo on `archiv-net` + `obs-net`, Grafana on `obs-net`) is the right call for the OTLP push model. No action needed unless you want a comment in the compose file explaining why `archiv-net: external: true` is set. _(Raised by: Tobias — recommends keeping current design, just documenting it)_

marcel referenced this issue

2026-05-15 01:49:38 +02:00

devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #585

marcel referenced this issue

2026-05-15 02:24:04 +02:00

devops(observability): add Loki + Promtail for centralised container log aggregation #586

marcel referenced this issue from a commit

2026-05-15 03:01:52 +02:00

devops(observability): add Tempo for distributed trace storage (OTLP receiver)

marcel commented

2026-05-15 03:02:11 +02:00

Implementation complete — branch `feat/issue-575-tempo`

All tasks from the issue are done. Here's what was committed in de08ffe9:

Files changed

File	Change
`infra/observability/tempo/tempo.yml`	Created — Tempo config with OTLP gRPC (4317) + HTTP (4318) receivers, local storage, 30-day block retention, and `metrics_generator` (service-graphs + span-metrics)
`docker-compose.observability.yml`	Added `tempo` service — pinned `grafana/tempo:2.7.2`, healthcheck via `/ready`, `expose:` only (no host ports), joined `archiv-net` + `obs-net`
`docs/DEPLOYMENT.md`	Added `obs-tempo` row to the observability services table
`docs/architecture/c4/l2-containers.puml`	Added `tempo` container inside the observability boundary + `Rel(backend, tempo, ...)` for OTLP gRPC

Reviewer feedback addressed

Tobias (DevOps): healthcheck added (wget -qO- http://localhost:3200/ready, start_period: 15s). archiv-net: external: true and obs-net: driver: bridge were already present in the scaffold. Named volume tempo_data was already declared.
Markus (Architect): L2 container diagram and DEPLOYMENT.md both updated per the doc-update table.
Nora (Security): Comment in tempo.yml explaining the auth model — Tempo API is unauthenticated; access controlled by obs-net network isolation.
Felix (Dev): Storage backend comments added explaining why local storage is chosen (single VPS, no S3 needed). WAL + blocks paths confirmed correct for grafana/tempo:2.7.2.
metrics_generator: included now (service-graphs + span-metrics) so Tempo is already emitting span-metrics when Prometheus is wired — avoids a config change to a running instance later.

Verification

docker compose -f docker-compose.observability.yml config --quiet
# Exit code: 0

Next step

Open PR against main → /review-pr

## Implementation complete — branch `feat/issue-575-tempo` All tasks from the issue are done. Here's what was committed in `de08ffe9`: ### Files changed | File | Change | |---|---| | `infra/observability/tempo/tempo.yml` | Created — Tempo config with OTLP gRPC (4317) + HTTP (4318) receivers, local storage, 30-day block retention, and `metrics_generator` (service-graphs + span-metrics) | | `docker-compose.observability.yml` | Added `tempo` service — pinned `grafana/tempo:2.7.2`, healthcheck via `/ready`, `expose:` only (no host ports), joined `archiv-net` + `obs-net` | | `docs/DEPLOYMENT.md` | Added `obs-tempo` row to the observability services table | | `docs/architecture/c4/l2-containers.puml` | Added `tempo` container inside the observability boundary + `Rel(backend, tempo, ...)` for OTLP gRPC | ### Reviewer feedback addressed - **Tobias (DevOps):** healthcheck added (`wget -qO- http://localhost:3200/ready`, `start_period: 15s`). `archiv-net: external: true` and `obs-net: driver: bridge` were already present in the scaffold. Named volume `tempo_data` was already declared. - **Markus (Architect):** L2 container diagram and DEPLOYMENT.md both updated per the doc-update table. - **Nora (Security):** Comment in `tempo.yml` explaining the auth model — Tempo API is unauthenticated; access controlled by `obs-net` network isolation. - **Felix (Dev):** Storage backend comments added explaining why local storage is chosen (single VPS, no S3 needed). WAL + blocks paths confirmed correct for `grafana/tempo:2.7.2`. - **metrics_generator:** included now (service-graphs + span-metrics) so Tempo is already emitting span-metrics when Prometheus is wired — avoids a config change to a running instance later. ### Verification ``` docker compose -f docker-compose.observability.yml config --quiet # Exit code: 0 ``` ### Next step Open PR against `main` → `/review-pr`

marcel referenced a pull request that will close this issue

2026-05-15 03:02:30 +02:00

devops(observability): add Tempo for distributed trace storage (OTLP receiver) #587

marcel referenced this issue

2026-05-15 03:04:26 +02:00

devops(observability): add Tempo for distributed trace storage (OTLP receiver) #587

marcel closed this issue

2026-05-15 03:21:12 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#575

devops(observability): add Tempo for distributed trace storage (OTLP receiver) #575

Context

Service to Add

Config File to Create

infra/observability/tempo/tempo.yml

Acceptance Criteria

Definition of Done

🏗️ Markus Keller — Senior Application Architect

Observations

Recommendations

Open Decisions

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

Recommendations

Open Decisions (omit this section entirely if none)

🚀 Tobias Wendt — DevOps & Platform Engineer

Observations

Recommendations

Open Decisions

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

Recommendations

Open Decisions

🧪 Sara Holt — QA Engineer & Test Strategist

Observations

Recommendations

📋 Elicit — Requirements Engineer

Observations

Recommendations

Open Decisions

🎨 Leonie Voss — UX Designer & Accessibility Strategist

🗳️ Decision Queue — Action Required

Infrastructure / Sequencing

Network Topology

Implementation complete — branch feat/issue-575-tempo

Files changed

Reviewer feedback addressed

Verification

Next step

`infra/observability/tempo/tempo.yml`

Implementation complete — branch `feat/issue-575-tempo`