devops(observability): scaffold docker-compose.observability.yml and infra/observability/ directory structure #572
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
The application lives in
docker-compose.yml. We need a separatedocker-compose.observability.ymlat the project root so the observability stack can be started and stopped without touching the application. The two stacks communicate via the existingarchiv-netDocker bridge network, which the observability compose joins as an external network.This issue creates the skeleton only — no running services yet, just the file/directory structure and network wiring that all subsequent observability issues depend on.
Acceptance Criteria
docker-compose.observability.ymlexists at the project rootarchiv-netis declared asexternal: truein the observability compose — so Prometheus can reacharchive-backendby container nameprometheus_data,loki_data,tempo_data,grafana_data,glitchtip_data.gitkeep):infra/observability/prometheus/,infra/observability/loki/,infra/observability/promtail/,infra/observability/tempo/,infra/observability/grafana/provisioning/datasources/,infra/observability/grafana/provisioning/dashboards/.env.examplehas a new# --- Observability ---section with these vars and sensible defaults:docker compose -f docker-compose.observability.yml configexits 0 (all referenced env vars have defaults or are in.env.example)Implementation Notes
archiv-netnetwork is created by the maindocker-compose.yml; the observability compose must not attempt to create it./infra/observability/prometheus/prometheus.ymlFiles to Create / Modify
docker-compose.observability.ymlinfra/observability/prometheus/.gitkeepinfra/observability/loki/.gitkeepinfra/observability/promtail/.gitkeepinfra/observability/tempo/.gitkeepinfra/observability/grafana/provisioning/datasources/.gitkeepinfra/observability/grafana/provisioning/dashboards/.gitkeep.env.example— add# --- Observability ---blockDefinition of Done
docker compose -f docker-compose.observability.yml configexits 0main🏗️ Markus Keller — Senior Application Architect
Observations
archiv-net: external: truepattern is the correct way to join an existing network without attempting to own it. Good call — this is the main structural decision and it's made correctly.prometheus_data,loki_data,tempo_data,grafana_data,glitchtip_data) are declared upfront before any services exist. This creates orphaned volume declarations indocker compose configoutput — Docker Compose will warn about volumes with no consumer. That's cosmetic for now but worth noting.docs/architecture/c4/l2-containers.pumlanddocs/DEPLOYMENT.md. These are not listed in the "Files to Create / Modify" table in the issue.docker compose … config exits 0forces the implementor to write at least a skeleton structure — this constraint is correct and will flush out the issue.Recommendations
services: {}block (or a single placeholder comment underservices:) sodocker compose configvalidates without errors. Docker Compose requires at least aserviceskey.docs/architecture/c4/l2-containers.pumlanddocs/DEPLOYMENT.mdto the "Files to Create / Modify" table. Even for a skeleton issue, the container diagram should reflect that an observability stack is incoming — a comment block in the PlantUML with' TODO: observability services (issue #572+)is sufficient for now and prevents doc drift.Open Decisions
docker compose configwill warn about unused volumes. Both approaches work; the upfront approach is slightly cleaner for the milestone arc. No objection to the spec as written — noting it as a conscious choice.👨💻 Felix Brandt — Senior Fullstack Developer
Observations
.env.exampleadditions.docker compose -f docker-compose.observability.yml config exits 0is a clean, executable definition of done. Good — this can be verified in CI without running any containers..env.exampleadditions includeGLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret. The word "changeme" in a secret key variable is the pattern Nora will flag on the security side, but from a DX perspective, the inline generation hint (python3 -c "import secrets; print(secrets.token_hex(32))") that already exists in.env.exampleforOCR_TRAINING_TOKENshould be applied here too — it's a much better developer hint than a freeform string.SENTRY_DSN=andVITE_SENTRY_DSN=are listed as empty-by-default vars. An emptySENTRY_DSNis fine — GlitchTip/Sentry SDKs treat an empty DSN as "disabled." No issue there.PORT_GLITCHTIP=3002conflicts withPORT_GRAFANA=3001. The maindocker-compose.ymlalready uses ports 5173 (frontend), 8080 (backend), 9000/9001 (MinIO), 5432 (DB), 8025/1025 (Mailpit). The observability ports 3001/3002 are new — no collision with the app stack. That's correct.Recommendations
OCR_TRAINING_TOKENcomment pattern forGLITCHTIP_SECRET_KEY:.env.example— stay consistent..gitkeepfiles ininfra/observability/*/should be committed as truly empty files, not as files containing a comment. Verify this in the implementation — some editors or templates add content.serviceskey empty. This makes the intended structure readable and acts as scaffolding for the next implementor:🔧 Tobias Wendt — DevOps & Platform Engineer
Observations
docker-compose.ymlusesminio/minio:latest(flagged in my persona as an anti-pattern). The observability compose is a good opportunity to establish pinned tags as the pattern for new services. The issue doesn't mention image tags since no services are defined yet, but the follow-up issues must pin every image (e.g.prom/prometheus:v2.53.0,grafana/grafana:11.0.0,grafana/loki:3.0.0).PORT_GRAFANA=3001andPORT_GLITCHTIP=3002defaults make sense for local dev. In production, these services should not be publicly exposed — Grafana sits behind Caddy (internal-only or auth-gated), GlitchTip has its own auth. The issue is silent on this, which is fine for a skeleton issue, but the follow-up Grafana and GlitchTip issues should explicitly address Caddy routing and auth.archiv-netis already declared in the main compose and is created bydocker compose up. Theexternal: truedeclaration in the observability compose is correct — it means the observability stack will fail with a clear error if the app stack isn't up, rather than silently creating a disconnected network.infra/observability/directory structure mirrors the existinginfra/caddy/pattern. That's consistent and correct.infra/directory already hascaddy/,gitea/,minio/,fail2ban/subdirectories — addinginfra/observability/is a natural extension.Recommendations
docker-compose.observability.ymlexplaining the startup dependency:docker compose -f docker-compose.observability.yml configvalidation in the acceptance criteria is excellent — consider adding it as a CI job step in the existing CI pipeline (a simpledocker compose configdry-run job). This prevents the observability compose from silently drifting into an invalid state as env vars and services are added in subsequent issues.docs/DEPLOYMENT.md§1 to mention the observability compose as a parallel optional stack. Even a single sentence ("An optional observability stack is available viadocker-compose.observability.yml— see milestone 'Observability Stack'") prevents confusion when someone reads the deployment docs and wonders why there are two compose files.Open Decisions
🔐 Nora Steiner (NullX) — Application Security Engineer
Observations
GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret— This is a DjangoSECRET_KEYequivalent. GlitchTip uses it to sign sessions and cryptographic tokens. Shipping a known-string default means anyone who deploys without changing it has predictable session signing. The existingOCR_TRAINING_TOKENalready sets a better precedent with a generation hint — apply the same pattern here.SENTRY_DSN=andVITE_SENTRY_DSN=— Empty DSN is safe (SDK disables itself). No issue.PORT_GRAFANA=3001andPORT_GLITCHTIP=3002as published host ports means these services will be reachable on the VPS host on those ports. In the current dev setup, all ports are bound to all interfaces (0.0.0.0). The production setup binds to127.0.0.1only (perdocs/DEPLOYMENT.md), so this is fine for production but a mild exposure risk in dev if the dev machine is accessible on a LAN. This issue is pre-existing in the app stack too — noting for completeness.archiv-net: external: truemeans observability containers join the same bridge network as the application containers. This gives Prometheus, Loki, etc. direct network access toarchive-backend,archive-db, andarchive-minioby container name. This is intentional and necessary for scraping, but it also means a compromised Grafana container has network-level access to the database port. Acceptable for a family project on a single VPS; worth documenting..env.examplevars. Clean.GLITCHTIP_DOMAIN=http://localhost:3002— Using HTTP (not HTTPS) for the GlitchTip domain means Sentry SDK error reports will be sent unencrypted in production unless this is overridden. The.env.exampleshould comment that production requireshttps://your-domain.example.com/glitchtip(or similar).Recommendations
GLITCHTIP_SECRET_KEYdefault to empty with a mandatory generation notice, following theOCR_TRAINING_TOKENpattern:GLITCHTIP_DOMAIN:archiv-netis a shared trust boundary and observability services have full internal network access. This is a documented accepted risk for a single-VPS family project, not a blocker.No hard blockers for this skeleton issue — the security concerns above are either documentation gaps or configuration defaults to improve before services are actually wired up.
🧪 Sara Holt — QA Engineer & Test Strategist
Observations
docker compose -f docker-compose.observability.yml config exits 0. This is the right gate for a skeleton issue — it's deterministic, fast, and verifiable in CI without running any containers..gitkeepfiles are the only deliverables that don't have an automated check. Their existence is implicitly verified by git — if they're not committed, the directories don't exist in the repo.Recommendations
docker compose configon the observability compose file. This is a one-liner that fits into the existing CI workflow:.env.examplecompleteness in CI — The acceptance criterion says "all referenced env vars have defaults or are in.env.example". This is currently a manual check. A simple script (e.g.docker compose configwith a populated.envderived from.env.example) would make this automatic. This is a broader CI improvement, not specific to this issue — but this issue is a good trigger to request it.docker compose configexit code. Make it explicit in the PR checklist: the implementor must rundocker compose -f docker-compose.observability.yml configlocally before opening the PR, and the CI job must run it on every push.No blocking QA concerns for this skeleton issue. The risk profile is low — no production data, no running services, no network changes to the existing stack.
📋 Elicit — Requirements Engineer
Observations
docker compose configcriterion: "all referenced env vars have defaults or are in.env.example" — butdocker compose configonly validates YAML structure and interpolates known vars. It does not fail if a var is referenced in the compose file but missing from.env.example. The real mechanism is: if a var has no default and is not in.env, the command fails. The acceptance criterion should clarify that the validator must copy.env.exampleto.envbefore running the check, otherwise the criterion is misleading.infra/observability/prometheus/.gitkeepbeing committed is implied but not explicit..env.examplesection listsSENTRY_DSN=andVITE_SENTRY_DSN=without explaining the distinction. A comment clarifying which is used server-side (backend Spring Boot / GlitchTip SDK) vs. client-side (Vite / frontend Sentry SDK) would improve the.env.examplefor operators who aren't familiar with the DSN split.Recommendations
docker compose configacceptance criterion:.env.exampleadditions distinguishingSENTRY_DSN(backend) fromVITE_SENTRY_DSN(frontend/Vite build), e.g.:These are minor clarifications — the issue is implementation-ready as written. The ambiguities are edge cases that a careful implementor will handle correctly, but making them explicit removes any guesswork.
🎨 Leonie Voss — UX Designer & Accessibility Strategist
No UI or frontend changes in this issue — it is purely infrastructure scaffolding (Docker Compose file, config directories,
.env.exampleadditions).From my angle, I checked:
rel="noopener noreferrer"convention for external tool links and use the standard<BackButton>pattern for navigation.No concerns from my angle on this skeleton issue. I'll re-engage when there are UI touchpoints in the observability milestone (e.g., if a status/health indicator is added to the admin panel).
🗳️ Decision Queue — Action Required
3 decisions need your input before implementation starts.
Architecture
Volume declaration timing — Should named volumes (
prometheus_data,loki_data, etc.) be declared in this skeleton issue (as the spec says), or deferred to the issue that introduces each service? The upfront approach meansdocker compose configwill warn about volumes with no consumer; the deferred approach keeps each issue self-contained. No strong objection to either — but pick one and stick to it across the milestone. (Raised by: Markus)ADR for LGTM + GlitchTip — The Grafana LGTM stack + GlitchTip represents ~5 new services and ~1 GB RAM at runtime. Should an ADR be written in this milestone capturing "why LGTM + GlitchTip over alternatives (Datadog, Sentry SaaS, single Prometheus only, etc.)"? The ADR doesn't block this skeleton issue but should be filed before the stack goes live. (Raised by: Markus)
Infrastructure
archiv.example.com/grafana) or only via SSH tunnel (ssh -L 3001:localhost:3001)? Caddy approach is more convenient but adds a public-facing auth surface; SSH tunnel has zero exposure risk but requires operator access for every use. This belongs in the Grafana-specific follow-up issue, but the decision affects whetherPORT_GRAFANAneeds a Caddy vhost entry. (Raised by: Tobias)Arch: Volume can stay here. Create an ADR
Grafana should be accesible via subdomain. grafana.raddatz,cloud
Implementation complete — Felix Brandt
All acceptance criteria met. Branch
feat/issue-572-observability-scaffoldis pushed and ready for PR.Commit
1d42be98— devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structureWhat was delivered
docker-compose.observability.yml(project root)archiv-netdeclared asexternal: true— joins the app stack's network without attempting to create itobs-netdeclared asdriver: bridge— internal observability-to-observability traffic (Grafana → Prometheus/Loki/Tempo)prometheus_data,loki_data,tempo_data,grafana_data,glitchtip_dataservices: {}— valid Docker Compose skeleton; placeholder comments document which issue adds each service group (Metrics #573, Logs #574/#575, Traces, Dashboards, Error Tracking)docker compose up -dmust run first to createarchiv-netinfra/observability/directory structure — 6 empty.gitkeepfiles committed:prometheus/,loki/,promtail/,tempo/grafana/provisioning/datasources/,grafana/provisioning/dashboards/.env.example— new# --- Observability ---section with:PORT_GRAFANA=3001,PORT_GLITCHTIP=3002,PORT_PROMETHEUS=9090GLITCHTIP_DOMAIN=http://localhost:3002(with production HTTPS note)GLITCHTIP_SECRET_KEYwith generation command (python3 -c "import secrets; print(secrets.token_hex(50))") — mirrors theOCR_TRAINING_TOKENpattern per Felix + Nora's recommendationsSENTRY_DSN=andVITE_SENTRY_DSN=with separate comment lines clarifying backend vs. frontend usageReviewer recommendations applied
GLITCHTIP_SECRET_KEYgeneration hint added (mirrorsOCR_TRAINING_TOKENpattern).gitkeepfiles committed as truly empty (verified withtouch)GLITCHTIP_DOMAINproduction HTTPS note added;GLITCHTIP_SECRET_KEYgeneration command usestoken_hex(50)per her recommendationdocker compose up -dmust precede the observability stack)SENTRY_DSN/VITE_SENTRY_DSNdistinction clarified via separate comment linesVerification
Next step
Open PR against
mainand run/review-pr. Then proceed to issue #573 (Prometheus service definition).