devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure #584
Reference in New Issue
Block a user
Delete Branch "feat/issue-572-observability-scaffold"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes #572
docker-compose.observability.ymlat project root witharchiv-netas external network,obs-netbridge network, and five named volumes (prometheus_data,loki_data,tempo_data,grafana_data,glitchtip_data).gitkeepplaceholder directories underinfra/observability/for all upcoming service configs# --- Observability ---section to.env.examplewith all required env varsTest plan
docker compose -f docker-compose.observability.yml configexits 0infra/observability/subdirectories exist with.gitkeep.env.examplecontains the new Observability section🤖 Generated with Claude Code
Creates the skeleton observability stack (no running services yet) that all subsequent Grafana LGTM + GlitchTip issues depend on: - docker-compose.observability.yml: external archiv-net join, obs-net bridge, named volumes for all five services, placeholder comments for each service group (Metrics/Logs/Traces/Dashboards/Error Tracking), startup-order note - infra/observability/{prometheus,loki,promtail,tempo,grafana/provisioning/{datasources,dashboards}}/.gitkeep - .env.example: new # --- Observability --- section with PORT_GRAFANA, PORT_GLITCHTIP, PORT_PROMETHEUS, GLITCHTIP_DOMAIN, GLITCHTIP_SECRET_KEY (with generation hint), SENTRY_DSN, VITE_SENTRY_DSN Verified: docker compose -f docker-compose.observability.yml config exits 0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>🏗️ Markus Keller — Application Architect
Verdict: 🚫 Changes requested
Blockers
1.
docs/architecture/c4/l2-containers.pumlis not updated.The architecture persona rules are explicit: "New Docker service or infrastructure component →
docs/architecture/c4/l2-containers.puml+docs/DEPLOYMENT.md" and "A doc omission is a blocker, not a concern — the PR does not merge until the diagram or text matches the code."docker-compose.observability.ymlintroduces a second Docker Compose stack with a new external network (archiv-net) dependency and five new named volumes. Thel2-containers.pumlC4 diagram shows only the main stack — the observability stack boundary, and its relationship toarchiv-net, must be added. Even a placeholderSystemBoundary(obs, "Observability Stack (phase 2)")block with the planned services annotated as' -- added in subsequent issues'would satisfy the constraint. The diagram must reflect the infrastructure that exists in the repository.2.
docs/DEPLOYMENT.mdsection §4 is stale.The current §4 "Logs + observability" says: "Phase 7 of the Production v1 milestone adds Prometheus + Loki + Grafana. No monitoring infrastructure is in place yet."
This is no longer accurate — the scaffold Compose file and
infra/observability/directories now exist. §4 must acknowledge the new file, describe how to start the optional stack (docker compose -f docker-compose.observability.yml up -d), and note that services are populated in subsequent issues. The ownership note at the top ofDEPLOYMENT.mdis unambiguous: "Update this file in any PR that changes the container topology."Suggestions (not blockers)
3. ADR consideration for the two-stack topology.
Splitting the stack into
docker-compose.yml+docker-compose.observability.ymlis an architectural decision with lasting operational consequences: startup order dependency, separatedowncommands, and thearchiv-net: external: truecoupling. This is small enough that a note in the PR body or a comment block at the top of the compose file explaining why the split was chosen (vs. aprofiles:approach in the main compose) would suffice — no full ADR required at the scaffold stage, but the reasoning should be recorded before the services are wired in.4.
obs-netbridge has noname:override.Without an explicit
name:, Docker Compose will prefix the network as<project>_obs-net, where<project>defaults to the directory name. If the directory isfamilienarchiv, the network becomesfamilienarchiv_obs-net. This is fine for now, but when services are added that need to reference the network by name from outside the Compose context (e.g. Prometheus scrape config), the auto-generated name is fragile. Worth addingname: obs-netalongside thedriver: bridgedeclaration, matching thearchiv-netexternal pattern.What is done well
The file structure is clean: the comment block at the top explains the startup order, the
# No services defined yetplaceholder is honest, and the volume names match Tobias's semantic naming convention exactly. The.env.exampleadditions are thorough and the security comments onGLITCHTIP_SECRET_KEYare precisely what the comment policy requires.@mkeller
⚙️ Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
Blockers
None. The compose file is structurally sound and would pass
docker compose config.Concerns (should fix before services are wired in)
1. Image tags will be needed in follow-up issues — flag the pattern now.
The file currently has
services: {}, so there is nothing to pin yet. However, the comment block names five services (prometheus, loki, promtail, tempo, grafana, glitchtip). Every one of those follow-up PRs must pin to a specific version tag — no:latest. Noting this expectation in a comment alongside each placeholder would make it harder for a future PR to skip the discipline:This is a suggestion for the scaffold, not a blocker here.
2. The five named volumes pre-exist the services that use them.
Docker Compose will create
prometheus_data,loki_data,tempo_data,grafana_data, andglitchtip_datathe momentdocker compose -f docker-compose.observability.yml up -dis run — even withservices: {}. On a clean host this is fine. On an existing production host, this creates five empty named volumes that accumulate alongside the main stack's volumes. Consider documenting this side-effect indocs/DEPLOYMENT.mdso operators know to prune them if they abandon the observability stack before it is populated. Not a blocker at scaffold stage.3.
GLITCHTIP_DOMAINdefault value ishttp://localhost:3002.In
.env.examplethe comment correctly warns about the production value. However, a developer who copies.env.exampleto.envand starts the stack will have a working GlitchTip atlocalhost:3002— but any error reports from the backend will use this domain in envelope headers, which will cause GlitchTip to reject reports from other machines. The comment should explicitly say: "For any shared dev/staging environment, set this to the actual hostname before starting the stack." Minor — the inline comment is better than most projects manage.4.
PORT_PROMETHEUS=9090— potential host collision.Port 9090 is the canonical Prometheus port and is likely already used if the operator runs any other Prometheus instance on the host. The
.env.examplecomment doesn't mention this. Suggest adding: "# Change if port 9090 is already in use on the host". Very minor.What is done well
archiv-net: external: trueis exactly the right pattern. It makes the startup dependency explicit and gives a clear error message if the main stack is not running.obs-net: driver: bridge— correct isolation pattern. Observability-internal traffic stays on this network; only Prometheus leaves it viaarchiv-net.services: {}scaffold approach — deliberately empty with issue references is far cleaner than a half-wired compose file that doesn't actually run..env.exampleadditions — theGLITCHTIP_SECRET_KEYwarning andSENTRY_DSN/VITE_SENTRY_DSNempty-default pattern (leave empty to disable) is well-thought-out. The "fail-closed" note on the secret key is excellent..gitkeepdirectories — right approach for preserving directory structure without committing config files that don't exist yet.@tobiwendt
🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
Blockers
None. This is an infrastructure scaffold with no running services and no attack surface yet. The patterns established here are sound.
Concerns
1.
GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret— placeholder value in.env.example.The comment above it is exactly right: "REQUIRED in production — must not be empty or 'changeme'. Fail-closed: GlitchTip will refuse to start with an invalid key." This is the correct way to document a secret in
.env.example. No issue here — documenting the intent is appropriate.One gap to address before GlitchTip is wired in: the production secret must flow through Gitea Secrets → workflow
env:→.env.production(matching the pattern already used forPROD_POSTGRES_PASSWORD,PROD_MINIO_PASSWORD, etc.). The.env.examplecomment should explicitly mention the Gitea secret name that should carry this value in CI/CD, so the operator knows where to put it:This prevents the operator from setting it only in
.envand forgetting to wire it through the workflow.2.
SENTRY_DSNandVITE_SENTRY_DSNare empty by default — correct, but document the principle explicitly.The comment "leave empty to disable the SDK (safe default)" is good. When these are eventually populated, they must come from Gitea Secrets as well (they are sensitive — a DSN allows anyone who knows it to inject error reports). A brief note in
.env.exampleto that effect would close the loop:3.
PORT_GLITCHTIP=3002— host-bound port.When GlitchTip is added, its container port should be bound to
127.0.0.1:3002only (matching the existing pattern for other services), not0.0.0.0:3002. The.env.exampledoes not configure this, but the follow-up PRs that add the actual service definition must use"127.0.0.1:${PORT_GLITCHTIP}:8000"not"${PORT_GLITCHTIP}:8000". The scaffold PR sets the precedent — a brief comment in.env.exampleto this effect would help:What is done well
GLITCHTIP_SECRET_KEY— exactly the right pattern. Too many projects document secrets as optional when they are mandatory in production..gitkeepfiles contain nothing sensitive.archiv-net: external: true— the explicit external network dependency prevents the observability stack from starting without the main stack, which is a correct operational guard.@NullX
🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ✅ Approved
This is a pure scaffold PR — no services are running, no application code is changed, and no new behaviour is introduced. There is nothing to test at the unit or integration layer.
Checked
PR test plan is complete and correct.
The three checklist items in the PR description are the right ones for this change:
docker compose -f docker-compose.observability.yml configexits 0 — this validates the YAML syntax and network/volume references without starting anything.infra/observability/subdirectories exist with.gitkeep— a simplegit ls-filesorlscheck..env.examplecontains the new Observability section — agrepcheck.These are exactly the level of testing appropriate for a scaffold. They are fast, deterministic, and can run in CI without a Docker daemon.
Suggestions (not blockers)
Add the
docker compose configcheck to CI.The main compose file already passes through CI linting (implicitly via the deployment workflow). The observability compose file does not. It would be worth adding a step to an appropriate CI job — perhaps the existing
unit-testsorbuildworkflow — to run:This catches syntax errors and undefined variable references before they reach production. Without this, a typo in a follow-up PR could go undetected until someone actually tries to start the stack.
The test plan items are all manual. That is acceptable for a scaffold, but when the first real service (Prometheus, per issue #573) is added, a CI smoke test should be introduced that starts the stack and checks the health endpoint. That is a concern for the #573 PR, not this one.
What is done well
.gitkeepfiles are committed correctly (empty, in the right directories), so the directory structure is verifiable viagit.services: {}scaffold approach means there is no risk of an accidentally-running-but-misconfigured service in CI.@saraholt
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
This PR contains no application code — no Java, no TypeScript, no Svelte, no Python. My review is scoped to what I can meaningfully evaluate: YAML structure, naming discipline, and documentation hygiene.
Checked
YAML structure and naming.
docker-compose.observability.ymlfollows the project's conventions cleanly:prometheus_data,loki_data,tempo_data,grafana_data,glitchtip_data— each tells you what it stores.services: {}placeholder with issue references is clear and honest — no dead config disguised as real config..env.exampleadditions.The new section follows the existing pattern: grouped under a header comment, each var explained, sensitive values documented with generation instructions. The
GLITCHTIP_SECRET_KEYcomment is particularly good — it says what happens if you get it wrong, not just what the value is..gitkeep files.
Six empty files in six correct directories. Nothing to flag.
Suggestions (not blockers)
glitchtip_datavolume — note the Django dependency.GlitchTip is a Django application and its volume will hold Django's database (SQLite by default, or a connected PostgreSQL). The follow-up PR that adds the GlitchTip service should configure it to use the project's existing PostgreSQL instance rather than embedding SQLite — this keeps the data model consistent and backup-able via the existing
pg_dumpprocedure. This is a concern for the GlitchTip service PR, not this one, but worth flagging now so the volume design decision (separate PostgreSQL volume vs. shared database) is made deliberately.What is done well
The file is minimal, honest, and well-commented. It establishes the right directory structure for the follow-up work without over-building. The PR does exactly what it says it does.
@felixbrandt
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Requirements engineering review is largely out-of-scope for a pure infrastructure scaffold PR. My check focuses on whether the PR faithfully delivers what issue #572 specified, and whether the work introduced creates any requirements gaps downstream.
Checked
Traceability: PR → Issue #572.
The PR description maps cleanly to a scaffold deliverable:
docker-compose.observability.ymlcreated with the correct network topologyinfra/observability/subdirectories with.gitkeepplaceholders.env.exampleupdated with the observability sectionThe PR title, commit references, and "Closes #572" are all present. This is good issue-driven development discipline.
Test plan completeness.
The three acceptance criteria in the PR test plan are all verifiable and correspond to observable outcomes (not internal state). They match what a Requirements Engineer would write as Given-When-Then criteria for a scaffold story. No gaps.
Observations (not blockers)
glitchtip_datavolume raises an unresolved data model question.GlitchTip is an error tracking service. It captures exception data from the backend (Java SDK via
SENTRY_DSN) and frontend (Vite build viaVITE_SENTRY_DSN). This data includes stack traces, user context, and request metadata. Before GlitchTip is wired in, the following requirements questions should be answered in the relevant follow-up issue:glitchtip_datavolume name is ambiguous on this point.These are questions for the GlitchTip service PR (a future issue), not this one. But they should be included in the acceptance criteria of that issue.
DSN wiring creates a build-time dependency for the frontend.
VITE_SENTRY_DSNis injected at build time (per the comment). This means the frontend Docker image must be rebuilt whenever the DSN changes. This constraint should be documented in the issue that adds the frontend SDK integration — if a developer assumes it can be changed at runtime (like most env vars), they will be surprised.What is done well
The PR description is unusually thorough for an infrastructure scaffold. The summary, test plan, and issue reference are all present and correct. This is what good issue-driven delivery looks like.
@Elicit
🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
This PR contains no UI changes — no Svelte components, no routes, no CSS, no i18n strings. There is nothing for me to evaluate from a UX, accessibility, or design system perspective.
I have checked that no frontend files were touched (confirmed: 8 changed files, all infrastructure —
.env.example,docker-compose.observability.yml, and six.gitkeepfiles).LGTM from a UI/UX standpoint. I will pick this up again when the Grafana and GlitchTip dashboards are wired in and exposed through the UI.
@leonievoss
Review blockers fixed in
cf8d22d8:Blocker 1 —
docs/DEPLOYMENT.md§4 Logs + observabilityReplaced the stale "No monitoring infrastructure is in place yet" paragraph with a brief note that
docker-compose.observability.ymlandinfra/observability/exist, the stack joinsarchiv-net, and full documentation is tracked in issue #581.Blocker 2 —
docs/architecture/c4/l2-containers.pumlFile exists. Added a
System_Boundaryblock for the observability stack with placeholder containers for Prometheus, Loki, and Grafana, each noting they connect viaarchiv-netand referencing issue #581 for wiring details.