docs: document observability stack in DEPLOYMENT.md and CLAUDE.md #581
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
The observability stack adds a new compose file, new service ports, new env vars, and a manual first-run step (creating the GlitchTip superuser and projects). Without documentation, a new developer (or you, six months from now) cannot get the stack running from scratch.
Depends on: all other observability issues (document the final state, not intermediate states)
Files to Update
docs/DEPLOYMENT.mdAdd a new top-level section "Observability Stack" after the existing application stack section. It must cover:
Starting and stopping
Service URLs
http://localhost:3001$GRAFANA_ADMIN_PASSWORDhttp://localhost:3002http://localhost:9090First-run setup (required once per environment)
Generating a GLITCHTIP_SECRET_KEY
CLAUDE.mdUpdate the stack overview table (under "Infrastructure") to include:
obs-grafanaobs-prometheusobs-lokiobs-tempoobs-glitchtipAlso add the new env vars to the
.envreference section:PORT_GRAFANA,PORT_GLITCHTIP,PORT_PROMETHEUSGRAFANA_ADMIN_PASSWORDGLITCHTIP_SECRET_KEY,GLITCHTIP_DOMAINSENTRY_DSN,VITE_SENTRY_DSNAcceptance Criteria
docs/DEPLOYMENT.mdhas an "Observability Stack" section covering all steps abovedocs/DEPLOYMENT.mdcan get the full stack running from a fresh clone without consulting any other fileCLAUDE.mdstack table includes all five new services with correct container names and portsCLAUDE.mdenv var reference lists all new observability varsDefinition of Done
main🏗️ Markus Keller — Senior Application Architect
Observations
docs/DEPLOYMENT.mdandCLAUDE.mdas the two targets — those are exactly the right files per the architecture doc-update rules (new Docker service → updatedocs/DEPLOYMENT.md+l2-containers.puml).docs/architecture/c4/l2-containers.pumlmust be updated when new Docker services are added. The architect persona rule is explicit: "New Docker service or infrastructure component →docs/architecture/c4/l2-containers.puml+docs/DEPLOYMENT.md". This file currently has no mention of Grafana, Prometheus, Loki, Tempo, or GlitchTip. That omission should be added to the acceptance criteria.docs/infrastructure/production-compose.mdalready contains a forward-reference placeholder: "Prometheus, Loki, Grafana, Alertmanager, Uptime Kuma, GlitchTip and ntfy are not part of the production deployment... tracked as follow-up issue #498." Once the observability compose lands, that stale placeholder needs to be updated too. The issue's scope only touchesDEPLOYMENT.mdandCLAUDE.md— theproduction-compose.mdstale section will silently contradict the new docs unless cleared.docker-compose.observability.yml. DEPLOYMENT.md currently describes two compose files (docker-compose.ymlanddocker-compose.prod.yml) plus the overlay pattern. Adding a third standalone file means the §1 topology diagram (Mermaid) should be updated to show the observability cluster as a parallel deployment alongside the application stack — not currently mentioned in the issue scope.docs/infrastructure/self-hosted-catalogue.mdalready documents GlitchTip and references a compose snippet. The new section inDEPLOYMENT.mdshould cross-link to it rather than duplicating configuration details.Recommendations
docs/architecture/c4/l2-containers.pumlupdated with the five new observability containers and their relationships.docs/infrastructure/production-compose.mdto reference the new section inDEPLOYMENT.md.DEPLOYMENT.md(the Mermaid graph) to showobs-grafana,obs-prometheus,obs-loki,obs-tempo,obs-glitchtipas a parallel sidecar cluster connected to the application stack.GLITCHTIP_SECRET_KEYgeneration command matches the existing pattern used forOCR_TRAINING_TOKEN— consistent.Open Decisions
production-compose.mdstale section: Should cleaning updocs/infrastructure/production-compose.mdbe in scope for this issue, or tracked as a separate follow-up? The contradiction it would create is minor but could confuse a developer who reads the catalogue beforeDEPLOYMENT.md. Given this is a documentation-only issue with no risk of scope creep, I'd include it in scope.👨💻 Felix Brandt — Senior Fullstack Developer
Observations
## Infrastructuresection currently reads only:→ See docs/DEPLOYMENT.md. Adding the service table and env var reference there means the section grows from 1 line to a substantive reference block. That's fine — it's the right level of information for an LLM context file.CLAUDE.mdupdate lists env var names but doesn't specify where in CLAUDE.md to put them. The file currently has no env var reference section — the issue implies adding one under Infrastructure. The implementer should add it as a subsection like### Observability env varsdirectly after the service table, for consistency with howDEPLOYMENT.md §2is structured.docker exec -it obs-glitchtip ./manage.py createsuperuserstep uses-it(interactive TTY). In a headless server environment, this is correct — but worth noting that automated provisioning scripts cannot use-it. The documentation should note this is a manual one-time step (the issue body already says "required once per environment", which is correct)."Django"when creating the GlitchTip backend project. GlitchTip is Django-based, so"Django"is the right SDK type for the Spring Boot backend's Sentry SDK integration — this is correct.Recommendations
#comments, and the env var table style already in §2.npm run generate:apineeded. Confirm this explicitly in the PR description to save reviewers from wondering.No open decisions — the scope is clear and the content is fully specified in the issue body.
🚀 Tobias Wendt — DevOps & Platform Engineer
Observations
docker-compose.observability.ymlfile does not exist in the repo yet. This issue documents a compose file that is being added by sibling issues in this milestone. That's the right sequencing ("depends on all other observability issues"), but it means the documentation implementer cannot verify the container names, env var names, or port numbers against a real file. The acceptance criterion "A developer following only DEPLOYMENT.md can get the full stack running from a fresh clone" cannot be validated until the compose file lands. Track this dependency explicitly.obs-grafana,obs-prometheus,obs-loki,obs-tempo,obs-glitchtip) use a cleanobs-prefix namespace — good for distinguishing observability containers from application containers indocker psoutput.docker compose(V2 plugin syntax), notdocker-compose(V1). The existingDEPLOYMENT.mdusesdocker composethroughout — consistent.docker exec -it obs-glitchtip ./manage.py createsuperuser. The container nameobs-glitchtipmust match the name indocker-compose.observability.yml. If the actual compose file usescontainer_name: obs-glitchtip-web(because GlitchTip typically has aweb+workersplit), the exec command will fail with "no such container". The self-hosted catalogue shows GlitchTip withglitchtip-webandglitchtip-workerservices. The container name in the documentation needs to be verified against the actual compose file.production-compose.mdalready warns that "Loki + Grafana with >30 days retention" may push memory past CX32 limits. The new DEPLOYMENT.md section should include a note that the observability stack adds approximately 1-2 GB RAM, and on CX32 (8 GB) operators should monitor available memory.GLITCHTIP_SECRET_KEYgeneration command usessecrets.token_hex(32)— identical to the existingOCR_TRAINING_TOKENgeneration pattern. Consistent. Good.http://localhost:9090with "none" credentials. In production the Prometheus port should not be published to the host at all — Grafana scrapes it on the internal Docker network. The documentation should clarify thatlocalhost:9090is a dev-only convenience and the port should be removed (expose:notports:) in the production compose overlay.Recommendations
localhost:9090(Prometheus) and potentiallylocalhost:3001(Grafana) are dev-accessible. In production, Grafana should be proxied through Caddy or kept internal; Prometheus should not be port-published.createsuperuserexec step against the actual compose file before merging this PR.Open Decisions
DEPLOYMENT.mddocument proxying Grafana behind Caddy (with a subdomain or subpath) as the recommended production setup, or is direct port access acceptable for a private family archive? The issue spec leaves Grafana atlocalhost:3001— which works for local dev but not for remote access from a VPS.📋 Elicit — Requirements Engineer
Observations
docker-compose.observability.ymlexisting in the repo. The Definition of Done should explicitly state this criterion is validated after the compose file from the sibling issues has been merged.docs/infrastructure/production-compose.md(which says the observability stack is "not yet deployed" and tracked as #498). After this issue lands, that sentence becomes a lie. Add a criterion: "The stale 'not yet deployed' note indocs/infrastructure/production-compose.mdis updated or removed.".env.exampleupdates, Caddyfile changes, or Gitea secrets changes. That's intentional and correct — this issue documents what already exists, not what to configure.## Infrastructureheading in CLAUDE.md, after the existing stack bullet list."Recommendations
docs/infrastructure/production-compose.md(noting observability as 'not yet deployed') is updated to reference the new DEPLOYMENT.md section."## Infrastructureheading, and the env var list appears in a clearly labeled subsection beneath it."docker-compose.observability.ymlexists in the repo, so the container names and port numbers in the documentation can be verified against the actual file."No open decisions — the requirements are clear and the content is fully prescribed.
🔒 Nora "NullX" Steiner — Application Security Engineer
Observations
echo >> .env(not hardcoded in any committed file), andGLITCHTIP_SECRET_KEYis generated withsecrets.token_hex(32)— same pattern as the existingOCR_TRAINING_TOKEN. No secrets committed to git.GRAFANA_ADMIN_PASSWORDcredential column saysadmin / $GRAFANA_ADMIN_PASSWORD: The documentation correctly references the env var, not a hardcoded value. The first-run step (step 7) instructs the operator to change the Grafana admin password athttp://localhost:3001. This is a manual step — there is no enforcement. Consider adding a warning that Grafana's default admin password (admin) must be changed before exposing the port to any network. The doc says "Change Grafana admin password" but does not warn about the consequence of skipping it..envalongsideSENTRY_DSN(backend) andVITE_SENTRY_DSN(frontend). This is appropriate —.envis gitignored and treated as sensitive. No concern here.docker exec -it obs-glitchtip ./manage.py createsuperusercreates a Django superuser on GlitchTip. The doc correctly instructs the operator to do this during first-run. No automation of this credential in committed files — correct..envfile pattern:echo "SENTRY_DSN=<backend-dsn>" >> .envappends to.env. If.envis tracked by git in any way (it isn't — it's in.gitignore), this would be dangerous. The existing project correctly keeps.envout of git. No issue.Recommendations
⚠ Do not skip this step — the default Grafana password is "admin".) to make it explicit that the default is insecure, not just a suggestion.No open decisions.
🧪 Sara Holt — QA Engineer & Test Strategist
Observations
docker-compose.observability.ymlonce it exists. Consider a checklist: follow each numbered step from a clean shell session with only theDEPLOYMENT.mdopen.markdown-link-checkor a manual pass through every[§X]and(../*)reference in the updated files will confirm no rotten links. The issue doesn't mention running such a check, but it should be part of the PR review.docker compose up -d backend frontend. This is correct for picking upSENTRY_DSNandVITE_SENTRY_DSNenvironment variables. However, if a developer runsdocker compose up -d(the full stack) instead of targeting specific services, it is equally valid and less error-prone. The documentation could note both options.docker compose -f docker-compose.observability.yml down -vcommand wipes all observability data. This is a destructive operation. Standard DEPLOYMENT.md practice (as seen in the existing rollback section) is to warn before destructive commands. A short⚠ This destroys all metrics, logs, and trace data.note is consistent with the existing doc tone.Recommendations
⚠warning before thedown -vcommand: "This destroys all metrics, logs, and trace data permanently."docker compose up -d) must be running before starting the observability stack — and what happens if it isn't (the GlitchTip service will fail to connect to its database if it shares the main Postgres instance, for example).No open decisions.
🎨 Leonie Voss — UX Designer & Accessibility Strategist
Observations
#comments follow the existing DEPLOYMENT.md style — easy to scan.GLITCHTIP_SECRET_KEYgeneration command is buried at the bottom of the first-run section as a separate subsection. Consider moving it before the first-run steps, or making it step 0, since the operator needs this value when populating.envbefore starting the stack.Recommendations
DEPLOYMENT.mdto remove the "Future observability" note and add a cross-reference: "→ See §[N] Observability Stack for setup, service URLs, and first-run configuration."GLITCHTIP_SECRET_KEYgeneration snippet before step 1, or note at the top of the first-run section: "Before starting: generate yourGLITCHTIP_SECRET_KEYusing the command in §[N.N] and add it to.env."No open decisions.
🗳️ Decision Queue — Action Required
2 decisions need your input before implementation starts.
Infrastructure / Documentation Scope
production-compose.mdstale section — in scope or follow-up? The file currently contains a note stating the observability stack is "not yet deployed" and tracked as #498. Once this issue's docs land, that note becomes a contradiction. Options: (a) include fixing it in this issue's scope (low effort, no risk of scope creep since it's documentation-only), or (b) track it as a separate follow-up. Markus recommends option (a). (Raised by: Markus, Elicit)Infrastructure / Production Access
localhost:3001. This is correct for local dev but means Grafana is not reachable remotely on a VPS without SSH tunnelling or a Caddy proxy. Options: (a) document it as dev-only, with a note that production access requires a Caddy proxy or Tailscale tunnel, (b) add Caddy proxy configuration to the section as the recommended production setup, or (c) leave the current spec as-is and treat remote access as out of scope for this issue. (Raised by: Tobias)