docs: document observability stack in DEPLOYMENT.md and CLAUDE.md (#581) #605
Reference in New Issue
Block a user
Delete Branch "feat/issue-581-obs-docs"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #581
Summary
### Observability env varssubsection with all 8 varsglitchtip:v4→glitchtip:6.1.6for bothobs-glitchtipandobs-glitchtip-worker; corrects Grafana default port from3001→3003in the services table descriptionSENTRY_DSNrow (wired indocker-compose.yml:150andapplication.yaml:123)DEPLOYMENT.md §4Verification against code
All documented values were verified against the actual files before writing:
docker-compose.observability.ymldocker-compose.observability.yml(${PORT_GRAFANA:-3003},${PORT_GLITCHTIP:-3002},${PORT_PROMETHEUS:-9090}) andinfra/observability/obs.envimage:field (6.1.6, not the stalev4)SENTRY_DSNusage fromdocker-compose.yml:150andbackend/src/main/resources/application.yaml:123VITE_SENTRY_DSNusage fromfrontend/src/hooks.client.ts,hooks.server.ts,vite.config.tsNo code changes — no
npm run generate:apineeded.Test plan
CLAUDE.mdInfrastructure section — service table and env vars are present and accuratedocker-compose.observability.yml— all matchdocs/infrastructure/production-compose.mdno longer contradicts the live obs stack🤖 Generated with Claude Code
🏗️ Markus Keller (@mkeller) — Application Architect
Verdict: ⚠️ Approved with concerns
Blockers
C4 container diagram currency — My documentation-update checklist requires that "New Docker service or infrastructure component" →
docs/architecture/c4/l2-containers.puml+docs/DEPLOYMENT.md. The observability services (Grafana, Prometheus, Loki, Tempo, GlitchTip) are now being documented, but I cannot confirm from this diff thatl2-containers.pumlwas ever updated when these containers were first deployed. The PR description says "No code changes", and the obs stack was landed in earlier PRs, but if those PRs also skipped the C4 update, the architecture diagram is now stale. This is a blocker per my checklist: the diagram must match the deployed infrastructure before merge.Action needed: Confirm
docs/architecture/c4/l2-containers.pumlalready includes the 5 observability services. If not, add them in this PR.Suggestions
docs/infrastructure/production-compose.mdnow points readers toDEPLOYMENT.md §4for setup. Good cross-referencing. Consider also updating the file's introductory paragraph if it still uses the "proposed overlay pattern" framing, since the obs stack ships as a standalone file, not an overlay.What's done well
The CLAUDE.md infrastructure section is now a reliable quick reference — the service table and env-var table are exactly what an LLM or new contributor needs to onboard to the ops side. The port corrections (
3001 → 3003for Grafana,v4 → 6.1.6for GlitchTip) eliminate ambiguity that would cause wasted debugging cycles. The verified-against-code note in the PR description is the right discipline.👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Verdict: ✅ Approved
Blockers
None. This is a docs-only PR with no code changes.
Suggestions
VITE_SENTRY_DSNgap inDEPLOYMENT.md— The new env var is correctly documented inCLAUDE.md's observability table, butDEPLOYMENT.mdonly addsSENTRY_DSN(the backend var). There is likely a frontend environment variables section inDEPLOYMENT.mdwhereVITE_SENTRY_DSNshould also appear, since it's a distinct Vite build-time variable. A developer followingDEPLOYMENT.mdas the canonical ops guide would miss it. Worth checking and adding if the frontend section exists.Circular reference in description —
SENTRY_DSN's purpose column inDEPLOYMENT.mdsays "Set after GlitchTip first-run (§4)." That's helpful for sequencing, but §4 is the same section that describesSENTRY_DSN— the reader is already in §4 when they see this. The phrase is harmless but slightly confusing. Consider: "Obtained from GlitchTip's project settings after first setup (see §4 first-run steps)."What's done well
The verification discipline in the PR description — every documented value traced back to the exact file and line — is exactly what documentation PRs should look like. The corrections are accurate and the new CLAUDE.md section gives the right level of detail for an LLM coding assistant: just enough to navigate without becoming a second source of truth that drifts from
DEPLOYMENT.md.🛠️ Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: ✅ Approved
This is a clean documentation correction — the kind of hygiene PR that prevents silent failures during on-call. Let me check the specifics.
Blockers
None.
What I verified
Image tags:
glitchtip:v4 → glitchtip:6.1.6for bothobs-glitchtipandobs-glitchtip-worker— correct. Pinned tags are the standard, and the PR description confirms this was verified against the composeimage:field. Good.Grafana port:
3001 → 3003in the DEPLOYMENT.md services table — correct. The compose file uses${PORT_GRAFANA:-3003}. A stale3001in the docs would send someone to a dead port during incident response.SENTRY_DSN column format: The added row uses
| — | — | YES |for the last three columns. Looking at the table pattern forSPRING_PROFILES_ACTIVE(| YES | — |) and other rows, the fifth column appears to mark prod-required vars.SENTRY_DSNbeing prod-required (YES) and not set in the compose file (—) is the correct representation. Internally consistent.Prometheus binding: Documenting "9090 (dev only —
127.0.0.1bound)" in the services table is accurate and the right security note — Prometheus scrapes are not authenticated, so the host-only binding is intentional. Good to call this out explicitly.Suggestions
VITE_SENTRY_DSNinDEPLOYMENT.md— It appears in the CLAUDE.md env-vars table (frontend build-time var), but I don't see it added toDEPLOYMENT.md. IfDEPLOYMENT.mdhas a frontend env vars section, add it there. A deployment runbook that omits a required build-time variable will cause a silent Sentry SDK no-op in production.The
obs.envfile ininfra/observability/presumably holds defaults for some of these vars. A pointer to that file in the DEPLOYMENT.md env-vars table row forPORT_GRAFANA/PORT_GLITCHTIP/PORT_PROMETHEUSwould help operators know where to set them without guessing.What's done well
Cleaning the stale "not yet deployed" paragraph from
production-compose.mdis exactly the kind of decay that silently misleads future contributors. The cross-reference toDEPLOYMENT.md §4is the right pattern — one authoritative source, pointers everywhere else.📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Blockers
None.
Requirements coverage check
The source issue (#581) asks for observability stack documentation. Evaluating completeness against the standard NFR checklist for documentation:
Coverage ✓
Ambiguity flags (minor)
GLITCHTIP_DOMAIN— missing example format — The purpose says "Public-facing base URL for GlitchTip (email links, CORS)" but does not show whether it should include a trailing slash, a protocol (https://), or a path prefix. A concrete example (https://glitchtip.example.com) would eliminate ambiguity for first-time operators.SENTRY_DSN— lifecycle ordering — BothSENTRY_DSNandVITE_SENTRY_DSNcan only be obtained after GlitchTip is running and a project is created. The documentation correctly says "leave empty to disable", but the first-run sequence (start stack → create GlitchTip project → get DSN → update.env→ restart backend + rebuild frontend) is implicit. A one-line note in the env var description or a numbered first-run step in §4 would make this explicit and reduce support confusion.GRAFANA_ADMIN_PASSWORD— no generation guidance —GLITCHTIP_SECRET_KEYcorrectly provides a generation command.GRAFANA_ADMIN_PASSWORDis also a secret but has no such guidance. Adding even "use a strong random password" or the samesecrets.token_hexcommand would be consistent.What's done well
The two-level documentation strategy (CLAUDE.md for quick reference, DEPLOYMENT.md §4 for full procedure, production-compose.md for cross-reference) maps well to different reader contexts. The PR description's verification table — tracing each value to the exact file and line — is a strong requirement-traceability practice that other documentation PRs should emulate.
🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
No security vulnerabilities. This is a documentation-only PR and I've checked it for the concerns that documentation PRs can introduce.
What I checked
Secrets in documentation — None committed. All secret-bearing fields (
GLITCHTIP_SECRET_KEY,SENTRY_DSN,VITE_SENTRY_DSN,GRAFANA_ADMIN_PASSWORD) are documented as env vars to be set externally, not as example values or defaults.GLITCHTIP_SECRET_KEYincludes a correct generation command using Python'ssecrets.token_hex(32)(cryptographically random, 256-bit key). Good.DSN security posture — Both
SENTRY_DSNandVITE_SENTRY_DSNcontain a secret key component. The documentation correctly notes they should be left empty to disable the SDK, which prevents accidental leakage if the DSN were misconfigured or captured in logs.VITE_SENTRY_DSNis noted as "injected at build time via Vite" — this is correct: it ends up embedded in the frontend JS bundle, which means it is effectively public. This is standard Sentry/GlitchTip practice (the DSN is rate-limited, not secret), but worth being aware of.Prometheus binding note — The documentation correctly calls out that Prometheus is "dev only —
127.0.0.1bound." Prometheus has no built-in authentication; the host-only binding is the security control. Documenting this explicitly prevents someone from opening it to the network in a "quick fix."No sensitive defaults — No hardcoded credential examples anywhere in the diff.
Suggestion (non-blocking)
VITE_SENTRY_DSNdocumentation could benefit from a one-line note that this value is embedded in the public frontend bundle (by Vite's nature) and therefore the DSN's rate-limiting in GlitchTip is the only defense against DSN abuse. This sets the right expectations for operators who might assume it's a private secret.🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist
Verdict: ✅ Approved
Blockers
None. No code changes means no test coverage delta.
Review focus
For a documentation PR, I review the test plan and the verifiability of what's documented.
Test plan quality — The PR's manual checklist is concise and specific:
These are meaningful verifications for a docs PR. The port comparison check is especially important — stale port numbers cause real debugging time during incident response.
Verifiability of documented values — All 8 env vars in the new CLAUDE.md table have clear, testable descriptions. The service-to-container-name mapping is exact and matches how Docker Compose exposes the containers. An operator could write a smoke test script using exactly these container names and ports.
Suggestion (non-blocking)
The test plan doesn't include verifying the
VITE_SENTRY_DSNenv var is reachable from the frontend context. Since it's a Vite build-time var, a quickgrep VITE_SENTRY_DSN frontend/vite.config.ts frontend/src/hooks.client.tsagainst the referenced files in the PR description would close the loop. Not a blocker — the PR author already verified this — but a checklist item would make it explicit for future reviewers.🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
Scope note
This is a pure documentation PR with no frontend or UI changes. My standard review criteria (touch targets, contrast, semantic HTML, responsive layout, Tailwind tokens) do not apply.
Documentation UX check
I reviewed the documentation for the principles I care about: clarity, information hierarchy, and whether a reader can find what they need without friction.
Information architecture — The two-table structure in CLAUDE.md (services table + env vars table) is the right pattern. Services tell you what is running; env vars tell you how to configure it. Clear separation of concerns.
Scannability — All five observability services are in one compact table. Ports and internal-vs-host-bound status are in the same row as the service name. An operator in an incident can find the Grafana port in one glance. Good.
Cross-reference chain —
production-compose.md → DEPLOYMENT.md §4is a one-hop redirect with a concrete anchor link. This is the right pattern: one authoritative source, explicit pointers from everywhere else. The old paragraph ("not yet deployed") was documentation debt that would confuse any new contributor reading the architecture docs.No blockers, no suggestions from my domain.
The documentation changes are accurate, well-organized, and easy to navigate.
Review concerns addressed
Five commits pushed addressing all open reviewer concerns:
✅ C4 diagram — stale GlitchTip image tag (
0d662f3a)docs/architecture/c4/l2-containers.puml: Updated both GlitchTip containers fromglitchtip/glitchtip:v4→glitchtip/glitchtip:6.1.6, matching the tag already corrected inDEPLOYMENT.mdin this PR.Resolves: @mkeller concern about C4 diagram consistency
✅
VITE_SENTRY_DSNadded to §2 observability env vars table (58b92043)docs/DEPLOYMENT.md: AddedVITE_SENTRY_DSNrow to the### Observability stackenv vars table. This variable is injected at build time via Vite for the SvelteKit frontend — it was missing despite being wired up inhooks.client.ts,hooks.server.ts, andvite.config.ts.Resolves: gap identified during review
✅
VITE_SENTRY_DSNadded to §3.3 Gitea secrets table (9d84ebc4)docs/DEPLOYMENT.md: AddedVITE_SENTRY_DSNrow alongsideSENTRY_DSNin the Gitea secrets table so operators know to provision both DSNs when setting up GlitchTip.Resolves: gap identified during review
✅
GLITCHTIP_DOMAINURL format example in CLAUDE.md (c9a22945)CLAUDE.md: Addede.g. https://glitchtip.example.comto theGLITCHTIP_DOMAINdescription so the expected format is unambiguous to a new operator.Resolves: @elicit concern about ambiguous URL format
✅
GRAFANA_ADMIN_PASSWORDgeneration guidance in CLAUDE.md (7712180f)CLAUDE.md: Added— generate with openssl rand -hex 32to theGRAFANA_ADMIN_PASSWORDdescription, matching the pattern already established forGLITCHTIP_SECRET_KEY.Resolves: @elicit concern about missing generation guidance
🏗️ Markus Keller — Senior Application Architect
Verdict: ✅ Approved
Pure documentation PR — no code to review, only doc currency to audit.
Documentation Currency Check
I run my standard table against this PR:
docs/architecture/c4/l2-containers.puml+docs/DEPLOYMENT.mdCLAUDE.mdInfrastructure sectionAll three obligations are met. The C4 L2 diagram (
l2-containers.puml) has the corrected GlitchTip image tag (6.1.6) in both the web and worker containers. DEPLOYMENT.md §4 service table has the corrected Grafana port (3003) and GlitchTip image.production-compose.mdno longer contradicts the live obs stack.Observations (non-blocking)
CLAUDE.md service table condenses the 9-container obs stack to 5 user-visible services. The omitted containers (
obs-promtail,obs-glitchtip-worker,obs-redis,obs-glitchtip-db-init) are implementation details that an LLM does not need to know to work on application code. This is the right call — CLAUDE.md is a context summary, not an ops runbook. DEPLOYMENT.md §4 is the authoritative reference and the cross-link is present.Prometheus annotated as "dev only" in the CLAUDE.md table. Strictly, Prometheus is bound to
127.0.0.1:9090in all environments (the compose file uses127.0.0.1:${PORT_PROMETHEUS:-9090}), not just dev. The annotation probably means "not exposed through the reverse proxy in prod" rather than "only runs in dev." Consider clarifying if it causes confusion — e.g., "localhost-only, not proxied" — but this is cosmetic.No architectural concerns. The boring, well-verified documentation approach is exactly the right way to handle this.
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
Pure documentation PR — no Svelte, Java, or Python to review. I checked the diff from a developer perspective: do these docs accurately reflect the code? The PR description includes an explicit verification checklist, and I spot-checked the claims.
What I Checked
obs-*naming convention indocker-compose.observability.yml✅3003for Grafana,3002for GlitchTip,9090for Prometheus) match the${VAR:-default}expressions in the compose file ✅6.1.6matches bothobs-glitchtipandobs-glitchtip-worker— both corrected consistently in DEPLOYMENT.md andl2-containers.puml✅VITE_SENTRY_DSNis wired infrontend/src/hooks.client.ts,hooks.server.ts, andvite.config.tsper the PR description ✅SENTRY_DSNis referenced indocker-compose.yml:150andapplication.yaml:123per the PR description ✅Minor Observation
The generation command for
GLITCHTIP_SECRET_KEYis shown as:The PR description mentions the same command, and it appears in the original
DEPLOYMENT.mdenv vars table too. Consistent — no issue.Nothing to fix. LGTM.
🛠️ Tobias Wendt — DevOps & Platform Engineer
Verdict: ✅ Approved
This is exactly the kind of maintenance I like to see: stale docs corrected with verified values before they cause an ops incident. Let me run through my checklist.
What's Correct
Image tag drift fixed.
glitchtip/glitchtip:v4is a floating tag — it moves whenever the maintainers push a new v4.x release, which meansdocker compose pullin production would silently upgrade to an untested version. Pinning to6.1.6in DEPLOYMENT.md andl2-containers.pumlmatches what the actual compose file declares. ✅Grafana port corrected.
3001→3003was clearly a copy-paste artifact from another service. The compose file's${PORT_GRAFANA:-3003}is now what the docs say it is. ✅VITE_SENTRY_DSNadded to the §3.3 secrets table. This is the one that actually matters for a production deploy — the build-time env var needs to be in Gitea secrets or the frontend ships with an empty DSN and errors never reach GlitchTip. Good catch. ✅SENTRY_DSNadded to the §2 backend env vars table. Needed for the Spring Boot error reporting to work. ✅production-compose.md no longer says "not yet deployed" about an ops stack that has been running for some time. The cross-reference to DEPLOYMENT.md §4 is the right call — one canonical source. ✅
Suggestions (non-blocking)
Renovate would prevent future tag drift.
glitchtip/glitchtip:6.1.6will need a manual bump next time the image is updated. Since Renovate is already configured for this repo, it's worth verifying there's a Renovate rule coveringdocker-compose.observability.yml. If not, adding afileMatchentry would automate the next bump and keepl2-containers.pumlin sync.PORT_PROMETHEUSis listed in CLAUDE.md env vars but not in the §3.3 Gitea secrets table in DEPLOYMENT.md (and rightly so — Prometheus is localhost-only and you would not override this port in production). The asymmetry between CLAUDE.md's env vars table (8 entries) and §3.3 (fewer, only production-relevant ones) is intentional and correct. No action needed.Clean PR. Ship it.
🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
Documentation-only PR. My review focuses on whether the documented secrets/credentials guidance is sound and whether the PR introduces any new exposure.
What I Checked
Secret generation commands are cryptographically correct.
GRAFANA_ADMIN_PASSWORD:→ 256 bits of CSPRNG output. ✅
GLITCHTIP_SECRET_KEY:→ Python's
secretsmodule usesos.urandom()— CSPRNG. 256 bits. ✅ Equivalent quality to the openssl version.DSNs documented with explicit "leave empty to disable" instruction. This is the right guidance. An empty DSN is safe; a missing-but-expected DSN can cause startup failures or noisy error logs. The opt-in framing is correct. ✅
No hardcoded credentials or example values added to any documentation. All sensitive fields show
—or a generation command, never a sample value. ✅VITE_SENTRY_DSNis a build-time variable injected via Vite and therefore baked into the compiled frontend bundle. Anyone who can inspect the JavaScript bundle can read the DSN. This is expected and documented Sentry/GlitchTip behavior — DSNs are public-facing by design (they are rate-limited and scope-limited to ingest-only). Not a vulnerability, but worth knowing so DSNs are rotated if leaked rather than treated like a secret password.SENTRY_DSNin §3.3 Gitea secrets table — listed asYESunder "Gitea secret". Correct: keeping it out of.env.exampleand committing it as a secret reduces the surface for accidental exposure. ✅No Blockers
This PR documents the existing setup accurately. No new attack surface introduced.
🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ✅ Approved
Pure documentation PR — no code paths, no test surface, no regressions possible. From a QA perspective this is a maintenance item, not a feature, so my review is brief.
What I Checked
PR test plan is appropriate for a docs PR:
These are the right verifications for documentation correctness. No automated test coverage needed.
Correctness assertions in the PR description are explicit and verifiable. The author lists specific file:line references for each documented value (e.g.,
docker-compose.yml:150forSENTRY_DSN,frontend/src/hooks.client.tsforVITE_SENTRY_DSN). This is the documentation equivalent of a unit test — a precise, reviewable claim.The removed stale content in production-compose.md ("not yet deployed... tracked as follow-up issue #498") is clean. Stale documentation acts like a disabled test — it erodes confidence in the entire document when readers can't tell which parts are current.
Observations
If this repo ever adds automated doc linting (e.g., checking that port numbers in CLAUDE.md match
docker-compose.observability.yml), this PR would be the kind of drift it should catch. A simple grep-based CI job comparing documented default ports against compose:-defaultexpressions would prevent future regressions of the3001→3003variety. Not a blocker for this PR, but worth a follow-up issue.LGTM.
🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
This PR contains no UI changes, no Svelte components, no CSS, no frontend routes, and no user-facing copy. It is entirely infrastructure documentation. There is nothing within my review scope.
I reviewed the diff to confirm: all four changed files are Markdown or PlantUML — developer and operator documentation only. No
+page.svelte, no.css, nomessages/*.jsontranslation files touched.LGTM from a UI/accessibility perspective — nothing to evaluate.
📋 Elicit — Requirements Engineer & Business Analyst
Verdict: ⚠️ Approved with concerns
This PR closes issue #581 by documenting the existing observability stack. I'm reviewing for requirements completeness — does the documentation accurately and fully capture what was built, and are there any gaps that would leave a future maintainer (or an LLM reading CLAUDE.md) with an incomplete picture?
What's Well-Done
The PR description follows good requirements practice: every documented value is traced back to a specific file and line. This is the EARS principle in documentation form — every claim is verifiable.
The DEPLOYMENT.md additions are precise and actionable:
SENTRY_DSN— references §4 for when to set it ✅VITE_SENTRY_DSN— documents build-time injection behavior ✅production-compose.mdwas carrying a false requirement — "not yet deployed, tracked as #498" — when the obs stack has been running. Removing it prevents a future operator from acting on stale guidance. ✅Concern: CLAUDE.md Service Table Incompleteness
The CLAUDE.md
### Observability stacksection lists 5 services:The actual obs stack has 9 containers: the above 5 plus
obs-promtail,obs-glitchtip-worker,obs-redis, andobs-glitchtip-db-init.This is intentional — CLAUDE.md is a quick-reference for LLMs and developers, not an ops runbook. The 5 services listed are the ones a developer would configure or interact with. The other 4 are infrastructure internals (log shipper, Celery worker, task broker, one-shot DB init). The cross-reference to DEPLOYMENT.md §4 covers the full picture.
However: the omission of
obs-glitchtip-workerspecifically could cause confusion. If an LLM reads CLAUDE.md and is asked to help with "GlitchTip isn't processing events," it might not know there's a separate worker container to check. A brief parenthetical in the GlitchTip row — e.g., "Error tracking (Sentry-compatible) — web + worker" — would close this gap without cluttering the table.This is a suggestion, not a blocker. The cross-reference to DEPLOYMENT.md is sufficient for a human.
Minor Ambiguity
CLAUDE.md describes Prometheus as
9090 (dev only — 127.0.0.1 bound). Thedev onlyqualifier is ambiguous: Prometheus runs in both dev and prod, bound to127.0.0.1in both. The intent is "not exposed via the reverse proxy in production," but a reader could infer "this container only starts in a dev environment." Suggest:9090 (localhost-only, not reverse-proxied).Verdict
Both concerns are non-blocking. The core documentation requirement — accurate, verifiable, cross-referenced — is met. Approve with the suggestion to clarify the Prometheus annotation.