docs: document observability stack in DEPLOYMENT.md and CLAUDE.md #581

New Issue

marcel · 2026-05-14T15:05:59+02:00

marcel commented

2026-05-14 15:05:59 +02:00

Context

The observability stack adds a new compose file, new service ports, new env vars, and a manual first-run step (creating the GlitchTip superuser and projects). Without documentation, a new developer (or you, six months from now) cannot get the stack running from scratch.

Depends on: all other observability issues (document the final state, not intermediate states)

Files to Update

`docs/DEPLOYMENT.md`

Add a new top-level section "Observability Stack" after the existing application stack section. It must cover:

Starting and stopping

# Start observability stack (application stack must already be running)
docker compose -f docker-compose.observability.yml up -d

# Stop (data volumes are preserved)
docker compose -f docker-compose.observability.yml down

# Stop and wipe all observability data
docker compose -f docker-compose.observability.yml down -v

Service URLs

Service	URL	Credentials
Grafana	`http://localhost:3001`	admin / `$GRAFANA_ADMIN_PASSWORD`
GlitchTip	`http://localhost:3002`	created during first-run
Prometheus	`http://localhost:9090`	none

First-run setup (required once per environment)

# 1. Ensure main application stack is running
docker compose up -d

# 2. Start observability stack
docker compose -f docker-compose.observability.yml up -d

# 3. Create GlitchTip superuser
docker exec -it obs-glitchtip ./manage.py createsuperuser

# 4. Open http://localhost:3002
#    Create an Organization
#    Create a Project → type "Django" → name "familienarchiv-backend" → copy DSN
#    Create a Project → type "JavaScript" → name "familienarchiv-frontend" → copy DSN

# 5. Add DSNs to .env
echo "SENTRY_DSN=<backend-dsn>" >> .env
echo "VITE_SENTRY_DSN=<frontend-dsn>" >> .env

# 6. Restart application stack to pick up new env vars
docker compose up -d backend frontend

# 7. Change Grafana admin password at http://localhost:3001

Generating a GLITCHTIP_SECRET_KEY

python3 -c "import secrets; print(secrets.token_hex(32))"

`CLAUDE.md`

Update the stack overview table (under "Infrastructure") to include:

Service	Container	Default Port	Purpose
Grafana	`obs-grafana`	3001	Metrics / logs / traces dashboard
Prometheus	`obs-prometheus`	9090	Metrics store
Loki	`obs-loki`	— (internal)	Log store
Tempo	`obs-tempo`	— (internal)	Trace store
GlitchTip	`obs-glitchtip`	3002	Error tracking

Also add the new env vars to the .env reference section:

PORT_GRAFANA, PORT_GLITCHTIP, PORT_PROMETHEUS
GRAFANA_ADMIN_PASSWORD
GLITCHTIP_SECRET_KEY, GLITCHTIP_DOMAIN
SENTRY_DSN, VITE_SENTRY_DSN

Acceptance Criteria

docs/DEPLOYMENT.md has an "Observability Stack" section covering all steps above
A developer following only docs/DEPLOYMENT.md can get the full stack running from a fresh clone without consulting any other file
CLAUDE.md stack table includes all five new services with correct container names and ports
CLAUDE.md env var reference lists all new observability vars
No broken internal links (cross-references between DEPLOYMENT.md and CLAUDE.md remain valid)

Definition of Done

All acceptance criteria checked
Committed on a feature branch, PR opened against main

## Context The observability stack adds a new compose file, new service ports, new env vars, and a manual first-run step (creating the GlitchTip superuser and projects). Without documentation, a new developer (or you, six months from now) cannot get the stack running from scratch. **Depends on:** all other observability issues (document the final state, not intermediate states) ## Files to Update ### `docs/DEPLOYMENT.md` Add a new top-level section **"Observability Stack"** after the existing application stack section. It must cover: #### Starting and stopping ```bash # Start observability stack (application stack must already be running) docker compose -f docker-compose.observability.yml up -d # Stop (data volumes are preserved) docker compose -f docker-compose.observability.yml down # Stop and wipe all observability data docker compose -f docker-compose.observability.yml down -v ``` #### Service URLs | Service | URL | Credentials | |---------|-----|-------------| | Grafana | `http://localhost:3001` | admin / `$GRAFANA_ADMIN_PASSWORD` | | GlitchTip | `http://localhost:3002` | created during first-run | | Prometheus | `http://localhost:9090` | none | #### First-run setup (required once per environment) ```bash # 1. Ensure main application stack is running docker compose up -d # 2. Start observability stack docker compose -f docker-compose.observability.yml up -d # 3. Create GlitchTip superuser docker exec -it obs-glitchtip ./manage.py createsuperuser # 4. Open http://localhost:3002 # Create an Organization # Create a Project → type "Django" → name "familienarchiv-backend" → copy DSN # Create a Project → type "JavaScript" → name "familienarchiv-frontend" → copy DSN # 5. Add DSNs to .env echo "SENTRY_DSN=<backend-dsn>" >> .env echo "VITE_SENTRY_DSN=<frontend-dsn>" >> .env # 6. Restart application stack to pick up new env vars docker compose up -d backend frontend # 7. Change Grafana admin password at http://localhost:3001 ``` #### Generating a GLITCHTIP_SECRET_KEY ```bash python3 -c "import secrets; print(secrets.token_hex(32))" ``` ### `CLAUDE.md` Update the stack overview table (under "Infrastructure") to include: | Service | Container | Default Port | Purpose | |---------|-----------|-------------|---------| | Grafana | `obs-grafana` | 3001 | Metrics / logs / traces dashboard | | Prometheus | `obs-prometheus` | 9090 | Metrics store | | Loki | `obs-loki` | — (internal) | Log store | | Tempo | `obs-tempo` | — (internal) | Trace store | | GlitchTip | `obs-glitchtip` | 3002 | Error tracking | Also add the new env vars to the `.env` reference section: - `PORT_GRAFANA`, `PORT_GLITCHTIP`, `PORT_PROMETHEUS` - `GRAFANA_ADMIN_PASSWORD` - `GLITCHTIP_SECRET_KEY`, `GLITCHTIP_DOMAIN` - `SENTRY_DSN`, `VITE_SENTRY_DSN` ## Acceptance Criteria - [ ] `docs/DEPLOYMENT.md` has an "Observability Stack" section covering all steps above - [ ] A developer following only `docs/DEPLOYMENT.md` can get the full stack running from a fresh clone without consulting any other file - [ ] `CLAUDE.md` stack table includes all five new services with correct container names and ports - [ ] `CLAUDE.md` env var reference lists all new observability vars - [ ] No broken internal links (cross-references between DEPLOYMENT.md and CLAUDE.md remain valid) ## Definition of Done - All acceptance criteria checked - Committed on a feature branch, PR opened against `main`

marcel added this to the Observability Stack — Grafana LGTM + GlitchTip milestone 2026-05-14 15:05:59 +02:00

marcel added the P2-medium documentation labels 2026-05-14 15:06:14 +02:00

marcel commented

2026-05-14 15:28:33 +02:00

🏗️ Markus Keller — Senior Application Architect

Observations

The issue correctly identifies docs/DEPLOYMENT.md and CLAUDE.md as the two targets — those are exactly the right files per the architecture doc-update rules (new Docker service → update docs/DEPLOYMENT.md + l2-containers.puml).
There is a gap the issue does not mention: docs/architecture/c4/l2-containers.puml must be updated when new Docker services are added. The architect persona rule is explicit: "New Docker service or infrastructure component → docs/architecture/c4/l2-containers.puml + docs/DEPLOYMENT.md". This file currently has no mention of Grafana, Prometheus, Loki, Tempo, or GlitchTip. That omission should be added to the acceptance criteria.
The docs/infrastructure/production-compose.md already contains a forward-reference placeholder: "Prometheus, Loki, Grafana, Alertmanager, Uptime Kuma, GlitchTip and ntfy are not part of the production deployment... tracked as follow-up issue #498." Once the observability compose lands, that stale placeholder needs to be updated too. The issue's scope only touches DEPLOYMENT.md and CLAUDE.md — the production-compose.md stale section will silently contradict the new docs unless cleared.
The issue says the observability compose file is a separate docker-compose.observability.yml. DEPLOYMENT.md currently describes two compose files (docker-compose.yml and docker-compose.prod.yml) plus the overlay pattern. Adding a third standalone file means the §1 topology diagram (Mermaid) should be updated to show the observability cluster as a parallel deployment alongside the application stack — not currently mentioned in the issue scope.
The docs/infrastructure/self-hosted-catalogue.md already documents GlitchTip and references a compose snippet. The new section in DEPLOYMENT.md should cross-link to it rather than duplicating configuration details.

Recommendations

Add to acceptance criteria: docs/architecture/c4/l2-containers.puml updated with the five new observability containers and their relationships.
Add to scope: Update the stale "not yet deployed" note in docs/infrastructure/production-compose.md to reference the new section in DEPLOYMENT.md.
Add to scope: Update the §1 deployment topology diagram in DEPLOYMENT.md (the Mermaid graph) to show obs-grafana, obs-prometheus, obs-loki, obs-tempo, obs-glitchtip as a parallel sidecar cluster connected to the application stack.
The first-run setup steps in the issue body are solid. The GLITCHTIP_SECRET_KEY generation command matches the existing pattern used for OCR_TRAINING_TOKEN — consistent.

Open Decisions

production-compose.md stale section: Should cleaning up docs/infrastructure/production-compose.md be in scope for this issue, or tracked as a separate follow-up? The contradiction it would create is minor but could confuse a developer who reads the catalogue before DEPLOYMENT.md. Given this is a documentation-only issue with no risk of scope creep, I'd include it in scope.

## 🏗️ Markus Keller — Senior Application Architect ### Observations - The issue correctly identifies `docs/DEPLOYMENT.md` and `CLAUDE.md` as the two targets — those are exactly the right files per the architecture doc-update rules (new Docker service → update `docs/DEPLOYMENT.md` + `l2-containers.puml`). - There is a gap the issue does not mention: **`docs/architecture/c4/l2-containers.puml`** must be updated when new Docker services are added. The architect persona rule is explicit: "New Docker service or infrastructure component → `docs/architecture/c4/l2-containers.puml` + `docs/DEPLOYMENT.md`". This file currently has no mention of Grafana, Prometheus, Loki, Tempo, or GlitchTip. That omission should be added to the acceptance criteria. - The `docs/infrastructure/production-compose.md` already contains a forward-reference placeholder: *"Prometheus, Loki, Grafana, Alertmanager, Uptime Kuma, GlitchTip and ntfy are not part of the production deployment... tracked as follow-up issue #498."* Once the observability compose lands, that stale placeholder needs to be updated too. The issue's scope only touches `DEPLOYMENT.md` and `CLAUDE.md` — the `production-compose.md` stale section will silently contradict the new docs unless cleared. - The issue says the observability compose file is a separate `docker-compose.observability.yml`. DEPLOYMENT.md currently describes two compose files (`docker-compose.yml` and `docker-compose.prod.yml`) plus the overlay pattern. Adding a third standalone file means the §1 topology diagram (Mermaid) should be updated to show the observability cluster as a parallel deployment alongside the application stack — not currently mentioned in the issue scope. - The `docs/infrastructure/self-hosted-catalogue.md` already documents GlitchTip and references a compose snippet. The new section in `DEPLOYMENT.md` should cross-link to it rather than duplicating configuration details. ### Recommendations - **Add to acceptance criteria**: `docs/architecture/c4/l2-containers.puml` updated with the five new observability containers and their relationships. - **Add to scope**: Update the stale "not yet deployed" note in `docs/infrastructure/production-compose.md` to reference the new section in `DEPLOYMENT.md`. - **Add to scope**: Update the §1 deployment topology diagram in `DEPLOYMENT.md` (the Mermaid graph) to show `obs-grafana`, `obs-prometheus`, `obs-loki`, `obs-tempo`, `obs-glitchtip` as a parallel sidecar cluster connected to the application stack. - The first-run setup steps in the issue body are solid. The `GLITCHTIP_SECRET_KEY` generation command matches the existing pattern used for `OCR_TRAINING_TOKEN` — consistent. ### Open Decisions - **`production-compose.md` stale section**: Should cleaning up `docs/infrastructure/production-compose.md` be in scope for this issue, or tracked as a separate follow-up? The contradiction it would create is minor but could confuse a developer who reads the catalogue before `DEPLOYMENT.md`. Given this is a documentation-only issue with no risk of scope creep, I'd include it in scope.

marcel commented

2026-05-14 15:28:47 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

This is a pure documentation issue — no production code changes. That's the right call for a "depends on all other observability issues" ticket; document the final state only.
The issue body already contains the exact prose to write, including bash code blocks, a service URL table, and env var lists. The implementer essentially just needs to transplant these blocks into the correct files — low ambiguity.
The CLAUDE.md ## Infrastructure section currently reads only: → See docs/DEPLOYMENT.md. Adding the service table and env var reference there means the section grows from 1 line to a substantive reference block. That's fine — it's the right level of information for an LLM context file.
Missing detail in the issue: the CLAUDE.md update lists env var names but doesn't specify where in CLAUDE.md to put them. The file currently has no env var reference section — the issue implies adding one under Infrastructure. The implementer should add it as a subsection like ### Observability env vars directly after the service table, for consistency with how DEPLOYMENT.md §2 is structured.
The docker exec -it obs-glitchtip ./manage.py createsuperuser step uses -it (interactive TTY). In a headless server environment, this is correct — but worth noting that automated provisioning scripts cannot use -it. The documentation should note this is a manual one-time step (the issue body already says "required once per environment", which is correct).
The issue specifies project type "Django" when creating the GlitchTip backend project. GlitchTip is Django-based, so "Django" is the right SDK type for the Spring Boot backend's Sentry SDK integration — this is correct.

Recommendations

The implementer should keep the new DEPLOYMENT.md section internally consistent with the existing formatting: numbered steps, code blocks with # comments, and the env var table style already in §2.
In the CLAUDE.md service table, the "Purpose" column should stay brief (1 sentence max) — consistent with how the existing stack table entries read in README.md.
No code changes needed → no npm run generate:api needed. Confirm this explicitly in the PR description to save reviewers from wondering.

No open decisions — the scope is clear and the content is fully specified in the issue body.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - This is a pure documentation issue — no production code changes. That's the right call for a "depends on all other observability issues" ticket; document the final state only. - The issue body already contains the exact prose to write, including bash code blocks, a service URL table, and env var lists. The implementer essentially just needs to transplant these blocks into the correct files — low ambiguity. - The CLAUDE.md `## Infrastructure` section currently reads only: `→ See docs/DEPLOYMENT.md`. Adding the service table and env var reference there means the section grows from 1 line to a substantive reference block. That's fine — it's the right level of information for an LLM context file. - **Missing detail in the issue**: the `CLAUDE.md` update lists env var names but doesn't specify *where* in CLAUDE.md to put them. The file currently has no env var reference section — the issue implies adding one under Infrastructure. The implementer should add it as a subsection like `### Observability env vars` directly after the service table, for consistency with how `DEPLOYMENT.md §2` is structured. - The `docker exec -it obs-glitchtip ./manage.py createsuperuser` step uses `-it` (interactive TTY). In a headless server environment, this is correct — but worth noting that automated provisioning scripts cannot use `-it`. The documentation should note this is a **manual one-time step** (the issue body already says "required once per environment", which is correct). - The issue specifies project type `"Django"` when creating the GlitchTip backend project. GlitchTip is Django-based, so `"Django"` is the right SDK type for the Spring Boot backend's Sentry SDK integration — this is correct. ### Recommendations - The implementer should keep the new DEPLOYMENT.md section internally consistent with the existing formatting: numbered steps, code blocks with `#` comments, and the env var table style already in §2. - In the CLAUDE.md service table, the "Purpose" column should stay brief (1 sentence max) — consistent with how the existing stack table entries read in README.md. - No code changes needed → no `npm run generate:api` needed. Confirm this explicitly in the PR description to save reviewers from wondering. No open decisions — the scope is clear and the content is fully specified in the issue body.

marcel commented

2026-05-14 15:29:06 +02:00

🚀 Tobias Wendt — DevOps & Platform Engineer

Observations

The docker-compose.observability.yml file does not exist in the repo yet. This issue documents a compose file that is being added by sibling issues in this milestone. That's the right sequencing ("depends on all other observability issues"), but it means the documentation implementer cannot verify the container names, env var names, or port numbers against a real file. The acceptance criterion "A developer following only DEPLOYMENT.md can get the full stack running from a fresh clone" cannot be validated until the compose file lands. Track this dependency explicitly.
The container names in the issue (obs-grafana, obs-prometheus, obs-loki, obs-tempo, obs-glitchtip) use a clean obs- prefix namespace — good for distinguishing observability containers from application containers in docker ps output.
The issue specifies docker compose (V2 plugin syntax), not docker-compose (V1). The existing DEPLOYMENT.md uses docker compose throughout — consistent.
The first-run step uses docker exec -it obs-glitchtip ./manage.py createsuperuser. The container name obs-glitchtip must match the name in docker-compose.observability.yml. If the actual compose file uses container_name: obs-glitchtip-web (because GlitchTip typically has a web + worker split), the exec command will fail with "no such container". The self-hosted catalogue shows GlitchTip with glitchtip-web and glitchtip-worker services. The container name in the documentation needs to be verified against the actual compose file.
Memory budget note missing: production-compose.md already warns that "Loki + Grafana with >30 days retention" may push memory past CX32 limits. The new DEPLOYMENT.md section should include a note that the observability stack adds approximately 1-2 GB RAM, and on CX32 (8 GB) operators should monitor available memory.
The GLITCHTIP_SECRET_KEY generation command uses secrets.token_hex(32) — identical to the existing OCR_TRAINING_TOKEN generation pattern. Consistent. Good.
Port assignments: Grafana at 3001 and GlitchTip at 3002. These do not conflict with existing app ports (frontend 5173 dev/3000 prod, backend 8080, MinIO 9000/9001, Mailpit 8025/1025). Clean.
The issue says Prometheus is accessible at http://localhost:9090 with "none" credentials. In production the Prometheus port should not be published to the host at all — Grafana scrapes it on the internal Docker network. The documentation should clarify that localhost:9090 is a dev-only convenience and the port should be removed (expose: not ports:) in the production compose overlay.

Recommendations

Add a note in the new DEPLOYMENT.md section: localhost:9090 (Prometheus) and potentially localhost:3001 (Grafana) are dev-accessible. In production, Grafana should be proxied through Caddy or kept internal; Prometheus should not be port-published.
Add a one-sentence memory warning: "The observability stack adds ~1-2 GB RAM. Monitor available memory on CX32 (8 GB) instances."
Verify the exact container name for the GlitchTip createsuperuser exec step against the actual compose file before merging this PR.

Open Decisions

Caddy proxy for Grafana in production: Should DEPLOYMENT.md document proxying Grafana behind Caddy (with a subdomain or subpath) as the recommended production setup, or is direct port access acceptable for a private family archive? The issue spec leaves Grafana at localhost:3001 — which works for local dev but not for remote access from a VPS.

## 🚀 Tobias Wendt — DevOps & Platform Engineer ### Observations - **The `docker-compose.observability.yml` file does not exist in the repo yet.** This issue documents a compose file that is being added by sibling issues in this milestone. That's the right sequencing ("depends on all other observability issues"), but it means the documentation implementer cannot verify the container names, env var names, or port numbers against a real file. The acceptance criterion "A developer following only DEPLOYMENT.md can get the full stack running from a fresh clone" cannot be validated until the compose file lands. Track this dependency explicitly. - The container names in the issue (`obs-grafana`, `obs-prometheus`, `obs-loki`, `obs-tempo`, `obs-glitchtip`) use a clean `obs-` prefix namespace — good for distinguishing observability containers from application containers in `docker ps` output. - The issue specifies `docker compose` (V2 plugin syntax), not `docker-compose` (V1). The existing `DEPLOYMENT.md` uses `docker compose` throughout — consistent. - The first-run step uses `docker exec -it obs-glitchtip ./manage.py createsuperuser`. The container name `obs-glitchtip` must match the name in `docker-compose.observability.yml`. If the actual compose file uses `container_name: obs-glitchtip-web` (because GlitchTip typically has a `web` + `worker` split), the exec command will fail with "no such container". The self-hosted catalogue shows GlitchTip with `glitchtip-web` and `glitchtip-worker` services. **The container name in the documentation needs to be verified against the actual compose file.** - **Memory budget note missing**: `production-compose.md` already warns that "Loki + Grafana with >30 days retention" may push memory past CX32 limits. The new DEPLOYMENT.md section should include a note that the observability stack adds approximately 1-2 GB RAM, and on CX32 (8 GB) operators should monitor available memory. - The `GLITCHTIP_SECRET_KEY` generation command uses `secrets.token_hex(32)` — identical to the existing `OCR_TRAINING_TOKEN` generation pattern. Consistent. Good. - Port assignments: Grafana at 3001 and GlitchTip at 3002. These do not conflict with existing app ports (frontend 5173 dev/3000 prod, backend 8080, MinIO 9000/9001, Mailpit 8025/1025). Clean. - **The issue says Prometheus is accessible at `http://localhost:9090` with "none" credentials.** In production the Prometheus port should not be published to the host at all — Grafana scrapes it on the internal Docker network. The documentation should clarify that `localhost:9090` is a dev-only convenience and the port should be removed (`expose:` not `ports:`) in the production compose overlay. ### Recommendations - Add a note in the new DEPLOYMENT.md section: `localhost:9090` (Prometheus) and potentially `localhost:3001` (Grafana) are dev-accessible. In production, Grafana should be proxied through Caddy or kept internal; Prometheus should not be port-published. - Add a one-sentence memory warning: "The observability stack adds ~1-2 GB RAM. Monitor available memory on CX32 (8 GB) instances." - Verify the exact container name for the GlitchTip `createsuperuser` exec step against the actual compose file before merging this PR. ### Open Decisions - **Caddy proxy for Grafana in production**: Should `DEPLOYMENT.md` document proxying Grafana behind Caddy (with a subdomain or subpath) as the recommended production setup, or is direct port access acceptable for a private family archive? The issue spec leaves Grafana at `localhost:3001` — which works for local dev but not for remote access from a VPS.

marcel commented

2026-05-14 15:29:20 +02:00

📋 Elicit — Requirements Engineer

Observations

The issue is well-specified for a documentation task: exact section title, exact content blocks (commands, tables, env vars), and a numbered first-run procedure. Acceptance criteria are concrete and verifiable — no ambiguous language.
AC gap — testability: The AC says "A developer following only DEPLOYMENT.md can get the full stack running from a fresh clone without consulting any other file." This is the right criterion, but it depends on docker-compose.observability.yml existing in the repo. The Definition of Done should explicitly state this criterion is validated after the compose file from the sibling issues has been merged.
AC gap — broken links: "No broken internal links (cross-references between DEPLOYMENT.md and CLAUDE.md remain valid)" is correct but incomplete. It does not cover the stale cross-reference in docs/infrastructure/production-compose.md (which says the observability stack is "not yet deployed" and tracked as #498). After this issue lands, that sentence becomes a lie. Add a criterion: "The stale 'not yet deployed' note in docs/infrastructure/production-compose.md is updated or removed."
Scope boundary is clear: The issue explicitly does not scope .env.example updates, Caddyfile changes, or Gitea secrets changes. That's intentional and correct — this issue documents what already exists, not what to configure.
Missing acceptance criterion for CLAUDE.md placement: The issue lists what to add to CLAUDE.md but does not specify where in the existing structure the service table and env vars go. The AC should say: "The new service table appears under the ## Infrastructure heading in CLAUDE.md, after the existing stack bullet list."
The first-run steps are numbered and sequential — correct for a procedure that has strict ordering (you cannot create a GlitchTip project before the service is running). The numbering style is consistent with existing DEPLOYMENT.md step patterns.

Recommendations

Add one AC: "The stale forward-reference in docs/infrastructure/production-compose.md (noting observability as 'not yet deployed') is updated to reference the new DEPLOYMENT.md section."
Add one AC: "The CLAUDE.md service table appears under the ## Infrastructure heading, and the env var list appears in a clearly labeled subsection beneath it."
Add a dependency note to the Definition of Done: "Merged only after docker-compose.observability.yml exists in the repo, so the container names and port numbers in the documentation can be verified against the actual file."
The issue body contains exact content to be written — implementation risk is low. The main risk is documentation drift if the compose file uses different container names than the ones in the issue spec.

No open decisions — the requirements are clear and the content is fully prescribed.

## 📋 Elicit — Requirements Engineer ### Observations - The issue is well-specified for a documentation task: exact section title, exact content blocks (commands, tables, env vars), and a numbered first-run procedure. Acceptance criteria are concrete and verifiable — no ambiguous language. - **AC gap — testability**: The AC says "A developer following only DEPLOYMENT.md can get the full stack running from a fresh clone without consulting any other file." This is the right criterion, but it depends on `docker-compose.observability.yml` existing in the repo. The Definition of Done should explicitly state this criterion is validated after the compose file from the sibling issues has been merged. - **AC gap — broken links**: "No broken internal links (cross-references between DEPLOYMENT.md and CLAUDE.md remain valid)" is correct but incomplete. It does not cover the stale cross-reference in `docs/infrastructure/production-compose.md` (which says the observability stack is "not yet deployed" and tracked as #498). After this issue lands, that sentence becomes a lie. Add a criterion: "The stale 'not yet deployed' note in `docs/infrastructure/production-compose.md` is updated or removed." - **Scope boundary is clear**: The issue explicitly does not scope `.env.example` updates, Caddyfile changes, or Gitea secrets changes. That's intentional and correct — this issue documents what already exists, not what to configure. - **Missing acceptance criterion for CLAUDE.md placement**: The issue lists what to add to CLAUDE.md but does not specify *where* in the existing structure the service table and env vars go. The AC should say: "The new service table appears under the `## Infrastructure` heading in CLAUDE.md, after the existing stack bullet list." - The first-run steps are numbered and sequential — correct for a procedure that has strict ordering (you cannot create a GlitchTip project before the service is running). The numbering style is consistent with existing DEPLOYMENT.md step patterns. ### Recommendations - Add one AC: "The stale forward-reference in `docs/infrastructure/production-compose.md` (noting observability as 'not yet deployed') is updated to reference the new DEPLOYMENT.md section." - Add one AC: "The CLAUDE.md service table appears under the `## Infrastructure` heading, and the env var list appears in a clearly labeled subsection beneath it." - Add a dependency note to the Definition of Done: "Merged only after `docker-compose.observability.yml` exists in the repo, so the container names and port numbers in the documentation can be verified against the actual file." - The issue body contains exact content to be written — implementation risk is low. The main risk is documentation drift if the compose file uses different container names than the ones in the issue spec. No open decisions — the requirements are clear and the content is fully prescribed.

marcel commented

2026-05-14 15:29:36 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

This is a documentation issue — no application code changes. No injection, auth, or data-exposure risks introduced. Security review focuses on what the documentation itself discloses or obscures.
Credential handling in the first-run steps is correct: DSNs are added via echo >> .env (not hardcoded in any committed file), and GLITCHTIP_SECRET_KEY is generated with secrets.token_hex(32) — same pattern as the existing OCR_TRAINING_TOKEN. No secrets committed to git.
The GRAFANA_ADMIN_PASSWORD credential column says admin / $GRAFANA_ADMIN_PASSWORD: The documentation correctly references the env var, not a hardcoded value. The first-run step (step 7) instructs the operator to change the Grafana admin password at http://localhost:3001. This is a manual step — there is no enforcement. Consider adding a warning that Grafana's default admin password (admin) must be changed before exposing the port to any network. The doc says "Change Grafana admin password" but does not warn about the consequence of skipping it.
GlitchTip DSN sensitivity: A GlitchTip DSN contains the project URL and a public key. It is not a secret in the Sentry model (DSNs are designed to be used in browser-side JavaScript). However, the documentation adds it to .env alongside SENTRY_DSN (backend) and VITE_SENTRY_DSN (frontend). This is appropriate — .env is gitignored and treated as sensitive. No concern here.
Prometheus at localhost:9090 with no auth: The service URL table shows Prometheus with "none" credentials. If this port is published in production, any process on the host can query Prometheus metrics, which may include timing data, endpoint names, and error rates. This is a minor information disclosure in a private family archive context, but worth noting. The documentation should clarify this is dev-only access.
docker exec -it obs-glitchtip ./manage.py createsuperuser creates a Django superuser on GlitchTip. The doc correctly instructs the operator to do this during first-run. No automation of this credential in committed files — correct.
.env file pattern: echo "SENTRY_DSN=<backend-dsn>" >> .env appends to .env. If .env is tracked by git in any way (it isn't — it's in .gitignore), this would be dangerous. The existing project correctly keeps .env out of git. No issue.

Recommendations

Strengthen the Grafana password step: add a warning note (e.g., ⚠ Do not skip this step — the default Grafana password is "admin".) to make it explicit that the default is insecure, not just a suggestion.
The Prometheus "none" credentials row in the service table should have a footnote or note: "Dev access only — do not expose port 9090 in production."
These are Minor security hygiene notes, not blockers.

No open decisions.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Observations - This is a documentation issue — no application code changes. No injection, auth, or data-exposure risks introduced. Security review focuses on what the documentation itself discloses or obscures. - **Credential handling in the first-run steps is correct**: DSNs are added via `echo >> .env` (not hardcoded in any committed file), and `GLITCHTIP_SECRET_KEY` is generated with `secrets.token_hex(32)` — same pattern as the existing `OCR_TRAINING_TOKEN`. No secrets committed to git. - **The `GRAFANA_ADMIN_PASSWORD` credential column says `admin / $GRAFANA_ADMIN_PASSWORD`**: The documentation correctly references the env var, not a hardcoded value. The first-run step (step 7) instructs the operator to change the Grafana admin password at `http://localhost:3001`. This is a manual step — there is no enforcement. Consider adding a warning that Grafana's default admin password (`admin`) must be changed before exposing the port to any network. The doc says "Change Grafana admin password" but does not warn about the consequence of skipping it. - **GlitchTip DSN sensitivity**: A GlitchTip DSN contains the project URL and a public key. It is not a secret in the Sentry model (DSNs are designed to be used in browser-side JavaScript). However, the documentation adds it to `.env` alongside `SENTRY_DSN` (backend) and `VITE_SENTRY_DSN` (frontend). This is appropriate — `.env` is gitignored and treated as sensitive. No concern here. - **Prometheus at localhost:9090 with no auth**: The service URL table shows Prometheus with "none" credentials. If this port is published in production, any process on the host can query Prometheus metrics, which may include timing data, endpoint names, and error rates. This is a minor information disclosure in a private family archive context, but worth noting. The documentation should clarify this is dev-only access. - **`docker exec -it obs-glitchtip ./manage.py createsuperuser`** creates a Django superuser on GlitchTip. The doc correctly instructs the operator to do this during first-run. No automation of this credential in committed files — correct. - **`.env` file pattern**: `echo "SENTRY_DSN=<backend-dsn>" >> .env` appends to `.env`. If `.env` is tracked by git in any way (it isn't — it's in `.gitignore`), this would be dangerous. The existing project correctly keeps `.env` out of git. No issue. ### Recommendations - Strengthen the Grafana password step: add a warning note (e.g., `⚠ Do not skip this step — the default Grafana password is "admin".`) to make it explicit that the default is insecure, not just a suggestion. - The Prometheus "none" credentials row in the service table should have a footnote or note: "Dev access only — do not expose port 9090 in production." - These are Minor security hygiene notes, not blockers. No open decisions.

marcel commented

2026-05-14 15:29:50 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Observations

This is a pure documentation issue — no test pyramid impact. No new test cases are required. QA review focuses on whether the acceptance criteria are testable and whether the documentation itself can be verified.
AC "A developer following only DEPLOYMENT.md can get the full stack running from a fresh clone without consulting any other file" — this is testable but requires a live environment with the compose file present. It should be validated by doing a dry-run against the actual docker-compose.observability.yml once it exists. Consider a checklist: follow each numbered step from a clean shell session with only the DEPLOYMENT.md open.
AC "No broken internal links" — this is mechanically verifiable. A tool like markdown-link-check or a manual pass through every [§X] and (../*) reference in the updated files will confirm no rotten links. The issue doesn't mention running such a check, but it should be part of the PR review.
Missing negative test for the AC: The current ACs only describe positive conditions (docs are present, stack runs). There is no AC covering the failure mode: "What happens if a developer follows the first-run steps but the main application stack is not already running?" The documentation says "(application stack must already be running)" but doesn't describe the error the developer will see or how to recover. This is a UX-of-documentation gap, not a blocker.
First-run step 6 says "Restart application stack to pick up new env vars" with docker compose up -d backend frontend. This is correct for picking up SENTRY_DSN and VITE_SENTRY_DSN environment variables. However, if a developer runs docker compose up -d (the full stack) instead of targeting specific services, it is equally valid and less error-prone. The documentation could note both options.
The docker compose -f docker-compose.observability.yml down -v command wipes all observability data. This is a destructive operation. Standard DEPLOYMENT.md practice (as seen in the existing rollback section) is to warn before destructive commands. A short ⚠ This destroys all metrics, logs, and trace data. note is consistent with the existing doc tone.

Recommendations

Add a ⚠ warning before the down -v command: "This destroys all metrics, logs, and trace data permanently."
Add a note under "Starting and stopping" that the main application stack (docker compose up -d) must be running before starting the observability stack — and what happens if it isn't (the GlitchTip service will fail to connect to its database if it shares the main Postgres instance, for example).
For PR review: do a manual link-check pass on all internal cross-references in the updated files before merging.

No open decisions.

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Observations - This is a pure documentation issue — no test pyramid impact. No new test cases are required. QA review focuses on whether the acceptance criteria are testable and whether the documentation itself can be verified. - **AC "A developer following only DEPLOYMENT.md can get the full stack running from a fresh clone without consulting any other file"** — this is testable but requires a live environment with the compose file present. It should be validated by doing a dry-run against the actual `docker-compose.observability.yml` once it exists. Consider a checklist: follow each numbered step from a clean shell session with only the `DEPLOYMENT.md` open. - **AC "No broken internal links"** — this is mechanically verifiable. A tool like `markdown-link-check` or a manual pass through every `[§X]` and `(../*)` reference in the updated files will confirm no rotten links. The issue doesn't mention running such a check, but it should be part of the PR review. - **Missing negative test for the AC**: The current ACs only describe positive conditions (docs are present, stack runs). There is no AC covering the failure mode: "What happens if a developer follows the first-run steps but the main application stack is not already running?" The documentation says "(application stack must already be running)" but doesn't describe the error the developer will see or how to recover. This is a UX-of-documentation gap, not a blocker. - **First-run step 6 says "Restart application stack to pick up new env vars" with `docker compose up -d backend frontend`**. This is correct for picking up `SENTRY_DSN` and `VITE_SENTRY_DSN` environment variables. However, if a developer runs `docker compose up -d` (the full stack) instead of targeting specific services, it is equally valid and less error-prone. The documentation could note both options. - **The `docker compose -f docker-compose.observability.yml down -v` command** wipes all observability data. This is a destructive operation. Standard DEPLOYMENT.md practice (as seen in the existing rollback section) is to warn before destructive commands. A short `⚠ This destroys all metrics, logs, and trace data.` note is consistent with the existing doc tone. ### Recommendations - Add a `⚠` warning before the `down -v` command: "This destroys all metrics, logs, and trace data permanently." - Add a note under "Starting and stopping" that the main application stack (`docker compose up -d`) must be running before starting the observability stack — and what happens if it isn't (the GlitchTip service will fail to connect to its database if it shares the main Postgres instance, for example). - For PR review: do a manual link-check pass on all internal cross-references in the updated files before merging. No open decisions.

marcel commented

2026-05-14 15:30:02 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Observations

This is a developer documentation issue — no UI, no components, no frontend changes. Accessibility and visual design are not directly in scope.
From a documentation UX perspective (clarity, scannability, progressive disclosure), I reviewed the proposed content structure:
- The bash code blocks with # comments follow the existing DEPLOYMENT.md style — easy to scan.
- The service URL table uses a clear 3-column layout (Service | URL | Credentials) — consistent with the existing env var tables.
- The first-run procedure is numbered 1–7, which is the right format for a sequential, order-dependent setup flow. No issues.
Documentation progressive disclosure gap: A new developer opening DEPLOYMENT.md for the first time will hit the existing §4 "Logs + observability" section, which says "Phase 7 of the Production v1 milestone adds Prometheus + Loki + Grafana. No monitoring infrastructure is in place yet." After this issue lands, that sentence becomes stale. The new "Observability Stack" section should be cross-referenced from §4, and the "Future observability" subsection should be removed or rewritten.
The step numbering in the first-run procedure goes from 1 to 7 with inline sub-steps (e.g., step 4 has three sub-bullets for creating projects). Consider using sub-numbering (4a, 4b, 4c) or splitting step 4 into explicit steps, to make the procedure clearer when following it in a terminal session.
The GlitchTip GLITCHTIP_SECRET_KEY generation command is buried at the bottom of the first-run section as a separate subsection. Consider moving it before the first-run steps, or making it step 0, since the operator needs this value when populating .env before starting the stack.

Recommendations

Update §4 "Logs + observability" in DEPLOYMENT.md to remove the "Future observability" note and add a cross-reference: "→ See §[N] Observability Stack for setup, service URLs, and first-run configuration."
Consider renumbering or sub-numbering step 4 in the first-run procedure to make the three GlitchTip sub-steps explicit.
Move the GLITCHTIP_SECRET_KEY generation snippet before step 1, or note at the top of the first-run section: "Before starting: generate your GLITCHTIP_SECRET_KEY using the command in §[N.N] and add it to .env."

No open decisions.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist ### Observations - This is a developer documentation issue — no UI, no components, no frontend changes. Accessibility and visual design are not directly in scope. - From a documentation UX perspective (clarity, scannability, progressive disclosure), I reviewed the proposed content structure: - The bash code blocks with `#` comments follow the existing DEPLOYMENT.md style — easy to scan. - The service URL table uses a clear 3-column layout (Service | URL | Credentials) — consistent with the existing env var tables. - The first-run procedure is numbered 1–7, which is the right format for a sequential, order-dependent setup flow. No issues. - **Documentation progressive disclosure gap**: A new developer opening DEPLOYMENT.md for the first time will hit the existing §4 "Logs + observability" section, which says "Phase 7 of the Production v1 milestone adds Prometheus + Loki + Grafana. No monitoring infrastructure is in place yet." After this issue lands, that sentence becomes stale. The new "Observability Stack" section should be cross-referenced from §4, and the "Future observability" subsection should be removed or rewritten. - The step numbering in the first-run procedure goes from 1 to 7 with inline sub-steps (e.g., step 4 has three sub-bullets for creating projects). Consider using sub-numbering (4a, 4b, 4c) or splitting step 4 into explicit steps, to make the procedure clearer when following it in a terminal session. - The GlitchTip `GLITCHTIP_SECRET_KEY` generation command is buried at the bottom of the first-run section as a separate subsection. Consider moving it *before* the first-run steps, or making it step 0, since the operator needs this value when populating `.env` before starting the stack. ### Recommendations - Update §4 "Logs + observability" in `DEPLOYMENT.md` to remove the "Future observability" note and add a cross-reference: "→ See §[N] Observability Stack for setup, service URLs, and first-run configuration." - Consider renumbering or sub-numbering step 4 in the first-run procedure to make the three GlitchTip sub-steps explicit. - Move the `GLITCHTIP_SECRET_KEY` generation snippet before step 1, or note at the top of the first-run section: "Before starting: generate your `GLITCHTIP_SECRET_KEY` using the command in §[N.N] and add it to `.env`." No open decisions.

marcel commented

2026-05-14 15:30:16 +02:00

🗳️ Decision Queue — Action Required

2 decisions need your input before implementation starts.

Infrastructure / Documentation Scope

production-compose.md stale section — in scope or follow-up? The file currently contains a note stating the observability stack is "not yet deployed" and tracked as #498. Once this issue's docs land, that note becomes a contradiction. Options: (a) include fixing it in this issue's scope (low effort, no risk of scope creep since it's documentation-only), or (b) track it as a separate follow-up. Markus recommends option (a). (Raised by: Markus, Elicit)

Infrastructure / Production Access

Caddy proxy for Grafana in production — document or leave as localhost? The issue spec leaves Grafana accessible at localhost:3001. This is correct for local dev but means Grafana is not reachable remotely on a VPS without SSH tunnelling or a Caddy proxy. Options: (a) document it as dev-only, with a note that production access requires a Caddy proxy or Tailscale tunnel, (b) add Caddy proxy configuration to the section as the recommended production setup, or (c) leave the current spec as-is and treat remote access as out of scope for this issue. (Raised by: Tobias)

## 🗳️ Decision Queue — Action Required _2 decisions need your input before implementation starts._ ### Infrastructure / Documentation Scope - **`production-compose.md` stale section — in scope or follow-up?** The file currently contains a note stating the observability stack is "not yet deployed" and tracked as #498. Once this issue's docs land, that note becomes a contradiction. Options: (a) include fixing it in this issue's scope (low effort, no risk of scope creep since it's documentation-only), or (b) track it as a separate follow-up. Markus recommends option (a). _(Raised by: Markus, Elicit)_ ### Infrastructure / Production Access - **Caddy proxy for Grafana in production — document or leave as localhost?** The issue spec leaves Grafana accessible at `localhost:3001`. This is correct for local dev but means Grafana is not reachable remotely on a VPS without SSH tunnelling or a Caddy proxy. Options: (a) document it as dev-only, with a note that production access requires a Caddy proxy or Tailscale tunnel, (b) add Caddy proxy configuration to the section as the recommended production setup, or (c) leave the current spec as-is and treat remote access as out of scope for this issue. _(Raised by: Tobias)_

marcel referenced this issue from a commit

2026-05-15 01:28:50 +02:00

docs: update DEPLOYMENT.md and C4 diagram for observability scaffold

marcel referenced this issue

2026-05-15 01:29:00 +02:00

devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure #584

marcel referenced this issue

2026-05-15 01:50:30 +02:00

devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #585

marcel referenced this issue

2026-05-15 01:50:42 +02:00

devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #585