feat(observability): add GlitchTip error tracking infrastructure #590

Merged
marcel merged 3 commits from feat/issue-578-glitchtip into main 2026-05-15 06:12:29 +02:00
Owner

Summary

  • Adds 4 services to docker-compose.observability.yml: obs-glitchtip-db-init, obs-redis, obs-glitchtip, obs-glitchtip-worker
  • GlitchTip uses existing archive-db PostgreSQL (dedicated glitchtip database, created automatically by init container)
  • Redis (redis:7-alpine) used as Celery task queue, internal to obs-net only
  • GlitchTip web UI accessible at http://localhost:${PORT_GLITCHTIP:-3002}
  • All images pinned, all host ports bound to 127.0.0.1
  • Updates docs/DEPLOYMENT.md with first-run steps and env var table
  • Updates docs/architecture/c4/l2-containers.puml with GlitchTip + Redis containers

Closes #578

🤖 Generated with Claude Code

## Summary - Adds 4 services to `docker-compose.observability.yml`: `obs-glitchtip-db-init`, `obs-redis`, `obs-glitchtip`, `obs-glitchtip-worker` - GlitchTip uses existing `archive-db` PostgreSQL (dedicated `glitchtip` database, created automatically by init container) - Redis (`redis:7-alpine`) used as Celery task queue, internal to `obs-net` only - GlitchTip web UI accessible at `http://localhost:${PORT_GLITCHTIP:-3002}` - All images pinned, all host ports bound to `127.0.0.1` - Updates `docs/DEPLOYMENT.md` with first-run steps and env var table - Updates `docs/architecture/c4/l2-containers.puml` with GlitchTip + Redis containers Closes #578 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 2 commits 2026-05-15 04:40:08 +02:00
Adds obs-glitchtip, obs-glitchtip-worker, obs-redis, and obs-glitchtip-db-init
services to docker-compose.observability.yml. The one-shot db-init container
creates the dedicated glitchtip database on the existing archive-db PostgreSQL
instance automatically on first stack start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs(observability): document GlitchTip services in DEPLOYMENT.md and C4 diagram
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 5m53s
CI / OCR Service Tests (pull_request) Successful in 32s
CI / Backend Unit Tests (pull_request) Failing after 23m39s
CI / fail2ban Regex (pull_request) Successful in 2m13s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m55s
3ced565aa2
Adds GlitchTip env vars to the observability env var table, extends the
services table, and adds a first-run section with superuser creation and
project setup steps. Updates the C4 L2 container diagram with GlitchTip
and Redis containers and their relationships.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

🏗️ Markus Keller — Senior Application Architect

Verdict: 🚫 Changes requested

Blocker — C4 diagram references undefined PlantUML alias

In docs/architecture/c4/l2-containers.puml, the diff adds:

Rel(obs_glitchtip_worker, obs_redis, "Processes Celery tasks", "Redis / obs-net")

But obs_glitchtip_worker is never defined as a Container() element. PlantUML will either silently drop the relationship or render a dangling node. The C4 diagram is broken.

Fix — choose one of:

  1. Add a second container for the worker:

    Container(obs_glitchtip_worker, "GlitchTip Worker", "glitchtip/glitchtip:v4", "Celery + beat worker — async event ingestion, notifications, cleanup.")
    

    And update the existing glitchtip container description to clarify it's the web process only.

  2. Or collapse web + worker into one logical container and remove the obs_glitchtip_worker Rel (documenting the internal relationship in the description instead). This is defensible since both run the same Docker image.

What's Done Right

  • Architecture decision to share archive-db instead of adding a new PostgreSQL container is correct — one DB per operator, not one per service.
  • obs-glitchtip-db-init as a one-shot idempotent init container is the right pattern (not a startup script, not a migration hook in the app).
  • condition: service_completed_successfully on the init container — correct use of the compose healthcheck DSL for one-shot jobs.
  • DEPLOYMENT.md updated with first-run steps, env var table, and service table. ✓

Doc Table Pass

PR trigger Required update Status
New Docker services (4) l2-containers.puml + DEPLOYMENT.md Both updated (C4 has a bug to fix)
## 🏗️ Markus Keller — Senior Application Architect **Verdict: 🚫 Changes requested** ### Blocker — C4 diagram references undefined PlantUML alias In `docs/architecture/c4/l2-containers.puml`, the diff adds: ``` Rel(obs_glitchtip_worker, obs_redis, "Processes Celery tasks", "Redis / obs-net") ``` But `obs_glitchtip_worker` is never defined as a `Container()` element. PlantUML will either silently drop the relationship or render a dangling node. The C4 diagram is broken. Fix — choose one of: 1. Add a second container for the worker: ``` Container(obs_glitchtip_worker, "GlitchTip Worker", "glitchtip/glitchtip:v4", "Celery + beat worker — async event ingestion, notifications, cleanup.") ``` And update the existing `glitchtip` container description to clarify it's the web process only. 2. Or collapse web + worker into one logical container and remove the `obs_glitchtip_worker` Rel (documenting the internal relationship in the description instead). This is defensible since both run the same Docker image. ### What's Done Right - Architecture decision to share `archive-db` instead of adding a new PostgreSQL container is correct — one DB per operator, not one per service. - `obs-glitchtip-db-init` as a one-shot idempotent init container is the right pattern (not a startup script, not a migration hook in the app). - `condition: service_completed_successfully` on the init container — correct use of the compose healthcheck DSL for one-shot jobs. - DEPLOYMENT.md updated with first-run steps, env var table, and service table. ✓ ### Doc Table Pass | PR trigger | Required update | Status | |---|---|---| | New Docker services (4) | `l2-containers.puml` + `DEPLOYMENT.md` | ✅ Both updated (C4 has a bug to fix) |
Author
Owner

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

Concerns (non-blocking, but worth addressing before prod)

1. No healthcheck on obs-glitchtip

GlitchTip's REST API at /api/0/ returns HTTP 200 when ready. Without a healthcheck, operators can't distinguish "container started" from "Django app is accepting requests," and docker compose ps shows no health status. The worker (obs-glitchtip-worker) also has no healthcheck — though Celery workers are harder to health-check, the web process definitely should have one.

Suggested fix:

obs-glitchtip:
  healthcheck:
    test: ["CMD-SHELL", "wget -qO- http://localhost:8080/api/0/ | grep -q '\"version\"' || exit 1"]
    interval: 30s
    timeout: 10s
    retries: 5
    start_period: 60s  # Django startup + DB migration can take 30-50s

2. glitchtip/glitchtip:v4 is a major version tag, not a patch-pinned version

v4 will advance as v4.x.y releases ship. This is better than :latest but still moves. In production I'd prefer glitchtip/glitchtip:v4.1.4 (or whatever the current patch). For now it's acceptable given this is the same version constraint the issue spec requested, and Renovate can track it. Flagging for awareness.

What's Done Right

  • redis:7-alpine pinned ✓
  • postgres:16-alpine pinned (matches the main stack's postgres version) ✓
  • Redis healthcheck with redis-cli ping
  • condition: service_healthy on Redis dependency ✓
  • condition: service_completed_successfully on db-init dependency ✓
  • restart: "no" on the init container — correct ✓
  • Port bound to 127.0.0.1
  • glitchtip_data named volume for Redis persistence ✓
  • Init container is idempotent (checks before creating) ✓
## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** ### Concerns (non-blocking, but worth addressing before prod) **1. No healthcheck on `obs-glitchtip`** GlitchTip's REST API at `/api/0/` returns HTTP 200 when ready. Without a healthcheck, operators can't distinguish "container started" from "Django app is accepting requests," and `docker compose ps` shows no health status. The worker (`obs-glitchtip-worker`) also has no healthcheck — though Celery workers are harder to health-check, the web process definitely should have one. Suggested fix: ```yaml obs-glitchtip: healthcheck: test: ["CMD-SHELL", "wget -qO- http://localhost:8080/api/0/ | grep -q '\"version\"' || exit 1"] interval: 30s timeout: 10s retries: 5 start_period: 60s # Django startup + DB migration can take 30-50s ``` **2. `glitchtip/glitchtip:v4` is a major version tag, not a patch-pinned version** `v4` will advance as `v4.x.y` releases ship. This is better than `:latest` but still moves. In production I'd prefer `glitchtip/glitchtip:v4.1.4` (or whatever the current patch). For now it's acceptable given this is the same version constraint the issue spec requested, and Renovate can track it. Flagging for awareness. ### What's Done Right - `redis:7-alpine` pinned ✓ - `postgres:16-alpine` pinned (matches the main stack's postgres version) ✓ - Redis healthcheck with `redis-cli ping` ✓ - `condition: service_healthy` on Redis dependency ✓ - `condition: service_completed_successfully` on db-init dependency ✓ - `restart: "no"` on the init container — correct ✓ - Port bound to `127.0.0.1` ✓ - `glitchtip_data` named volume for Redis persistence ✓ - Init container is idempotent (checks before creating) ✓
Author
Owner

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

Security Observations

1. GlitchTip on archiv-net expands attack surface — informational

obs-glitchtip and obs-glitchtip-worker are both on archiv-net, which is necessary to reach archive-db and mailpit. However, this also means a compromised GlitchTip container can reach every other container on archiv-net (backend, frontend, MinIO, etc.). This is an acceptable trade-off for a single-operator self-hosted stack, but worth documenting as a known risk.

If this becomes a concern later: the DB access could be isolated by creating a dedicated glitchtip-net that has only GlitchTip ↔ archive-db connectivity, instead of the full archiv-net.

2. POSTGRES_USER shell substitution in db-init command — very low risk, informational

The command uses:

command: >
  sh -c "psql -h archive-db -U ${POSTGRES_USER} -tc ..."

If POSTGRES_USER contained shell metacharacters, this could be a command injection. In practice, POSTGRES_USER is operator-controlled and contains only safe identifiers. The actual risk is negligible — noted for completeness.

3. No GlitchTip registration restriction set

GlitchTip (Django) allows open user registration by default unless REGISTRATION_OPEN=False is set. On a family archive where only the admin should access the error tracker, consider adding:

environment:
  REGISTRATION_OPEN: "False"

This prevents anyone who discovers the port from creating an account.

What's Done Right

  • SECRET_KEY: ${GLITCHTIP_SECRET_KEY} — no default value, fail-closed ✓
  • Port binding 127.0.0.1 — not internet-reachable ✓
  • DATABASE_URL uses env vars for credentials — no hardcoded secrets ✓
  • EMAIL_URL: smtp://mailpit:1025 — dev mail catcher, no credential exposure ✓
  • GLITCHTIP_MAX_EVENT_LIFE_DAYS: 90 — data retention limit set, good privacy practice ✓
## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** ### Security Observations **1. GlitchTip on `archiv-net` expands attack surface — informational** `obs-glitchtip` and `obs-glitchtip-worker` are both on `archiv-net`, which is necessary to reach `archive-db` and `mailpit`. However, this also means a compromised GlitchTip container can reach every other container on `archiv-net` (backend, frontend, MinIO, etc.). This is an acceptable trade-off for a single-operator self-hosted stack, but worth documenting as a known risk. If this becomes a concern later: the DB access could be isolated by creating a dedicated `glitchtip-net` that has only GlitchTip ↔ archive-db connectivity, instead of the full `archiv-net`. **2. `POSTGRES_USER` shell substitution in db-init command — very low risk, informational** The command uses: ```yaml command: > sh -c "psql -h archive-db -U ${POSTGRES_USER} -tc ..." ``` If `POSTGRES_USER` contained shell metacharacters, this could be a command injection. In practice, `POSTGRES_USER` is operator-controlled and contains only safe identifiers. The actual risk is negligible — noted for completeness. **3. No GlitchTip registration restriction set** GlitchTip (Django) allows open user registration by default unless `REGISTRATION_OPEN=False` is set. On a family archive where only the admin should access the error tracker, consider adding: ```yaml environment: REGISTRATION_OPEN: "False" ``` This prevents anyone who discovers the port from creating an account. ### What's Done Right - `SECRET_KEY: ${GLITCHTIP_SECRET_KEY}` — no default value, fail-closed ✓ - Port binding `127.0.0.1` — not internet-reachable ✓ - `DATABASE_URL` uses env vars for credentials — no hardcoded secrets ✓ - `EMAIL_URL: smtp://mailpit:1025` — dev mail catcher, no credential exposure ✓ - `GLITCHTIP_MAX_EVENT_LIFE_DAYS: 90` — data retention limit set, good privacy practice ✓
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: Approved

What I Checked

  • YAML structure and naming conventions
  • No backend or frontend code changes in this PR

Observations

Pure infrastructure — no Java, Svelte, or Python code changed. YAML is clean and readable.

The obs-* naming prefix is consistent with the rest of the observability stack. The four service names (obs-redis, obs-glitchtip, obs-glitchtip-worker, obs-glitchtip-db-init) are self-documenting.

The init container command is a clean idiom:

command: >
  sh -c "psql ... -tc \"SELECT 1 FROM pg_database WHERE datname = 'glitchtip'\" |
  grep -q 1 ||
  psql ... -c \"CREATE DATABASE glitchtip;\""

The idempotency guard (grep -q 1 || ...) means re-running the stack doesn't fail on an existing database.

No TDD concerns — infrastructure configuration, not application logic.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** ### What I Checked - YAML structure and naming conventions - No backend or frontend code changes in this PR ### Observations Pure infrastructure — no Java, Svelte, or Python code changed. YAML is clean and readable. The `obs-*` naming prefix is consistent with the rest of the observability stack. The four service names (`obs-redis`, `obs-glitchtip`, `obs-glitchtip-worker`, `obs-glitchtip-db-init`) are self-documenting. The init container command is a clean idiom: ```yaml command: > sh -c "psql ... -tc \"SELECT 1 FROM pg_database WHERE datname = 'glitchtip'\" | grep -q 1 || psql ... -c \"CREATE DATABASE glitchtip;\"" ``` The idempotency guard (`grep -q 1 || ...`) means re-running the stack doesn't fail on an existing database. No TDD concerns — infrastructure configuration, not application logic.
Author
Owner

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: Approved

What I Checked

  • Testability of the init container pattern
  • Startup ordering correctness
  • Idempotency of the db-init approach

Observations

The obs-glitchtip-db-init init container is the correct pattern for one-time database setup:

  • restart: "no" — won't loop ✓
  • condition: service_completed_successfully — web container only starts after init exits 0 ✓
  • Idempotent check (SELECT 1 FROM pg_database WHERE datname = 'glitchtip') — safe to re-run ✓

This is much better than a startup script in the application container or a manual step — it's reproducible and testable.

Startup ordering chain is correct:

archive-db (main stack, healthy) →
obs-glitchtip-db-init (exits 0) →
obs-redis (healthy) →
obs-glitchtip (starts)
obs-glitchtip-worker (starts)

Missing: no healthcheck on obs-glitchtip. This means docker compose ps can't tell you if GlitchTip's web process is actually serving requests. Non-blocking for merge but worth adding (Tobias has the suggested test command).

First-run documentation in DEPLOYMENT.md is complete — the createsuperuser command and project creation steps are clearly documented. ✓

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ✅ Approved** ### What I Checked - Testability of the init container pattern - Startup ordering correctness - Idempotency of the db-init approach ### Observations The `obs-glitchtip-db-init` init container is the correct pattern for one-time database setup: - `restart: "no"` — won't loop ✓ - `condition: service_completed_successfully` — web container only starts after init exits 0 ✓ - Idempotent check (`SELECT 1 FROM pg_database WHERE datname = 'glitchtip'`) — safe to re-run ✓ This is much better than a startup script in the application container or a manual step — it's reproducible and testable. **Startup ordering chain is correct:** ``` archive-db (main stack, healthy) → obs-glitchtip-db-init (exits 0) → obs-redis (healthy) → obs-glitchtip (starts) obs-glitchtip-worker (starts) ``` **Missing:** no healthcheck on `obs-glitchtip`. This means `docker compose ps` can't tell you if GlitchTip's web process is actually serving requests. Non-blocking for merge but worth adding (Tobias has the suggested test command). **First-run documentation** in DEPLOYMENT.md is complete — the `createsuperuser` command and project creation steps are clearly documented. ✓
Author
Owner

🎨 Leonie Voss — UX Design Lead

Verdict: Approved

What I Checked

  • Frontend/UI changes in this PR

Result

This PR modifies only docker-compose.observability.yml, docs/DEPLOYMENT.md, and docs/architecture/c4/l2-containers.puml. No Svelte components, frontend routes, or UI code were changed.

GlitchTip's own UI is a third-party application — out of scope for this review. The DEPLOYMENT.md documents the first-run setup steps clearly so operators know what to do.

No UX concerns.

## 🎨 Leonie Voss — UX Design Lead **Verdict: ✅ Approved** ### What I Checked - Frontend/UI changes in this PR ### Result This PR modifies only `docker-compose.observability.yml`, `docs/DEPLOYMENT.md`, and `docs/architecture/c4/l2-containers.puml`. No Svelte components, frontend routes, or UI code were changed. GlitchTip's own UI is a third-party application — out of scope for this review. The DEPLOYMENT.md documents the first-run setup steps clearly so operators know what to do. No UX concerns. ✅
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved

Requirements Alignment

Issue #578 called for GlitchTip + worker + Redis, using the existing archive-db PostgreSQL, with a one-shot db-init container. This PR delivers all of it:

Requirement Status
Redis service (Celery broker) obs-redis with healthcheck
GlitchTip web service obs-glitchtip on port 3002
GlitchTip worker (Celery+beat) obs-glitchtip-worker
DB init one-shot (idempotent) obs-glitchtip-db-init
Uses existing archive-db (no new DB container) DATABASE_URL points to archive-db:5432
Port 127.0.0.1:${PORT_GLITCHTIP:-3002}:8080
GLITCHTIP_SECRET_KEY and GLITCHTIP_DOMAIN from .env.example Already present from scaffold issue
DEPLOYMENT.md first-run steps createsuperuser, org + 2 project creation
C4 diagram updated (minor bug to fix — see Markus)

Observation

The acceptance criteria in issue #578 include verifying curl -s http://localhost:3002/api/0/ returns HTTP 200. This is a manual verification step that can't be automated in CI (requires a running stack). The DEPLOYMENT.md documents the first-run steps but doesn't explicitly list this verification — acceptable since it's operator-facing infrastructure.

The two-project setup (familienarchiv-frontend + familienarchiv-backend) maps directly to the subsequent issues #579 and #580. The handoff is clean.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** ### Requirements Alignment Issue #578 called for GlitchTip + worker + Redis, using the existing archive-db PostgreSQL, with a one-shot db-init container. This PR delivers all of it: | Requirement | Status | |---|---| | Redis service (Celery broker) | ✅ `obs-redis` with healthcheck | | GlitchTip web service | ✅ `obs-glitchtip` on port 3002 | | GlitchTip worker (Celery+beat) | ✅ `obs-glitchtip-worker` | | DB init one-shot (idempotent) | ✅ `obs-glitchtip-db-init` | | Uses existing archive-db (no new DB container) | ✅ `DATABASE_URL` points to `archive-db:5432` | | Port `127.0.0.1:${PORT_GLITCHTIP:-3002}:8080` | ✅ | | `GLITCHTIP_SECRET_KEY` and `GLITCHTIP_DOMAIN` from `.env.example` | ✅ Already present from scaffold issue | | DEPLOYMENT.md first-run steps | ✅ `createsuperuser`, org + 2 project creation | | C4 diagram updated | ✅ (minor bug to fix — see Markus) | ### Observation The acceptance criteria in issue #578 include verifying `curl -s http://localhost:3002/api/0/` returns HTTP 200. This is a manual verification step that can't be automated in CI (requires a running stack). The DEPLOYMENT.md documents the first-run steps but doesn't explicitly list this verification — acceptable since it's operator-facing infrastructure. The two-project setup (familienarchiv-frontend + familienarchiv-backend) maps directly to the subsequent issues #579 and #580. The handoff is clean.
marcel added 1 commit 2026-05-15 04:43:45 +02:00
fix(observability): define obs_glitchtip_worker Container in C4 diagram
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 5m45s
CI / OCR Service Tests (pull_request) Successful in 36s
CI / Backend Unit Tests (pull_request) Failing after 23m49s
CI / fail2ban Regex (pull_request) Successful in 2m13s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m46s
67004737f6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel merged commit 427c3ea537 into main 2026-05-15 06:12:29 +02:00
marcel deleted branch feat/issue-578-glitchtip 2026-05-15 06:12:30 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#590