devops(observability): add GlitchTip error tracking infrastructure (GlitchTip + worker + Redis) #578
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
GlitchTip is a lightweight, actively maintained Sentry-compatible error tracker. It receives error events from the SvelteKit frontend (browser exceptions) and the Spring Boot backend (unhandled exceptions), groups them by fingerprint, and provides an issue-list UI with stack traces.
GlitchTip connects to the existing
archive-dbPostgreSQL instance using a dedicatedglitchtipdatabase — no new database container needed. It needs Redis only for its Celery task queue.Depends on: scaffold issue (compose file must exist);
archive-dbmust be running (main stack must be up)Services to Add
Acceptance Criteria
docker compose -f docker-compose.observability.yml up -d redis glitchtip glitchtip-workerstarts all containers without errordocker exec archive-db psql -U $POSTGRES_USER -c '\l'shows aglitchtipdatabasehttp://localhost:3002curl -s http://localhost:3002/api/0/returns HTTP 200docker exec obs-glitchtip ./manage.py createsuperuserFirst-Run Steps (document in commit message or PR body)
Definition of Done
main🔧 Tobias Wendt — DevOps & Platform Engineer
Observations
:latesttag on GlitchTip —image: glitchtip/glitchtip:latestappears three times (web, worker, db-init). This is a production-bound service in a milestone called "Observability Stack". The ToBI persona rule is clear::latestis not a version. GlitchTip tags releases asv4.x.y— pin to a specific version now.glitchtip-db-initusespostgres:16-alpine— the version is pinned here (good), but the container runs apsqlcommand againstarchive-db. That command must succeed before GlitchTip can start. Thedepends_on: glitchtip-db-initchain only works if the init container exits 0 — but ifarchive-dbisn't onobs-netit will fail silently. The init container is onarchiv-netonly, which is correct sincearchive-dbis also onarchiv-net. That works, but the container namearchive-dbrelies on the main stack being up — a runtime dependency that the compose file cannot enforce.glitchtipandglitchtip-workerlack healthchecks. GlitchTip exposes/api/0/which returns 200. A simplecurl -sf http://localhost:8080/api/0/covers it. The worker has no HTTP interface — acelery inspect pingwould work but adds celery-cli overhead;pg_isreadyagainst the DB is acceptable as a proxy.redis-cli pingcheck is trivial and preventsglitchtipfrom starting before Redis is ready.depends_on: rediswithout a condition is a startup race.archive-dbPostgreSQL" — verify that GlitchTip is configured withDISABLE_COLLECTSTATIC=1or has an explicitMEDIA_ROOTvolume if attachments need to survive.obs-netisn't declared — the compose snippet defines two networks (archiv-netandobs-net) and places services on both, but the snippet doesn't include anetworks:top-level declaration. Ifobs-netis a new network it must be declared. If it's an external network shared across stacks, it needsexternal: true.0.0.0.0—"${PORT_GLITCHTIP:-3002}:8080"binds to all interfaces. In production this should be127.0.0.1:${PORT_GLITCHTIP:-3002}:8080(behind Caddy). The issue is a dev-first issue, so0.0.0.0is acceptable for now — but a note in the commit message or PR body to restrict this before the prod compose file is written would prevent it being copy-pasted as-is.docker-compose.observability.ymlis a standalone file — consistent with the pattern. Good.restart: "no"on db-init — the issue showsrestart: "no"(correct). That's good — one-shot containers should not restart on failure.Recommendations
image: glitchtip/glitchtip:v4.0.4(or whatever the current stable release is). This is not negotiable for a service that stores error telemetry.redisservice:test: ["CMD", "redis-cli", "ping"],interval: 5s,timeout: 3s,retries: 5. Then updateglitchtip.depends_on.redistocondition: service_healthy.glitchtipservice:test: ["CMD-SHELL", "curl -sf http://localhost:8080/api/0/ || exit 1"]. Updateglitchtip-worker.depends_on.glitchtiptocondition: service_healthy.obs-netat the top level of the compose file. If it's meant to be the new observability-stack-internal network, addobs-net: driver: bridge. If GlitchTip only needs to reacharchive-dbandmailpit, it only needsarchiv-net— theobs-netis redundant.glitchtip_media:/code/uploads(check GlitchTip docs for the exact mount path).PORT_GLITCHTIPshould be bound to127.0.0.1in the production compose overlay.docs/architecture/c4/l2-containers.pumlanddocs/DEPLOYMENT.md— a new container was added (GlitchTip). Per the CLAUDE.md doc update table, this is required.🏛️ Markus Keller — Application Architect
Observations
archiv-netandobs-net. Looking at the actual dependencies: GlitchTip needs to reacharchive-db(onarchiv-net) andmailpit(onarchiv-net). Redis is internal to the observability stack. The worker needs Redis and the DB. The clearest design is: put everything onarchiv-net.obs-netadds isolation that serves no present purpose and adds mental overhead when reading the file.archive-dbPostgreSQL container for theglitchtipdatabase is the architecturally sound choice for this scale. Theself-hosted-catalogue.mdexplicitly endorses this pattern. Theglitchtip-db-initcontainer handles the CREATE DATABASE idempotently. This is consistent with the project's "boring technology wins" principle.django-celery-results+ PostgreSQL (no Redis needed, but GlitchTip doesn't support this out of the box), (c) Redis as chosen (durable, standard, operationally simple). The next ADR number is 015.glitchtip-db-initexternal dependency — the init container depends onarchive-dbby DNS name, butarchive-dbis in the main stack (docker-compose.yml). This creates a cross-stack runtime dependency with no compose-level enforcement. The acceptance criteria require "main stack must be up" — that's documented as a precondition, which is acceptable. But it meansdocker compose -f docker-compose.observability.yml upwill fail silently if the main stack is down, rather than failing with a clear error. Worth noting in the PR.l2-containers.pumlupdate required — per CLAUDE.md's doc update table, adding a new Docker service requires updatingdocs/architecture/c4/l2-containers.puml. GlitchTip is a new external-facing container.Recommendations
obs-net— use onlyarchiv-net. There's no isolation benefit to a second network here. Redis should only be reachable byglitchtipandglitchtip-worker, whichexpose(notports) already handles.run-celery-with-beat.shcommand) needs a persistent volume for its beat schedule. If the worker is recreated, the schedule resets. For error tracking this is usually acceptable, but it's worth checking.docs/architecture/c4/l2-containers.pumlto add GlitchTip as a new container in the system boundary, and updatedocs/DEPLOYMENT.mdwith the startup sequence.🔒 Nora "NullX" Steiner — Application Security Engineer
Observations
GLITCHTIP_SECRET_KEY— hardcoded fallback risk — the compose snippet usesSECRET_KEY: ${GLITCHTIP_SECRET_KEY}with no default. This is correct — do not add a default. If the env var is missing, Django should fail to start rather than use a weak key. Verify that GlitchTip does indeed fail loudly whenSECRET_KEYis empty (most Django apps do — confirm by checking GlitchTip's startup behavior). If it silently uses an empty string, that's a critical signing-key vulnerability: session cookies and CSRF tokens would be forgeable."${PORT_GLITCHTIP:-3002}:8080"binds to0.0.0.0:3002by default. GlitchTip's admin panel is at/admin/and gives full access to all error events, which include stack traces, request parameters, and potentially session tokens or credentials that appear in error reports. In a dev environment this is acceptable; in production it must be127.0.0.1:${PORT_GLITCHTIP}:8080behind Caddy with authentication. The issue's ACs don't mention access control — this should be an explicit AC for the production phase.GLITCHTIP_MAX_EVENT_LIFE_DAYS: 90(already in the issue — good) and data scrubbing rules. The official Sentry/GlitchTip SDK supportsbefore_sendhooks to strip PII before transmission. This is a follow-up concern, but it's worth noting now: the DSN issue (the next issue that uses these DSNs) should include PII scrubbing in its ACs.redis://redis:6379/0with no password. Since Redis is only on the internal Docker network (expose, notports), this is acceptable for a dev setup. In production, Redis should either have a password (redis://:password@redis:6379/0) or be network-isolated. The currentexpose-only approach is the right minimal fix — but document it explicitly in the ADR or PR so a future operator doesn't accidentally addports:to Redis.glitchtip-db-initruns psql withPGPASSWORD— the environment variablePGPASSWORD: ${POSTGRES_PASSWORD}is the standard way to pass PostgreSQL passwords topsql. This is fine. The alternative (-Winteractive prompt) doesn't work in Docker. No issue here.DATABASE_URLcontains credentials in plaintext —postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@db:5432/glitchtipis a URL with embedded credentials. This is the standard GlitchTip configuration. The risk is that if these values appear in container inspect output or logs, credentials leak. Mitigation: use Docker secrets in production. For the scope of this issue (dev stack), using env vars is acceptable../manage.py createsuperuser. This creates a Django admin account separate from the GlitchTip UI account. The Django admin at/admin/exposes raw database access. The PR body should note that the Django admin path should be blocked at the Caddy level in production (respond 404 toglitchtip.example.com/admin/from external access) — only the GlitchTip UI at/needs to be publicly accessible.Recommendations
PORT_GLITCHTIPto127.0.0.1and front with Caddy; block/admin/at the proxy level."SECRET_KEY— check the Docker startup logs whenGLITCHTIP_SECRET_KEYis unset. If it starts with an empty key, theGLITCHTIP_SECRET_KEYenv var needs a mandatory check (same pattern asIMPORT_HOST_DIRin the prod compose:${GLITCHTIP_SECRET_KEY:?Set GLITCHTIP_SECRET_KEY}).before_sendPII scrubbing in both the Sentry-Java SDK (Spring Boot) and@sentry/sveltekitto strip email addresses, user IDs, and request bodies from error events before they're stored in GlitchTip.requirepassto Redis config or keep itexpose-only (notports-exposed) behind the Docker network boundary.👨💻 Felix Brandt — Senior Fullstack Developer
Observations
This issue is pure infrastructure — no application code changes. From a developer ergonomics perspective, the main concern is: does this compose file integrate cleanly into the dev workflow, and does it set up the right plumbing for the backend and frontend error tracking issues that will follow?
glitchtip-db-initcommand is fragile — the multi-linecommand:block in the compose snippet has a YAML quoting issue. The>block scalar folds newlines into spaces, which means thegrep -q 1 ||check and the finalpsqlcall become a single space-separated string. This may work in some shells but is brittle. A cleaner approach: use a dedicatedsh -centrypoint with explicit semicolons, or use thecommand: ["sh", "-c", "..."]list form which is unambiguous..env.exampleaddition shown — the issue introduces three new required env vars:GLITCHTIP_SECRET_KEY,GLITCHTIP_DOMAIN, andPORT_GLITCHTIP. The project almost certainly has a.env.examplefile (or equivalent). These vars need to be added there, otherwise the next developer who clones the repo and runs the observability stack will get a confusing failure with no guidance.docs/DEPLOYMENT.md— the issue says "document in commit message or PR body." Thecreatesuperuserstep is a persistent operational procedure, not a one-time commit note. It should live indocs/DEPLOYMENT.mdunder an "Observability Stack" section, not buried in git history.docker exec -it obs-glitchtiprequires an interactive terminal — the first-run step uses-it. In automated contexts (CI, scripted deploys) this fails. For thecreatesuperuserstep, this is acceptable since it's a one-time interactive procedure. But the acceptance criterion says "a superuser can be created" — not that it happens automatically. That's fine for this scope.curl -s http://localhost:3002/api/0/check is a good smoke test. The DSN copy steps are documented as manual procedures. The ACs are well-structured for a DevOps issue.Recommendations
glitchtip-db-initcommand using the list form to avoid YAML folding ambiguity:GLITCHTIP_SECRET_KEY,GLITCHTIP_DOMAIN, andPORT_GLITCHTIPto.env.examplewith documented values (e.g.GLITCHTIP_SECRET_KEY=change-me-in-production,PORT_GLITCHTIP=3002).docs/DEPLOYMENT.md. Commit messages are not searchable operational runbooks.🧪 Sara Holt — QA Engineer & Test Strategist
Observations
This is an infrastructure-only issue with no application code to test at the unit or integration layer. The relevant test layer is smoke/E2E verification of the running stack.
curl -s http://localhost:3002/api/0/returning HTTP 200 is the right smoke test — this is testable and unambiguous. However, the existing E2E test suite (npm run test:e2e) does not start the observability stack. If GlitchTip integration (DSN wiring) is added to the backend/frontend in follow-up issues, those integration tests must not depend on GlitchTip being up — the DSN env vars should be optional/no-op when absent. Otherwise CI breaks whendocker-compose.observability.ymlisn't running.glitchtip-db-initsucceeds whenarchive-dbis running and fails gracefully when it's not. This is a runtime risk: if someone runsdocker compose -f docker-compose.observability.yml upwithout the main stack, the failure message frompsqlwill be a connection-refused error with no indication of why. A comment in the compose file explaining the dependency is the minimum mitigation.curl -s http://localhost:3002/api/0/AC can be checked while GlitchTip is still running database migrations. GlitchTip returns 500s during migration. The AC should specify "afterdocker composereports all containers healthy" — but without a healthcheck on theglitchtipservice,docker compose up --waitnever waits for it.Recommendations
docker compose -f docker-compose.observability.yml psshows no containers in Starting state)."# Requires the main stack (docker-compose.yml) to be running — archive-db must be reachable on archiv-net.This makes the precondition explicit at the point where it matters.SENTRY_DSN=""to disable itself — verify this works for GlitchTip's DSN format.📋 Elicit — Requirements Engineer
Observations
The issue is well-structured for a DevOps infrastructure ticket. The body contains a Docker Compose snippet, acceptance criteria, and first-run steps. From a requirements completeness standpoint:
sentry-spring-boot-starter, which connects to GlitchTip via DSN regardless of the project type set in the UI. The AC says "type: Django" — this will cause confusion when the developer sees the project type list and wonders if they should choose "Java" or "Spring." Recommend changing the AC to: "Create a new Project (any type, e.g. Django) — the project type is metadata only; the DSN works with any Sentry-compatible SDK."docs/DEPLOYMENT.mdunder a new "Observability Stack" section.GLITCHTIP_MAX_EVENT_LIFE_DAYS: 90is set. This is a good default. But the database growth implications of 90 days of error events are not discussed. For a family archive with low traffic this is negligible, but it should be acknowledged.Recommendations
docker-compose.yml) is confirmed to be running before starting the observability stack."docs/DEPLOYMENT.mdas a persistent section rather than relying on PR/commit body.🎨 Leonie Voss — UX Designer & Accessibility Strategist
Observations
This is a pure infrastructure issue — no frontend components, no UI changes, no user-facing flows. From my angle, the relevant concern is downstream: what gets built when these DSNs are wired into the frontend and backend.
@sentry/sveltekitSDK captures. By default, Sentry captures full page URLs including query parameters. For the Familienarchiv, document URLs contain UUIDs (/documents/{id}) which are non-sensitive, but user search queries (?q=...) may contain personal names. The frontend DSN integration issue should include PII filtering as an AC.No blocking concerns from the UX/accessibility perspective.
🗳️ Decision Queue — Action Required
3 decisions need your input before implementation starts.
Architecture
ADR-015: Redis as Celery broker — GlitchTip requires Celery, Celery requires a message broker. Redis 7-alpine is the proposed choice. The alternatives are: (a) in-memory broker — fast but not durable, events lost on restart; (b) PostgreSQL-backed broker via
django-celery-results— no new service, but GlitchTip doesn't support this out of the box. Write ADR-015 before implementing. (Raised by: Markus)obs-netvsarchiv-net— collapse to one network? — The compose snippet puts all observability services on botharchiv-netand a newobs-net. GlitchTip only needs to reacharchive-dbandmailpit, which are already onarchiv-net. Redis can beexpose-only and still reachable by GlitchTip on the same network. Theobs-netadds no isolation benefit at this scale. Decision: use onlyarchiv-netfor everything, or keepobs-netfor future Prometheus/Loki/Alertmanager services in this milestone? (Raised by: Markus, Tobias)Infrastructure / Security
PORT_GLITCHTIP— the dev compose binds to0.0.0.0:3002. GlitchTip's Django admin at/admin/gives raw database access to error events containing stack traces and potentially session data. Two sub-decisions: (1) When the prod compose is written, confirm127.0.0.1:${PORT_GLITCHTIP}:8080binding and Caddy fronting. (2) Should the Caddy config for production block the/admin/path entirely, or is it acceptable behind basic auth? (Raised by: Nora, Tobias)Implementation complete on branch
feat/issue-578-glitchtip.Two commits:
feat(observability): add GlitchTip error tracking to observability stack— adds four services todocker-compose.observability.yml:obs-glitchtip-db-init— one-shot postgres:16-alpine container that creates theglitchtipdatabase onarchive-dbif it doesn't already existobs-redis— Redis 7 Alpine broker for Celery,obs-net-internal onlyobs-glitchtip— main web process, bound to127.0.0.1:${PORT_GLITCHTIP:-3002}:8080, on botharchiv-net(reachesarchive-dbandmailpit) andobs-netobs-glitchtip-worker— Celery + beat worker (./bin/run-celery-with-beat.sh)docs(observability): document GlitchTip services in DEPLOYMENT.md and C4 diagram— extends the observability env var table withPORT_GLITCHTIP,GLITCHTIP_DOMAIN,GLITCHTIP_SECRET_KEY; adds service rows to the services table; adds a GlitchTip first-run subsection (superuser creation, org + two projects); updatesdocs/architecture/c4/l2-containers.pumlwith the GlitchTip and Redis containers and their relationships.Validation:
docker compose -f docker-compose.observability.yml configexits cleanly (expected warning forGLITCHTIP_SECRET_KEYhaving no default — it's a required secret by design).PR-ready — no breaking changes, additive-only.