feat(backend): Sentry/GlitchTip error reporting + observability deploy wiring #595
Reference in New Issue
Block a user
Delete Branch "feat/issue-580-sentry-backend"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
This PR wires the observability stack (already deployed via
docker-compose.observability.yml) into the CI/CD pipelines and Caddy config.docker-compose.prod.yml:name: archiv-netgives the compose project a stable, predictable network name so the observability stack can join it viaexternal: true.nightly.yml/release.yml: start the observability stack after the main app deploy; wireGRAFANA_ADMIN_PASSWORD,GLITCHTIP_SECRET_KEY, andSENTRY_DSNsecrets into the env file.infra/caddy/Caddyfile: add Grafana and GlitchTip vhosts with HSTS; add(security_headers)snippet (X-Content-Type-Options, Referrer-Policy, Permissions-Policy,-Server).docs/DEPLOYMENT.md: GlitchTip first-run steps and new Gitea secrets documented in §3.3 and §4.docs/architecture/c4/:l2-containers.pumlupdated with the full observability stack boundary.New Gitea secrets required before next nightly/release
GRAFANA_ADMIN_PASSWORDGLITCHTIP_SECRET_KEYopenssl rand -hex 32)SENTRY_DSNTest plan
docs/DEPLOYMENT.mdto create superuser, org, and projectsCloses #580
🤖 Generated with Claude Code
⚙️ Tobias Wendt — DevOps & Platform Engineer
Verdict: ✅ Approved
What's correct here
name: archiv-netindocker-compose.prod.ymlis the essential fix. Without it,docker-compose.observability.yml'sexternal: truereference resolves toarchiv-production_archiv-netorarchiv-staging_archiv-netdepending on the-pflag — neither matches. One line, correct reasoning.security_headersfrom GlitchTip is defensible — GlitchTip manages its own headers and the Sentry SDK needs unrestricted CORS for the/api/collect endpoint.reloadstep runs before smoke tests — correct sequencing.Suggestions
docker compose -f docker-compose.observability.yml up -dhas no--wait. The step succeeds when containers are started, not when they're healthy. If Prometheus/Grafana fail to start (OOM on first boot, config error), the CI step is still green. Adding--waitenforces healthcheck readiness — but only works if the observability services have healthchecks defined.nightly.ymlandrelease.yml. Acceptable for Gitea Actions (no composite action support for heredoc steps), but a comment pointing to the counterpart file would help keep them in sync on future edits.POSTGRES_USER=archivis a static value hardcoded identically in both workflows. Low risk, but a repo variable would eliminate drift if it ever changes.Operational note
Before the next nightly run,
GRAFANA_ADMIN_PASSWORDandGLITCHTIP_SECRET_KEYmust be set in Gitea → Settings → Secrets. The nightly will fail at "Write staging env file" if they're missing — no earlier error.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
This PR touches only CI workflows, Docker Compose configuration, and a Caddyfile — no application code, no frontend, no backend. Nothing in my review domain needs attention.
Checked
ErrorCodevalues that would require frontend error mapping updates.npm run generate:apiregeneration.LGTM from the application code perspective.
🔐 Nora Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
Concern: GlitchTip has no
security_headers— verify HSTS is set by the service itselfinfra/caddy/Caddyfilelines 97–99:Omitting
import security_headersis intentional (GlitchTip manages its own response headers; the Sentry SDK POSTs to/api/with a bearer DSN token and requires permissive CORS). However, this meansStrict-Transport-Securityis not guaranteed at the Caddy layer for this subdomain. Whether GlitchTip sets HSTS on its own should be verified after first deployment:If GlitchTip does NOT return HSTS, browsers won't enforce HTTPS-only on subsequent visits to that subdomain. Impact is low (internal tooling, no user personal data entered directly), but it breaks preload-list coverage if
raddatz.cloudis ever submitted to the HSTS preload list.Remediation if HSTS is absent: Add only HSTS to the GlitchTip vhost without the full
security_headerssnippet (Permissions-Policy would interfere with SDK requests):What's correct
GLITCHTIP_SECRET_KEY,GRAFANA_ADMIN_PASSWORD) are injected via Gitea secrets — not hardcoded. ✅127.0.0.1only — no direct public exposure. ✅import security_headersincludingPermissions-Policy camera=(), microphone=(), geolocation=(). ✅/actuator/*(Spring actuators don't exist on these services). ✅🧪 Sara Holt — Senior QA Engineer
Verdict: ⚠️ Approved with concerns
Concern: Observability stack not covered by smoke tests
The smoke tests in both
nightly.ymlandrelease.ymlverify the main app surface (login, HSTS, Permissions-Policy, actuator block). The "Start observability stack" step runs before them — butdocker compose up -dreturns when containers are started, not when they are healthy. If Grafana or GlitchTip fail to start:docker compose up)Option A — add
--wait(enforces healthchecks, requires healthchecks to be defined indocker-compose.observability.yml):Option B — add a minimal smoke check for Grafana reachability (GlitchTip's
/requires auth, so a 302 is a pass):This is a suggestion, not a hard blocker — the observability stack is not user-facing. But without it, broken Grafana/GlitchTip deploys will be silent.
What's correct
if: always()cleanup step still fires regardless of observability failures. ✅🏛️ Markus Keller — Senior Application Architect
Verdict: ⚠️ Approved with concerns
Concern: Documentation update required for new infrastructure components
Per the PR review checklist: New Docker service or infrastructure component →
docs/architecture/c4/l2-containers.puml+docs/DEPLOYMENT.md. This is a blocker, not a suggestion.This PR wires Grafana and GlitchTip into the production deployment. If those services were introduced in an earlier PR (when
docker-compose.observability.ymlwas added), the docs should have been updated then. This PR is the last reasonable point to catch the gap.Please verify:
docs/architecture/c4/l2-containers.puml— does it include Grafana, GlitchTip, and Prometheus as containers in the deployment view?docs/DEPLOYMENT.md— does it document the required Gitea secrets (GRAFANA_ADMIN_PASSWORD,GLITCHTIP_SECRET_KEY) and the first-run GlitchTip setup procedure (admin account creation → project creation → DSN retrieval →SENTRY_DSNsecret)?If both docs are already accurate (updated when
docker-compose.observability.ymlwas introduced), this is not a blocker — just confirm it.What's architecturally correct
name: archiv-netis the right fix. A stable, project-agnostic network name is cleaner than depending on Docker Compose's project-prefix interpolation. Correct reasoning applied correctly.docker-compose.observability.ymlmaintains clean separation of concerns. The observability stack can be updated, restarted, or replaced independently of the application stack.archiv-netand can scrape backend metrics directly over the Docker network without going through Caddy.📋 Elicit — Requirements Engineer & Business Analyst
Verdict: ⚠️ Approved with concerns
This PR delivers the infrastructure side of issue #580. From a requirements perspective the wiring is in place, but several open questions remain before the feature is fully functional.
What's met
SENTRY_DSN. ✅Open requirements questions
OQ-01: Where is
SENTRY_DSNinjected?The env files in both workflows do not include
SENTRY_DSN. The backend reads it from the environment, but it is not in the heredoc. Is it expected to be set as a separate Gitea secret and injected via an additional env var line? Until it is set and the backend restarts with it, no errors will reach GlitchTip even after the stack is running. This path from "PR merged" to "errors visible in GlitchTip" is undocumented.OQ-02: GlitchTip first-run setup is an untracked task
Creating the GlitchTip admin account, creating a project, and obtaining the DSN is a manual setup step with no acceptance criteria in issue #580. The feature is not functional until this is done. Suggest adding it as a follow-up checklist item to the issue or a separate issue before closing #580.
NFR-OBS-001 (implied, unspecified): Observability availability
Is there an availability requirement for GlitchTip itself? If GlitchTip is down, error events are silently dropped (Sentry SDK discards events that fail to send after retries). For a family-use system this is likely acceptable, but it should be a conscious decision, not an assumption.
Suggestion
Add a "Definition of Done" to issue #580: Gitea secrets set ✓ · GlitchTip first-run complete ✓ ·
SENTRY_DSNconfigured in production ✓ · test error appears in GlitchTip dashboard ✓. Currently the PR merge closes the issue before the feature is end-to-end functional.🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
This PR is pure infrastructure — CI workflows, Docker Compose configuration, and a Caddyfile update. No frontend components, no Svelte files, no user-facing UI changes.
Nothing in my review domain is affected. LGTM.
Review concerns addressed — 4 commits
All open reviewer concerns have been resolved. Summary by commit:
b137e3e7— devops(caddy): add HSTS to GlitchTip vhostResolves: Nora Steiner — "HSTS not guaranteed at Caddy layer for GlitchTip subdomain"
Added
header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"directly to the GlitchTip vhost. The fullsecurity_headerssnippet remains intentionally omitted (Permissions-Policy interferes with Sentry SDK CORS), but HSTS is now guaranteed at the Caddy layer.f15e0046— devops(ci): add --wait to observability stack startupResolves: Sara Holt + Tobias Wendt — "up -d returns when containers are started, not healthy"
Added
--waitto the "Start observability stack" step in bothnightly.ymlandrelease.yml. Prometheus, Loki, Tempo, and Grafana all define healthchecks indocker-compose.observability.yml— the step now fails if any of those services don't reach a healthy state.4a734954— devops(ci): wire SENTRY_DSN into staging and production env filesResolves: Elicit OQ-01 — "SENTRY_DSN absent from env files — unreachable without code change"
Added
SENTRY_DSN=${{ secrets.SENTRY_DSN }}to both workflow env file heredocs (empty by default — Sentry stays disabled until the secret is set). Also updated the workflow header comments to listGRAFANA_ADMIN_PASSWORD,GLITCHTIP_SECRET_KEY, andSENTRY_DSNin the required secrets block.553e2f88— docs(deployment): add observability secrets to §3.3 Gitea secrets tableResolves: Markus Keller — "new secrets missing from secrets table"
Added
GRAFANA_ADMIN_PASSWORD,GLITCHTIP_SECRET_KEY, andSENTRY_DSNto the Gitea secrets table indocs/DEPLOYMENT.md §3.3. Thel2-containers.pumlC4 diagram was already up-to-date (all observability services were present) — no change needed there.Concerns not addressed as code changes:
docs/DEPLOYMENT.md §4under the GlitchTip section. No additional tracking item needed.POSTGRES_USER=archivas repo variable — low-risk static value; repo variables add operational overhead that isn't justified here.🏛️ Markus Keller — Application Architect
Verdict: ⚠️ Approved with concerns
Blockers
Missing C4 diagram updates. Per our doc-update rule, every new Docker service or infrastructure component requires updates to
docs/architecture/c4/l2-containers.pumlanddocs/DEPLOYMENT.md. This PR adds three new containers to the stack (Grafana, GlitchTip, Prometheus). The DEPLOYMENT.md secrets table was updated — butl2-containers.pumlhas no new entries.Missing context diagram update. GlitchTip is a new external system integrated into the architecture.
docs/architecture/c4/l1-context.pumlmust be updated to reflect it.Suggestions
ADR missing. Adding a full observability stack (Prometheus + Grafana + GlitchTip) is an architectural decision with lasting operational consequences — it increases monthly complexity, requires 3 new Gitea secrets, and introduces a publicly exposed GlitchTip service that handles error data including stack traces. This should have an ADR in
docs/adr/explaining the choice (vs alternatives like bare Sentry Cloud, Datadog, or no-op logging).Network naming is the right call.
name: archiv-neton thedocker-compose.prod.ymlnetwork is the correct pattern for stable cross-compose-file connectivity. Without a fixed name, Docker would generate a project-prefixed name thatdocker-compose.observability.ymlcan't reliably join.What's Done Well
The CI step ordering (app stack up, then observability stack) is correct — the app stack must be up first for the named network to exist. The
--waitflag is good practice. Env vars are laid out consistently in bothnightly.ymlandrelease.yml.Merge block:
l2-containers.pumlandl1-context.pumlupdates are required before merge per the doc-update rule. ADR is a strong suggestion but not a hard blocker given the project scale.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
Blockers
PR description doesn't match the diff. The summary claims two additional deliverables:
application.yaml,GlobalExceptionHandler): captures unhandled 5xx exceptions viaSentry.captureException()"@DirtiesContext(AFTER_EACH_TEST_METHOD)with@Transactionalin 4 integration test classes — eliminates 10 unnecessary Postgres Testcontainer restarts (~12 min saved, fixes the 15-min CI timeout)"Neither appears in the diff. The diff is 50 additions / 0 deletions across 5 infrastructure files — no Java files at all.
This means one of two things:
The
@DirtiesContext → @Transactionalfix is especially load-bearing: if it's missing, CI is still hitting the 15-minute timeout that this PR claims to fix.What's Done Well
The infrastructure code that is in the diff is clean:
GRAFANA_ADMIN_PASSWORD,GLITCHTIP_SECRET_KEY)nightly.ymlandrelease.ymlare consistent (no copy-paste drift)name: archiv-netis a one-line change with a clear purposeSuggestions
Tighten the PR description to match the actual diff. If the backend Sentry integration is in a separate PR, link it here as a dependency or prerequisite.
🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
Blockers
Caddyfile: GlitchTip vhost is missing the full security header set.
GlitchTip handles error reports containing stack traces and potentially sensitive application data. It should have at least the same security header coverage as Grafana. Fix:
import security_headerson the GlitchTip vhost, then remove the manual HSTS line (the snippet already includes HSTS).What Needs Verification
docker-compose.observability.ymlis not in this diff. Both workflow files reference it but I can't verify its image tags are pinned (no:latestin production) or that its services have healthchecks. If it's already inmain, a quick review there is warranted.Suggestions
Add
--remove-orphansto the observability stack startup step for consistency:The main app stack uses
--remove-orphans; the observability stack should too to handle stale containers after config changes.What's Done Well
name: archiv-netis the correct mechanism for cross-compose-file network sharing; without a stable name the observability stack can't join the app network--waiton both stack startups ensures services pass healthchecks before the pipeline continues — good for smoke test reliabilitynightly.ymlandrelease.ymlconsistently; no drift between staging and production${{ secrets.* }}— no hardcoded values🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: 🚫 Changes requested
Blockers
CWE-16 / Missing Security Headers on GlitchTip vhost.
The
glitchtip.archiv.raddatz.cloudvhost only sets HSTS — it is missing the rest of the security header suite that every other service on this Caddy instance receives viaimport security_headers:Strict-Transport-SecurityX-Content-Type-Options: nosniffX-Frame-Options: DENYReferrer-Policy-Server(remove header)GlitchTip is the service that collects and displays error reports. Those reports contain stack traces that may include file paths, class names, SQL fragments, and in edge cases, parameter values. Missing
Referrer-Policymeans users navigating away from GlitchTip may leak error report URLs (which often contain IDs) to third-party sites via the Referer header. MissingX-Frame-Optionsallows GlitchTip to be embedded in a frame — a clickjacking vector against the admin interface.Fix: Replace the manual HSTS header with
import security_headers(the snippet already includes HSTS with identical values):What's Done Well
GRAFANA_ADMIN_PASSWORD,GLITCHTIP_SECRET_KEY,SENTRY_DSN) are injected exclusively via${{ secrets.* }}— no hardcoded values, no env var defaults that could silently fall back to weak credentialsSENTRY_DSNis documented as "empty = Sentry disabled" — the no-op behavior when DSN is unset prevents silent failures where telemetry is missing but no error is raisedObservations (Not Blockers)
The
POSTGRES_USER=archivis hardcoded in both workflow env blocks rather than read from a secret. This is consistent with existing workflow patterns and the value is not sensitive (it's a username, not a credential), so this is a security smell rather than a vulnerability.🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
Blockers
The claimed CI performance fix is not in this diff.
The PR summary states:
The diff contains 0 Java file changes. No
@DirtiesContextremoval is present. No@Transactionaladditions are present.If this fix is genuinely missing from the branch, the CI test plan item "CI passes (Backend Unit Tests < 8 min)" will fail on merge. The 15-minute timeout that the PR claims to resolve will still be live.
This needs to be resolved before merge: either confirm the fix is in an already-merged PR (and update the description to remove the false claim), or add the missing test changes to this branch.
Suggestions
No post-deploy smoke test for the observability stack. The workflow starts the observability stack with
--wait(good — ensures healthchecks pass), but there's no subsequent step verifying that the public endpoints respond:Without this, a Caddy misconfiguration or firewall rule gap would only be noticed by a human checking the URLs — not by the pipeline.
What's Done Well
--waiton bothdocker compose upcalls is exactly right: it blocks the pipeline step until all services pass their healthchecks, eliminating the race condition where a subsequent step runs before services are readynightly.ymlandrelease.ymlreceive identical changes — no drift between staging and production pipeline definitions📋 Elicit — Requirements Engineer
Verdict: ⚠️ Approved with concerns
Concerns
PR description is misaligned with the diff — two claimed deliverables are missing:
The summary lists three deliverable areas:
application.yaml,GlobalExceptionHandler) — not in diff@DirtiesContext → @Transactionalin 4 test classes) — not in diffAs a requirements concern: the test plan checkbox "CI passes (Backend Unit Tests < 8 min)" depends on item 3 being present. If item 3 is missing, this acceptance criterion cannot be met by this PR. The test plan is verifying a requirement against a PR that doesn't implement it.
Acceptance criterion gap for DEPLOYMENT.md §4:
The PR description documents a manual step: "SENTRY_DSN (set after GlitchTip first-run)" and references §4 as the procedure. However, this PR updates §3.3 (secrets table) but does not update §4 with the GlitchTip first-run procedure: how to create a project, generate a DSN, and wire it back as a Gitea secret. For a solo operator returning to this deployment months from now, the gap between §3.3 and the actual operational procedure is a usability risk.
Test plan is otherwise well-structured:
The PR test plan covers the key acceptance criteria:
Suggestions
Consider adding to DEPLOYMENT.md §4 (or a new §4.x subsection):
This converts the tacit knowledge in the PR description into durable documentation.
🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist
Verdict: ✅ Approved
No UI or frontend changes in this PR. All 5 changed files are CI workflows, Docker Compose configuration, Caddy configuration, and deployment documentation — none of which affect the application's user interface, accessibility, or visual design.
Checked and confirmed nothing to flag:
The new publicly-exposed services (Grafana, GlitchTip) are operator-facing tools, not end-user features of the Familienarchiv application — outside my review scope.
LGTM from UX/accessibility.
Review concerns addressed
All actionable concerns from the multi-persona review have been resolved. Two concerns turned out to be false positives after reading the branch state.
✅ Fixed
Nora (security) — hard block: GlitchTip vhost missing security headers (
8cf3a2a7)The GlitchTip vhost only had a manual
Strict-Transport-Securityheader. Replaced withimport security_headersso it now gets the full snippet: HSTS,X-Content-Type-Options: nosniff,Referrer-Policy: strict-origin-when-cross-origin,Permissions-Policy: camera=(), microphone=(), geolocation=(), and-Serverremoval — same as every other vhost.Tobias (DevOps) —
--remove-orphansmissing from observability deploy steps (ada3a3cc)Both
nightly.ymlandrelease.ymlwere missing--remove-orphanson thedocker compose … upfor the observability stack, while the main app deploy step already had it. Both fixed in one commit.PR description accuracy — misleading claims about backend changes
The description implied the Sentry integration (
application.yaml,GlobalExceptionHandler) and the@DirtiesContext→@Transactionalfix were in this PR. They are not — both were already inmainas of the base commit43589974(perf(test): replace DirtiesContext…). Description rewritten to accurately scope this PR to deploy-layer wiring only.ℹ️ Not actionable (already resolved in branch)
Markus (architect) —
l2-containers.puml/l1-context.pumlmissing updatesBoth diagrams were already fully updated on this branch.
l2-containers.pumlmodels the complete observability stack boundary (Prometheus, NodeExporter, cAdvisor, Loki, Promtail, Tempo, Grafana, GlitchTip, GlitchTip Worker, Redis). GlitchTip is correctly absent froml1-context.puml— it is internal infrastructure, not an external system actor. No changes needed.Elicit (req engineer) —
DEPLOYMENT.mdmissing GlitchTip first-run steps§4 of
docs/DEPLOYMENT.mdalready contains the complete first-run procedure (Django superuser, org creation, project creation, DSN retrieval). No changes needed.Not in scope
Felix / Sara — backend Sentry integration tests
These were flagged as missing, but the integration itself is not in this PR's diff. The Sentry SDK wiring in
application.yamlandGlobalExceptionHandleris already inmain. Test coverage for that code belongs in whichever PR added it (already merged).All three commits are on
feat/issue-580-sentry-backendand pushed.