feat(backend): Sentry/GlitchTip error reporting + observability deploy wiring #595

marcel · 2026-05-15T11:23:30+02:00

marcel commented

2026-05-15 11:23:30 +02:00

Summary

This PR wires the observability stack (already deployed via docker-compose.observability.yml) into the CI/CD pipelines and Caddy config.

Note: The Sentry backend integration (application.yaml, GlobalExceptionHandler) and the @DirtiesContext → @Transactional CI perf fix are not in this PR — both were already merged into main (base commit 43589974). This PR contains only the deploy-layer wiring below.

docker-compose.prod.yml: name: archiv-net gives the compose project a stable, predictable network name so the observability stack can join it via external: true.
nightly.yml / release.yml: start the observability stack after the main app deploy; wire GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, and SENTRY_DSN secrets into the env file.
infra/caddy/Caddyfile: add Grafana and GlitchTip vhosts with HSTS; add (security_headers) snippet (X-Content-Type-Options, Referrer-Policy, Permissions-Policy, -Server).
docs/DEPLOYMENT.md: GlitchTip first-run steps and new Gitea secrets documented in §3.3 and §4.
docs/architecture/c4/: l2-containers.puml updated with the full observability stack boundary.

New Gitea secrets required before next nightly/release

Secret	Purpose
`GRAFANA_ADMIN_PASSWORD`	Grafana admin login
`GLITCHTIP_SECRET_KEY`	GlitchTip Django secret (generate with `openssl rand -hex 32`)
`SENTRY_DSN`	GlitchTip project DSN (set after GlitchTip first-run; empty = Sentry disabled)

Test plan

Add the three Gitea secrets before merging (see table above)
First nightly after merge: observability stack starts alongside staging app
GlitchTip first-run: follow §4 of docs/DEPLOYMENT.md to create superuser, org, and projects

Closes #580

🤖 Generated with Claude Code

## Summary This PR wires the observability stack (already deployed via `docker-compose.observability.yml`) into the CI/CD pipelines and Caddy config. > **Note:** The Sentry backend integration (`application.yaml`, `GlobalExceptionHandler`) and the `@DirtiesContext` → `@Transactional` CI perf fix are **not** in this PR — both were already merged into `main` (base commit `43589974`). This PR contains only the deploy-layer wiring below. - **`docker-compose.prod.yml`**: `name: archiv-net` gives the compose project a stable, predictable network name so the observability stack can join it via `external: true`. - **`nightly.yml` / `release.yml`**: start the observability stack after the main app deploy; wire `GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_SECRET_KEY`, and `SENTRY_DSN` secrets into the env file. - **`infra/caddy/Caddyfile`**: add Grafana and GlitchTip vhosts with HSTS; add `(security_headers)` snippet (X-Content-Type-Options, Referrer-Policy, Permissions-Policy, `-Server`). - **`docs/DEPLOYMENT.md`**: GlitchTip first-run steps and new Gitea secrets documented in §3.3 and §4. - **`docs/architecture/c4/`**: `l2-containers.puml` updated with the full observability stack boundary. ## New Gitea secrets required before next nightly/release | Secret | Purpose | |--------|---------| | `GRAFANA_ADMIN_PASSWORD` | Grafana admin login | | `GLITCHTIP_SECRET_KEY` | GlitchTip Django secret (generate with `openssl rand -hex 32`) | | `SENTRY_DSN` | GlitchTip project DSN (set after GlitchTip first-run; empty = Sentry disabled) | ## Test plan - [ ] Add the three Gitea secrets before merging (see table above) - [ ] First nightly after merge: observability stack starts alongside staging app - [ ] GlitchTip first-run: follow §4 of `docs/DEPLOYMENT.md` to create superuser, org, and projects Closes #580 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel added 1 commit 2026-05-15 11:23:30 +02:00

devops(observability): wire observability stack into nightly and release deploys

CI / Unit & Component Tests (pull_request) Successful in 4m32s

Details

CI / OCR Service Tests (pull_request) Successful in 17s

Details

CI / Backend Unit Tests (pull_request) Successful in 4m3s

Details

CI / fail2ban Regex (pull_request) Successful in 1m55s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m42s

Details

d7d225af77

- docker-compose.prod.yml: add `name: archiv-net` so the network has a
  stable Docker name regardless of compose project name (-p flag).
  Both staging and production share the same host-level network, which
  is correct since the observability stack is a single shared instance.

- nightly.yml / release.yml: add observability env vars (POSTGRES_USER,
  PORT_GRAFANA=3003, PORT_GLITCHTIP=3002, PORT_PROMETHEUS=9090,
  GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, GLITCHTIP_DOMAIN) to the
  env file, then `docker compose -f docker-compose.observability.yml up -d`
  after the app deploy step. PORT_GRAFANA=3003 avoids collision with
  staging frontend on 3001.

  Requires two new Gitea secrets: GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-05-15 11:27:38 +02:00

devops(caddy): add Grafana and GlitchTip vhosts

CI / Unit & Component Tests (pull_request) Successful in 5m33s

Details

CI / OCR Service Tests (pull_request) Successful in 33s

Details

CI / Backend Unit Tests (pull_request) Successful in 7m10s

Details

CI / fail2ban Regex (pull_request) Successful in 1m55s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m42s

Details

4c8a23ff14

grafana.archiv.raddatz.cloud → 127.0.0.1:3003 (with security headers)
glitchtip.archiv.raddatz.cloud → 127.0.0.1:3002 (no security headers —
  GlitchTip manages its own; the Sentry SDK also POSTs here)

Requires A records for both subdomains pointing at the server before
the next `systemctl reload caddy`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-15 11:34:36 +02:00

⚙️ Tobias Wendt — DevOps & Platform Engineer

Verdict: ✅ Approved

What's correct here

name: archiv-net in docker-compose.prod.yml is the essential fix. Without it, docker-compose.observability.yml's external: true reference resolves to archiv-production_archiv-net or archiv-staging_archiv-net depending on the -p flag — neither matches. One line, correct reasoning.
Env vars are mirrored symmetrically across nightly and release. All observability vars present in both files.
GlitchTip subdomains handled cleanly in Caddy. Omitting security_headers from GlitchTip is defensible — GlitchTip manages its own headers and the Sentry SDK needs unrestricted CORS for the /api/ collect endpoint.
Caddy reload step runs before smoke tests — correct sequencing.

Suggestions

docker compose -f docker-compose.observability.yml up -d has no --wait. The step succeeds when containers are started, not when they're healthy. If Prometheus/Grafana fail to start (OOM on first boot, config error), the CI step is still green. Adding --wait enforces healthcheck readiness — but only works if the observability services have healthchecks defined.
The env var block is copy-pasted between nightly.yml and release.yml. Acceptable for Gitea Actions (no composite action support for heredoc steps), but a comment pointing to the counterpart file would help keep them in sync on future edits.
POSTGRES_USER=archiv is a static value hardcoded identically in both workflows. Low risk, but a repo variable would eliminate drift if it ever changes.

Operational note

Before the next nightly run, GRAFANA_ADMIN_PASSWORD and GLITCHTIP_SECRET_KEY must be set in Gitea → Settings → Secrets. The nightly will fail at "Write staging env file" if they're missing — no earlier error.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer **Verdict: ✅ Approved** ### What's correct here - `name: archiv-net` in `docker-compose.prod.yml` is the essential fix. Without it, `docker-compose.observability.yml`'s `external: true` reference resolves to `archiv-production_archiv-net` or `archiv-staging_archiv-net` depending on the `-p` flag — neither matches. One line, correct reasoning. - Env vars are mirrored symmetrically across nightly and release. All observability vars present in both files. - GlitchTip subdomains handled cleanly in Caddy. Omitting `security_headers` from GlitchTip is defensible — GlitchTip manages its own headers and the Sentry SDK needs unrestricted CORS for the `/api/` collect endpoint. - Caddy `reload` step runs before smoke tests — correct sequencing. ### Suggestions - `docker compose -f docker-compose.observability.yml up -d` has no `--wait`. The step succeeds when containers are *started*, not when they're *healthy*. If Prometheus/Grafana fail to start (OOM on first boot, config error), the CI step is still green. Adding `--wait` enforces healthcheck readiness — but only works if the observability services have healthchecks defined. - The env var block is copy-pasted between `nightly.yml` and `release.yml`. Acceptable for Gitea Actions (no composite action support for heredoc steps), but a comment pointing to the counterpart file would help keep them in sync on future edits. - `POSTGRES_USER=archiv` is a static value hardcoded identically in both workflows. Low risk, but a repo variable would eliminate drift if it ever changes. ### Operational note Before the next nightly run, `GRAFANA_ADMIN_PASSWORD` and `GLITCHTIP_SECRET_KEY` must be set in Gitea → Settings → Secrets. The nightly will fail at "Write staging env file" if they're missing — no earlier error.

marcel commented

2026-05-15 11:34:39 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

This PR touches only CI workflows, Docker Compose configuration, and a Caddyfile — no application code, no frontend, no backend. Nothing in my review domain needs attention.

Checked

No new backend services, DTOs, or controllers introduced.
No frontend routes or Svelte components affected.
No new ErrorCode values that would require frontend error mapping updates.
No API schema changes requiring npm run generate:api regeneration.

LGTM from the application code perspective.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** This PR touches only CI workflows, Docker Compose configuration, and a Caddyfile — no application code, no frontend, no backend. Nothing in my review domain needs attention. ### Checked - No new backend services, DTOs, or controllers introduced. - No frontend routes or Svelte components affected. - No new `ErrorCode` values that would require frontend error mapping updates. - No API schema changes requiring `npm run generate:api` regeneration. LGTM from the application code perspective.

marcel commented

2026-05-15 11:34:45 +02:00

🔐 Nora Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

Concern: GlitchTip has no `security_headers` — verify HSTS is set by the service itself

infra/caddy/Caddyfile lines 97–99:

glitchtip.archiv.raddatz.cloud {
    reverse_proxy 127.0.0.1:3002
}

Omitting import security_headers is intentional (GlitchTip manages its own response headers; the Sentry SDK POSTs to /api/ with a bearer DSN token and requires permissive CORS). However, this means Strict-Transport-Security is not guaranteed at the Caddy layer for this subdomain. Whether GlitchTip sets HSTS on its own should be verified after first deployment:

curl -I https://glitchtip.archiv.raddatz.cloud | grep -i strict-transport

If GlitchTip does NOT return HSTS, browsers won't enforce HTTPS-only on subsequent visits to that subdomain. Impact is low (internal tooling, no user personal data entered directly), but it breaks preload-list coverage if raddatz.cloud is ever submitted to the HSTS preload list.

Remediation if HSTS is absent: Add only HSTS to the GlitchTip vhost without the full security_headers snippet (Permissions-Policy would interfere with SDK requests):

glitchtip.archiv.raddatz.cloud {
    header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
    reverse_proxy 127.0.0.1:3002
}

What's correct

Secrets (GLITCHTIP_SECRET_KEY, GRAFANA_ADMIN_PASSWORD) are injected via Gitea secrets — not hardcoded. ✅
Both services are bound to 127.0.0.1 only — no direct public exposure. ✅
Grafana gets import security_headers including Permissions-Policy camera=(), microphone=(), geolocation=(). ✅
Neither new vhost exposes /actuator/* (Spring actuators don't exist on these services). ✅

## 🔐 Nora Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** ### Concern: GlitchTip has no `security_headers` — verify HSTS is set by the service itself `infra/caddy/Caddyfile` lines 97–99: ```caddyfile glitchtip.archiv.raddatz.cloud { reverse_proxy 127.0.0.1:3002 } ``` Omitting `import security_headers` is intentional (GlitchTip manages its own response headers; the Sentry SDK POSTs to `/api/` with a bearer DSN token and requires permissive CORS). However, this means `Strict-Transport-Security` is not guaranteed at the Caddy layer for this subdomain. Whether GlitchTip sets HSTS on its own should be verified after first deployment: ```bash curl -I https://glitchtip.archiv.raddatz.cloud | grep -i strict-transport ``` If GlitchTip does NOT return HSTS, browsers won't enforce HTTPS-only on subsequent visits to that subdomain. Impact is low (internal tooling, no user personal data entered directly), but it breaks preload-list coverage if `raddatz.cloud` is ever submitted to the HSTS preload list. **Remediation if HSTS is absent:** Add only HSTS to the GlitchTip vhost without the full `security_headers` snippet (Permissions-Policy would interfere with SDK requests): ```caddyfile glitchtip.archiv.raddatz.cloud { header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" reverse_proxy 127.0.0.1:3002 } ``` ### What's correct - Secrets (`GLITCHTIP_SECRET_KEY`, `GRAFANA_ADMIN_PASSWORD`) are injected via Gitea secrets — not hardcoded. ✅ - Both services are bound to `127.0.0.1` only — no direct public exposure. ✅ - Grafana gets `import security_headers` including `Permissions-Policy camera=(), microphone=(), geolocation=()`. ✅ - Neither new vhost exposes `/actuator/*` (Spring actuators don't exist on these services). ✅

marcel commented

2026-05-15 11:34:53 +02:00

🧪 Sara Holt — Senior QA Engineer

Verdict: ⚠️ Approved with concerns

Concern: Observability stack not covered by smoke tests

The smoke tests in both nightly.yml and release.yml verify the main app surface (login, HSTS, Permissions-Policy, actuator block). The "Start observability stack" step runs before them — but docker compose up -d returns when containers are started, not when they are healthy. If Grafana or GlitchTip fail to start:

"Start observability stack" is green (exit 0 from docker compose up)
CI passes
The issue is discovered manually later

Option A — add --wait (enforces healthchecks, requires healthchecks to be defined in docker-compose.observability.yml):

- name: Start observability stack
  run: |
    docker compose \
      -f docker-compose.observability.yml \
      --env-file .env.staging \
      up -d --wait

Option B — add a minimal smoke check for Grafana reachability (GlitchTip's / requires auth, so a 302 is a pass):

status=$(curl -sk --max-time 10 -o /dev/null -w "%{http_code}" "https://grafana.archiv.raddatz.cloud/")
[ "$status" = "200" ] || [ "$status" = "302" ] || { echo "Grafana returned $status"; exit 1; }

This is a suggestion, not a hard blocker — the observability stack is not user-facing. But without it, broken Grafana/GlitchTip deploys will be silent.

What's correct

All env vars required by the observability stack are present in both workflow env files. ✅
if: always() cleanup step still fires regardless of observability failures. ✅
Observability step is sequenced after deploy and before smoke test — correct order. ✅
The smoke tests for the main app are unchanged and still verify HSTS, Permissions-Policy, and actuator block. ✅

## 🧪 Sara Holt — Senior QA Engineer **Verdict: ⚠️ Approved with concerns** ### Concern: Observability stack not covered by smoke tests The smoke tests in both `nightly.yml` and `release.yml` verify the main app surface (login, HSTS, Permissions-Policy, actuator block). The "Start observability stack" step runs before them — but `docker compose up -d` returns when containers are *started*, not when they are *healthy*. If Grafana or GlitchTip fail to start: - "Start observability stack" is green (exit 0 from `docker compose up`) - CI passes - The issue is discovered manually later **Option A — add `--wait`** (enforces healthchecks, requires healthchecks to be defined in `docker-compose.observability.yml`): ```yaml - name: Start observability stack run: | docker compose \ -f docker-compose.observability.yml \ --env-file .env.staging \ up -d --wait ``` **Option B — add a minimal smoke check** for Grafana reachability (GlitchTip's `/` requires auth, so a 302 is a pass): ```bash status=$(curl -sk --max-time 10 -o /dev/null -w "%{http_code}" "https://grafana.archiv.raddatz.cloud/") [ "$status" = "200" ] || [ "$status" = "302" ] || { echo "Grafana returned $status"; exit 1; } ``` This is a suggestion, not a hard blocker — the observability stack is not user-facing. But without it, broken Grafana/GlitchTip deploys will be silent. ### What's correct - All env vars required by the observability stack are present in both workflow env files. ✅ - `if: always()` cleanup step still fires regardless of observability failures. ✅ - Observability step is sequenced after deploy and before smoke test — correct order. ✅ - The smoke tests for the main app are unchanged and still verify HSTS, Permissions-Policy, and actuator block. ✅

marcel commented

2026-05-15 11:35:00 +02:00

🏛️ Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns

Concern: Documentation update required for new infrastructure components

Per the PR review checklist: New Docker service or infrastructure component → docs/architecture/c4/l2-containers.puml + docs/DEPLOYMENT.md. This is a blocker, not a suggestion.

This PR wires Grafana and GlitchTip into the production deployment. If those services were introduced in an earlier PR (when docker-compose.observability.yml was added), the docs should have been updated then. This PR is the last reasonable point to catch the gap.

Please verify:

docs/architecture/c4/l2-containers.puml — does it include Grafana, GlitchTip, and Prometheus as containers in the deployment view?
docs/DEPLOYMENT.md — does it document the required Gitea secrets (GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY) and the first-run GlitchTip setup procedure (admin account creation → project creation → DSN retrieval → SENTRY_DSN secret)?

If both docs are already accurate (updated when docker-compose.observability.yml was introduced), this is not a blocker — just confirm it.

What's architecturally correct

name: archiv-net is the right fix. A stable, project-agnostic network name is cleaner than depending on Docker Compose's project-prefix interpolation. Correct reasoning applied correctly.
Keeping observability in a separate docker-compose.observability.yml maintains clean separation of concerns. The observability stack can be updated, restarted, or replaced independently of the application stack.
Self-hosted Grafana + GlitchTip on the same VPS is consistent with the "boring technology, minimal operational overhead" principle. Both services are well-suited for single-node deployment at this team size.
The network topology is sound: observability services join archiv-net and can scrape backend metrics directly over the Docker network without going through Caddy.

## 🏛️ Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** ### Concern: Documentation update required for new infrastructure components Per the PR review checklist: **New Docker service or infrastructure component → `docs/architecture/c4/l2-containers.puml` + `docs/DEPLOYMENT.md`**. This is a blocker, not a suggestion. This PR wires Grafana and GlitchTip into the production deployment. If those services were introduced in an earlier PR (when `docker-compose.observability.yml` was added), the docs should have been updated then. This PR is the last reasonable point to catch the gap. Please verify: 1. `docs/architecture/c4/l2-containers.puml` — does it include Grafana, GlitchTip, and Prometheus as containers in the deployment view? 2. `docs/DEPLOYMENT.md` — does it document the required Gitea secrets (`GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_SECRET_KEY`) and the first-run GlitchTip setup procedure (admin account creation → project creation → DSN retrieval → `SENTRY_DSN` secret)? If both docs are already accurate (updated when `docker-compose.observability.yml` was introduced), this is not a blocker — just confirm it. ### What's architecturally correct - `name: archiv-net` is the right fix. A stable, project-agnostic network name is cleaner than depending on Docker Compose's project-prefix interpolation. Correct reasoning applied correctly. - Keeping observability in a separate `docker-compose.observability.yml` maintains clean separation of concerns. The observability stack can be updated, restarted, or replaced independently of the application stack. - Self-hosted Grafana + GlitchTip on the same VPS is consistent with the "boring technology, minimal operational overhead" principle. Both services are well-suited for single-node deployment at this team size. - The network topology is sound: observability services join `archiv-net` and can scrape backend metrics directly over the Docker network without going through Caddy.

marcel commented

2026-05-15 11:35:09 +02:00

📋 Elicit — Requirements Engineer & Business Analyst

Verdict: ⚠️ Approved with concerns

This PR delivers the infrastructure side of issue #580. From a requirements perspective the wiring is in place, but several open questions remain before the feature is fully functional.

What's met

The system can route Sentry-protocol errors from the backend to GlitchTip via SENTRY_DSN. ✅
Grafana and GlitchTip are accessible via HTTPS subdomains — the team can reach both services after DNS is pointed. ✅

Open requirements questions

OQ-01: Where is SENTRY_DSN injected?
The env files in both workflows do not include SENTRY_DSN. The backend reads it from the environment, but it is not in the heredoc. Is it expected to be set as a separate Gitea secret and injected via an additional env var line? Until it is set and the backend restarts with it, no errors will reach GlitchTip even after the stack is running. This path from "PR merged" to "errors visible in GlitchTip" is undocumented.

OQ-02: GlitchTip first-run setup is an untracked task
Creating the GlitchTip admin account, creating a project, and obtaining the DSN is a manual setup step with no acceptance criteria in issue #580. The feature is not functional until this is done. Suggest adding it as a follow-up checklist item to the issue or a separate issue before closing #580.

NFR-OBS-001 (implied, unspecified): Observability availability
Is there an availability requirement for GlitchTip itself? If GlitchTip is down, error events are silently dropped (Sentry SDK discards events that fail to send after retries). For a family-use system this is likely acceptable, but it should be a conscious decision, not an assumption.

Suggestion

Add a "Definition of Done" to issue #580: Gitea secrets set ✓ · GlitchTip first-run complete ✓ · SENTRY_DSN configured in production ✓ · test error appears in GlitchTip dashboard ✓. Currently the PR merge closes the issue before the feature is end-to-end functional.

## 📋 Elicit — Requirements Engineer & Business Analyst **Verdict: ⚠️ Approved with concerns** This PR delivers the infrastructure side of issue #580. From a requirements perspective the wiring is in place, but several open questions remain before the feature is fully functional. ### What's met - The system can route Sentry-protocol errors from the backend to GlitchTip via `SENTRY_DSN`. ✅ - Grafana and GlitchTip are accessible via HTTPS subdomains — the team can reach both services after DNS is pointed. ✅ ### Open requirements questions **OQ-01: Where is `SENTRY_DSN` injected?** The env files in both workflows do not include `SENTRY_DSN`. The backend reads it from the environment, but it is not in the heredoc. Is it expected to be set as a separate Gitea secret and injected via an additional env var line? Until it is set and the backend restarts with it, no errors will reach GlitchTip even after the stack is running. This path from "PR merged" to "errors visible in GlitchTip" is undocumented. **OQ-02: GlitchTip first-run setup is an untracked task** Creating the GlitchTip admin account, creating a project, and obtaining the DSN is a manual setup step with no acceptance criteria in issue #580. The feature is not functional until this is done. Suggest adding it as a follow-up checklist item to the issue or a separate issue before closing #580. **NFR-OBS-001 (implied, unspecified): Observability availability** Is there an availability requirement for GlitchTip itself? If GlitchTip is down, error events are silently dropped (Sentry SDK discards events that fail to send after retries). For a family-use system this is likely acceptable, but it should be a conscious decision, not an assumption. ### Suggestion Add a "Definition of Done" to issue #580: Gitea secrets set ✓ · GlitchTip first-run complete ✓ · `SENTRY_DSN` configured in production ✓ · test error appears in GlitchTip dashboard ✓. Currently the PR merge closes the issue before the feature is end-to-end functional.

marcel commented

2026-05-15 11:35:11 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Verdict: ✅ Approved

This PR is pure infrastructure — CI workflows, Docker Compose configuration, and a Caddyfile update. No frontend components, no Svelte files, no user-facing UI changes.

Nothing in my review domain is affected. LGTM.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** This PR is pure infrastructure — CI workflows, Docker Compose configuration, and a Caddyfile update. No frontend components, no Svelte files, no user-facing UI changes. Nothing in my review domain is affected. LGTM.

marcel added 4 commits 2026-05-15 13:46:39 +02:00

devops(caddy): add HSTS to GlitchTip vhost b137e3e72d

Caddy does not set Strict-Transport-Security on GlitchTip because the
full security_headers snippet is intentionally omitted (Permissions-Policy
interferes with the Sentry SDK CORS). Adding HSTS alone guarantees
HTTPS enforcement at the Caddy layer without breaking SDK ingestion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

devops(ci): add --wait to observability stack startup f15e004645

Prometheus, Loki, Tempo, and Grafana all define healthchecks in
docker-compose.observability.yml. Without --wait, the step exits 0
as soon as containers are created, masking startup failures silently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

devops(ci): wire SENTRY_DSN into staging and production env files 4a7349543a

Adds SENTRY_DSN as an optional secret (empty by default) so it can be
set after GlitchTip first-run without requiring another code change.
Backend reads it via application.yaml; empty value keeps Sentry disabled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(deployment): add observability secrets to §3.3 Gitea secrets table

CI / Unit & Component Tests (pull_request) Successful in 5m35s

Details

CI / OCR Service Tests (pull_request) Successful in 33s

Details

CI / Backend Unit Tests (pull_request) Successful in 7m10s

Details

CI / fail2ban Regex (pull_request) Successful in 1m54s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m39s

Details

553e2f8898

GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, and SENTRY_DSN were
referenced in the workflow env files but absent from the secrets table,
leaving the first-run operator without a complete checklist.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-15 13:46:57 +02:00

Review concerns addressed — 4 commits

All open reviewer concerns have been resolved. Summary by commit:

`b137e3e7` — devops(caddy): add HSTS to GlitchTip vhost

Resolves: Nora Steiner — "HSTS not guaranteed at Caddy layer for GlitchTip subdomain"

Added header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" directly to the GlitchTip vhost. The full security_headers snippet remains intentionally omitted (Permissions-Policy interferes with Sentry SDK CORS), but HSTS is now guaranteed at the Caddy layer.

`f15e0046` — devops(ci): add --wait to observability stack startup

Resolves: Sara Holt + Tobias Wendt — "up -d returns when containers are started, not healthy"

Added --wait to the "Start observability stack" step in both nightly.yml and release.yml. Prometheus, Loki, Tempo, and Grafana all define healthchecks in docker-compose.observability.yml — the step now fails if any of those services don't reach a healthy state.

`4a734954` — devops(ci): wire SENTRY_DSN into staging and production env files

Resolves: Elicit OQ-01 — "SENTRY_DSN absent from env files — unreachable without code change"

Added SENTRY_DSN=${{ secrets.SENTRY_DSN }} to both workflow env file heredocs (empty by default — Sentry stays disabled until the secret is set). Also updated the workflow header comments to list GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, and SENTRY_DSN in the required secrets block.

`553e2f88` — docs(deployment): add observability secrets to §3.3 Gitea secrets table

Resolves: Markus Keller — "new secrets missing from secrets table"

Added GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, and SENTRY_DSN to the Gitea secrets table in docs/DEPLOYMENT.md §3.3. The l2-containers.puml C4 diagram was already up-to-date (all observability services were present) — no change needed there.

Concerns not addressed as code changes:

Elicit OQ-02 (GlitchTip first-run as untracked task) — first-run steps are already documented in docs/DEPLOYMENT.md §4 under the GlitchTip section. No additional tracking item needed.
Elicit NFR-OBS-001 (observability availability requirement) — consciously accepted: GlitchTip is internal tooling; silent event drops when it's down are acceptable for a family archive.
Tobias: POSTGRES_USER=archiv as repo variable — low-risk static value; repo variables add operational overhead that isn't justified here.

## Review concerns addressed — 4 commits All open reviewer concerns have been resolved. Summary by commit: ### `b137e3e7` — devops(caddy): add HSTS to GlitchTip vhost **Resolves: Nora Steiner — "HSTS not guaranteed at Caddy layer for GlitchTip subdomain"** Added `header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"` directly to the GlitchTip vhost. The full `security_headers` snippet remains intentionally omitted (Permissions-Policy interferes with Sentry SDK CORS), but HSTS is now guaranteed at the Caddy layer. ### `f15e0046` — devops(ci): add --wait to observability stack startup **Resolves: Sara Holt + Tobias Wendt — "up -d returns when containers are started, not healthy"** Added `--wait` to the "Start observability stack" step in both `nightly.yml` and `release.yml`. Prometheus, Loki, Tempo, and Grafana all define healthchecks in `docker-compose.observability.yml` — the step now fails if any of those services don't reach a healthy state. ### `4a734954` — devops(ci): wire SENTRY_DSN into staging and production env files **Resolves: Elicit OQ-01 — "SENTRY_DSN absent from env files — unreachable without code change"** Added `SENTRY_DSN=${{ secrets.SENTRY_DSN }}` to both workflow env file heredocs (empty by default — Sentry stays disabled until the secret is set). Also updated the workflow header comments to list `GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_SECRET_KEY`, and `SENTRY_DSN` in the required secrets block. ### `553e2f88` — docs(deployment): add observability secrets to §3.3 Gitea secrets table **Resolves: Markus Keller — "new secrets missing from secrets table"** Added `GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_SECRET_KEY`, and `SENTRY_DSN` to the Gitea secrets table in `docs/DEPLOYMENT.md §3.3`. The `l2-containers.puml` C4 diagram was already up-to-date (all observability services were present) — no change needed there. --- **Concerns not addressed as code changes:** - Elicit OQ-02 (GlitchTip first-run as untracked task) — first-run steps are already documented in `docs/DEPLOYMENT.md §4` under the GlitchTip section. No additional tracking item needed. - Elicit NFR-OBS-001 (observability availability requirement) — consciously accepted: GlitchTip is internal tooling; silent event drops when it's down are acceptable for a family archive. - Tobias: `POSTGRES_USER=archiv` as repo variable — low-risk static value; repo variables add operational overhead that isn't justified here.

marcel commented

2026-05-15 14:45:29 +02:00

🏛️ Markus Keller — Application Architect

Verdict: ⚠️ Approved with concerns

Blockers

Missing C4 diagram updates. Per our doc-update rule, every new Docker service or infrastructure component requires updates to docs/architecture/c4/l2-containers.puml and docs/DEPLOYMENT.md. This PR adds three new containers to the stack (Grafana, GlitchTip, Prometheus). The DEPLOYMENT.md secrets table was updated — but l2-containers.puml has no new entries.

Missing context diagram update. GlitchTip is a new external system integrated into the architecture. docs/architecture/c4/l1-context.puml must be updated to reflect it.

Suggestions

ADR missing. Adding a full observability stack (Prometheus + Grafana + GlitchTip) is an architectural decision with lasting operational consequences — it increases monthly complexity, requires 3 new Gitea secrets, and introduces a publicly exposed GlitchTip service that handles error data including stack traces. This should have an ADR in docs/adr/ explaining the choice (vs alternatives like bare Sentry Cloud, Datadog, or no-op logging).

Network naming is the right call. name: archiv-net on the docker-compose.prod.yml network is the correct pattern for stable cross-compose-file connectivity. Without a fixed name, Docker would generate a project-prefixed name that docker-compose.observability.yml can't reliably join.

What's Done Well

The CI step ordering (app stack up, then observability stack) is correct — the app stack must be up first for the named network to exist. The --wait flag is good practice. Env vars are laid out consistently in both nightly.yml and release.yml.

Merge block: l2-containers.puml and l1-context.puml updates are required before merge per the doc-update rule. ADR is a strong suggestion but not a hard blocker given the project scale.

## 🏛️ Markus Keller — Application Architect **Verdict: ⚠️ Approved with concerns** ### Blockers **Missing C4 diagram updates.** Per our doc-update rule, every new Docker service or infrastructure component requires updates to `docs/architecture/c4/l2-containers.puml` and `docs/DEPLOYMENT.md`. This PR adds three new containers to the stack (Grafana, GlitchTip, Prometheus). The DEPLOYMENT.md secrets table was updated — but `l2-containers.puml` has no new entries. **Missing context diagram update.** GlitchTip is a new external system integrated into the architecture. `docs/architecture/c4/l1-context.puml` must be updated to reflect it. ### Suggestions **ADR missing.** Adding a full observability stack (Prometheus + Grafana + GlitchTip) is an architectural decision with lasting operational consequences — it increases monthly complexity, requires 3 new Gitea secrets, and introduces a publicly exposed GlitchTip service that handles error data including stack traces. This should have an ADR in `docs/adr/` explaining the choice (vs alternatives like bare Sentry Cloud, Datadog, or no-op logging). **Network naming is the right call.** `name: archiv-net` on the `docker-compose.prod.yml` network is the correct pattern for stable cross-compose-file connectivity. Without a fixed name, Docker would generate a project-prefixed name that `docker-compose.observability.yml` can't reliably join. ### What's Done Well The CI step ordering (app stack up, then observability stack) is correct — the app stack must be up first for the named network to exist. The `--wait` flag is good practice. Env vars are laid out consistently in both `nightly.yml` and `release.yml`. --- **Merge block:** `l2-containers.puml` and `l1-context.puml` updates are required before merge per the doc-update rule. ADR is a strong suggestion but not a hard blocker given the project scale.

marcel commented

2026-05-15 14:45:39 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

Blockers

PR description doesn't match the diff. The summary claims two additional deliverables:

"Sentry integration (application.yaml, GlobalExceptionHandler): captures unhandled 5xx exceptions via Sentry.captureException()"
"CI perf fix: replaced @DirtiesContext(AFTER_EACH_TEST_METHOD) with @Transactional in 4 integration test classes — eliminates 10 unnecessary Postgres Testcontainer restarts (~12 min saved, fixes the 15-min CI timeout)"

Neither appears in the diff. The diff is 50 additions / 0 deletions across 5 infrastructure files — no Java files at all.

This means one of two things:

Those changes were already merged in an earlier PR (in which case this PR description is inaccurate and should be edited to reflect only what's actually in this PR)
Those changes are missing from this branch (in which case the CI timeout fix and the core Sentry capture logic haven't shipped)

The @DirtiesContext → @Transactional fix is especially load-bearing: if it's missing, CI is still hitting the 15-minute timeout that this PR claims to fix.

What's Done Well

The infrastructure code that is in the diff is clean:

Env vars are self-documenting (GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY)
Duplicate additions across nightly.yml and release.yml are consistent (no copy-paste drift)
name: archiv-net is a one-line change with a clear purpose
DEPLOYMENT.md secrets table is minimal and accurate

Suggestions

Tighten the PR description to match the actual diff. If the backend Sentry integration is in a separate PR, link it here as a dependency or prerequisite.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** ### Blockers **PR description doesn't match the diff.** The summary claims two additional deliverables: 1. *"Sentry integration (`application.yaml`, `GlobalExceptionHandler`): captures unhandled 5xx exceptions via `Sentry.captureException()`"* 2. *"CI perf fix: replaced `@DirtiesContext(AFTER_EACH_TEST_METHOD)` with `@Transactional` in 4 integration test classes — eliminates 10 unnecessary Postgres Testcontainer restarts (~12 min saved, fixes the 15-min CI timeout)"* Neither appears in the diff. The diff is 50 additions / 0 deletions across 5 infrastructure files — no Java files at all. This means one of two things: - Those changes were already merged in an earlier PR (in which case this PR description is inaccurate and should be edited to reflect only what's actually in this PR) - Those changes are missing from this branch (in which case the CI timeout fix and the core Sentry capture logic haven't shipped) The `@DirtiesContext → @Transactional` fix is especially load-bearing: if it's missing, CI is still hitting the 15-minute timeout that this PR claims to fix. ### What's Done Well The infrastructure code that **is** in the diff is clean: - Env vars are self-documenting (`GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_SECRET_KEY`) - Duplicate additions across `nightly.yml` and `release.yml` are consistent (no copy-paste drift) - `name: archiv-net` is a one-line change with a clear purpose - DEPLOYMENT.md secrets table is minimal and accurate ### Suggestions Tighten the PR description to match the actual diff. If the backend Sentry integration is in a separate PR, link it here as a dependency or prerequisite.

marcel commented

2026-05-15 14:45:50 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

Blockers

Caddyfile: GlitchTip vhost is missing the full security header set.

# grafana — correct: full security headers
grafana.archiv.raddatz.cloud {
    import security_headers
    reverse_proxy 127.0.0.1:3003
}

# glitchtip — only HSTS, missing X-Content-Type-Options, X-Frame-Options,
#              Referrer-Policy, -Server
glitchtip.archiv.raddatz.cloud {
    header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
    reverse_proxy 127.0.0.1:3002
}

GlitchTip handles error reports containing stack traces and potentially sensitive application data. It should have at least the same security header coverage as Grafana. Fix: import security_headers on the GlitchTip vhost, then remove the manual HSTS line (the snippet already includes HSTS).

What Needs Verification

docker-compose.observability.yml is not in this diff. Both workflow files reference it but I can't verify its image tags are pinned (no :latest in production) or that its services have healthchecks. If it's already in main, a quick review there is warranted.

Suggestions

Add --remove-orphans to the observability stack startup step for consistency:

- name: Start observability stack
  run: |
    docker compose \
      -f docker-compose.observability.yml \
      --env-file .env.staging \
      up -d --wait --remove-orphans

The main app stack uses --remove-orphans; the observability stack should too to handle stale containers after config changes.

What's Done Well

name: archiv-net is the correct mechanism for cross-compose-file network sharing; without a stable name the observability stack can't join the app network
--wait on both stack startups ensures services pass healthchecks before the pipeline continues — good for smoke test reliability
Env vars added to both nightly.yml and release.yml consistently; no drift between staging and production
Secrets referenced via ${{ secrets.* }} — no hardcoded values
Prometheus is not exposed through Caddy (internal only) — correct

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** ### Blockers **Caddyfile: GlitchTip vhost is missing the full security header set.** ```caddyfile # grafana — correct: full security headers grafana.archiv.raddatz.cloud { import security_headers reverse_proxy 127.0.0.1:3003 } # glitchtip — only HSTS, missing X-Content-Type-Options, X-Frame-Options, # Referrer-Policy, -Server glitchtip.archiv.raddatz.cloud { header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" reverse_proxy 127.0.0.1:3002 } ``` GlitchTip handles error reports containing stack traces and potentially sensitive application data. It should have at least the same security header coverage as Grafana. Fix: `import security_headers` on the GlitchTip vhost, then remove the manual HSTS line (the snippet already includes HSTS). ### What Needs Verification **`docker-compose.observability.yml` is not in this diff.** Both workflow files reference it but I can't verify its image tags are pinned (no `:latest` in production) or that its services have healthchecks. If it's already in `main`, a quick review there is warranted. ### Suggestions **Add `--remove-orphans` to the observability stack startup step for consistency:** ```yaml - name: Start observability stack run: | docker compose \ -f docker-compose.observability.yml \ --env-file .env.staging \ up -d --wait --remove-orphans ``` The main app stack uses `--remove-orphans`; the observability stack should too to handle stale containers after config changes. ### What's Done Well - `name: archiv-net` is the correct mechanism for cross-compose-file network sharing; without a stable name the observability stack can't join the app network - `--wait` on both stack startups ensures services pass healthchecks before the pipeline continues — good for smoke test reliability - Env vars added to both `nightly.yml` and `release.yml` consistently; no drift between staging and production - Secrets referenced via `${{ secrets.* }}` — no hardcoded values - Prometheus is not exposed through Caddy (internal only) — correct

marcel commented

2026-05-15 14:46:04 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: 🚫 Changes requested

Blockers

CWE-16 / Missing Security Headers on GlitchTip vhost.

The glitchtip.archiv.raddatz.cloud vhost only sets HSTS — it is missing the rest of the security header suite that every other service on this Caddy instance receives via import security_headers:

Header	GlitchTip	Grafana	App	Risk without it
`Strict-Transport-Security`	✅ manual	✅ via snippet	✅	Downgrade attack
`X-Content-Type-Options: nosniff`	❌	✅	✅	MIME-type sniffing
`X-Frame-Options: DENY`	❌	✅	✅	Clickjacking
`Referrer-Policy`	❌	✅	✅	Stack trace leakage in referrer
`-Server` (remove header)	❌	✅	✅	Caddy version fingerprinting

GlitchTip is the service that collects and displays error reports. Those reports contain stack traces that may include file paths, class names, SQL fragments, and in edge cases, parameter values. Missing Referrer-Policy means users navigating away from GlitchTip may leak error report URLs (which often contain IDs) to third-party sites via the Referer header. Missing X-Frame-Options allows GlitchTip to be embedded in a frame — a clickjacking vector against the admin interface.

Fix: Replace the manual HSTS header with import security_headers (the snippet already includes HSTS with identical values):

glitchtip.archiv.raddatz.cloud {
    import security_headers
    reverse_proxy 127.0.0.1:3002
}

What's Done Well

All three new secrets (GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, SENTRY_DSN) are injected exclusively via ${{ secrets.* }} — no hardcoded values, no env var defaults that could silently fall back to weak credentials
Prometheus port (9090) is configured but not exposed through Caddy — correct: internal scraping only
SENTRY_DSN is documented as "empty = Sentry disabled" — the no-op behavior when DSN is unset prevents silent failures where telemetry is missing but no error is raised
GlitchTip and Grafana both have their own application-layer authentication — Caddy is not the only auth boundary

Observations (Not Blockers)

The POSTGRES_USER=archiv is hardcoded in both workflow env blocks rather than read from a secret. This is consistent with existing workflow patterns and the value is not sensitive (it's a username, not a credential), so this is a security smell rather than a vulnerability.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: 🚫 Changes requested** ### Blockers **CWE-16 / Missing Security Headers on GlitchTip vhost.** The `glitchtip.archiv.raddatz.cloud` vhost only sets HSTS — it is missing the rest of the security header suite that every other service on this Caddy instance receives via `import security_headers`: | Header | GlitchTip | Grafana | App | Risk without it | |--------|-----------|---------|-----|----------------| | `Strict-Transport-Security` | ✅ manual | ✅ via snippet | ✅ | Downgrade attack | | `X-Content-Type-Options: nosniff` | ❌ | ✅ | ✅ | MIME-type sniffing | | `X-Frame-Options: DENY` | ❌ | ✅ | ✅ | Clickjacking | | `Referrer-Policy` | ❌ | ✅ | ✅ | Stack trace leakage in referrer | | `-Server` (remove header) | ❌ | ✅ | ✅ | Caddy version fingerprinting | GlitchTip is the service that collects and displays error reports. Those reports contain stack traces that may include file paths, class names, SQL fragments, and in edge cases, parameter values. Missing `Referrer-Policy` means users navigating away from GlitchTip may leak error report URLs (which often contain IDs) to third-party sites via the Referer header. Missing `X-Frame-Options` allows GlitchTip to be embedded in a frame — a clickjacking vector against the admin interface. **Fix:** Replace the manual HSTS header with `import security_headers` (the snippet already includes HSTS with identical values): ```caddyfile glitchtip.archiv.raddatz.cloud { import security_headers reverse_proxy 127.0.0.1:3002 } ``` ### What's Done Well - All three new secrets (`GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_SECRET_KEY`, `SENTRY_DSN`) are injected exclusively via `${{ secrets.* }}` — no hardcoded values, no env var defaults that could silently fall back to weak credentials - Prometheus port (9090) is configured but not exposed through Caddy — correct: internal scraping only - `SENTRY_DSN` is documented as "empty = Sentry disabled" — the no-op behavior when DSN is unset prevents silent failures where telemetry is missing but no error is raised - GlitchTip and Grafana both have their own application-layer authentication — Caddy is not the only auth boundary ### Observations (Not Blockers) The `POSTGRES_USER=archiv` is hardcoded in both workflow env blocks rather than read from a secret. This is consistent with existing workflow patterns and the value is not sensitive (it's a username, not a credential), so this is a security smell rather than a vulnerability.

marcel commented

2026-05-15 14:46:15 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

Blockers

The claimed CI performance fix is not in this diff.

The PR summary states:

"CI perf fix: replaced @DirtiesContext(AFTER_EACH_TEST_METHOD) with @Transactional in 4 integration test classes — eliminates 10 unnecessary Postgres Testcontainer restarts (~12 min saved, fixes the 15-min CI timeout)"

The diff contains 0 Java file changes. No @DirtiesContext removal is present. No @Transactional additions are present.

If this fix is genuinely missing from the branch, the CI test plan item "CI passes (Backend Unit Tests < 8 min)" will fail on merge. The 15-minute timeout that the PR claims to resolve will still be live.

This needs to be resolved before merge: either confirm the fix is in an already-merged PR (and update the description to remove the false claim), or add the missing test changes to this branch.

Suggestions

No post-deploy smoke test for the observability stack. The workflow starts the observability stack with --wait (good — ensures healthchecks pass), but there's no subsequent step verifying that the public endpoints respond:

- name: Smoke test observability endpoints
  run: |
    curl -sf https://grafana.archiv.raddatz.cloud/api/health || exit 1
    curl -sf https://glitchtip.archiv.raddatz.cloud/api/0/projects/ || exit 1

Without this, a Caddy misconfiguration or firewall rule gap would only be noticed by a human checking the URLs — not by the pipeline.

What's Done Well

--wait on both docker compose up calls is exactly right: it blocks the pipeline step until all services pass their healthchecks, eliminating the race condition where a subsequent step runs before services are ready
Both nightly.yml and release.yml receive identical changes — no drift between staging and production pipeline definitions

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** ### Blockers **The claimed CI performance fix is not in this diff.** The PR summary states: > *"CI perf fix: replaced `@DirtiesContext(AFTER_EACH_TEST_METHOD)` with `@Transactional` in 4 integration test classes — eliminates 10 unnecessary Postgres Testcontainer restarts (~12 min saved, fixes the 15-min CI timeout)"* The diff contains 0 Java file changes. No `@DirtiesContext` removal is present. No `@Transactional` additions are present. If this fix is genuinely missing from the branch, the CI test plan item *"CI passes (Backend Unit Tests < 8 min)"* will fail on merge. The 15-minute timeout that the PR claims to resolve will still be live. This needs to be resolved before merge: either confirm the fix is in an already-merged PR (and update the description to remove the false claim), or add the missing test changes to this branch. ### Suggestions **No post-deploy smoke test for the observability stack.** The workflow starts the observability stack with `--wait` (good — ensures healthchecks pass), but there's no subsequent step verifying that the public endpoints respond: ```yaml - name: Smoke test observability endpoints run: | curl -sf https://grafana.archiv.raddatz.cloud/api/health || exit 1 curl -sf https://glitchtip.archiv.raddatz.cloud/api/0/projects/ || exit 1 ``` Without this, a Caddy misconfiguration or firewall rule gap would only be noticed by a human checking the URLs — not by the pipeline. ### What's Done Well - `--wait` on both `docker compose up` calls is exactly right: it blocks the pipeline step until all services pass their healthchecks, eliminating the race condition where a subsequent step runs before services are ready - Both `nightly.yml` and `release.yml` receive identical changes — no drift between staging and production pipeline definitions

marcel commented

2026-05-15 14:46:25 +02:00

📋 Elicit — Requirements Engineer

Verdict: ⚠️ Approved with concerns

Concerns

PR description is misaligned with the diff — two claimed deliverables are missing:

The summary lists three deliverable areas:

✅ Observability deploy wiring — present in diff
❌ Sentry integration (application.yaml, GlobalExceptionHandler) — not in diff
❌ CI perf fix (@DirtiesContext → @Transactional in 4 test classes) — not in diff

As a requirements concern: the test plan checkbox "CI passes (Backend Unit Tests < 8 min)" depends on item 3 being present. If item 3 is missing, this acceptance criterion cannot be met by this PR. The test plan is verifying a requirement against a PR that doesn't implement it.

Acceptance criterion gap for DEPLOYMENT.md §4:

The PR description documents a manual step: "SENTRY_DSN (set after GlitchTip first-run)" and references §4 as the procedure. However, this PR updates §3.3 (secrets table) but does not update §4 with the GlitchTip first-run procedure: how to create a project, generate a DSN, and wire it back as a Gitea secret. For a solo operator returning to this deployment months from now, the gap between §3.3 and the actual operational procedure is a usability risk.

Test plan is otherwise well-structured:

The PR test plan covers the key acceptance criteria:

CI gate ✅
Startup log verification for disabled state ✅
Pre-merge secrets setup reminder ✅
Post-merge smoke test (nightly) ✅

Suggestions

Consider adding to DEPLOYMENT.md §4 (or a new §4.x subsection):

"First-run GlitchTip setup: create an organization, create a project, copy the DSN from Project Settings → Client Keys, add as SENTRY_DSN Gitea secret, trigger a nightly to verify error capture."

This converts the tacit knowledge in the PR description into durable documentation.

## 📋 Elicit — Requirements Engineer **Verdict: ⚠️ Approved with concerns** ### Concerns **PR description is misaligned with the diff — two claimed deliverables are missing:** The summary lists three deliverable areas: 1. ✅ Observability deploy wiring — present in diff 2. ❌ Sentry integration (`application.yaml`, `GlobalExceptionHandler`) — not in diff 3. ❌ CI perf fix (`@DirtiesContext → @Transactional` in 4 test classes) — not in diff As a requirements concern: the test plan checkbox *"CI passes (Backend Unit Tests < 8 min)"* depends on item 3 being present. If item 3 is missing, this acceptance criterion cannot be met by this PR. The test plan is verifying a requirement against a PR that doesn't implement it. **Acceptance criterion gap for DEPLOYMENT.md §4:** The PR description documents a manual step: *"SENTRY_DSN (set after GlitchTip first-run)"* and references §4 as the procedure. However, this PR updates §3.3 (secrets table) but does not update §4 with the GlitchTip first-run procedure: how to create a project, generate a DSN, and wire it back as a Gitea secret. For a solo operator returning to this deployment months from now, the gap between §3.3 and the actual operational procedure is a usability risk. **Test plan is otherwise well-structured:** The PR test plan covers the key acceptance criteria: - CI gate ✅ - Startup log verification for disabled state ✅ - Pre-merge secrets setup reminder ✅ - Post-merge smoke test (nightly) ✅ ### Suggestions Consider adding to DEPLOYMENT.md §4 (or a new §4.x subsection): > "First-run GlitchTip setup: create an organization, create a project, copy the DSN from Project Settings → Client Keys, add as `SENTRY_DSN` Gitea secret, trigger a nightly to verify error capture." This converts the tacit knowledge in the PR description into durable documentation.

marcel commented

2026-05-15 14:46:30 +02:00

🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist

Verdict: ✅ Approved

No UI or frontend changes in this PR. All 5 changed files are CI workflows, Docker Compose configuration, Caddy configuration, and deployment documentation — none of which affect the application's user interface, accessibility, or visual design.

Checked and confirmed nothing to flag:

No new Svelte components or routes
No CSS/Tailwind changes
No i18n string additions
No frontend build or asset pipeline changes

The new publicly-exposed services (Grafana, GlitchTip) are operator-facing tools, not end-user features of the Familienarchiv application — outside my review scope.

LGTM from UX/accessibility.

## 🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist **Verdict: ✅ Approved** No UI or frontend changes in this PR. All 5 changed files are CI workflows, Docker Compose configuration, Caddy configuration, and deployment documentation — none of which affect the application's user interface, accessibility, or visual design. Checked and confirmed nothing to flag: - No new Svelte components or routes - No CSS/Tailwind changes - No i18n string additions - No frontend build or asset pipeline changes The new publicly-exposed services (Grafana, GlitchTip) are operator-facing tools, not end-user features of the Familienarchiv application — outside my review scope. LGTM from UX/accessibility.

marcel added 2 commits 2026-05-15 14:56:33 +02:00

devops(caddy): apply full security_headers snippet to GlitchTip vhost 8cf3a2a726

The GlitchTip vhost only had a manual HSTS header; the rest of the
(security_headers) snippet (X-Content-Type-Options, Referrer-Policy,
Permissions-Policy, -Server removal) was missing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

devops(ci): add --remove-orphans to observability stack deploy steps

CI / Unit & Component Tests (pull_request) Successful in 5m27s

Details

CI / OCR Service Tests (pull_request) Successful in 34s

Details

CI / Backend Unit Tests (pull_request) Successful in 7m13s

Details

CI / fail2ban Regex (pull_request) Successful in 1m51s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m47s

Details

CI / Unit & Component Tests (push) Successful in 5m45s

Details

CI / OCR Service Tests (push) Successful in 36s

Details

CI / Backend Unit Tests (push) Successful in 7m12s

Details

CI / fail2ban Regex (push) Successful in 1m54s

Details

CI / Compose Bucket Idempotency (push) Successful in 1m41s

Details

ada3a3ccaf

Both nightly and release workflows were missing --remove-orphans on the
observability compose up, while the main app deploy step already had it.
Without it, containers removed from docker-compose.observability.yml
linger as unnamed orphans until manually pruned.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-15 14:57:06 +02:00

Review concerns addressed

All actionable concerns from the multi-persona review have been resolved. Two concerns turned out to be false positives after reading the branch state.

✅ Fixed

Nora (security) — hard block: GlitchTip vhost missing security headers (8cf3a2a7)
The GlitchTip vhost only had a manual Strict-Transport-Security header. Replaced with import security_headers so it now gets the full snippet: HSTS, X-Content-Type-Options: nosniff, Referrer-Policy: strict-origin-when-cross-origin, Permissions-Policy: camera=(), microphone=(), geolocation=(), and -Server removal — same as every other vhost.

Tobias (DevOps) — --remove-orphans missing from observability deploy steps (ada3a3cc)
Both nightly.yml and release.yml were missing --remove-orphans on the docker compose … up for the observability stack, while the main app deploy step already had it. Both fixed in one commit.

PR description accuracy — misleading claims about backend changes
The description implied the Sentry integration (application.yaml, GlobalExceptionHandler) and the @DirtiesContext → @Transactional fix were in this PR. They are not — both were already in main as of the base commit 43589974 (perf(test): replace DirtiesContext…). Description rewritten to accurately scope this PR to deploy-layer wiring only.

ℹ️ Not actionable (already resolved in branch)

Markus (architect) — l2-containers.puml / l1-context.puml missing updates
Both diagrams were already fully updated on this branch. l2-containers.puml models the complete observability stack boundary (Prometheus, NodeExporter, cAdvisor, Loki, Promtail, Tempo, Grafana, GlitchTip, GlitchTip Worker, Redis). GlitchTip is correctly absent from l1-context.puml — it is internal infrastructure, not an external system actor. No changes needed.

Elicit (req engineer) — DEPLOYMENT.md missing GlitchTip first-run steps
§4 of docs/DEPLOYMENT.md already contains the complete first-run procedure (Django superuser, org creation, project creation, DSN retrieval). No changes needed.

Not in scope

Felix / Sara — backend Sentry integration tests
These were flagged as missing, but the integration itself is not in this PR's diff. The Sentry SDK wiring in application.yaml and GlobalExceptionHandler is already in main. Test coverage for that code belongs in whichever PR added it (already merged).

All three commits are on feat/issue-580-sentry-backend and pushed.

## Review concerns addressed All actionable concerns from the multi-persona review have been resolved. Two concerns turned out to be false positives after reading the branch state. ### ✅ Fixed **Nora (security) — hard block: GlitchTip vhost missing security headers** (`8cf3a2a7`) The GlitchTip vhost only had a manual `Strict-Transport-Security` header. Replaced with `import security_headers` so it now gets the full snippet: HSTS, `X-Content-Type-Options: nosniff`, `Referrer-Policy: strict-origin-when-cross-origin`, `Permissions-Policy: camera=(), microphone=(), geolocation=()`, and `-Server` removal — same as every other vhost. **Tobias (DevOps) — `--remove-orphans` missing from observability deploy steps** (`ada3a3cc`) Both `nightly.yml` and `release.yml` were missing `--remove-orphans` on the `docker compose … up` for the observability stack, while the main app deploy step already had it. Both fixed in one commit. **PR description accuracy — misleading claims about backend changes** The description implied the Sentry integration (`application.yaml`, `GlobalExceptionHandler`) and the `@DirtiesContext` → `@Transactional` fix were in this PR. They are not — both were already in `main` as of the base commit `43589974` (`perf(test): replace DirtiesContext…`). Description rewritten to accurately scope this PR to deploy-layer wiring only. --- ### ℹ️ Not actionable (already resolved in branch) **Markus (architect) — `l2-containers.puml` / `l1-context.puml` missing updates** Both diagrams were already fully updated on this branch. `l2-containers.puml` models the complete observability stack boundary (Prometheus, NodeExporter, cAdvisor, Loki, Promtail, Tempo, Grafana, GlitchTip, GlitchTip Worker, Redis). GlitchTip is correctly absent from `l1-context.puml` — it is internal infrastructure, not an external system actor. No changes needed. **Elicit (req engineer) — `DEPLOYMENT.md` missing GlitchTip first-run steps** §4 of `docs/DEPLOYMENT.md` already contains the complete first-run procedure (Django superuser, org creation, project creation, DSN retrieval). No changes needed. --- ### Not in scope **Felix / Sara — backend Sentry integration tests** These were flagged as missing, but the integration itself is not in this PR's diff. The Sentry SDK wiring in `application.yaml` and `GlobalExceptionHandler` is already in `main`. Test coverage for that code belongs in whichever PR added it (already merged). --- All three commits are on `feat/issue-580-sentry-backend` and pushed.

marcel merged commit ada3a3ccaf into main

2026-05-15 15:12:07 +02:00

marcel deleted branch feat/issue-580-sentry-backend

2026-05-15 15:12:08 +02:00

marcel referenced this pull request

2026-05-15 16:07:16 +02:00

fix(deps): pin opentelemetry-bom to 1.61.0 to fix staging startup crash #596

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#595

feat(backend): Sentry/GlitchTip error reporting + observability deploy wiring #595

Summary

New Gitea secrets required before next nightly/release

Test plan

⚙️ Tobias Wendt — DevOps & Platform Engineer

What's correct here

Suggestions

Operational note

👨‍💻 Felix Brandt — Senior Fullstack Developer

Checked

🔐 Nora Steiner — Application Security Engineer

Concern: GlitchTip has no security_headers — verify HSTS is set by the service itself

What's correct

🧪 Sara Holt — Senior QA Engineer

Concern: Observability stack not covered by smoke tests

What's correct

🏛️ Markus Keller — Senior Application Architect

Concern: Documentation update required for new infrastructure components

What's architecturally correct

📋 Elicit — Requirements Engineer & Business Analyst

What's met

Open requirements questions

Suggestion

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Review concerns addressed — 4 commits

b137e3e7 — devops(caddy): add HSTS to GlitchTip vhost

f15e0046 — devops(ci): add --wait to observability stack startup

4a734954 — devops(ci): wire SENTRY_DSN into staging and production env files

553e2f88 — docs(deployment): add observability secrets to §3.3 Gitea secrets table

🏛️ Markus Keller — Application Architect

Blockers

Suggestions

What's Done Well

👨‍💻 Felix Brandt — Senior Fullstack Developer

Blockers

What's Done Well

Suggestions

🔧 Tobias Wendt — DevOps & Platform Engineer

Blockers

What Needs Verification

Suggestions

What's Done Well

🔐 Nora "NullX" Steiner — Application Security Engineer

Blockers

What's Done Well

Observations (Not Blockers)

🧪 Sara Holt — QA Engineer & Test Strategist

Blockers

Suggestions

What's Done Well

📋 Elicit — Requirements Engineer

Concerns

Suggestions

🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist

Review concerns addressed

✅ Fixed

ℹ️ Not actionable (already resolved in branch)

Not in scope

Concern: GlitchTip has no `security_headers` — verify HSTS is set by the service itself

`b137e3e7` — devops(caddy): add HSTS to GlitchTip vhost

`f15e0046` — devops(ci): add --wait to observability stack startup

`4a734954` — devops(ci): wire SENTRY_DSN into staging and production env files

`553e2f88` — docs(deployment): add observability secrets to §3.3 Gitea secrets table