fix(obs): Tempo processors schema fix + configurable Postgres host and Grafana port #602

marcel · 2026-05-15T21:46:47+02:00

marcel commented

2026-05-15 21:46:47 +02:00

Summary

Fixes discovered during the observability stack migration to /opt/familienarchiv/ (see #601).

Tempo 2.7.2 schema breaking change: metrics_generator.processors was removed from the top-level config and is only valid under overrides.defaults.metrics_generator.processors. The duplicate top-level block caused Tempo to refuse to start with field processors not found in type generator.Config. Removed the now-invalid block — the processors were already configured under overrides.
Configurable Postgres host: GlitchTip and its db-init job hardcoded archive-db as the Postgres hostname. When only the staging stack is running (container archiv-staging-db-1), DNS resolution fails. ${POSTGRES_HOST:-archive-db} makes the host overridable via .env without changing the production default.
PORT_GRAFANA default: Changed from 3001 to 3003 — port 3001 is occupied by the staging frontend (archiv-staging-frontend-1), causing Bind for 127.0.0.1:3001 failed: port is already allocated on compose up.

Test plan

docker compose -f docker-compose.observability.yml config validates without errors
Tempo starts healthy (no field processors not found error in logs)
GlitchTip connects to Postgres when POSTGRES_HOST is set in .env
GlitchTip connects to archive-db by default (no .env override needed in production)
Grafana binds to port 3003 by default; overridable via PORT_GRAFANA

Closes #601

🤖 Generated with Claude Code

## Summary Fixes discovered during the observability stack migration to `/opt/familienarchiv/` (see #601). - **Tempo 2.7.2 schema breaking change**: `metrics_generator.processors` was removed from the top-level config and is only valid under `overrides.defaults.metrics_generator.processors`. The duplicate top-level block caused Tempo to refuse to start with `field processors not found in type generator.Config`. Removed the now-invalid block — the processors were already configured under `overrides`. - **Configurable Postgres host**: GlitchTip and its db-init job hardcoded `archive-db` as the Postgres hostname. When only the staging stack is running (container `archiv-staging-db-1`), DNS resolution fails. `${POSTGRES_HOST:-archive-db}` makes the host overridable via `.env` without changing the production default. - **PORT_GRAFANA default**: Changed from `3001` to `3003` — port 3001 is occupied by the staging frontend (`archiv-staging-frontend-1`), causing `Bind for 127.0.0.1:3001 failed: port is already allocated` on compose up. ## Test plan - [ ] `docker compose -f docker-compose.observability.yml config` validates without errors - [ ] Tempo starts healthy (no `field processors not found` error in logs) - [ ] GlitchTip connects to Postgres when `POSTGRES_HOST` is set in `.env` - [ ] GlitchTip connects to `archive-db` by default (no `.env` override needed in production) - [ ] Grafana binds to port 3003 by default; overridable via `PORT_GRAFANA` Closes #601 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel added 2 commits 2026-05-15 21:46:48 +02:00

fix(obs): remove invalid processors block from tempo metrics_generator 458968ded5

Tempo 2.7.2 removed `processors` from the top-level metrics_generator
config; the field is only valid under `overrides.defaults.metrics_generator`.
The setting was already present there, so this only removes the now-rejected
duplicate at the top level.

Closes part of #601

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(obs): make Postgres host configurable and fix PORT_GRAFANA default

CI / Unit & Component Tests (pull_request) Successful in 3m6s

Details

CI / OCR Service Tests (pull_request) Successful in 19s

Details

CI / Backend Unit Tests (pull_request) Successful in 2m43s

Details

CI / fail2ban Regex (pull_request) Successful in 39s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 59s

Details

1181b97f94

POSTGRES_HOST variable (default: archive-db) lets the observability stack
connect to a different Postgres container — needed when only the staging
stack is running (container name: archiv-staging-db-1).

PORT_GRAFANA default changed from 3001 to 3003 to avoid collision with
the staging frontend which occupies 3001.

Closes #601

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel added 1 commit 2026-05-15 21:59:44 +02:00

fix(ci): deploy obs configs to /opt/familienarchiv/ before starting stack

CI / Unit & Component Tests (pull_request) Successful in 3m4s

Details

CI / OCR Service Tests (pull_request) Successful in 18s

Details

CI / Backend Unit Tests (pull_request) Successful in 2m42s

Details

CI / fail2ban Regex (pull_request) Successful in 41s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s

Details

7e52494880

The observability stack's bind-mount sources pointed to workspace-relative
paths. When CI wiped the workspace between runs, containers kept running but
their config files disappeared — causing Docker to auto-create directories
at the missing paths and crash the services on next restart.

Fix: mount /opt/familienarchiv/ into CI job containers via runner-config.yaml,
then copy infra/observability/ and docker-compose.observability.yml there before
docker compose up. Compose runs from the permanent path, so bind mounts resolve
to stable host paths that survive workspace wipes.

Docker Compose reads /opt/familienarchiv/.env automatically (no --env-file flag),
which is managed on the server and persists between CI runs.

Closes #601

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-15 23:50:39 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

The root cause fix is solid — stable bind mounts at /opt/familienarchiv/ survive workspace wipes by definition. The pattern mirrors how the app stack already operates. A few operational details need attention before this is bulletproof.

Blockers

GLITCHTIP_DOMAIN will revert to http://localhost:3002 after this merge. The nightly step previously passed --env-file .env.staging, which set GLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloud. That flag is removed in this PR. Docker Compose now reads /opt/familienarchiv/.env automatically — but that file doesn't currently include GLITCHTIP_DOMAIN. The compose default is ${GLITCHTIP_DOMAIN:-http://localhost:3002}, so GlitchTip will report the wrong domain in error report URLs. Fix: add GLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloud to /opt/familienarchiv/.env on the server before the next nightly run.
Runner must be restarted for runner-config.yaml changes to take effect. The new /opt/familienarchiv mount only applies to job containers spawned after the runner process has reloaded its config. Until then, the "Deploy observability configs" step will fail when it tries to write to /opt/familienarchiv/. Fix: restart the Gitea Act runner on the host after this PR is merged: systemctl restart gitea-runner (or equivalent).

Suggestions

cp -r doesn't remove deleted files. If a config file is removed from the repo, it persists in /opt/familienarchiv/infra/observability/ indefinitely. For this project's change frequency that's acceptable — but worth knowing. A rsync -a --delete would give a clean mirror.
docker compose up -d --wait is the right call. It waits for all healthchecks to pass before the step exits, so a broken Tempo config or Loki failure fails CI immediately. Good that it was preserved from the old step.
Named volumes are unaffected. loki_data, grafana_data, prometheus_data, tempo_data are Docker-managed and not touched by the copy step or the compose project name change. No data migration risk.

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** The root cause fix is solid — stable bind mounts at `/opt/familienarchiv/` survive workspace wipes by definition. The pattern mirrors how the app stack already operates. A few operational details need attention before this is bulletproof. ### Blockers - **`GLITCHTIP_DOMAIN` will revert to `http://localhost:3002` after this merge.** The nightly step previously passed `--env-file .env.staging`, which set `GLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloud`. That flag is removed in this PR. Docker Compose now reads `/opt/familienarchiv/.env` automatically — but that file doesn't currently include `GLITCHTIP_DOMAIN`. The compose default is `${GLITCHTIP_DOMAIN:-http://localhost:3002}`, so GlitchTip will report the wrong domain in error report URLs. **Fix:** add `GLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloud` to `/opt/familienarchiv/.env` on the server before the next nightly run. - **Runner must be restarted for `runner-config.yaml` changes to take effect.** The new `/opt/familienarchiv` mount only applies to job containers spawned after the runner process has reloaded its config. Until then, the "Deploy observability configs" step will fail when it tries to write to `/opt/familienarchiv/`. **Fix:** restart the Gitea Act runner on the host after this PR is merged: `systemctl restart gitea-runner` (or equivalent). ### Suggestions - **`cp -r` doesn't remove deleted files.** If a config file is removed from the repo, it persists in `/opt/familienarchiv/infra/observability/` indefinitely. For this project's change frequency that's acceptable — but worth knowing. A `rsync -a --delete` would give a clean mirror. - **`docker compose up -d --wait` is the right call.** It waits for all healthchecks to pass before the step exits, so a broken Tempo config or Loki failure fails CI immediately. Good that it was preserved from the old step. - **Named volumes are unaffected.** `loki_data`, `grafana_data`, `prometheus_data`, `tempo_data` are Docker-managed and not touched by the copy step or the compose project name change. No data migration risk.

marcel commented

2026-05-15 23:50:53 +02:00

🏗️ Markus Keller — Application Architect

Verdict: 🚫 Changes requested

The architectural decision is correct: the observability stack is persistent infrastructure and must be colocated with the permanent app deployment in /opt/familienarchiv/. The CI-push model (copy on every nightly deploy) is the right fit for this single-operator setup — no git credentials on the server, no server-pull complexity.

However, the documentation obligations triggered by this change are not met.

Blockers

docs/DEPLOYMENT.md is missing the observability section. Per the CLAUDE.md docs table, any change to "New Docker service / infrastructure component" location requires a docs/DEPLOYMENT.md update. This PR moves the obs stack's operational location. The deployment docs should cover: (1) starting the obs stack from /opt/familienarchiv/, (2) required keys in /opt/familienarchiv/.env with a note on $$ escaping for passwords containing $, (3) the one-time GlitchTip migrate + createsuperuser steps, (4) the runner restart requirement after updating runner-config.yaml.
ADR-016 is missing. This PR makes a permanent architectural decision: the obs stack shares /opt/familienarchiv/ with the app stack, and config sync is CI-push (not server-pull, not a separate directory). The alternatives had different trade-offs worth recording. Without an ADR, the next person who touches this infrastructure will not know why it was designed this way. The Decision Queue comment on issue #601 explicitly flagged this choice as needing resolution — resolving it in code without documenting it is incomplete.

Concerns

Implicit dependency on /opt/familienarchiv/.env correctness with no validation. If GLITCHTIP_SECRET_KEY is missing or $$ escaping is wrong for a $-containing password, GlitchTip starts silently with a broken or default config. A docker compose -f /opt/familienarchiv/docker-compose.observability.yml config dry-run step before up would catch substitution errors in CI before containers start.

## 🏗️ Markus Keller — Application Architect **Verdict: 🚫 Changes requested** The architectural decision is correct: the observability stack is persistent infrastructure and must be colocated with the permanent app deployment in `/opt/familienarchiv/`. The CI-push model (copy on every nightly deploy) is the right fit for this single-operator setup — no git credentials on the server, no server-pull complexity. However, the documentation obligations triggered by this change are not met. ### Blockers - **`docs/DEPLOYMENT.md` is missing the observability section.** Per the CLAUDE.md docs table, any change to "New Docker service / infrastructure component" location requires a `docs/DEPLOYMENT.md` update. This PR moves the obs stack's operational location. The deployment docs should cover: (1) starting the obs stack from `/opt/familienarchiv/`, (2) required keys in `/opt/familienarchiv/.env` with a note on `$$` escaping for passwords containing `$`, (3) the one-time GlitchTip `migrate` + `createsuperuser` steps, (4) the runner restart requirement after updating `runner-config.yaml`. - **ADR-016 is missing.** This PR makes a permanent architectural decision: the obs stack shares `/opt/familienarchiv/` with the app stack, and config sync is CI-push (not server-pull, not a separate directory). The alternatives had different trade-offs worth recording. Without an ADR, the next person who touches this infrastructure will not know why it was designed this way. The Decision Queue comment on issue #601 explicitly flagged this choice as needing resolution — resolving it in code without documenting it is incomplete. ### Concerns - **Implicit dependency on `/opt/familienarchiv/.env` correctness with no validation.** If `GLITCHTIP_SECRET_KEY` is missing or `$$` escaping is wrong for a `$`-containing password, GlitchTip starts silently with a broken or default config. A `docker compose -f /opt/familienarchiv/docker-compose.observability.yml config` dry-run step before `up` would catch substitution errors in CI before containers start.

marcel commented

2026-05-15 23:51:01 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

No application code changes — pure infrastructure. The implementation is clean and the scope is well-contained. One documentation obligation is triggered that must be met before merge.

Blockers

Missing doc updates per the CLAUDE.md table. The docs table entry "New Docker service / infrastructure component" requires updating both docs/architecture/c4/l2-containers.puml and docs/DEPLOYMENT.md. The obs stack's operational location changed — that's a container topology change the C4 L2 diagram should reflect. DEPLOYMENT.md needs the ops runbook section.

LGTM

Tempo fix is correct. Removing the top-level processors block is the right fix for the Tempo 2.7.2 schema change — the processors are already correctly configured under overrides.defaults.metrics_generator.processors.
POSTGRES_HOST variable is well-defaulted. ${POSTGRES_HOST:-archive-db} is backward compatible — production with the full stack running needs zero .env changes.
PORT_GRAFANA default change is low-risk. 3003 avoids the collision with the staging frontend. Any environment without an explicit PORT_GRAFANA var gets the safer default.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** No application code changes — pure infrastructure. The implementation is clean and the scope is well-contained. One documentation obligation is triggered that must be met before merge. ### Blockers - **Missing doc updates per the CLAUDE.md table.** The docs table entry "New Docker service / infrastructure component" requires updating both `docs/architecture/c4/l2-containers.puml` **and** `docs/DEPLOYMENT.md`. The obs stack's operational location changed — that's a container topology change the C4 L2 diagram should reflect. DEPLOYMENT.md needs the ops runbook section. ### LGTM - **Tempo fix is correct.** Removing the top-level `processors` block is the right fix for the Tempo 2.7.2 schema change — the processors are already correctly configured under `overrides.defaults.metrics_generator.processors`. - **`POSTGRES_HOST` variable is well-defaulted.** `${POSTGRES_HOST:-archive-db}` is backward compatible — production with the full stack running needs zero `.env` changes. - **`PORT_GRAFANA` default change is low-risk.** `3003` avoids the collision with the staging frontend. Any environment without an explicit `PORT_GRAFANA` var gets the safer default.

marcel commented

2026-05-15 23:51:15 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

No new application-layer vulnerabilities introduced. The infrastructure changes have one expanded security surface worth explicitly acknowledging, and one net security improvement.

Concerns (not blockers for a private single-operator repo)

Mounting /opt/familienarchiv/ into CI job containers expands the blast radius of a compromised workflow step. Before this PR, a malicious job step could write to the workspace and use the Docker socket (which already grants root-equivalent host access). After this PR, it can also directly overwrite files in /opt/familienarchiv/ — including the main app's compose files and Caddy config. The existing comment in runner-config.yaml correctly notes "only trusted code from this private repo runs on this runner." Recommend hardening the comment to make the risk explicit: # WARNING: grants CI jobs write access to /opt/familienarchiv/ — do not add this runner to repos with external contributors.

Net Security Improvement

Removing --env-file .env.staging for the obs stack fixes a silent credential truncation bug. The .env.staging heredoc uses shell expansion. Passwords containing $ (like GRAFANA_ADMIN_PASSWORD=me30g@b$Nt$Z2g) were silently truncated: $Nt and $Z2g were expanded as empty variables, so Grafana was starting with me30g@b as the admin password. The /opt/familienarchiv/.env file uses $$ → $ escaping that Docker Compose handles correctly. The full intended password now reaches Grafana. This is a genuine security improvement as a side effect of this PR.

LGTM

Network isolation unchanged: Grafana and Prometheus still bound to 127.0.0.1. No new ports exposed externally.
All image tags remain pinned. No new :latest references introduced.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** No new application-layer vulnerabilities introduced. The infrastructure changes have one expanded security surface worth explicitly acknowledging, and one net security improvement. ### Concerns (not blockers for a private single-operator repo) - **Mounting `/opt/familienarchiv/` into CI job containers expands the blast radius of a compromised workflow step.** Before this PR, a malicious job step could write to the workspace and use the Docker socket (which already grants root-equivalent host access). After this PR, it can also directly overwrite files in `/opt/familienarchiv/` — including the main app's compose files and Caddy config. The existing comment in `runner-config.yaml` correctly notes "only trusted code from this private repo runs on this runner." Recommend hardening the comment to make the risk explicit: `# WARNING: grants CI jobs write access to /opt/familienarchiv/ — do not add this runner to repos with external contributors`. ### Net Security Improvement - **Removing `--env-file .env.staging` for the obs stack fixes a silent credential truncation bug.** The `.env.staging` heredoc uses shell expansion. Passwords containing `$` (like `GRAFANA_ADMIN_PASSWORD=me30g@b$Nt$Z2g`) were silently truncated: `$Nt` and `$Z2g` were expanded as empty variables, so Grafana was starting with `me30g@b` as the admin password. The `/opt/familienarchiv/.env` file uses `$$` → `$` escaping that Docker Compose handles correctly. The full intended password now reaches Grafana. This is a genuine security improvement as a side effect of this PR. ### LGTM - Network isolation unchanged: Grafana and Prometheus still bound to `127.0.0.1`. No new ports exposed externally. - All image tags remain pinned. No new `:latest` references introduced.

marcel commented

2026-05-15 23:51:26 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

The fix addresses the root failure mode. The --wait flag on docker compose up provides implicit health verification — if a healthchecked service fails, the CI step fails. That's better than nothing, but some issue #601 acceptance criteria remain unchecked.

Concerns

Services without healthcheck directives are not verified. docker compose up --wait waits for services with healthchecks to reach healthy. Services without them (obs-promtail, obs-cadvisor, obs-node-exporter, obs-glitchtip-worker) are considered "started" the moment the process runs. A follow-up docker inspect assertion step would give CI a definitive pass/fail for the critical services: docker inspect obs-loki obs-prometheus obs-grafana obs-tempo --format '{{.Name}} {{.State.Health.Status}}' and assert all are healthy.
Issue #601 acceptance criteria coverage gaps not in this PR:
- ❌ .env.example not updated with observability vars and $$ escaping note
- ❌ docs/DEPLOYMENT.md missing createsuperuser one-time step and required .env keys
- ❌ No regression guard to verify the /opt/familienarchiv/ path is used (a future workflow refactor could revert to workspace-relative paths silently)

LGTM

--wait is preserved from the old step. The meaningful health gate was not accidentally dropped.
Tempo fix has a clear verification signal. Tempo emits a specific error when processors is at the wrong config level. Removing it will eliminate the crash loop — no ambiguity about whether it works.
POSTGRES_HOST default is backward compatible. Production behavior is unchanged when the env var is absent.

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** The fix addresses the root failure mode. The `--wait` flag on `docker compose up` provides implicit health verification — if a healthchecked service fails, the CI step fails. That's better than nothing, but some issue #601 acceptance criteria remain unchecked. ### Concerns - **Services without `healthcheck` directives are not verified.** `docker compose up --wait` waits for services *with* healthchecks to reach `healthy`. Services without them (`obs-promtail`, `obs-cadvisor`, `obs-node-exporter`, `obs-glitchtip-worker`) are considered "started" the moment the process runs. A follow-up `docker inspect` assertion step would give CI a definitive pass/fail for the critical services: `docker inspect obs-loki obs-prometheus obs-grafana obs-tempo --format '{{.Name}} {{.State.Health.Status}}'` and assert all are `healthy`. - **Issue #601 acceptance criteria coverage gaps not in this PR:** - ❌ `.env.example` not updated with observability vars and `$$` escaping note - ❌ `docs/DEPLOYMENT.md` missing `createsuperuser` one-time step and required `.env` keys - ❌ No regression guard to verify the `/opt/familienarchiv/` path is used (a future workflow refactor could revert to workspace-relative paths silently) ### LGTM - **`--wait` is preserved from the old step.** The meaningful health gate was not accidentally dropped. - **Tempo fix has a clear verification signal.** Tempo emits a specific error when `processors` is at the wrong config level. Removing it will eliminate the crash loop — no ambiguity about whether it works. - **`POSTGRES_HOST` default is backward compatible.** Production behavior is unchanged when the env var is absent.

marcel commented

2026-05-15 23:51:31 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Verdict: ✅ Approved

No concerns from my angle.

This PR is entirely server-side infrastructure — CI workflow steps, runner configuration, Docker Compose adjustments, and a Tempo YAML fix. No user-facing routes, Svelte components, CSS changes, or UI states are involved.

The Grafana provisioning directory (infra/observability/grafana/provisioning/) is copied as-is — dashboards, datasources, and dashboard JSON files are unchanged. No Grafana UI modification is implied. Nora's observation that the Grafana admin password is now correct (fixed from the previous $ truncation) actually improves the operator experience.

Nothing to flag from a UX or accessibility perspective.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** No concerns from my angle. This PR is entirely server-side infrastructure — CI workflow steps, runner configuration, Docker Compose adjustments, and a Tempo YAML fix. No user-facing routes, Svelte components, CSS changes, or UI states are involved. The Grafana provisioning directory (`infra/observability/grafana/provisioning/`) is copied as-is — dashboards, datasources, and dashboard JSON files are unchanged. No Grafana UI modification is implied. Nora's observation that the Grafana admin password is now correct (fixed from the previous `$` truncation) actually improves the operator experience. Nothing to flag from a UX or accessibility perspective.

marcel commented

2026-05-15 23:51:46 +02:00

📋 Elicit — Requirements Engineer

Verdict: ⚠️ Approved with concerns

Reviewing against the acceptance criteria in issue #601, including additions from the two findings comments.

Acceptance Criteria Coverage

Criterion	Status
Deployed to and managed from `/opt/familienarchiv/`	✅ CI step copies files + runs compose from there
Bind-mount sources point to `/opt/familienarchiv/infra/observability/…`	✅ Compose uses absolute path; relative mounts resolve correctly
`project.config_files` label points to `/opt/familienarchiv/`	✅ Compose is launched with `-f /opt/familienarchiv/docker-compose.observability.yml`
CI workspace wipe has zero effect	✅ Containers bind-mount from stable host path
CI step handles syncing configs	✅ "Deploy observability configs" step in nightly.yml
No manual `docker rm` required after workspace wipe	✅ Containers restart from stable bind mounts
GlitchTip tag pinned to specific version	✅ Already `6.1.6` in main; unchanged here
`.env.example` updated with observability vars + `$$` escaping note	❌ Not in this PR
`docs/DEPLOYMENT.md` documents `createsuperuser` one-time step	❌ Not in this PR
All obs-* containers verified healthy after migration	⚠️ Partial — `--wait` covers services with healthchecks; services without are not explicitly checked

Concerns

Two ACs from #601 are unaddressed (.env.example and docs/DEPLOYMENT.md). These were added in the second findings comment. They are not blockers for the infrastructure fix, but #601 cannot be closed cleanly without them. Recommend either including them here or opening a follow-up issue before closing #601.
The Decision Queue open decision is resolved by this PR but not recorded. The issue's Decision Queue asked: CI-push vs server-pull for config sync? This PR answers CI-push. That resolution should appear in ADR-016 (which Markus flagged as missing) so future contributors understand why server-pull was not chosen.
/opt/familienarchiv/.env required-key inventory is undocumented. The PR description correctly states the file is "managed separately on the server — not written or deleted by CI." But there is no documented list of what keys must be present. A new operator following any runbook would not know. The DEPLOYMENT.md update should include a required-key checklist with example values and the $$ escaping rule for passwords containing $.

## 📋 Elicit — Requirements Engineer **Verdict: ⚠️ Approved with concerns** Reviewing against the acceptance criteria in issue #601, including additions from the two findings comments. ### Acceptance Criteria Coverage | Criterion | Status | |---|---| | Deployed to and managed from `/opt/familienarchiv/` | ✅ CI step copies files + runs compose from there | | Bind-mount sources point to `/opt/familienarchiv/infra/observability/…` | ✅ Compose uses absolute path; relative mounts resolve correctly | | `project.config_files` label points to `/opt/familienarchiv/` | ✅ Compose is launched with `-f /opt/familienarchiv/docker-compose.observability.yml` | | CI workspace wipe has zero effect | ✅ Containers bind-mount from stable host path | | CI step handles syncing configs | ✅ "Deploy observability configs" step in nightly.yml | | No manual `docker rm` required after workspace wipe | ✅ Containers restart from stable bind mounts | | GlitchTip tag pinned to specific version | ✅ Already `6.1.6` in main; unchanged here | | `.env.example` updated with observability vars + `$$` escaping note | ❌ Not in this PR | | `docs/DEPLOYMENT.md` documents `createsuperuser` one-time step | ❌ Not in this PR | | All obs-* containers verified healthy after migration | ⚠️ Partial — `--wait` covers services with healthchecks; services without are not explicitly checked | ### Concerns - **Two ACs from #601 are unaddressed** (`.env.example` and `docs/DEPLOYMENT.md`). These were added in the second findings comment. They are not blockers for the infrastructure fix, but #601 cannot be closed cleanly without them. Recommend either including them here or opening a follow-up issue before closing #601. - **The Decision Queue open decision is resolved by this PR but not recorded.** The issue's Decision Queue asked: CI-push vs server-pull for config sync? This PR answers CI-push. That resolution should appear in ADR-016 (which Markus flagged as missing) so future contributors understand why server-pull was not chosen. - **`/opt/familienarchiv/.env` required-key inventory is undocumented.** The PR description correctly states the file is "managed separately on the server — not written or deleted by CI." But there is no documented list of what keys must be present. A new operator following any runbook would not know. The DEPLOYMENT.md update should include a required-key checklist with example values and the `$$` escaping rule for passwords containing `$`.

marcel added 7 commits 2026-05-16 00:02:16 +02:00

docs(obs): update .env.example for PORT_GRAFANA 3003, POSTGRES_HOST, $$ escaping 448c3cdcdb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(obs): update DEPLOYMENT.md with /opt/familienarchiv/ ops section, env keys, runner restart 7b7d0c92a8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(c4): update l2-containers obs boundary to show /opt/familienarchiv/ permanent path dec6b8139b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(adr): add ADR-016 for obs stack co-location and CI-push config sync 4e94d85d7e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(ci): harden runner-config.yaml security comment for /opt/familienarchiv/ write access c7d2eeb3f0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci(obs): add compose config dry-run before obs stack up to catch .env substitution errors df37113d38

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci(obs): assert obs-loki/prometheus/grafana/tempo are healthy after stack up

CI / Unit & Component Tests (pull_request) Successful in 2m58s

Details

CI / OCR Service Tests (pull_request) Successful in 17s

Details

CI / Backend Unit Tests (pull_request) Successful in 2m36s

Details

CI / fail2ban Regex (pull_request) Successful in 41s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s

Details

79735e23e0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-16 00:02:35 +02:00

Review concerns addressed

All open reviewer concerns have been resolved in 7 additional commits on this branch.

Blockers resolved

Markus — docs/DEPLOYMENT.md missing ✅

Added production ops subsection to §4: starting the obs stack from /opt/familienarchiv/, the complete /opt/familienarchiv/.env required-key inventory, the $$ escaping rule for $-containing passwords, and a note on what happens when GLITCHTIP_DOMAIN is missing (defaults to http://localhost:3002). Commit 7b7d0c92.

Markus — ADR-016 missing ✅

Added docs/adr/016-obs-stack-co-location-ci-push.md documenting the CI-push model, the co-location decision, and the two rejected alternatives (server-pull, separate directory). The Decision Queue open point from issue #601 is now recorded. Commit 4e94d85d.

Felix — l2-containers.puml not updated ✅

Updated the observability System_Boundary label to show (/opt/familienarchiv/docker-compose.observability.yml). Commit dec6b813.

Suggestions / concerns resolved

Nora — blast-radius comment too weak ✅

Hardened runner-config.yaml options comment with an explicit SECURITY WARNING block calling out (1) root-equivalent Docker socket access and (2) read/write access to /opt/familienarchiv/ including app compose files and Caddy config. Commit c7d2eeb3.

Sara + Elicit — .env.example not updated ✅

Changed PORT_GRAFANA from 3001 to 3003, added POSTGRES_HOST with the default and override context, added $$ escaping note with an example. Commit 448c3cdc.

Markus — no .env correctness validation before up ✅

Added Validate observability compose config step (docker compose config --quiet) immediately before Start observability stack. Catches missing keys, truncated passwords, and YAML errors before any containers start. Commit df37113d.

Sara — services without healthchecks not explicitly verified ✅

Added Assert observability stack health step after up --wait. Iterates obs-loki, obs-prometheus, obs-grafana, obs-tempo; emits a ::error:: annotation and exits non-zero if any container is not healthy. Commit 79735e23.

Tobias — operational actions (not fixable in code)

These two items require manual server-side action before or after merge:

GLITCHTIP_DOMAIN in /opt/familienarchiv/.env — Add GLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloud to the file on the server before the next nightly run. Now documented in DEPLOYMENT.md §4 required-key table.
Runner restart — Run systemctl restart gitea-runner after this PR is merged so the /opt/familienarchiv mount takes effect. Now documented in DEPLOYMENT.md §3.1.

## Review concerns addressed All open reviewer concerns have been resolved in 7 additional commits on this branch. ### Blockers resolved **Markus — docs/DEPLOYMENT.md missing** ✅ - Added production ops subsection to §4: starting the obs stack from `/opt/familienarchiv/`, the complete `/opt/familienarchiv/.env` required-key inventory, the `$$` escaping rule for `$`-containing passwords, and a note on what happens when `GLITCHTIP_DOMAIN` is missing (defaults to `http://localhost:3002`). Commit `7b7d0c92`. **Markus — ADR-016 missing** ✅ - Added `docs/adr/016-obs-stack-co-location-ci-push.md` documenting the CI-push model, the co-location decision, and the two rejected alternatives (server-pull, separate directory). The Decision Queue open point from issue #601 is now recorded. Commit `4e94d85d`. **Felix — l2-containers.puml not updated** ✅ - Updated the observability `System_Boundary` label to show `(/opt/familienarchiv/docker-compose.observability.yml)`. Commit `dec6b813`. ### Suggestions / concerns resolved **Nora — blast-radius comment too weak** ✅ - Hardened `runner-config.yaml` options comment with an explicit `SECURITY WARNING` block calling out (1) root-equivalent Docker socket access and (2) read/write access to `/opt/familienarchiv/` including app compose files and Caddy config. Commit `c7d2eeb3`. **Sara + Elicit — .env.example not updated** ✅ - Changed `PORT_GRAFANA` from `3001` to `3003`, added `POSTGRES_HOST` with the default and override context, added `$$` escaping note with an example. Commit `448c3cdc`. **Markus — no .env correctness validation before `up`** ✅ - Added `Validate observability compose config` step (`docker compose config --quiet`) immediately before `Start observability stack`. Catches missing keys, truncated passwords, and YAML errors before any containers start. Commit `df37113d`. **Sara — services without healthchecks not explicitly verified** ✅ - Added `Assert observability stack health` step after `up --wait`. Iterates `obs-loki`, `obs-prometheus`, `obs-grafana`, `obs-tempo`; emits a `::error::` annotation and exits non-zero if any container is not `healthy`. Commit `79735e23`. ### Tobias — operational actions (not fixable in code) These two items require manual server-side action before or after merge: 1. **`GLITCHTIP_DOMAIN` in `/opt/familienarchiv/.env`** — Add `GLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloud` to the file on the server before the next nightly run. Now documented in DEPLOYMENT.md §4 required-key table. 2. **Runner restart** — Run `systemctl restart gitea-runner` after this PR is merged so the `/opt/familienarchiv` mount takes effect. Now documented in DEPLOYMENT.md §3.1.

marcel added 6 commits 2026-05-16 00:22:15 +02:00

feat(obs): add obs.env with non-secret config tracked in git b67bd201b2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(obs): add GF_SERVER_ROOT_URL to Grafana service f9baf02b86

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci(obs): GitOps obs env split in nightly — obs.env in git, secrets fresh from Gitea c5139851b8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci(obs): GitOps obs env split in release — deploy to /opt/familienarchiv/, secrets fresh from Gitea 1ae4bfe325

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(obs): document obs vs main stack env model, obs.env + obs-secrets.env approach d83ed7254d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(obs): set POSTGRES_HOST per environment — staging/prod use compose auto-names not archive-db

CI / Unit & Component Tests (pull_request) Successful in 2m58s

Details

CI / OCR Service Tests (pull_request) Successful in 19s

Details

CI / Backend Unit Tests (pull_request) Successful in 2m39s

Details

CI / fail2ban Regex (pull_request) Successful in 40s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s

Details

53cf1837b2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-16 00:22:30 +02:00

GitOps refactor — obs env model simplified (5 more commits)

After reviewing the secrets flow, the previous approach left a gap: Gitea secrets were written to .env.staging / .env.production but those files were no longer passed to the obs stack, meaning secrets could silently drift. Replaced with a clean two-source model matching how the main app stack works.

What changed

infra/observability/obs.env (new, tracked in git)
All non-secret obs config: ports, public URLs (GLITCHTIP_DOMAIN, GF_SERVER_ROOT_URL). Changes go through PR review. No secrets.

/opt/familienarchiv/obs-secrets.env (written by CI on every deploy)
Only passwords and secret keys, injected fresh from Gitea secrets. Gitea is the single source of truth — rotating a secret takes effect on the next nightly/release run with no manual server action needed.

Both files passed explicitly via --env-file to every obs compose command (validate dry-run + up). No implicit auto-read .env.

release.yml now matches nightly — adds the same "Deploy observability configs" step, deploys from /opt/familienarchiv/, uses the two-file env model.

GF_SERVER_ROOT_URL added to obs-grafana service (defaults to http://localhost:3003 for local dev, set to https://grafana.archiv.raddatz.cloud via obs.env in production).

Dead obs vars removed from .env.staging / .env.production — those files now contain only vars that docker-compose.prod.yml actually uses. GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, and port vars are no longer in those files.

POSTGRES_HOST is now per-environment — archiv-staging-db-1 (nightly) and archiv-production-db-1 (release), injected via obs-secrets.env. The old archive-db default was wrong for both CI environments (no container_name is set in the prod compose).

Why obs differs from the main stack (documented in DEPLOYMENT.md §4)

The main app stack has no config-file bind mounts — a workspace wipe doesn't affect running containers. The obs stack bind-mounts prometheus.yml, tempo.yml, Grafana provisioning, etc. into containers. A workspace wipe removes those source paths; the next docker compose up can't find them. That's why obs needs /opt/familienarchiv/ as a permanent home. Everything else (secrets model, env-file pattern) now mirrors the main stack.

## GitOps refactor — obs env model simplified (5 more commits) After reviewing the secrets flow, the previous approach left a gap: Gitea secrets were written to `.env.staging` / `.env.production` but those files were no longer passed to the obs stack, meaning secrets could silently drift. Replaced with a clean two-source model matching how the main app stack works. ### What changed **`infra/observability/obs.env` (new, tracked in git)** All non-secret obs config: ports, public URLs (`GLITCHTIP_DOMAIN`, `GF_SERVER_ROOT_URL`). Changes go through PR review. No secrets. **`/opt/familienarchiv/obs-secrets.env` (written by CI on every deploy)** Only passwords and secret keys, injected fresh from Gitea secrets. Gitea is the single source of truth — rotating a secret takes effect on the next nightly/release run with no manual server action needed. **Both files passed explicitly** via `--env-file` to every obs compose command (validate dry-run + up). No implicit auto-read `.env`. **`release.yml` now matches nightly** — adds the same "Deploy observability configs" step, deploys from `/opt/familienarchiv/`, uses the two-file env model. **`GF_SERVER_ROOT_URL` added** to obs-grafana service (defaults to `http://localhost:3003` for local dev, set to `https://grafana.archiv.raddatz.cloud` via obs.env in production). **Dead obs vars removed** from `.env.staging` / `.env.production` — those files now contain only vars that `docker-compose.prod.yml` actually uses. `GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_SECRET_KEY`, and port vars are no longer in those files. **`POSTGRES_HOST` is now per-environment** — `archiv-staging-db-1` (nightly) and `archiv-production-db-1` (release), injected via obs-secrets.env. The old `archive-db` default was wrong for both CI environments (no `container_name` is set in the prod compose). ### Why obs differs from the main stack (documented in DEPLOYMENT.md §4) The main app stack has no config-file bind mounts — a workspace wipe doesn't affect running containers. The obs stack bind-mounts `prometheus.yml`, `tempo.yml`, Grafana provisioning, etc. into containers. A workspace wipe removes those source paths; the next `docker compose up` can't find them. That's why obs needs `/opt/familienarchiv/` as a permanent home. Everything else (secrets model, env-file pattern) now mirrors the main stack.

marcel commented

2026-05-16 00:24:41 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

The permanent-path approach is the right call. Bind mounts from ephemeral workspace paths have always been a ticking clock, and the CI-push model mirrors the main stack's deploy pattern without introducing new infrastructure. The two-file env split (git-tracked obs.env + CI-injected obs-secrets.env) is clean. The Tempo schema fix and port collision fix are correct.

Blockers

1. ADR-016 "Decision" section describes a secrets model that doesn't exist

docs/adr/016-obs-stack-co-location-ci-push.md (Decision section) says:

Secrets are stored in /opt/familienarchiv/.env on the server. This file is managed by the operator — CI does not write or delete it. Docker Compose auto-reads it when started from /opt/familienarchiv/.

But the actual implementation does the opposite: CI writes /opt/familienarchiv/obs-secrets.env fresh on every run from Gitea secrets. The operator doesn't manage it. Docker Compose doesn't auto-read it — it's passed explicitly via --env-file. This is a significant inaccuracy in the ADR that will mislead the next person trying to understand or troubleshoot the obs stack. The ADR needs to be corrected to match the real obs.env + obs-secrets.env two-source model.

2. release.yml is missing the "Validate observability compose config" and "Assert observability stack health" steps that nightly.yml has

nightly.yml validates the resolved compose config before starting (docker compose config --quiet) and then explicitly asserts the health of the four critical services (obs-loki, obs-prometheus, obs-grafana, obs-tempo) after up --wait. release.yml skips both steps — it just runs up -d --wait --remove-orphans and moves on.

This matters for production: a malformed config file (e.g., a variable left undefined, a typo in a YAML key) could cause up to silently succeed on already-running containers while the new config is never applied. And if a service is unhealthy after the deploy, the release workflow has no gate to catch it. The two steps should be identical between nightly and release.

Suggestions

cp -r leaves deleted files behind: acknowledged in ADR-016 ("Negative" consequences). If a config file is removed from the repo, its stale copy persists at /opt/familienarchiv/infra/observability/ indefinitely. rsync -a --delete infra/observability/ /opt/familienarchiv/infra/observability/ would give a clean mirror. Low priority given the current change frequency, but worth doing while this code is fresh.
POSTGRES_HOST container names are fragile: archiv-staging-db-1 and archiv-production-db-1 are Docker Compose-derived names that depend on the project name and service name staying stable. A project rename silently breaks the obs stack's DB connection. A comment noting this dependency (and that a project rename requires updating these values) would reduce future confusion.

What's done well

Permanent-path bind mount pattern is solid and well-reasoned.
Runner valid_volumes and security comment in runner-config.yaml are accurate and honest about the blast radius.
obs.env committed to git covers all non-secret config. No operator-managed non-secret files needed.
DEPLOYMENT.md §4 is now genuinely useful as an ops procedure reference.
The GlitchTip db-init container and both glitchtip-web/glitchtip-worker services all updated consistently to use ${POSTGRES_HOST:-archive-db}.

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** The permanent-path approach is the right call. Bind mounts from ephemeral workspace paths have always been a ticking clock, and the CI-push model mirrors the main stack's deploy pattern without introducing new infrastructure. The two-file env split (git-tracked `obs.env` + CI-injected `obs-secrets.env`) is clean. The Tempo schema fix and port collision fix are correct. ### Blockers **1. ADR-016 "Decision" section describes a secrets model that doesn't exist** `docs/adr/016-obs-stack-co-location-ci-push.md` (Decision section) says: > Secrets are stored in `/opt/familienarchiv/.env` on the server. This file is managed by the operator — CI does not write or delete it. Docker Compose auto-reads it when started from `/opt/familienarchiv/`. But the actual implementation does the opposite: CI writes `/opt/familienarchiv/obs-secrets.env` fresh on every run from Gitea secrets. The operator doesn't manage it. Docker Compose doesn't auto-read it — it's passed explicitly via `--env-file`. This is a significant inaccuracy in the ADR that will mislead the next person trying to understand or troubleshoot the obs stack. The ADR needs to be corrected to match the real `obs.env` + `obs-secrets.env` two-source model. **2. `release.yml` is missing the "Validate observability compose config" and "Assert observability stack health" steps that `nightly.yml` has** `nightly.yml` validates the resolved compose config before starting (`docker compose config --quiet`) and then explicitly asserts the health of the four critical services (`obs-loki`, `obs-prometheus`, `obs-grafana`, `obs-tempo`) after `up --wait`. `release.yml` skips both steps — it just runs `up -d --wait --remove-orphans` and moves on. This matters for production: a malformed config file (e.g., a variable left undefined, a typo in a YAML key) could cause `up` to silently succeed on already-running containers while the new config is never applied. And if a service is unhealthy after the deploy, the release workflow has no gate to catch it. The two steps should be identical between nightly and release. ### Suggestions - **`cp -r` leaves deleted files behind**: acknowledged in ADR-016 ("Negative" consequences). If a config file is removed from the repo, its stale copy persists at `/opt/familienarchiv/infra/observability/` indefinitely. `rsync -a --delete infra/observability/ /opt/familienarchiv/infra/observability/` would give a clean mirror. Low priority given the current change frequency, but worth doing while this code is fresh. - **POSTGRES_HOST container names are fragile**: `archiv-staging-db-1` and `archiv-production-db-1` are Docker Compose-derived names that depend on the project name and service name staying stable. A project rename silently breaks the obs stack's DB connection. A comment noting this dependency (and that a project rename requires updating these values) would reduce future confusion. ### What's done well - Permanent-path bind mount pattern is solid and well-reasoned. - Runner `valid_volumes` and security comment in `runner-config.yaml` are accurate and honest about the blast radius. - `obs.env` committed to git covers all non-secret config. No operator-managed non-secret files needed. - `DEPLOYMENT.md §4` is now genuinely useful as an ops procedure reference. - The GlitchTip `db-init` container and both `glitchtip-web`/`glitchtip-worker` services all updated consistently to use `${POSTGRES_HOST:-archive-db}`.

marcel commented

2026-05-16 00:24:56 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

This is a pure infrastructure PR — no application code, no frontend, no backend Java. I'll focus on CI YAML quality and configuration file hygiene.

No Blockers

The code is functionally correct. The Tempo schema fix (removing the invalid top-level processors block from tempo.yml) is clearly right — it was a duplicate of the overrides.defaults.metrics_generator.processors block that Tempo 2.7.2 stopped tolerating. The parameterized ${POSTGRES_HOST:-archive-db} replacement hits all three places that hardcoded archive-db (the two DATABASE_URL envs and the db-init command). The port change is consistent across docker-compose.observability.yml, .env.example, and docs/DEPLOYMENT.md.

Suggestions

DRY violation between nightly.yml and release.yml: The "Deploy observability configs" step is nearly identical in both files, but they've already diverged — nightly.yml has "Validate observability compose config" and "Assert observability stack health" steps that release.yml doesn't. CI workflow duplication is the right place to apply the DRY rule: any future change to the obs deploy procedure must now be made in two places, and last time it wasn't (one got the health checks, one didn't). A composite action (uses: ./.gitea/actions/deploy-obs) would be the clean solution; a simpler workaround is a comment at the top of each step saying "Keep in sync with release.yml / nightly.yml step N."

obs.env comment formatting: The comment block at the top of infra/observability/obs.env is clear. The multi-line explanation for POSTGRES_HOST reads well. No issues.

GF_SERVER_ROOT_URL default: The docker-compose.observability.yml change adds:

GF_SERVER_ROOT_URL: ${GF_SERVER_ROOT_URL:-http://localhost:3003}

The default http://localhost:3003 is correct for local dev. obs.env sets the production value to https://grafana.archiv.raddatz.cloud. This is well-structured.

What's done well

All three occurrences of hardcoded archive-db are updated atomically — no partial fix.
The heredoc in both workflow files is correctly structured (YAML block scalar strips the common indentation, so the env file content lines reach the shell without leading spaces).
The set -e + accumulate-then-exit pattern in the health assertion step is correct: it checks all four services before exiting, so you get a full picture of which services are unhealthy rather than stopping at the first failure.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** This is a pure infrastructure PR — no application code, no frontend, no backend Java. I'll focus on CI YAML quality and configuration file hygiene. ### No Blockers The code is functionally correct. The Tempo schema fix (removing the invalid top-level `processors` block from `tempo.yml`) is clearly right — it was a duplicate of the `overrides.defaults.metrics_generator.processors` block that Tempo 2.7.2 stopped tolerating. The parameterized `${POSTGRES_HOST:-archive-db}` replacement hits all three places that hardcoded `archive-db` (the two `DATABASE_URL` envs and the `db-init` command). The port change is consistent across `docker-compose.observability.yml`, `.env.example`, and `docs/DEPLOYMENT.md`. ### Suggestions **DRY violation between `nightly.yml` and `release.yml`**: The "Deploy observability configs" step is nearly identical in both files, but they've already diverged — `nightly.yml` has "Validate observability compose config" and "Assert observability stack health" steps that `release.yml` doesn't. CI workflow duplication is the right place to apply the DRY rule: any future change to the obs deploy procedure must now be made in two places, and last time it wasn't (one got the health checks, one didn't). A composite action (`uses: ./.gitea/actions/deploy-obs`) would be the clean solution; a simpler workaround is a comment at the top of each step saying "Keep in sync with release.yml / nightly.yml step N." **`obs.env` comment formatting**: The comment block at the top of `infra/observability/obs.env` is clear. The multi-line explanation for `POSTGRES_HOST` reads well. No issues. **`GF_SERVER_ROOT_URL` default**: The `docker-compose.observability.yml` change adds: ```yaml GF_SERVER_ROOT_URL: ${GF_SERVER_ROOT_URL:-http://localhost:3003} ``` The default `http://localhost:3003` is correct for local dev. `obs.env` sets the production value to `https://grafana.archiv.raddatz.cloud`. This is well-structured. ### What's done well - All three occurrences of hardcoded `archive-db` are updated atomically — no partial fix. - The heredoc in both workflow files is correctly structured (YAML block scalar strips the common indentation, so the env file content lines reach the shell without leading spaces). - The `set -e` + accumulate-then-exit pattern in the health assertion step is correct: it checks all four services before exiting, so you get a full picture of which services are unhealthy rather than stopping at the first failure.

marcel commented

2026-05-16 00:25:12 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: ✅ Approved

Reviewed from a security perspective: secret handling, filesystem exposure, CI job container permissions, and network isolation.

No Blockers

The security posture is appropriate for a single-tenant private runner. The blast-radius warnings in runner-config.yaml are honest and well-placed — whoever adds this runner to an external repo in the future has been explicitly warned.

Observations (non-blocking)

obs-secrets.env file permissions on the host: The cat > /opt/familienarchiv/obs-secrets.env <<EOF step creates the file with default umask permissions (typically 644 — world-readable). The file contains the Grafana admin password, GlitchTip secret key, and the Postgres password. On a single-tenant VPS with no other users this is acceptable, but chmod 600 /opt/familienarchiv/obs-secrets.env immediately after creation would be a free defense-in-depth measure:

run: |
  cat > /opt/familienarchiv/obs-secrets.env <<EOF
  ...
  EOF
  chmod 600 /opt/familienarchiv/obs-secrets.env

Secret rotation surface: The secrets STAGING_POSTGRES_PASSWORD and PROD_POSTGRES_PASSWORD are injected into obs-secrets.env as POSTGRES_PASSWORD. If these Gitea secrets are rotated, the obs stack reconnects correctly on the next CI run (because obs-secrets.env is rewritten). This is correct. The GLITCHTIP_SECRET_KEY and GRAFANA_ADMIN_PASSWORD are shared between nightly and release — documented appropriately in DEPLOYMENT.md.

$$ escaping note in .env.example: This is a good catch. Docker Compose variable interpolation in .env files silently consuming $ in passwords is a real operational hazard. The note is in the right place.

GF_SERVER_ROOT_URL exposure: The new GF_SERVER_ROOT_URL env var passed to the Grafana container is non-sensitive (it's the public URL). No concern.

Port binding: All observability ports remain bound to 127.0.0.1 — Caddy is the only external entry point. The port change from 3001 to 3003 doesn't change the binding scope.

No hardcoded credentials in committed files: obs.env contains no secrets (verified). Secrets flow only through Gitea secrets → CI → ephemeral obs-secrets.env. The model is correct.

What's done well

Zero hardcoded credentials in any committed file.
Security blast-radius warning in runner-config.yaml is explicit and actionable ("Do NOT add this runner to any repo with external contributors").
The two-source env model prevents a scenario where a CI deploy accidentally overwrites operator-set secrets.
ADR-016 explicitly names the security trade-off of mounting /opt/familienarchiv into CI job containers.

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** Reviewed from a security perspective: secret handling, filesystem exposure, CI job container permissions, and network isolation. ### No Blockers The security posture is appropriate for a single-tenant private runner. The blast-radius warnings in `runner-config.yaml` are honest and well-placed — whoever adds this runner to an external repo in the future has been explicitly warned. ### Observations (non-blocking) **obs-secrets.env file permissions on the host**: The `cat > /opt/familienarchiv/obs-secrets.env <<EOF` step creates the file with default umask permissions (typically `644` — world-readable). The file contains the Grafana admin password, GlitchTip secret key, and the Postgres password. On a single-tenant VPS with no other users this is acceptable, but `chmod 600 /opt/familienarchiv/obs-secrets.env` immediately after creation would be a free defense-in-depth measure: ```yaml run: | cat > /opt/familienarchiv/obs-secrets.env <<EOF ... EOF chmod 600 /opt/familienarchiv/obs-secrets.env ``` **Secret rotation surface**: The secrets `STAGING_POSTGRES_PASSWORD` and `PROD_POSTGRES_PASSWORD` are injected into `obs-secrets.env` as `POSTGRES_PASSWORD`. If these Gitea secrets are rotated, the obs stack reconnects correctly on the next CI run (because obs-secrets.env is rewritten). This is correct. The `GLITCHTIP_SECRET_KEY` and `GRAFANA_ADMIN_PASSWORD` are shared between nightly and release — documented appropriately in `DEPLOYMENT.md`. **`$$` escaping note in `.env.example`**: This is a good catch. Docker Compose variable interpolation in `.env` files silently consuming `$` in passwords is a real operational hazard. The note is in the right place. **GF_SERVER_ROOT_URL exposure**: The new `GF_SERVER_ROOT_URL` env var passed to the Grafana container is non-sensitive (it's the public URL). No concern. **Port binding**: All observability ports remain bound to `127.0.0.1` — Caddy is the only external entry point. The port change from 3001 to 3003 doesn't change the binding scope. **No hardcoded credentials in committed files**: `obs.env` contains no secrets (verified). Secrets flow only through Gitea secrets → CI → ephemeral `obs-secrets.env`. The model is correct. ### What's done well - Zero hardcoded credentials in any committed file. - Security blast-radius warning in `runner-config.yaml` is explicit and actionable ("Do NOT add this runner to any repo with external contributors"). - The two-source env model prevents a scenario where a CI deploy accidentally overwrites operator-set secrets. - ADR-016 explicitly names the security trade-off of mounting `/opt/familienarchiv` into CI job containers.

marcel commented

2026-05-16 00:25:26 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

Infrastructure change with no application code impact. Reviewing CI verification coverage, health assertion correctness, and consistency between workflow files.

Concerns (non-blocking but worth addressing)

Missing verification gates in release.yml: nightly.yml has two explicit gates that release.yml lacks:

"Validate observability compose config" (docker compose config --quiet) — this dry-run resolves all variable substitutions and reports missing required keys before any container starts. Without it, a broken config could reach production silently.
"Assert observability stack health" — this explicitly checks obs-loki, obs-prometheus, obs-grafana, and obs-tempo are healthy after up --wait. Without it, an unhealthy service after a production deploy goes undetected until someone checks manually.

This asymmetry means nightly is better-tested than the production deploy path, which is backwards. Both steps should be in release.yml.

Health assertion correctness: The health assertion step in nightly.yml is well-constructed:

status=$(docker inspect "$svc" --format '{{.State.Health.Status}}' 2>/dev/null || echo "missing")

The 2>/dev/null || echo "missing" handles the case where the container doesn't exist. The accumulate-then-exit pattern (unhealthy="$unhealthy $svc") gives a full picture before failing. The explicit list of four services (obs-loki obs-prometheus obs-grafana obs-tempo) matches the services with declared healthchecks in docker-compose.observability.yml — verified correct.

No automated test for the Tempo schema fix: The tempo.yml change (removing the duplicate processors block) is infra config, not application code. There's no unit test possible here. The only verification path is "Tempo starts without the field processors not found error." That's covered by the health assertion step (obs-tempo must become healthy). Acceptable.

docker compose up --wait semantics: The comment in the nightly "Assert observability stack health" step correctly documents the limitation — --wait only covers services WITH healthcheck directives. The four services without healthchecks (obs-promtail, obs-cadvisor, obs-node-exporter, obs-glitchtip-worker) are not asserted. This is an acknowledged gap, documented appropriately.

What's done well

Explicit health assertion for the four critical services is a real improvement over relying solely on --wait.
"Validate observability compose config" as a dry-run gate before the actual up is the right pattern — catches missing variables before containers fail to start.
The test plan in the PR description covers all five scenarios and is specific enough to actually verify.

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** Infrastructure change with no application code impact. Reviewing CI verification coverage, health assertion correctness, and consistency between workflow files. ### Concerns (non-blocking but worth addressing) **Missing verification gates in `release.yml`**: `nightly.yml` has two explicit gates that `release.yml` lacks: 1. **"Validate observability compose config"** (`docker compose config --quiet`) — this dry-run resolves all variable substitutions and reports missing required keys before any container starts. Without it, a broken config could reach production silently. 2. **"Assert observability stack health"** — this explicitly checks `obs-loki`, `obs-prometheus`, `obs-grafana`, and `obs-tempo` are `healthy` after `up --wait`. Without it, an unhealthy service after a production deploy goes undetected until someone checks manually. This asymmetry means nightly is better-tested than the production deploy path, which is backwards. Both steps should be in `release.yml`. **Health assertion correctness**: The health assertion step in `nightly.yml` is well-constructed: ```bash status=$(docker inspect "$svc" --format '{{.State.Health.Status}}' 2>/dev/null || echo "missing") ``` The `2>/dev/null || echo "missing"` handles the case where the container doesn't exist. The accumulate-then-exit pattern (`unhealthy="$unhealthy $svc"`) gives a full picture before failing. The explicit list of four services (`obs-loki obs-prometheus obs-grafana obs-tempo`) matches the services with declared healthchecks in `docker-compose.observability.yml` — verified correct. **No automated test for the Tempo schema fix**: The `tempo.yml` change (removing the duplicate `processors` block) is infra config, not application code. There's no unit test possible here. The only verification path is "Tempo starts without the `field processors not found` error." That's covered by the health assertion step (obs-tempo must become `healthy`). Acceptable. **`docker compose up --wait` semantics**: The comment in the nightly "Assert observability stack health" step correctly documents the limitation — `--wait` only covers services WITH healthcheck directives. The four services without healthchecks (`obs-promtail`, `obs-cadvisor`, `obs-node-exporter`, `obs-glitchtip-worker`) are not asserted. This is an acknowledged gap, documented appropriately. ### What's done well - Explicit health assertion for the four critical services is a real improvement over relying solely on `--wait`. - "Validate observability compose config" as a dry-run gate before the actual `up` is the right pattern — catches missing variables before containers fail to start. - The test plan in the PR description covers all five scenarios and is specific enough to actually verify.

marcel commented

2026-05-16 00:25:33 +02:00

🎨 Leonie Voss — UI/UX Design Lead

Verdict: ✅ Approved

Pure infrastructure PR — CI workflows, Docker Compose config, observability stack config, documentation, and ADR. No frontend components, no Svelte files, no UI changes, no user-facing routes.

Nothing in this diff touches the user interface, accessibility, design tokens, or brand presentation. From a UX perspective, this is a clean LGTM.

The only tangentially user-relevant change is fixing the observability stack (Grafana, GlitchTip) so it reliably starts and stays healthy — which improves the operator experience for whoever monitors the family archive. The documentation improvements in DEPLOYMENT.md (especially the clear separation of "Dev — start from workspace" vs "Production — managed from /opt/familienarchiv/") will help future operators navigate the setup without frustration.

## 🎨 Leonie Voss — UI/UX Design Lead **Verdict: ✅ Approved** Pure infrastructure PR — CI workflows, Docker Compose config, observability stack config, documentation, and ADR. No frontend components, no Svelte files, no UI changes, no user-facing routes. Nothing in this diff touches the user interface, accessibility, design tokens, or brand presentation. From a UX perspective, this is a clean LGTM. The only tangentially user-relevant change is fixing the observability stack (Grafana, GlitchTip) so it reliably starts and stays healthy — which improves the operator experience for whoever monitors the family archive. The documentation improvements in `DEPLOYMENT.md` (especially the clear separation of "Dev — start from workspace" vs "Production — managed from `/opt/familienarchiv/`") will help future operators navigate the setup without frustration.

marcel commented

2026-05-16 00:25:50 +02:00

🏗️ Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns

The architectural decision is correct: bind-mount sources must be at a path that outlasts the CI workspace lifecycle. The CI-push model to /opt/familienarchiv/ is the right choice given the constraints — it avoids adding server-side automation (cron, git pull), stays consistent with the main stack's deploy pattern, and reflects that the obs stack is architecturally coupled to the main stack (shared archiv-net, shared archive-db). ADR-016 is the right artifact for this decision.

Blockers

ADR-016 "Decision" section contradicts the implementation

The Decision section of docs/adr/016-obs-stack-co-location-ci-push.md says:

Secrets are stored in /opt/familienarchiv/.env on the server. This file is managed by the operator — CI does not write or delete it. Docker Compose auto-reads it when started from /opt/familienarchiv/.

This describes a model where the operator manually manages a .env file and CI doesn't touch it. The actual implementation is the opposite: CI writes /opt/familienarchiv/obs-secrets.env fresh on every deploy from Gitea secrets, and Docker Compose reads it explicitly via --env-file (not auto-read). The operator manages nothing for secrets — CI owns it entirely.

An ADR whose "Decision" section describes the wrong implementation will cause the next person reading it (likely a future incident responder) to look for an operator-managed .env file that doesn't exist, or to believe CI doesn't overwrite the secrets when it does. This needs to be corrected before merge.

The correct description should match what DEPLOYMENT.md §4 already documents accurately: the two-source model where obs.env (git-tracked) contains non-secret config and obs-secrets.env (CI-written from Gitea secrets) contains credentials.

Suggestions

Documentation cross-reference check: Per the architect review checklist, this PR introduces a new infrastructure component path (/opt/familienarchiv/ as a permanent config store) and updates docs/architecture/c4/l2-containers.puml. The container diagram label update (docker-compose.observability.yml → /opt/familienarchiv/docker-compose.observability.yml) is present — good. DEPLOYMENT.md is updated with the full ops procedure — good. ADR-016 exists — good (modulo the content issue above).

Co-location rationale is sound: Rejecting "separate directory (e.g., /opt/obs/)" because GlitchTip shares archive-db and archiv-net with the main stack is the right call. The coupling is real — co-location reflects it honestly.

rsync --delete as future improvement: The ADR names cp -r's inability to remove deleted files as a known consequence. This is acceptable for the current change frequency. If the obs stack config grows significantly, rsync -a --delete would be the natural upgrade path.

What's done well

ADR follows the correct format: Context, Decision drivers, Alternatives considered (with rejection rationale), Decision, Consequences (positive and negative).
The three alternatives are genuinely evaluated, not strawmanned.
The two-source env model cleanly separates concerns: git-tracked (reviewed in PRs) vs CI-managed (from Gitea secrets).
The runner-config security note accurately describes the expanded blast radius.

## 🏗️ Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** The architectural decision is correct: bind-mount sources must be at a path that outlasts the CI workspace lifecycle. The CI-push model to `/opt/familienarchiv/` is the right choice given the constraints — it avoids adding server-side automation (cron, git pull), stays consistent with the main stack's deploy pattern, and reflects that the obs stack is architecturally coupled to the main stack (shared `archiv-net`, shared `archive-db`). ADR-016 is the right artifact for this decision. ### Blockers **ADR-016 "Decision" section contradicts the implementation** The Decision section of `docs/adr/016-obs-stack-co-location-ci-push.md` says: > Secrets are stored in `/opt/familienarchiv/.env` on the server. This file is managed by the operator — CI does not write or delete it. Docker Compose auto-reads it when started from `/opt/familienarchiv/`. This describes a model where the operator manually manages a `.env` file and CI doesn't touch it. The actual implementation is the opposite: CI writes `/opt/familienarchiv/obs-secrets.env` fresh on every deploy from Gitea secrets, and Docker Compose reads it explicitly via `--env-file` (not auto-read). The operator manages nothing for secrets — CI owns it entirely. An ADR whose "Decision" section describes the wrong implementation will cause the next person reading it (likely a future incident responder) to look for an operator-managed `.env` file that doesn't exist, or to believe CI doesn't overwrite the secrets when it does. This needs to be corrected before merge. The correct description should match what `DEPLOYMENT.md §4` already documents accurately: the two-source model where `obs.env` (git-tracked) contains non-secret config and `obs-secrets.env` (CI-written from Gitea secrets) contains credentials. ### Suggestions **Documentation cross-reference check**: Per the architect review checklist, this PR introduces a new infrastructure component path (`/opt/familienarchiv/` as a permanent config store) and updates `docs/architecture/c4/l2-containers.puml`. The container diagram label update (`docker-compose.observability.yml` → `/opt/familienarchiv/docker-compose.observability.yml`) is present — good. `DEPLOYMENT.md` is updated with the full ops procedure — good. ADR-016 exists — good (modulo the content issue above). **Co-location rationale is sound**: Rejecting "separate directory (e.g., `/opt/obs/`)" because GlitchTip shares `archive-db` and `archiv-net` with the main stack is the right call. The coupling is real — co-location reflects it honestly. **`rsync --delete` as future improvement**: The ADR names `cp -r`'s inability to remove deleted files as a known consequence. This is acceptable for the current change frequency. If the obs stack config grows significantly, `rsync -a --delete` would be the natural upgrade path. ### What's done well - ADR follows the correct format: Context, Decision drivers, Alternatives considered (with rejection rationale), Decision, Consequences (positive and negative). - The three alternatives are genuinely evaluated, not strawmanned. - The two-source env model cleanly separates concerns: git-tracked (reviewed in PRs) vs CI-managed (from Gitea secrets). - The runner-config security note accurately describes the expanded blast radius.

marcel commented

2026-05-16 00:26:06 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

Reviewing from a requirements and documentation completeness perspective: whether the change is traceable, the rationale is captured, the test plan is adequate, and operator-facing documentation leaves no gaps.

Observations

PR description quality: The summary is concise and structured. Each of the three fixes has a one-line root cause ("Tempo 2.7.2 schema breaking change", "hardcoded archive-db", "port 3001 occupied by staging frontend") and a one-line fix. The test plan covers all five observable outcomes. This is a well-scoped PR with clear traceability from problem to fix.

The POSTGRES_HOST naming could cause confusion: The variable is introduced in docker-compose.observability.yml and obs.env as POSTGRES_HOST, but the main app stack also connects to PostgreSQL. A future operator might wonder whether POSTGRES_HOST affects both the main stack and the obs stack, or only one. The documentation in DEPLOYMENT.md and obs.env makes it clear this is obs-specific, but the variable name itself doesn't signal that. OBS_POSTGRES_HOST would be unambiguous. Low priority — the current documentation is adequate — but worth flagging.

$$ escaping note in .env.example: The comment block explaining that passwords containing $ must use $$ is genuinely useful operator documentation. This is the kind of subtle behavior that causes production incidents ("GlitchTip can't authenticate even though the password looks correct"). Correctly placed in .env.example where an operator setting up the file will see it.

obs.env comment for POSTGRES_HOST is accurate: The comment explains why the value isn't a fixed string and exactly what the values are for staging vs production (based on Compose project name + service name). This prevents the "why doesn't archive-db work in staging?" question.

Gap: The DEPLOYMENT.md update references "§4" for the ops procedure (mkdir -p /opt/familienarchiv/infra etc.) but the section containing that code block isn't explicitly labelled "§4" in the diff. If the section numbering has drifted, the cross-references in the ADR and the runner restart warning would point to the wrong section. Low priority, but worth verifying the section numbers are consistent.

What's done well

ADR-016 captures both the rejected alternatives (server-pull model, separate directory, Docker Swarm configs) with clear rejection rationale — this is exactly what ADRs are for.
The PR test plan is specific and complete: each item maps to exactly one of the three fixes.
The two-source env model is documented with a clear table (Source | What it contains | Managed by) in DEPLOYMENT.md — easy for a future operator to understand at a glance.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** Reviewing from a requirements and documentation completeness perspective: whether the change is traceable, the rationale is captured, the test plan is adequate, and operator-facing documentation leaves no gaps. ### Observations **PR description quality**: The summary is concise and structured. Each of the three fixes has a one-line root cause ("Tempo 2.7.2 schema breaking change", "hardcoded `archive-db`", "port 3001 occupied by staging frontend") and a one-line fix. The test plan covers all five observable outcomes. This is a well-scoped PR with clear traceability from problem to fix. **The `POSTGRES_HOST` naming could cause confusion**: The variable is introduced in `docker-compose.observability.yml` and `obs.env` as `POSTGRES_HOST`, but the main app stack also connects to PostgreSQL. A future operator might wonder whether `POSTGRES_HOST` affects both the main stack and the obs stack, or only one. The documentation in `DEPLOYMENT.md` and `obs.env` makes it clear this is obs-specific, but the variable name itself doesn't signal that. `OBS_POSTGRES_HOST` would be unambiguous. Low priority — the current documentation is adequate — but worth flagging. **`$$` escaping note in `.env.example`**: The comment block explaining that passwords containing `$` must use `$$` is genuinely useful operator documentation. This is the kind of subtle behavior that causes production incidents ("GlitchTip can't authenticate even though the password looks correct"). Correctly placed in `.env.example` where an operator setting up the file will see it. **`obs.env` comment for `POSTGRES_HOST` is accurate**: The comment explains *why* the value isn't a fixed string and exactly what the values are for staging vs production (based on Compose project name + service name). This prevents the "why doesn't `archive-db` work in staging?" question. **Gap**: The `DEPLOYMENT.md` update references "§4" for the ops procedure (`mkdir -p /opt/familienarchiv/infra` etc.) but the section containing that code block isn't explicitly labelled "§4" in the diff. If the section numbering has drifted, the cross-references in the ADR and the runner restart warning would point to the wrong section. Low priority, but worth verifying the section numbers are consistent. ### What's done well - ADR-016 captures both the rejected alternatives (server-pull model, separate directory, Docker Swarm configs) with clear rejection rationale — this is exactly what ADRs are for. - The PR test plan is specific and complete: each item maps to exactly one of the three fixes. - The two-source env model is documented with a clear table (Source | What it contains | Managed by) in `DEPLOYMENT.md` — easy for a future operator to understand at a glance.

marcel added 4 commits 2026-05-16 08:54:41 +02:00

docs(adr): correct ADR-016 Decision section to match two-source env model 4c5ee96e36

The Decision section described an operator-managed /opt/familienarchiv/.env
that CI does not touch. The actual implementation is a two-source model:
obs.env (git-tracked, non-secret config) + obs-secrets.env (CI-written
fresh from Gitea secrets on every deploy). Also updates the Consequences
bullet that incorrectly stated secrets are decoupled from CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci(obs): add validate + health assertion steps to release.yml f628ab6435

nightly.yml had two observability gates that release.yml lacked:
- "Validate observability compose config" (docker compose config --quiet)
  catches missing env vars and YAML errors before any containers start
- "Assert observability stack health" checks obs-loki/prometheus/grafana/tempo
  are healthy after up --wait, covering services without healthcheck directives

Mirrors the nightly.yml steps verbatim so the production deploy path is at
least as well-verified as the nightly staging path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci(obs): chmod 600 obs-secrets.env after creation in both workflows dec0001bd1

The heredoc creates the file with default umask permissions (644 —
world-readable). Setting 600 immediately after creation prevents other
processes on the host from reading the Grafana, GlitchTip, and Postgres
credentials. Defence-in-depth for the single-tenant VPS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci(obs): document POSTGRES_HOST derivation from Compose project name

CI / Unit & Component Tests (pull_request) Successful in 5m38s

Details

CI / OCR Service Tests (pull_request) Successful in 45s

Details

CI / Backend Unit Tests (pull_request) Failing after 10m48s

Details

CI / fail2ban Regex (pull_request) Successful in 2m51s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 2m16s

Details

f5c7be932b

The container names archiv-staging-db-1 and archiv-production-db-1 are
derived from the Compose project name + service name. A project rename
silently breaks the obs stack DB connection. Add a comment at the point
of definition so the dependency is obvious when someone changes it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-16 08:54:58 +02:00

Second-round review concerns addressed

All open concerns from the second review cycle resolved in 4 commits.

Blockers resolved

Tobias + Markus — ADR-016 Decision section contradicts the implementation ✅
The Decision section described an operator-managed /opt/familienarchiv/.env that CI never touches. The actual implementation is the opposite. Corrected to accurately describe the two-source model: obs.env (git-tracked, non-secret config) + obs-secrets.env (CI-written fresh from Gitea secrets on every deploy, passed explicitly via --env-file). The stale Consequences bullet ("secrets decoupled from CI — CI cannot accidentally overwrite them") is replaced with the accurate statement: rotating a Gitea secret takes effect on the next deploy with no manual server action. Commit 4c5ee96e.

Tobias + Sara — release.yml missing "Validate observability compose config" and "Assert observability stack health" steps ✅
Both steps added to release.yml, identical to the nightly.yml equivalents. The validate step (docker compose config --quiet) now runs before up on production deploys, and the health assertion loop checks obs-loki, obs-prometheus, obs-grafana, obs-tempo are healthy after up --wait. "Keep in sync with nightly.yml" comments added at each step to make the duplication intentional and visible. Commit f628ab64.

Non-blocking suggestions resolved

Nora — obs-secrets.env created with world-readable permissions ✅
Added chmod 600 /opt/familienarchiv/obs-secrets.env immediately after the heredoc in both nightly.yml and release.yml. Free defence-in-depth on the single-tenant VPS. Commit dec0001b.

Tobias — POSTGRES_HOST container names are fragile ✅
Added a comment at the POSTGRES_HOST assignment in both workflow files explaining the name is derived from the Compose project name + service name and must be updated on a project rename. Commit f5c7be93.

## Second-round review concerns addressed All open concerns from the second review cycle resolved in 4 commits. ### Blockers resolved **Tobias + Markus — ADR-016 Decision section contradicts the implementation** ✅ The Decision section described an operator-managed `/opt/familienarchiv/.env` that CI never touches. The actual implementation is the opposite. Corrected to accurately describe the two-source model: `obs.env` (git-tracked, non-secret config) + `obs-secrets.env` (CI-written fresh from Gitea secrets on every deploy, passed explicitly via `--env-file`). The stale Consequences bullet ("secrets decoupled from CI — CI cannot accidentally overwrite them") is replaced with the accurate statement: rotating a Gitea secret takes effect on the next deploy with no manual server action. Commit `4c5ee96e`. **Tobias + Sara — `release.yml` missing "Validate observability compose config" and "Assert observability stack health" steps** ✅ Both steps added to `release.yml`, identical to the `nightly.yml` equivalents. The validate step (`docker compose config --quiet`) now runs before `up` on production deploys, and the health assertion loop checks `obs-loki`, `obs-prometheus`, `obs-grafana`, `obs-tempo` are `healthy` after `up --wait`. "Keep in sync with nightly.yml" comments added at each step to make the duplication intentional and visible. Commit `f628ab64`. ### Non-blocking suggestions resolved **Nora — `obs-secrets.env` created with world-readable permissions** ✅ Added `chmod 600 /opt/familienarchiv/obs-secrets.env` immediately after the heredoc in both `nightly.yml` and `release.yml`. Free defence-in-depth on the single-tenant VPS. Commit `dec0001b`. **Tobias — POSTGRES_HOST container names are fragile** ✅ Added a comment at the `POSTGRES_HOST` assignment in both workflow files explaining the name is derived from the Compose project name + service name and must be updated on a project rename. Commit `f5c7be93`.

marcel commented

2026-05-16 08:57:46 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

This PR makes the right architectural call: stable host path → bind mounts survive workspace wipes. The two-source env model (git-tracked obs.env + CI-written obs-secrets.env) is clean and auditable. Good work overall.

Blockers

Unquoted heredoc with Gitea secrets — can corrupt secret values

In both nightly.yml and release.yml, obs-secrets.env is written via an unquoted heredoc (<<EOF):

cat > /opt/familienarchiv/obs-secrets.env <<EOF
GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }}
GLITCHTIP_SECRET_KEY=${{ secrets.GLITCHTIP_SECRET_KEY }}
POSTGRES_PASSWORD=${{ secrets.STAGING_POSTGRES_PASSWORD }}
...
EOF

Gitea expands ${{ secrets.X }} at template-render time before the shell runs. If a secret value contains $ (common in strong passwords — e.g. P@$s5w0rd), the shell's heredoc will try to expand $s5w0rd as a shell variable — silently truncating the value. Result: the service starts with the wrong password and fails to connect to the DB.

Fix: Quote the heredoc delimiter to suppress all shell expansion:

cat > /opt/familienarchiv/obs-secrets.env <<'EOF'
GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }}
...
EOF

<<'EOF' writes the Gitea-substituted value literally, without further shell expansion. Apply to both nightly.yml and release.yml.

Suggestions

cp -r doesn't clean deleted files (acknowledged in ADR-016) — if an obs config file is removed from git, it lives on in /opt/familienarchiv/infra/observability/ until manually deleted. Worth adding a one-liner runbook note in DEPLOYMENT.md §4:

# After removing a config file from git, clean up the permanent copy:
rm /opt/familienarchiv/infra/observability/<path-to-removed-file>

What's done well:

chmod 600 /opt/familienarchiv/obs-secrets.env — correct permissions on the secrets file.
Validate step (docker compose config --quiet) before up is a solid practice — catches undefined var substitutions before containers start.
Assert health step explicitly checks the four services that lack healthchecks in the compose file. The --wait gap was real and this fixes it correctly.
Removing obs env vars from the main app .env block in both workflows is the right cleanup.
ADR-016 is complete and honest about the blast-radius trade-off.

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** This PR makes the right architectural call: stable host path → bind mounts survive workspace wipes. The two-source env model (git-tracked `obs.env` + CI-written `obs-secrets.env`) is clean and auditable. Good work overall. --- ### Blockers **Unquoted heredoc with Gitea secrets — can corrupt secret values** In both `nightly.yml` and `release.yml`, `obs-secrets.env` is written via an unquoted heredoc (`<<EOF`): ```yaml cat > /opt/familienarchiv/obs-secrets.env <<EOF GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }} GLITCHTIP_SECRET_KEY=${{ secrets.GLITCHTIP_SECRET_KEY }} POSTGRES_PASSWORD=${{ secrets.STAGING_POSTGRES_PASSWORD }} ... EOF ``` Gitea expands `${{ secrets.X }}` at template-render time before the shell runs. If a secret value contains `$` (common in strong passwords — e.g. `P@$s5w0rd`), the shell's heredoc will try to expand `$s5w0rd` as a shell variable — silently truncating the value. Result: the service starts with the wrong password and fails to connect to the DB. **Fix:** Quote the heredoc delimiter to suppress all shell expansion: ```yaml cat > /opt/familienarchiv/obs-secrets.env <<'EOF' GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }} ... EOF ``` `<<'EOF'` writes the Gitea-substituted value literally, without further shell expansion. Apply to both `nightly.yml` and `release.yml`. --- ### Suggestions **`cp -r` doesn't clean deleted files** (acknowledged in ADR-016) — if an obs config file is removed from git, it lives on in `/opt/familienarchiv/infra/observability/` until manually deleted. Worth adding a one-liner runbook note in `DEPLOYMENT.md §4`: ```bash # After removing a config file from git, clean up the permanent copy: rm /opt/familienarchiv/infra/observability/<path-to-removed-file> ``` **What's done well:** - `chmod 600 /opt/familienarchiv/obs-secrets.env` — correct permissions on the secrets file. - Validate step (`docker compose config --quiet`) before `up` is a solid practice — catches undefined var substitutions before containers start. - Assert health step explicitly checks the four services that lack healthchecks in the compose file. The `--wait` gap was real and this fixes it correctly. - Removing obs env vars from the main app `.env` block in both workflows is the right cleanup. - ADR-016 is complete and honest about the blast-radius trade-off.

marcel commented

2026-05-16 08:58:00 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

The security posture of this PR is generally solid, but there is one shell injection risk in the CI workflows worth fixing before merge.

Blockers

CWE-78 / Shell variable expansion in unquoted heredoc — secrets silently mangled

Both nightly.yml and release.yml write obs-secrets.env using an unquoted heredoc delimiter (<<EOF). Gitea substitutes ${{ secrets.X }} into the script before the shell runs. If a Gitea secret value contains $, the shell heredoc will perform a second expansion:

# Secret value: "P@$secret_key_42"
# After Gitea render:
GLITCHTIP_SECRET_KEY=P@$secret_key_42
# Shell expands $secret_key_42 → "" (undefined var)
# Written to file: GLITCHTIP_SECRET_KEY=P@

This silently corrupts the secret at rest — GlitchTip would start with a broken SECRET_KEY and invalidate all existing sessions. Severity: Medium (integrity / availability, not confidentiality, since the attacker would need access to the file to read it).

Fix — quote the heredoc delimiter:

cat > /opt/familienarchiv/obs-secrets.env <<'EOF'
GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }}
GLITCHTIP_SECRET_KEY=${{ secrets.GLITCHTIP_SECRET_KEY }}
...
EOF
chmod 600 /opt/familienarchiv/obs-secrets.env

Single-quoted 'EOF' tells the shell to skip all variable expansion inside the heredoc body. Gitea's ${{ }} expansion has already happened at the template layer, so the secret values are written verbatim. Apply to both workflow files.

Observations (non-blocking)

chmod 600 on obs-secrets.env — correct. The file contains three sensitive values (Grafana admin password, GlitchTip secret key, Postgres password). chmod 600 ensures only the runner's process user can read it.

Blast radius of /opt/familienarchiv mount — acknowledged. The ADR-016 negative consequence is honest: a compromised workflow step could overwrite app compose files, Caddy config, or the secrets file itself. The runner-config.yaml security comment update accurately describes this. Acceptable for a single-tenant, private-repo runner.

No credential leakage through logs. The Gitea ${{ secrets.X }} substitution is masked in CI logs. The heredoc body (containing the expanded values) is not echoed. No set -x in the secret-writing steps. Clean.

Tempo schema fix (tempo.yml - removing duplicate processors block) is purely config — no security implications.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** The security posture of this PR is generally solid, but there is one shell injection risk in the CI workflows worth fixing before merge. --- ### Blockers **CWE-78 / Shell variable expansion in unquoted heredoc — secrets silently mangled** Both `nightly.yml` and `release.yml` write `obs-secrets.env` using an unquoted heredoc delimiter (`<<EOF`). Gitea substitutes `${{ secrets.X }}` into the script *before* the shell runs. If a Gitea secret value contains `$`, the shell heredoc will perform a second expansion: ```bash # Secret value: "P@$secret_key_42" # After Gitea render: GLITCHTIP_SECRET_KEY=P@$secret_key_42 # Shell expands $secret_key_42 → "" (undefined var) # Written to file: GLITCHTIP_SECRET_KEY=P@ ``` This silently corrupts the secret at rest — GlitchTip would start with a broken `SECRET_KEY` and invalidate all existing sessions. Severity: Medium (integrity / availability, not confidentiality, since the attacker would need access to the file to read it). **Fix — quote the heredoc delimiter:** ```yaml cat > /opt/familienarchiv/obs-secrets.env <<'EOF' GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }} GLITCHTIP_SECRET_KEY=${{ secrets.GLITCHTIP_SECRET_KEY }} ... EOF chmod 600 /opt/familienarchiv/obs-secrets.env ``` Single-quoted `'EOF'` tells the shell to skip all variable expansion inside the heredoc body. Gitea's `${{ }}` expansion has already happened at the template layer, so the secret values are written verbatim. Apply to both workflow files. --- ### Observations (non-blocking) **`chmod 600` on `obs-secrets.env` — correct.** The file contains three sensitive values (Grafana admin password, GlitchTip secret key, Postgres password). `chmod 600` ensures only the runner's process user can read it. **Blast radius of `/opt/familienarchiv` mount — acknowledged.** The ADR-016 negative consequence is honest: a compromised workflow step could overwrite app compose files, Caddy config, or the secrets file itself. The runner-config.yaml security comment update accurately describes this. Acceptable for a single-tenant, private-repo runner. **No credential leakage through logs.** The Gitea `${{ secrets.X }}` substitution is masked in CI logs. The heredoc body (containing the expanded values) is not echoed. No `set -x` in the secret-writing steps. Clean. **Tempo schema fix** (`tempo.yml` - removing duplicate `processors` block) is purely config — no security implications.

marcel commented

2026-05-16 08:58:13 +02:00

🏛️ Markus Keller — Senior Application Architect

Verdict: ✅ Approved

The architectural decision documented in ADR-016 is sound given the constraints (no new infrastructure dependencies, single-tenant runner, DooD setup). The co-location at /opt/familienarchiv/ is the simplest path that satisfies all four decision drivers, and the alternatives section correctly rejects the alternatives.

Architecture Assessment

Two-source env model is well-scoped. infra/observability/obs.env (git-tracked, no secrets) + /opt/familienarchiv/obs-secrets.env (CI-written, secrets only) separates concerns cleanly. Non-secret config goes through PR review; secrets are rotated via Gitea without touching git. This is the right boundary.

CI-push model mirrors the main app stack pattern. The nightly and release workflows already push the main app compose files via a similar mechanism. Extending this pattern to the obs stack avoids introducing a second deployment mechanism (server-pull, systemd timer, etc.). Boring is good.

GF_SERVER_ROOT_URL addition is necessary. Without it, Grafana generates relative URLs in alert emails and OAuth redirects, which break when accessed through Caddy's reverse proxy. This was a latent bug in the original compose file.

Documentation Checklist

PR contains	Required doc update	Status
New Docker service behavior	`docs/architecture/c4/l2-containers.puml`	✅ Updated — path reference corrected
New infrastructure component pattern	`docs/DEPLOYMENT.md`	✅ Updated — §4 documents the permanent directory setup and manual ops procedure
Architectural decision with lasting consequences	New ADR in `docs/adr/`	✅ ADR-016 added

All required documentation updates are present. No missed updates.

ADR-016 Quality

The ADR follows the established project format (Context, Decision, Consequences). The "Negative" consequences section is unusually honest about the blast radius trade-off — that's the right call. The cp -r / no-delete limitation is the main operational footgun and is correctly flagged. Worth adding a manual cleanup note in DEPLOYMENT.md §4 for completeness (see Tobias's comment), but not a blocker.

## 🏛️ Markus Keller — Senior Application Architect **Verdict: ✅ Approved** The architectural decision documented in ADR-016 is sound given the constraints (no new infrastructure dependencies, single-tenant runner, DooD setup). The co-location at `/opt/familienarchiv/` is the simplest path that satisfies all four decision drivers, and the alternatives section correctly rejects the alternatives. --- ### Architecture Assessment **Two-source env model is well-scoped.** `infra/observability/obs.env` (git-tracked, no secrets) + `/opt/familienarchiv/obs-secrets.env` (CI-written, secrets only) separates concerns cleanly. Non-secret config goes through PR review; secrets are rotated via Gitea without touching git. This is the right boundary. **CI-push model mirrors the main app stack pattern.** The nightly and release workflows already push the main app compose files via a similar mechanism. Extending this pattern to the obs stack avoids introducing a second deployment mechanism (server-pull, systemd timer, etc.). Boring is good. **`GF_SERVER_ROOT_URL` addition is necessary.** Without it, Grafana generates relative URLs in alert emails and OAuth redirects, which break when accessed through Caddy's reverse proxy. This was a latent bug in the original compose file. --- ### Documentation Checklist | PR contains | Required doc update | Status | |---|---|---| | New Docker service behavior | `docs/architecture/c4/l2-containers.puml` | ✅ Updated — path reference corrected | | New infrastructure component pattern | `docs/DEPLOYMENT.md` | ✅ Updated — §4 documents the permanent directory setup and manual ops procedure | | Architectural decision with lasting consequences | New ADR in `docs/adr/` | ✅ ADR-016 added | All required documentation updates are present. No missed updates. --- ### ADR-016 Quality The ADR follows the established project format (Context, Decision, Consequences). The "Negative" consequences section is unusually honest about the blast radius trade-off — that's the right call. The `cp -r` / no-delete limitation is the main operational footgun and is correctly flagged. Worth adding a manual cleanup note in `DEPLOYMENT.md §4` for completeness (see Tobias's comment), but not a blocker.

marcel commented

2026-05-16 08:58:23 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

This PR touches infrastructure and CI only — no application code, no Svelte components, no Spring Boot services. There's nothing for me to flag on the code quality front. A few observations from a developer perspective:

What I Checked

Bash scripts in CI steps: Readable and intention-revealing. The health-assert loop uses set -e correctly and accumulates failures before exiting — avoiding the "exits on first failure, you don't know what else is broken" problem. Variable names (unhealthy, svc, status) are clear.
obs.env file: Comments explain why values are set, not what the keys mean. That's the right level of comment.
tempo.yml removal of the duplicate processors block: Straightforward schema compliance fix. The comment at the top of the file (# Tempo HTTP API is unauthenticated) is appropriate — it's a non-obvious security posture decision that a reader needs to know.
docker-compose.observability.yml changes: Three substitution sites (DATABASE_URL, worker env, psql -h) are consistent. No site was missed. The -h ${POSTGRES_HOST:-archive-db} pattern correctly preserves the production default while allowing override.

LGTM

No code quality issues. The shell scripts are clean, the env file separation is sensible, and the docs are updated. Ship it once the heredoc quoting concern Tobias raised is resolved.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** This PR touches infrastructure and CI only — no application code, no Svelte components, no Spring Boot services. There's nothing for me to flag on the code quality front. A few observations from a developer perspective: --- ### What I Checked - **Bash scripts in CI steps:** Readable and intention-revealing. The health-assert loop uses `set -e` correctly and accumulates failures before exiting — avoiding the "exits on first failure, you don't know what else is broken" problem. Variable names (`unhealthy`, `svc`, `status`) are clear. - **`obs.env` file:** Comments explain *why* values are set, not *what* the keys mean. That's the right level of comment. - **`tempo.yml` removal of the duplicate `processors` block:** Straightforward schema compliance fix. The comment at the top of the file (`# Tempo HTTP API is unauthenticated`) is appropriate — it's a non-obvious security posture decision that a reader needs to know. - **docker-compose.observability.yml changes:** Three substitution sites (`DATABASE_URL`, worker env, `psql -h`) are consistent. No site was missed. The `-h ${POSTGRES_HOST:-archive-db}` pattern correctly preserves the production default while allowing override. --- ### LGTM No code quality issues. The shell scripts are clean, the env file separation is sensible, and the docs are updated. Ship it once the heredoc quoting concern Tobias raised is resolved.

marcel commented

2026-05-16 08:58:39 +02:00

🧪 Sara Holt — Senior QA Engineer

Verdict: ⚠️ Approved with concerns

No application test changes in this PR (expected for a CI/infra fix). I'm reviewing the CI pipeline itself as the test artifact here.

What's Good

Validate observability compose config step — dry-run docker compose config --quiet before up is a proper preflight gate. It catches undefined variable substitutions, YAML parse errors, and missing required keys before any container starts. This is the right approach and it was missing before.

Assert observability stack health step — this closes a real gap. docker compose up --wait only waits for services with healthcheck directives. The four services without healthchecks (obs-promtail, obs-cadvisor, obs-node-exporter, obs-glitchtip-worker) are correctly noted, and the four that do have healthchecks (obs-loki, obs-prometheus, obs-grafana, obs-tempo) are explicitly asserted before the smoke test proceeds. Good.

The loop pattern is deterministic: accumulate failures into $unhealthy, exit non-zero only after checking all four. This gives a full failure report rather than aborting on the first unhealthy service.

Concerns

No assert step for worker services. obs-glitchtip-worker has no healthcheck directive and is not checked in the assert step. The comment correctly acknowledges it. However, a failed worker (e.g. due to a bad DATABASE_URL) would pass the CI gate silently — you'd only discover it when GlitchTip fails to process error reports in production.

A minimal worker check (verify it's running, not just started) could be:

status=$(docker inspect "obs-glitchtip-worker" --format '{{.State.Status}}' 2>/dev/null || echo "missing")
if [ "$status" != "running" ]; then echo "::warning::obs-glitchtip-worker is not running"; fi

Not a blocker (it's pre-existing and out of scope for this PR), but worth a follow-up issue.

nightly.yml and release.yml diverge only in POSTGRES_HOST and POSTGRES_PASSWORD secret name. The duplicate validate + assert step bodies are identical. The comments say "Keep in sync with the equivalent step in nightly.yml." This is a maintenance risk — if the assert step is updated in one file but not the other, CI diverges silently. A composite action would eliminate the sync burden, but that's a cleanup issue, not a blocker here.

Test Plan Coverage

The PR's test plan is concrete and verifiable:

✅ docker compose config validates — covered by the new validate step in CI
✅ Tempo starts healthy — covered by the health assert step
✅ GlitchTip connects to Postgres with POSTGRES_HOST override — manually verifiable via the configurable default
✅ GlitchTip connects to archive-db by default — docker-compose default ${POSTGRES_HOST:-archive-db} preserves this
✅ Grafana binds to port 3003 — confirmed by compose file change

## 🧪 Sara Holt — Senior QA Engineer **Verdict: ⚠️ Approved with concerns** No application test changes in this PR (expected for a CI/infra fix). I'm reviewing the CI pipeline itself as the test artifact here. --- ### What's Good **Validate observability compose config step** — dry-run `docker compose config --quiet` before `up` is a proper preflight gate. It catches undefined variable substitutions, YAML parse errors, and missing required keys before any container starts. This is the right approach and it was missing before. **Assert observability stack health step** — this closes a real gap. `docker compose up --wait` only waits for services with healthcheck directives. The four services without healthchecks (`obs-promtail`, `obs-cadvisor`, `obs-node-exporter`, `obs-glitchtip-worker`) are correctly noted, and the four that do have healthchecks (`obs-loki`, `obs-prometheus`, `obs-grafana`, `obs-tempo`) are explicitly asserted before the smoke test proceeds. Good. The loop pattern is deterministic: accumulate failures into `$unhealthy`, exit non-zero only after checking all four. This gives a full failure report rather than aborting on the first unhealthy service. --- ### Concerns **No assert step for worker services.** `obs-glitchtip-worker` has no healthcheck directive and is not checked in the assert step. The comment correctly acknowledges it. However, a failed worker (e.g. due to a bad `DATABASE_URL`) would pass the CI gate silently — you'd only discover it when GlitchTip fails to process error reports in production. A minimal worker check (verify it's running, not just started) could be: ```bash status=$(docker inspect "obs-glitchtip-worker" --format '{{.State.Status}}' 2>/dev/null || echo "missing") if [ "$status" != "running" ]; then echo "::warning::obs-glitchtip-worker is not running"; fi ``` Not a blocker (it's pre-existing and out of scope for this PR), but worth a follow-up issue. **`nightly.yml` and `release.yml` diverge only in `POSTGRES_HOST` and `POSTGRES_PASSWORD` secret name.** The duplicate validate + assert step bodies are identical. The comments say "Keep in sync with the equivalent step in nightly.yml." This is a maintenance risk — if the assert step is updated in one file but not the other, CI diverges silently. A composite action would eliminate the sync burden, but that's a cleanup issue, not a blocker here. --- ### Test Plan Coverage The PR's test plan is concrete and verifiable: - ✅ `docker compose config` validates — covered by the new validate step in CI - ✅ Tempo starts healthy — covered by the health assert step - ✅ GlitchTip connects to Postgres with POSTGRES_HOST override — manually verifiable via the configurable default - ✅ GlitchTip connects to `archive-db` by default — docker-compose default `${POSTGRES_HOST:-archive-db}` preserves this - ✅ Grafana binds to port 3003 — confirmed by compose file change

marcel commented

2026-05-16 08:58:53 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

This PR closes three distinct bugs discovered during the issue #601 migration. All three are clearly specified in the PR description with root cause, fix, and verifiable outcome. From a requirements standpoint, this is well-scoped and implementation-ready.

Requirements Traceability

Bug	Root cause stated	Fix stated	Verifiable outcome
Tempo `processors` schema error	`metrics_generator.processors` moved under `overrides` in Tempo 2.7.2	Removed duplicate top-level block	Tempo starts without `field processors not found` error
GlitchTip hardcoded `archive-db` hostname	DNS resolution fails when only staging stack is running	`${POSTGRES_HOST:-archive-db}` variable substitution	GlitchTip connects to DB when `POSTGRES_HOST` set in env
Grafana port 3001 conflict	Port 3001 occupied by staging frontend	Default changed to 3003	Grafana binds successfully; `Bind for 127.0.0.1:3001 failed` gone

All three fixes are consistent between code, documentation (DEPLOYMENT.md), env template (.env.example), and the new obs.env config file.

Acceptance Criteria Assessment

The PR test plan maps one-to-one to verifiable system behaviors. No ambiguous terms ("better", "faster", "improved") appear. Each checklist item is a concrete, testable state.

Non-Functional Requirements

Operational continuity: The permanent-path approach (/opt/familienarchiv/) satisfies the unstated NFR that the obs stack survives CI workspace wipes. ADR-016 documents this explicitly.
Secrets management: The two-source env model satisfies the NFR that secrets must not be committed to git while still allowing config-as-code for non-secret values.
Default backward-compatibility: ${POSTGRES_HOST:-archive-db} preserves the production default — no operator action required on existing deployments.

Open Questions

None remaining. ADR-016's open question register is empty and the scope is tightly bounded to the three bugs. The cp -r / stale-file limitation is acknowledged in ADR-016 and is an acceptable trade-off at this project's change frequency.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** This PR closes three distinct bugs discovered during the issue #601 migration. All three are clearly specified in the PR description with root cause, fix, and verifiable outcome. From a requirements standpoint, this is well-scoped and implementation-ready. --- ### Requirements Traceability | Bug | Root cause stated | Fix stated | Verifiable outcome | |---|---|---|---| | Tempo `processors` schema error | `metrics_generator.processors` moved under `overrides` in Tempo 2.7.2 | Removed duplicate top-level block | Tempo starts without `field processors not found` error | | GlitchTip hardcoded `archive-db` hostname | DNS resolution fails when only staging stack is running | `${POSTGRES_HOST:-archive-db}` variable substitution | GlitchTip connects to DB when `POSTGRES_HOST` set in env | | Grafana port 3001 conflict | Port 3001 occupied by staging frontend | Default changed to 3003 | Grafana binds successfully; `Bind for 127.0.0.1:3001 failed` gone | All three fixes are consistent between code, documentation (`DEPLOYMENT.md`), env template (`.env.example`), and the new `obs.env` config file. --- ### Acceptance Criteria Assessment The PR test plan maps one-to-one to verifiable system behaviors. No ambiguous terms ("better", "faster", "improved") appear. Each checklist item is a concrete, testable state. --- ### Non-Functional Requirements - **Operational continuity:** The permanent-path approach (`/opt/familienarchiv/`) satisfies the unstated NFR that the obs stack survives CI workspace wipes. ADR-016 documents this explicitly. - **Secrets management:** The two-source env model satisfies the NFR that secrets must not be committed to git while still allowing config-as-code for non-secret values. - **Default backward-compatibility:** `${POSTGRES_HOST:-archive-db}` preserves the production default — no operator action required on existing deployments. --- ### Open Questions None remaining. ADR-016's open question register is empty and the scope is tightly bounded to the three bugs. The `cp -r` / stale-file limitation is acknowledged in ADR-016 and is an acceptable trade-off at this project's change frequency.

marcel commented

2026-05-16 08:58:58 +02:00

🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist

Verdict: ✅ Approved

This PR modifies CI workflows, Docker Compose configuration, environment files, infrastructure config, and ADR documentation. There are no frontend components, Svelte files, or UI-facing changes in scope.

I checked the diff for any inadvertent frontend impact (e.g. changes to environment variables that affect the SvelteKit frontend or its build):

PORT_GRAFANA change from 3001 → 3003: Grafana is an internal tool only — not user-facing, not part of the SvelteKit frontend. No UI impact.
GF_SERVER_ROOT_URL addition: Grafana-internal config for alert email links. Not user-facing.
POSTGRES_HOST variable: Backend infrastructure only.

LGTM from a UX/UI and accessibility perspective. Nothing to flag.

## 🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist **Verdict: ✅ Approved** This PR modifies CI workflows, Docker Compose configuration, environment files, infrastructure config, and ADR documentation. There are no frontend components, Svelte files, or UI-facing changes in scope. I checked the diff for any inadvertent frontend impact (e.g. changes to environment variables that affect the SvelteKit frontend or its build): - `PORT_GRAFANA` change from 3001 → 3003: Grafana is an internal tool only — not user-facing, not part of the SvelteKit frontend. No UI impact. - `GF_SERVER_ROOT_URL` addition: Grafana-internal config for alert email links. Not user-facing. - `POSTGRES_HOST` variable: Backend infrastructure only. **LGTM from a UX/UI and accessibility perspective.** Nothing to flag.

marcel added 3 commits 2026-05-16 09:05:02 +02:00

ci(obs): quote heredoc delimiter in nightly obs-secrets.env write 9662ff5f8c

Prevents shell from expanding '$' in Gitea-rendered secret values.
Without the quote, a password like 'P@$s5w0rd' has '$s5w0rd' silently
expanded to '' — writing a truncated value to obs-secrets.env.
'<<'EOF'' suppresses shell expansion; Gitea's '${{ }}' template
rendering already ran before the shell sees the script.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci(obs): quote heredoc delimiter in release obs-secrets.env write 25062be657

Same fix as nightly.yml: prevents shell expansion of '$' in secret
values after Gitea renders them. Keep in sync with nightly.yml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(obs): add cp-r stale-file cleanup note to DEPLOYMENT.md

CI / Unit & Component Tests (pull_request) Successful in 5m39s

Details

CI / OCR Service Tests (pull_request) Successful in 46s

Details

CI / Backend Unit Tests (pull_request) Failing after 9m24s

Details

CI / fail2ban Regex (pull_request) Successful in 2m52s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 2m24s

Details

a7f60ebed8

CI uses 'cp -r' which does not remove deleted files. Documents the
manual cleanup step for config files removed from git.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-16 09:05:13 +02:00

Third-round review concerns addressed

All open concerns from the third review cycle resolved in 3 commits.

Blockers resolved

Tobias + Nora — Unquoted heredoc in nightly.yml and release.yml silently mangles $-containing secrets ✅

Both obs-secrets.env heredocs changed from <<EOF to <<'EOF'. Single-quoted delimiter suppresses shell variable expansion inside the heredoc body. Gitea's ${{ secrets.X }} template rendering already happened before the shell runs, so secret values are written verbatim — a password like P@$s5w0rd no longer has $s5w0rd silently expanded to an empty string. Applied to:

nightly.yml — commit 9662ff5f
release.yml — commit 25062be6

Suggestion resolved

Tobias — Add manual cleanup runbook note for cp -r stale files ✅

Added a note in DEPLOYMENT.md immediately after the manual start/restart command block, explaining that cp -r does not remove deleted files and documenting the manual rm step for config files removed from the repo. Commit a7f60ebe.

## Third-round review concerns addressed All open concerns from the third review cycle resolved in 3 commits. ### Blockers resolved **Tobias + Nora — Unquoted heredoc in `nightly.yml` and `release.yml` silently mangles `$`-containing secrets** ✅ Both `obs-secrets.env` heredocs changed from `<<EOF` to `<<'EOF'`. Single-quoted delimiter suppresses shell variable expansion inside the heredoc body. Gitea's `${{ secrets.X }}` template rendering already happened before the shell runs, so secret values are written verbatim — a password like `P@$s5w0rd` no longer has `$s5w0rd` silently expanded to an empty string. Applied to: - `nightly.yml` — commit `9662ff5f` - `release.yml` — commit `25062be6` ### Suggestion resolved **Tobias — Add manual cleanup runbook note for `cp -r` stale files** ✅ Added a note in `DEPLOYMENT.md` immediately after the manual start/restart command block, explaining that `cp -r` does not remove deleted files and documenting the manual `rm` step for config files removed from the repo. Commit `a7f60ebe`.

marcel commented

2026-05-16 09:08:57 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ✅ Approved

The final blocker (unquoted heredoc) is fixed. <<'EOF' is correct — Gitea renders ${{ secrets.X }} before the shell runs, so single-quoting the delimiter suppresses the second shell expansion without losing the secret values. chmod 600 is in place. Cleanup note in DEPLOYMENT.md added. This is merge-ready.

What's done well

Two-source env model is clean: non-secret config in git (obs.env), credentials freshly written from Gitea secrets each deploy (obs-secrets.env). No operator-managed file to drift.
Named volumes (grafana_data, loki_data, prometheus_data, tempo_data) are Docker-managed and untouched by the copy step — no data migration risk.
All image tags remain pinned. No :latest introduced.
Validate step (docker compose config --quiet) before up is the right pattern. Catches undefined variables and YAML errors before containers start.
Health assertion step in both workflows now. The accumulate-then-exit pattern gives a full failure picture before exiting.

Deployment maintainability suggestions (no blockers)

1. Replace cp -r with rsync -a --delete

The stale-file limitation is documented in both ADR-016 and DEPLOYMENT.md, but it's a one-command fix that eliminates the problem class entirely:

# Instead of:
cp -r infra/observability /opt/familienarchiv/infra/

# Use:
rsync -a --delete infra/observability/ /opt/familienarchiv/infra/observability/

Now that cp appears in two workflow files and a runbook, fixing it later costs three edits instead of one. Low priority given the current config change frequency — but cheapest to fix here.

2. Wrap the manual start command in a script or Makefile target

The manual restart command in DEPLOYMENT.md is four lines with absolute paths and two --env-file flags. An operator copying it under pressure can easily typo a path:

# scripts/obs-up.sh (or a Makefile obs-up target)
#!/bin/sh
docker compose \
  -f /opt/familienarchiv/docker-compose.observability.yml \
  --env-file /opt/familienarchiv/infra/observability/obs.env \
  --env-file /opt/familienarchiv/obs-secrets.env \
  up -d --wait --remove-orphans

Operators copy the script name, not four lines of flags. The script also stays in sync automatically since it lives in the repo.

3. POSTGRES_USER=archiv hardcoded in the secrets heredoc

POSTGRES_USER is not a secret — it's the application's Postgres role name, as predictable as archive-db. It's currently written into obs-secrets.env alongside the actual secrets. If the role is ever renamed, this is a third place to update (in addition to docker-compose.observability.yml and the main .env.* files). Moving it to obs.env alongside POSTGRES_HOST=archive-db keeps the secrets file to credentials only.

4. Composite action for the duplicated obs deploy steps

The "Keep in sync with nightly.yml / release.yml" comments are better than nothing, but the class of failure (one workflow updated, the other not) is only prevented by the next reviewer's attention. A Gitea composite action at .gitea/actions/deploy-obs/action.yaml (accepting env: staging|production, postgres_password:, etc. as inputs) would make "updated in one place" structurally impossible. Acceptable at current change frequency; the right answer before the obs deploy procedure grows further.

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ✅ Approved** The final blocker (unquoted heredoc) is fixed. `<<'EOF'` is correct — Gitea renders `${{ secrets.X }}` before the shell runs, so single-quoting the delimiter suppresses the second shell expansion without losing the secret values. `chmod 600` is in place. Cleanup note in DEPLOYMENT.md added. This is merge-ready. ### What's done well - Two-source env model is clean: non-secret config in git (`obs.env`), credentials freshly written from Gitea secrets each deploy (`obs-secrets.env`). No operator-managed file to drift. - Named volumes (`grafana_data`, `loki_data`, `prometheus_data`, `tempo_data`) are Docker-managed and untouched by the copy step — no data migration risk. - All image tags remain pinned. No `:latest` introduced. - Validate step (`docker compose config --quiet`) before `up` is the right pattern. Catches undefined variables and YAML errors before containers start. - Health assertion step in both workflows now. The accumulate-then-exit pattern gives a full failure picture before exiting. --- ### Deployment maintainability suggestions (no blockers) **1. Replace `cp -r` with `rsync -a --delete`** The stale-file limitation is documented in both ADR-016 and DEPLOYMENT.md, but it's a one-command fix that eliminates the problem class entirely: ```yaml # Instead of: cp -r infra/observability /opt/familienarchiv/infra/ # Use: rsync -a --delete infra/observability/ /opt/familienarchiv/infra/observability/ ``` Now that `cp` appears in two workflow files and a runbook, fixing it later costs three edits instead of one. Low priority given the current config change frequency — but cheapest to fix here. **2. Wrap the manual start command in a script or Makefile target** The manual restart command in DEPLOYMENT.md is four lines with absolute paths and two `--env-file` flags. An operator copying it under pressure can easily typo a path: ```bash # scripts/obs-up.sh (or a Makefile obs-up target) #!/bin/sh docker compose \ -f /opt/familienarchiv/docker-compose.observability.yml \ --env-file /opt/familienarchiv/infra/observability/obs.env \ --env-file /opt/familienarchiv/obs-secrets.env \ up -d --wait --remove-orphans ``` Operators copy the script name, not four lines of flags. The script also stays in sync automatically since it lives in the repo. **3. `POSTGRES_USER=archiv` hardcoded in the secrets heredoc** `POSTGRES_USER` is not a secret — it's the application's Postgres role name, as predictable as `archive-db`. It's currently written into `obs-secrets.env` alongside the actual secrets. If the role is ever renamed, this is a third place to update (in addition to `docker-compose.observability.yml` and the main `.env.*` files). Moving it to `obs.env` alongside `POSTGRES_HOST=archive-db` keeps the secrets file to credentials only. **4. Composite action for the duplicated obs deploy steps** The "Keep in sync with nightly.yml / release.yml" comments are better than nothing, but the class of failure (one workflow updated, the other not) is only prevented by the next reviewer's attention. A Gitea composite action at `.gitea/actions/deploy-obs/action.yaml` (accepting `env: staging|production`, `postgres_password:`, etc. as inputs) would make "updated in one place" structurally impossible. Acceptable at current change frequency; the right answer before the obs deploy procedure grows further.

marcel commented

2026-05-16 09:09:13 +02:00

🏗️ Markus Keller — Senior Application Architect

Verdict: ✅ Approved

Documentation obligations are fully met: ADR-016 is present and accurate, docs/architecture/c4/l2-containers.puml reflects the new operational path, DEPLOYMENT.md has the full ops procedure with the two-source env model table. The architectural choice (CI-push to /opt/familienarchiv/, co-located with the main stack) is sound — it reflects the real coupling (shared archiv-net, shared PostgreSQL) and requires no new infrastructure. Boring and correct.

Architecture assessment

The Decision section of ADR-016 now correctly describes the two-source model — no longer contradicts the implementation. The rejected alternatives are evaluated with genuine rationale, not strawmanned. The cp -r / no-delete limitation is the main operational footgun and is correctly flagged in Consequences.

Deployment maintainability observations (no blockers)

1. obs.env carries POSTGRES_HOST=archive-db — understanding the precedence matters

obs.env sets POSTGRES_HOST=archive-db as the default. In CI, obs-secrets.env overrides this with the environment-specific value (archiv-staging-db-1 / archiv-production-db-1). The file loaded last wins — and obs-secrets.env is listed second in both workflow --env-file chains, so override precedence is correct.

Worth confirming: if these --env-file flags were accidentally swapped, POSTGRES_HOST=archive-db from obs.env would silently win over the CI override, and GlitchTip would fail to connect in both environments. The current order is right; document the ordering contract explicitly in the validate step comment so a future refactor doesn't reverse it.

2. Variable naming ambiguity: POSTGRES_HOST vs OBS_POSTGRES_HOST

POSTGRES_HOST is obs-specific but the name doesn't signal that. The two-source model keeps obs vars isolated (they're only passed to the obs compose command), so there's no actual collision risk today. But if the main app compose ever needed a similar override, the name would conflict. OBS_POSTGRES_HOST is unambiguous at zero implementation cost — worth considering during any future obs-vars cleanup.

3. Docker network dependency is implicit but unavoidable

The obs stack joins archiv-net, which is created by the main app stack. If the main stack is not running when obs starts, Docker emits a clear error ("network archiv-net not found"). This is not a gap introduced by this PR — the compose file already declares external: true for the network. Confirming this is handled correctly.

4. If /opt/familienarchiv/ ever relocates, three places change

The ADR documents the runner restart requirement and cp-r limitation in Consequences, but not the /opt/familienarchiv/ relocation footgun: if the permanent directory moves, you must update runner-config.yaml valid_volumes, both workflow files, and DEPLOYMENT.md §4. A one-sentence Consequences bullet noting this would complete the picture for a future incident responder.

## 🏗️ Markus Keller — Senior Application Architect **Verdict: ✅ Approved** Documentation obligations are fully met: ADR-016 is present and accurate, `docs/architecture/c4/l2-containers.puml` reflects the new operational path, `DEPLOYMENT.md` has the full ops procedure with the two-source env model table. The architectural choice (CI-push to `/opt/familienarchiv/`, co-located with the main stack) is sound — it reflects the real coupling (shared `archiv-net`, shared PostgreSQL) and requires no new infrastructure. Boring and correct. ### Architecture assessment The Decision section of ADR-016 now correctly describes the two-source model — no longer contradicts the implementation. The rejected alternatives are evaluated with genuine rationale, not strawmanned. The `cp -r` / no-delete limitation is the main operational footgun and is correctly flagged in Consequences. --- ### Deployment maintainability observations (no blockers) **1. `obs.env` carries `POSTGRES_HOST=archive-db` — understanding the precedence matters** `obs.env` sets `POSTGRES_HOST=archive-db` as the default. In CI, `obs-secrets.env` overrides this with the environment-specific value (`archiv-staging-db-1` / `archiv-production-db-1`). The file loaded last wins — and `obs-secrets.env` is listed second in both workflow `--env-file` chains, so override precedence is correct. Worth confirming: if these `--env-file` flags were accidentally swapped, `POSTGRES_HOST=archive-db` from `obs.env` would silently win over the CI override, and GlitchTip would fail to connect in both environments. The current order is right; document the ordering contract explicitly in the validate step comment so a future refactor doesn't reverse it. **2. Variable naming ambiguity: `POSTGRES_HOST` vs `OBS_POSTGRES_HOST`** `POSTGRES_HOST` is obs-specific but the name doesn't signal that. The two-source model keeps obs vars isolated (they're only passed to the obs compose command), so there's no actual collision risk today. But if the main app compose ever needed a similar override, the name would conflict. `OBS_POSTGRES_HOST` is unambiguous at zero implementation cost — worth considering during any future obs-vars cleanup. **3. Docker network dependency is implicit but unavoidable** The obs stack joins `archiv-net`, which is created by the main app stack. If the main stack is not running when obs starts, Docker emits a clear error ("network archiv-net not found"). This is not a gap introduced by this PR — the compose file already declares `external: true` for the network. Confirming this is handled correctly. **4. If `/opt/familienarchiv/` ever relocates, three places change** The ADR documents the runner restart requirement and cp-r limitation in Consequences, but not the `/opt/familienarchiv/` relocation footgun: if the permanent directory moves, you must update `runner-config.yaml` valid_volumes, both workflow files, and DEPLOYMENT.md §4. A one-sentence Consequences bullet noting this would complete the picture for a future incident responder.

marcel commented

2026-05-16 09:09:31 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ✅ Approved

All security concerns from previous cycles are resolved. <<'EOF' correctly suppresses shell expansion — the two-stage expansion issue (Gitea template render → shell execution) is properly handled. chmod 600 on obs-secrets.env is in place. Zero hardcoded credentials in committed files. All image tags remain pinned. Security posture is appropriate for a single-tenant private runner.

What's done well

Single-quoted heredoc delimiter (<<'EOF') is the correct fix. Gitea's ${{ secrets.X }} substitution happens at template render time before the shell executes the script. The single quote tells the shell to treat the heredoc body as a literal string — the already-rendered secret value lands verbatim in the file.
chmod 600 obs-secrets.env immediately after creation. Correct.
No set -x in secret-writing steps — secret values are not echoed to CI logs.
The blast-radius warning in runner-config.yaml is explicit and actionable.
obs.env contains no secrets — verified in the diff.

Deployment security/maintainability observations (no blockers)

1. POSTGRES_USER=archiv in the secrets heredoc mixes credential and non-credential config

The secrets file (obs-secrets.env) currently contains:

POSTGRES_USER=archiv      ← not a secret, it's the application role name
POSTGRES_PASSWORD=...     ← secret
GRAFANA_ADMIN_PASSWORD=.. ← secret
GLITCHTIP_SECRET_KEY=...  ← secret
POSTGRES_HOST=...         ← not a secret, it's a hostname

Keeping POSTGRES_USER and potentially POSTGRES_HOST in the secrets file obscures the security boundary. The clean split would be: obs-secrets.env contains only values that would cause harm if leaked (passwords, secret keys). Everything else lives in obs.env. This also means a future rotation of obs-secrets.env can be audited for "does this file contain only secrets?" without cross-referencing obs.env.

2. obs-secrets.env is rewritten on every deploy — operator manual edits will be lost

This is the intended design and is documented. Confirming the security implication: emergency credential rotation cannot be done by editing /opt/familienarchiv/obs-secrets.env on the server — the next nightly run will overwrite it. Correct procedure: update the Gitea secret, then either wait for the next nightly run or manually trigger the workflow. The DEPLOYMENT.md mentions this indirectly; an explicit sentence ("manual edits to obs-secrets.env are overwritten on the next CI run — rotate secrets via Gitea") would prevent a future operator from wasting time on a server edit.

3. cat > file <<'EOF' creates the file before setting permissions

The sequence is: create file with default umask (likely 644) → write content → chmod 600. On a single-tenant VPS with no other users, the brief world-readable window is negligible. For future hardening: install -m 600 /dev/null /opt/familienarchiv/obs-secrets.env && cat >> /opt/familienarchiv/obs-secrets.env <<'EOF' creates the file with correct permissions before any content is written. Not a current risk; file it as a hardening note.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** All security concerns from previous cycles are resolved. `<<'EOF'` correctly suppresses shell expansion — the two-stage expansion issue (Gitea template render → shell execution) is properly handled. `chmod 600` on `obs-secrets.env` is in place. Zero hardcoded credentials in committed files. All image tags remain pinned. Security posture is appropriate for a single-tenant private runner. ### What's done well - Single-quoted heredoc delimiter (`<<'EOF'`) is the correct fix. Gitea's `${{ secrets.X }}` substitution happens at template render time before the shell executes the script. The single quote tells the shell to treat the heredoc body as a literal string — the already-rendered secret value lands verbatim in the file. - `chmod 600 obs-secrets.env` immediately after creation. Correct. - No `set -x` in secret-writing steps — secret values are not echoed to CI logs. - The blast-radius warning in `runner-config.yaml` is explicit and actionable. - `obs.env` contains no secrets — verified in the diff. --- ### Deployment security/maintainability observations (no blockers) **1. `POSTGRES_USER=archiv` in the secrets heredoc mixes credential and non-credential config** The secrets file (`obs-secrets.env`) currently contains: ``` POSTGRES_USER=archiv ← not a secret, it's the application role name POSTGRES_PASSWORD=... ← secret GRAFANA_ADMIN_PASSWORD=.. ← secret GLITCHTIP_SECRET_KEY=... ← secret POSTGRES_HOST=... ← not a secret, it's a hostname ``` Keeping `POSTGRES_USER` and potentially `POSTGRES_HOST` in the secrets file obscures the security boundary. The clean split would be: `obs-secrets.env` contains only values that would cause harm if leaked (passwords, secret keys). Everything else lives in `obs.env`. This also means a future rotation of `obs-secrets.env` can be audited for "does this file contain only secrets?" without cross-referencing `obs.env`. **2. `obs-secrets.env` is rewritten on every deploy — operator manual edits will be lost** This is the intended design and is documented. Confirming the security implication: emergency credential rotation cannot be done by editing `/opt/familienarchiv/obs-secrets.env` on the server — the next nightly run will overwrite it. Correct procedure: update the Gitea secret, then either wait for the next nightly run or manually trigger the workflow. The DEPLOYMENT.md mentions this indirectly; an explicit sentence ("manual edits to obs-secrets.env are overwritten on the next CI run — rotate secrets via Gitea") would prevent a future operator from wasting time on a server edit. **3. `cat > file <<'EOF'` creates the file before setting permissions** The sequence is: create file with default umask (likely `644`) → write content → `chmod 600`. On a single-tenant VPS with no other users, the brief world-readable window is negligible. For future hardening: `install -m 600 /dev/null /opt/familienarchiv/obs-secrets.env && cat >> /opt/familienarchiv/obs-secrets.env <<'EOF'` creates the file with correct permissions before any content is written. Not a current risk; file it as a hardening note.

marcel commented

2026-05-16 09:09:48 +02:00

🧪 Sara Holt — Senior QA Engineer

Verdict: ✅ Approved

Verification coverage for the obs deploy path is now meaningfully better than before this PR. The validate step (dry-run before up) and explicit health assertion step are both in nightly.yml and release.yml. The heredoc quoting fix eliminates the silent secret corruption that would have caused docker compose config --quiet to pass (expanded variables look like empty strings, not missing ones) but containers to fail at runtime.

Test coverage assessment

Verification	nightly.yml	release.yml
Compose config dry-run before up	✅	✅
healthchecked services asserted (obs-loki, obs-prometheus, obs-grafana, obs-tempo)	✅	✅
obs-glitchtip-worker running state	❌ (pre-existing gap)	❌ (pre-existing gap)
obs-promtail, obs-cadvisor, obs-node-exporter running state	❌ (pre-existing gap)	❌ (pre-existing gap)

Deployment maintainability observations (no blockers)

1. obs-glitchtip-worker crash loop would pass CI undetected

If DATABASE_URL is malformed (e.g. wrong POSTGRES_HOST), the worker process crashes immediately after starting. docker compose up --wait considers it "started" (it has no healthcheck), and the health assertion step only checks the four services with declared healthchecks. A one-liner added to the existing loop would catch immediate crash loops:

for svc in obs-glitchtip-worker obs-promtail; do
  status=$(docker inspect "$svc" --format '{{.State.Status}}' 2>/dev/null || echo "missing")
  if [ "$status" != "running" ]; then
    echo "::warning::$svc is not running (status: $status)"
  fi
done

Using ::warning:: (not ::error::) since these services lack healthchecks — a warning surfaces the state without blocking a deploy that otherwise looks healthy. Worth a follow-up issue rather than a blocker here.

2. Duplicate workflow bodies are the next maintenance risk

The validate and health assertion steps are now byte-for-byte identical in both files (modulo the "Keep in sync" comment). The next time either changes — for example, adding obs-glitchtip or obs-redis to the assertion list — there are two files to update with no automated reminder. The "Keep in sync" comment relies on the next reviewer noticing. A Gitea composite action (.gitea/actions/deploy-obs/) would make the single-place update structurally enforced. Acceptable at current change frequency; the right investment before any further obs deploy procedure changes.

3. docker compose config --quiet passes even with empty variable substitutions

If a Gitea secret is undefined (e.g. GRAFANA_ADMIN_PASSWORD not set), ${{ secrets.GRAFANA_ADMIN_PASSWORD }} evaluates to an empty string. The heredoc then writes GRAFANA_ADMIN_PASSWORD= (empty) into obs-secrets.env. Docker Compose config accepts this — it's a valid empty value, not an undefined variable (which would emit a warning or error). The validate step catches undefined variables (no = at all) but not empty ones. This is a known limitation of the validate-then-up pattern; documenting it in the step comment would set expectations correctly.

## 🧪 Sara Holt — Senior QA Engineer **Verdict: ✅ Approved** Verification coverage for the obs deploy path is now meaningfully better than before this PR. The validate step (dry-run before `up`) and explicit health assertion step are both in `nightly.yml` and `release.yml`. The heredoc quoting fix eliminates the silent secret corruption that would have caused `docker compose config --quiet` to pass (expanded variables look like empty strings, not missing ones) but containers to fail at runtime. ### Test coverage assessment | Verification | nightly.yml | release.yml | |---|---|---| | Compose config dry-run before up | ✅ | ✅ | | healthchecked services asserted (obs-loki, obs-prometheus, obs-grafana, obs-tempo) | ✅ | ✅ | | obs-glitchtip-worker running state | ❌ (pre-existing gap) | ❌ (pre-existing gap) | | obs-promtail, obs-cadvisor, obs-node-exporter running state | ❌ (pre-existing gap) | ❌ (pre-existing gap) | ### Deployment maintainability observations (no blockers) **1. `obs-glitchtip-worker` crash loop would pass CI undetected** If `DATABASE_URL` is malformed (e.g. wrong `POSTGRES_HOST`), the worker process crashes immediately after starting. `docker compose up --wait` considers it "started" (it has no healthcheck), and the health assertion step only checks the four services with declared healthchecks. A one-liner added to the existing loop would catch immediate crash loops: ```bash for svc in obs-glitchtip-worker obs-promtail; do status=$(docker inspect "$svc" --format '{{.State.Status}}' 2>/dev/null || echo "missing") if [ "$status" != "running" ]; then echo "::warning::$svc is not running (status: $status)" fi done ``` Using `::warning::` (not `::error::`) since these services lack healthchecks — a warning surfaces the state without blocking a deploy that otherwise looks healthy. Worth a follow-up issue rather than a blocker here. **2. Duplicate workflow bodies are the next maintenance risk** The validate and health assertion steps are now byte-for-byte identical in both files (modulo the "Keep in sync" comment). The next time either changes — for example, adding `obs-glitchtip` or `obs-redis` to the assertion list — there are two files to update with no automated reminder. The "Keep in sync" comment relies on the next reviewer noticing. A Gitea composite action (`.gitea/actions/deploy-obs/`) would make the single-place update structurally enforced. Acceptable at current change frequency; the right investment before any further obs deploy procedure changes. **3. `docker compose config --quiet` passes even with empty variable substitutions** If a Gitea secret is undefined (e.g. `GRAFANA_ADMIN_PASSWORD` not set), `${{ secrets.GRAFANA_ADMIN_PASSWORD }}` evaluates to an empty string. The heredoc then writes `GRAFANA_ADMIN_PASSWORD=` (empty) into `obs-secrets.env`. Docker Compose `config` accepts this — it's a valid empty value, not an undefined variable (which would emit a warning or error). The validate step catches undefined variables (no `=` at all) but not empty ones. This is a known limitation of the validate-then-up pattern; documenting it in the step comment would set expectations correctly.

marcel commented

2026-05-16 09:10:02 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

Pure infrastructure PR — no application code, no Svelte components, no Spring Boot services. The heredoc quoting fix is correct. Bash scripts are readable and intention-revealing.

What I checked

Heredoc quoting (<<'EOF'): Correct fix. Gitea's ${{ secrets.X }} template substitution runs before the shell sees the script. Single-quoting the delimiter tells the shell to treat the heredoc body as a literal string — the already-substituted secret value is written verbatim. No second shell expansion of $-containing passwords.

Health assertion loop: The set -e + accumulate-then-exit pattern is correct:

unhealthy=""
for svc in obs-loki obs-prometheus obs-grafana obs-tempo; do
  status=$(...)
  if [ "$status" != "healthy" ]; then
    unhealthy="$unhealthy $svc"
  fi
done
[ -z "$unhealthy" ] || exit 1

All four services are checked before the exit, so CI shows the full failure picture rather than stopping at the first unhealthy service. Clear variable names (unhealthy, svc, status).

tempo.yml change: Removing the duplicate top-level processors block is correct. Tempo 2.7.2 enforces that processors lives exclusively under overrides.defaults.metrics_generator.processors, which was already correctly configured. The removed block was the invalid duplicate.

docker-compose.observability.yml: All three substitution sites — both DATABASE_URL envs and the psql -h command in obs-glitchtip-db-init — are updated consistently to ${POSTGRES_HOST:-archive-db}. No site missed.

Deployment maintainability (CI hygiene angle)

The only remaining structural maintenance debt is the duplicated obs deploy section between nightly.yml and release.yml. The "Keep in sync with" comments are a code smell that usually resolves into a composite action once the duplication grows beyond two files or two sections. The obs deploy now has four steps (Deploy configs → Validate config → Start stack → Assert health) duplicated in two workflows — that's eight step definitions to keep synchronized. A composite action at .gitea/actions/deploy-obs/action.yaml accepting environment-specific inputs (secrets, postgres host) would collapse this to two uses: calls. Not a blocker for this PR; the natural trigger is the next time either workflow diverges despite the "Keep in sync" note.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** Pure infrastructure PR — no application code, no Svelte components, no Spring Boot services. The heredoc quoting fix is correct. Bash scripts are readable and intention-revealing. ### What I checked **Heredoc quoting (`<<'EOF'`):** Correct fix. Gitea's `${{ secrets.X }}` template substitution runs before the shell sees the script. Single-quoting the delimiter tells the shell to treat the heredoc body as a literal string — the already-substituted secret value is written verbatim. No second shell expansion of `$`-containing passwords. **Health assertion loop:** The `set -e` + accumulate-then-exit pattern is correct: ```bash unhealthy="" for svc in obs-loki obs-prometheus obs-grafana obs-tempo; do status=$(...) if [ "$status" != "healthy" ]; then unhealthy="$unhealthy $svc" fi done [ -z "$unhealthy" ] || exit 1 ``` All four services are checked before the exit, so CI shows the full failure picture rather than stopping at the first unhealthy service. Clear variable names (`unhealthy`, `svc`, `status`). **`tempo.yml` change:** Removing the duplicate top-level `processors` block is correct. Tempo 2.7.2 enforces that `processors` lives exclusively under `overrides.defaults.metrics_generator.processors`, which was already correctly configured. The removed block was the invalid duplicate. **`docker-compose.observability.yml`:** All three substitution sites — both `DATABASE_URL` envs and the `psql -h` command in `obs-glitchtip-db-init` — are updated consistently to `${POSTGRES_HOST:-archive-db}`. No site missed. --- ### Deployment maintainability (CI hygiene angle) The only remaining structural maintenance debt is the duplicated obs deploy section between `nightly.yml` and `release.yml`. The "Keep in sync with" comments are a code smell that usually resolves into a composite action once the duplication grows beyond two files or two sections. The obs deploy now has four steps (Deploy configs → Validate config → Start stack → Assert health) duplicated in two workflows — that's eight step definitions to keep synchronized. A composite action at `.gitea/actions/deploy-obs/action.yaml` accepting environment-specific inputs (secrets, postgres host) would collapse this to two `uses:` calls. Not a blocker for this PR; the natural trigger is the next time either workflow diverges despite the "Keep in sync" note.

marcel commented

2026-05-16 09:10:19 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

All three root-cause fixes from issue #601 are traceable from stated problem → implementation → verifiable outcome. The PR description test plan maps one-to-one to concrete, testable system states. Documentation chain is complete.

Requirements traceability (final check)

Fix	Root cause	Verifiable outcome	Covered
Tempo `processors` schema	Tempo 2.7.2 moved key to `overrides`	Tempo starts healthy, no `field processors not found`	✅ health assertion step
Configurable `POSTGRES_HOST`	Hardcoded `archive-db` fails when only staging stack runs	GlitchTip connects when `POSTGRES_HOST` set	✅ parameterized substitution at 3 sites
`PORT_GRAFANA` default 3003	Port 3001 occupied by staging frontend	Grafana binds without `port already allocated`	✅ compose file + obs.env + .env.example + DEPLOYMENT.md

Deployment maintainability observations (no blockers)

1. Directory relocation consequence missing from ADR-016

ADR-016 documents the runner restart requirement and the cp -r stale-file limitation in Consequences. It does not mention that if /opt/familienarchiv/ is ever relocated, three artifacts must change simultaneously: runner-config.yaml (valid_volumes), both workflow files (all absolute paths), and DEPLOYMENT.md §4. This is the kind of operational footgun that causes incidents when someone "cleans up" the VPS without reading the ADR. A one-sentence Consequences bullet would close the documentation gap:

"If /opt/familienarchiv/ is relocated, update simultaneously: runner-config.yaml valid_volumes and options, both nightly.yml and release.yml absolute paths, and docs/DEPLOYMENT.md §4. Partial updates will cause the next CI run to fail writing to the old path."

2. POSTGRES_USER and POSTGRES_HOST split across secrets and config files

obs-secrets.env currently carries POSTGRES_USER=archiv alongside actual secrets. From a requirements clarity standpoint: the two-source model's stated contract is "non-secret config in obs.env, secrets in obs-secrets.env." POSTGRES_USER is not a secret — it's a configuration constant. Moving it to obs.env would make the two-source model's contract self-consistent and eliminate one non-secret from the secrets file.

3. The empty-secret failure mode is undocumented

If a Gitea secret is not set (e.g. GRAFANA_ADMIN_PASSWORD missing), ${{ secrets.GRAFANA_ADMIN_PASSWORD }} evaluates to an empty string and the heredoc writes GRAFANA_ADMIN_PASSWORD= into obs-secrets.env. Docker Compose config --quiet accepts this (empty value ≠ undefined variable). Grafana would start with the changeme default password since an empty GF_SECURITY_ADMIN_PASSWORD falls back to the default. The requirement "rotating a secret takes effect on the next deploy" implicitly assumes the secret is set. A pre-flight check — [ -n "$GRAFANA_ADMIN_PASSWORD" ] || { echo "::error::GRAFANA_ADMIN_PASSWORD secret is not set"; exit 1; } — before the heredoc would make this failure fast and loud rather than silent. Worth a follow-up issue.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** All three root-cause fixes from issue #601 are traceable from stated problem → implementation → verifiable outcome. The PR description test plan maps one-to-one to concrete, testable system states. Documentation chain is complete. ### Requirements traceability (final check) | Fix | Root cause | Verifiable outcome | Covered | |---|---|---|---| | Tempo `processors` schema | Tempo 2.7.2 moved key to `overrides` | Tempo starts healthy, no `field processors not found` | ✅ health assertion step | | Configurable `POSTGRES_HOST` | Hardcoded `archive-db` fails when only staging stack runs | GlitchTip connects when `POSTGRES_HOST` set | ✅ parameterized substitution at 3 sites | | `PORT_GRAFANA` default 3003 | Port 3001 occupied by staging frontend | Grafana binds without `port already allocated` | ✅ compose file + obs.env + .env.example + DEPLOYMENT.md | --- ### Deployment maintainability observations (no blockers) **1. Directory relocation consequence missing from ADR-016** ADR-016 documents the runner restart requirement and the `cp -r` stale-file limitation in Consequences. It does not mention that if `/opt/familienarchiv/` is ever relocated, three artifacts must change simultaneously: `runner-config.yaml` (valid_volumes), both workflow files (all absolute paths), and DEPLOYMENT.md §4. This is the kind of operational footgun that causes incidents when someone "cleans up" the VPS without reading the ADR. A one-sentence Consequences bullet would close the documentation gap: > "If `/opt/familienarchiv/` is relocated, update simultaneously: `runner-config.yaml` valid_volumes and options, both `nightly.yml` and `release.yml` absolute paths, and `docs/DEPLOYMENT.md §4`. Partial updates will cause the next CI run to fail writing to the old path." **2. `POSTGRES_USER` and `POSTGRES_HOST` split across secrets and config files** `obs-secrets.env` currently carries `POSTGRES_USER=archiv` alongside actual secrets. From a requirements clarity standpoint: the two-source model's stated contract is "non-secret config in `obs.env`, secrets in `obs-secrets.env`." `POSTGRES_USER` is not a secret — it's a configuration constant. Moving it to `obs.env` would make the two-source model's contract self-consistent and eliminate one non-secret from the secrets file. **3. The empty-secret failure mode is undocumented** If a Gitea secret is not set (e.g. `GRAFANA_ADMIN_PASSWORD` missing), `${{ secrets.GRAFANA_ADMIN_PASSWORD }}` evaluates to an empty string and the heredoc writes `GRAFANA_ADMIN_PASSWORD=` into `obs-secrets.env`. Docker Compose `config --quiet` accepts this (empty value ≠ undefined variable). Grafana would start with the `changeme` default password since an empty `GF_SECURITY_ADMIN_PASSWORD` falls back to the default. The requirement "rotating a secret takes effect on the next deploy" implicitly assumes the secret is set. A pre-flight check — `[ -n "$GRAFANA_ADMIN_PASSWORD" ] || { echo "::error::GRAFANA_ADMIN_PASSWORD secret is not set"; exit 1; }` — before the heredoc would make this failure fast and loud rather than silent. Worth a follow-up issue.

marcel commented

2026-05-16 09:10:28 +02:00

🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist

Verdict: ✅ Approved

Pure infrastructure PR — CI workflows, Docker Compose config, Tempo YAML, env files, ADR, and deployment documentation. No frontend components, no Svelte files, no design tokens, no routes, no user-facing changes.

I reviewed the diff specifically for env vars or config changes that could inadvertently affect the SvelteKit frontend or the operator-facing Grafana/GlitchTip UI:

PORT_GRAFANA 3001 → 3003: Internal observability tooling only. Grafana is not user-facing and is bound to 127.0.0.1. No UI impact.
GF_SERVER_ROOT_URL: Grafana-internal setting for alert email links and OAuth redirects. Not part of the application UI. Correctly set to the production URL in obs.env.
POSTGRES_HOST: Backend infrastructure only. No frontend impact.
GLITCHTIP_DOMAIN: Error reporting ingestion endpoint. Not user-facing in the application UI.

Nothing to flag from a UX, accessibility, or brand perspective. The documentation improvements in DEPLOYMENT.md ("Dev — start from workspace" vs "Production — managed from /opt/familienarchiv/") will help operators navigate the setup without frustration — an ergonomic improvement for the human running this infrastructure.

## 🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist **Verdict: ✅ Approved** Pure infrastructure PR — CI workflows, Docker Compose config, Tempo YAML, env files, ADR, and deployment documentation. No frontend components, no Svelte files, no design tokens, no routes, no user-facing changes. I reviewed the diff specifically for env vars or config changes that could inadvertently affect the SvelteKit frontend or the operator-facing Grafana/GlitchTip UI: - `PORT_GRAFANA` 3001 → 3003: Internal observability tooling only. Grafana is not user-facing and is bound to `127.0.0.1`. No UI impact. - `GF_SERVER_ROOT_URL`: Grafana-internal setting for alert email links and OAuth redirects. Not part of the application UI. Correctly set to the production URL in `obs.env`. - `POSTGRES_HOST`: Backend infrastructure only. No frontend impact. - `GLITCHTIP_DOMAIN`: Error reporting ingestion endpoint. Not user-facing in the application UI. Nothing to flag from a UX, accessibility, or brand perspective. The documentation improvements in DEPLOYMENT.md ("Dev — start from workspace" vs "Production — managed from `/opt/familienarchiv/`") will help operators navigate the setup without frustration — an ergonomic improvement for the human running this infrastructure.

marcel added 1 commit 2026-05-16 09:20:40 +02:00

chore(obs): improve deploy maintainability from review feedback

CI / Unit & Component Tests (pull_request) Successful in 5m45s

Details

CI / OCR Service Tests (pull_request) Successful in 47s

Details

CI / fail2ban Regex (pull_request) Has been cancelled

Details

CI / Compose Bucket Idempotency (pull_request) Has been cancelled

Details

CI / Backend Unit Tests (pull_request) Has been cancelled

Details

6720a5aeb2

- Move POSTGRES_USER to obs.env (non-secret, constant across envs)
- Replace cp -r with rsync -a --delete so removed config files are
  purged from /opt/familienarchiv on next deploy instead of lingering
- Document --env-file ordering contract in validate + start steps:
  obs.env first (defaults), obs-secrets.env second (wins on dupes)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-16 09:24:48 +02:00

🏗️ Markus Keller — Senior Application Architect

Verdict: ✅ Approved

Documentation currency check

Per our doc-update rules:

Change in PR	Required doc	Status
New Docker infrastructure component (obs stack now co-located at `/opt/familienarchiv/`)	`docs/architecture/c4/l2-containers.puml`	✅ Updated
New Docker service management approach + env model	`docs/DEPLOYMENT.md`	✅ Updated (§4 substantially expanded)
Architectural decision with lasting consequences	New ADR in `docs/adr/`	✅ ADR-016 added

Architecture assessment

What's solid:

The two-source env model — git-tracked obs.env for non-secrets, CI-written obs-secrets.env for credentials — is architecturally clean. Secrets flow from Gitea → CI → host file, never through git. Non-secret config is reviewable in PRs. The sources of truth are separated by concern.

Co-locating the obs stack at /opt/familienarchiv/ correctly reflects that GlitchTip shares archive-db and archiv-net with the main app stack. They are the same deployment unit. A separate /opt/obs/ would have implied an independence that doesn't exist.

ADR-016 shows genuine judgment: the alternatives section dismisses server-pull (adds a second deployment mechanism), separate directory (misrepresents coupling), and Docker Swarm configs (disproportionate dependency) with concrete reasons. That's the right format.

The validate-before-start pattern (docker compose ... config --quiet) is exactly right — catch missing variables and YAML errors before containers attempt to start.

Minor inconsistency (non-blocking)

ADR-016 Consequences says: "cp -r does not remove deleted files; a config file removed from the repo persists in /opt/familienarchiv/infra/observability/ until manually deleted."

But the implementation uses rsync -a --delete infra/observability/ /opt/familienarchiv/infra/observability/ — which does remove deleted files from the config directory. The note is only accurate for the top-level compose file itself (cp docker-compose.observability.yml /opt/familienarchiv/), which uses plain cp.

The ADR should distinguish these two cases: the config tree is clean (rsync handles deletions); only a renamed/deleted top-level compose file would leave a stale copy. This doesn't change the decision — just makes the consequences section more accurate.

Suggested ADR fix:

rsync -a --delete keeps infra/observability/ in sync; deleted config files are removed. However, docker-compose.observability.yml is deployed with cp — if the file is ever renamed or removed from the repo, the stale copy at /opt/familienarchiv/docker-compose.observability.yml must be deleted manually.

## 🏗️ Markus Keller — Senior Application Architect **Verdict: ✅ Approved** ### Documentation currency check Per our doc-update rules: | Change in PR | Required doc | Status | |---|---|---| | New Docker infrastructure component (obs stack now co-located at `/opt/familienarchiv/`) | `docs/architecture/c4/l2-containers.puml` | ✅ Updated | | New Docker service management approach + env model | `docs/DEPLOYMENT.md` | ✅ Updated (§4 substantially expanded) | | Architectural decision with lasting consequences | New ADR in `docs/adr/` | ✅ ADR-016 added | ### Architecture assessment **What's solid:** The two-source env model — git-tracked `obs.env` for non-secrets, CI-written `obs-secrets.env` for credentials — is architecturally clean. Secrets flow from Gitea → CI → host file, never through git. Non-secret config is reviewable in PRs. The sources of truth are separated by concern. Co-locating the obs stack at `/opt/familienarchiv/` correctly reflects that GlitchTip shares `archive-db` and `archiv-net` with the main app stack. They are the same deployment unit. A separate `/opt/obs/` would have implied an independence that doesn't exist. ADR-016 shows genuine judgment: the alternatives section dismisses server-pull (adds a second deployment mechanism), separate directory (misrepresents coupling), and Docker Swarm configs (disproportionate dependency) with concrete reasons. That's the right format. The validate-before-start pattern (`docker compose ... config --quiet`) is exactly right — catch missing variables and YAML errors before containers attempt to start. ### Minor inconsistency (non-blocking) ADR-016 Consequences says: *"cp -r does not remove deleted files; a config file removed from the repo persists in `/opt/familienarchiv/infra/observability/` until manually deleted."* But the implementation uses `rsync -a --delete infra/observability/ /opt/familienarchiv/infra/observability/` — which **does** remove deleted files from the config directory. The note is only accurate for the top-level compose file itself (`cp docker-compose.observability.yml /opt/familienarchiv/`), which uses plain `cp`. The ADR should distinguish these two cases: the config tree is clean (rsync handles deletions); only a renamed/deleted top-level compose file would leave a stale copy. This doesn't change the decision — just makes the consequences section more accurate. Suggested ADR fix: > `rsync -a --delete` keeps `infra/observability/` in sync; deleted config files are removed. However, `docker-compose.observability.yml` is deployed with `cp` — if the file is ever renamed or removed from the repo, the stale copy at `/opt/familienarchiv/docker-compose.observability.yml` must be deleted manually.

marcel commented

2026-05-16 09:25:04 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

This is a pure infrastructure PR — no application code, no testable business logic. TDD doesn't apply here. My focus is on CI script quality and the one code quality issue that stands out.

Blocker

Copy-paste between nightly.yml and release.yml — this is the dominant code smell. The following four steps are near-identical in both files:

"Deploy observability configs"
"Validate observability compose config"
"Start observability stack"
"Assert observability stack health"

That's roughly 130 lines duplicated across two workflow files. The only differences are:

Secret name: STAGING_POSTGRES_PASSWORD vs PROD_POSTGRES_PASSWORD
Host name: archiv-staging-db-1 vs archiv-production-db-1

Gitea Actions supports composite actions. A deploy-obs-stack composite action in .gitea/actions/ would DRY this down to a ~10-line call site in each workflow with the two differing values passed as inputs. When the deploy logic changes (and it will), there's currently two files to update in sync — the "Keep in sync with the equivalent step in nightly.yml" comments are the red flag that says this should be an abstraction.

I understand this might be deferred work (the comment already acknowledges the duplication), but it's a blocker for me because the "keep in sync" comment is a maintenance debt bomb on a critical path.

Suggestions

Heredoc indentation (cosmetic, functionally fine):

cat > /opt/familienarchiv/obs-secrets.env << 'EOF'
          GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }}
          ...
          EOF

The heredoc content has leading whitespace because it follows the YAML indentation. Docker Compose env file parsers do strip leading whitespace, so this works. But <<-'EOF' with tab-indented content (or restructuring the run: block) would make the generated file cleaner. Minor — not a blocker.

Health check loop — good pattern:

unhealthy=""
for svc in obs-loki obs-prometheus obs-grafana obs-tempo; do ...
done
[ -z "$unhealthy" ] || exit 1

Collecting all failures before exiting (instead of set -e on the first check) gives complete diagnostics. This is the right way to write this kind of loop. ✅

Step comments explain WHY — the # Runs with absolute paths so bind mounts resolve to stable host paths style comments are exactly what I want to see in infra scripts. ✅

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** This is a pure infrastructure PR — no application code, no testable business logic. TDD doesn't apply here. My focus is on CI script quality and the one code quality issue that stands out. ### Blocker **Copy-paste between `nightly.yml` and `release.yml`** — this is the dominant code smell. The following four steps are near-identical in both files: - "Deploy observability configs" - "Validate observability compose config" - "Start observability stack" - "Assert observability stack health" That's roughly 130 lines duplicated across two workflow files. The only differences are: - Secret name: `STAGING_POSTGRES_PASSWORD` vs `PROD_POSTGRES_PASSWORD` - Host name: `archiv-staging-db-1` vs `archiv-production-db-1` Gitea Actions supports composite actions. A `deploy-obs-stack` composite action in `.gitea/actions/` would DRY this down to a ~10-line call site in each workflow with the two differing values passed as inputs. When the deploy logic changes (and it will), there's currently two files to update in sync — the "Keep in sync with the equivalent step in nightly.yml" comments are the red flag that says this should be an abstraction. I understand this might be deferred work (the comment already acknowledges the duplication), but it's a blocker for me because the "keep in sync" comment is a maintenance debt bomb on a critical path. ### Suggestions **Heredoc indentation (cosmetic, functionally fine):** ```yaml cat > /opt/familienarchiv/obs-secrets.env << 'EOF' GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }} ... EOF ``` The heredoc content has leading whitespace because it follows the YAML indentation. Docker Compose env file parsers do strip leading whitespace, so this works. But `<<-'EOF'` with tab-indented content (or restructuring the `run:` block) would make the generated file cleaner. Minor — not a blocker. **Health check loop — good pattern:** ```bash unhealthy="" for svc in obs-loki obs-prometheus obs-grafana obs-tempo; do ... done [ -z "$unhealthy" ] || exit 1 ``` Collecting all failures before exiting (instead of `set -e` on the first check) gives complete diagnostics. This is the right way to write this kind of loop. ✅ **Step comments explain WHY** — the `# Runs with absolute paths so bind mounts resolve to stable host paths` style comments are exactly what I want to see in infra scripts. ✅

marcel commented

2026-05-16 09:25:24 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

This fixes three real production issues cleanly. The overall approach is sound. I have one blocker (CI duplication) and a few concerns worth naming.

Blockers

130 lines of copy-pasted CI steps. The four obs-stack steps in nightly.yml and release.yml are identical except for secret names and container hostnames. The "Keep in sync" comments in release.yml are the CI equivalent of a TODO: fix this. When the deploy script changes — and it will, because obs stacks always need tuning — two files need editing in lockstep, and there is no tooling to enforce that.

Create a composite action: .gitea/actions/deploy-obs-stack/action.yml with inputs postgres_password, postgres_host, and env_file_prefix. Then both workflows call it with their respective values. 10 lines per workflow, one source of truth for the deploy logic.

Concerns

Hardcoded container names are fragile. archiv-staging-db-1 and archiv-production-db-1 in the CI secret files derive from Docker Compose project names (archiv-staging, archiv-production) and service name (db). If the Compose project is renamed, these values silently break. The comment acknowledges this:

A project rename requires updating this value.

Fair — but this is exactly the kind of thing that gets forgotten. Could docker compose -p archiv-staging port db 5432 or a simpler lookup replace the hardcoded name? Probably overkill for this project's change frequency, but worth noting.

cp docker-compose.observability.yml /opt/familienarchiv/ leaves stale copies. This is called out correctly in DEPLOYMENT.md. Since the config tree uses rsync -a --delete, the stale compose file is the only manual cleanup risk. Very low probability event, acceptable given how rarely compose files are renamed.

GlitchTip connectivity is not automatically verified. The health assertion step checks obs-loki obs-prometheus obs-grafana obs-tempo but not GlitchTip. If POSTGRES_HOST is misconfigured, the worker silently fails to connect and the CI run still turns green. The test plan calls this out as a manual check — that's honest, but means a bad POSTGRES_HOST wouldn't be caught until someone looks at GlitchTip.

What's done well

rsync -a --delete for the config tree — clean mirror, handles removed files ✅
chmod 600 on the secrets file — correct ✅
docker compose ... config --quiet validate-before-start — catches env var issues before containers start ✅
PORT_GRAFANA:-3003 default in the compose file is consistent with obs.env ✅
Security warning expanded in runner-config.yaml appropriately reflects the new blast radius ✅
The ${{ secrets.XXX }} + <<'EOF' heredoc correctly uses Gitea template substitution before shell execution — the single-quoted heredoc is fine here ✅
GF_SERVER_ROOT_URL added to the compose file — catches the common Grafana misconfiguration where alert email links use localhost ✅

Note on the Tempo fix

Removing the duplicate metrics_generator.processors block from the top-level config is the correct fix for the Tempo 2.7.2 breaking change. The processors were already defined under overrides.defaults.metrics_generator.processors — the top-level block was redundant and became invalid. Clean removal. ✅

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** This fixes three real production issues cleanly. The overall approach is sound. I have one blocker (CI duplication) and a few concerns worth naming. ### Blockers **130 lines of copy-pasted CI steps.** The four obs-stack steps in `nightly.yml` and `release.yml` are identical except for secret names and container hostnames. The "Keep in sync" comments in `release.yml` are the CI equivalent of a `TODO: fix this`. When the deploy script changes — and it will, because obs stacks always need tuning — two files need editing in lockstep, and there is no tooling to enforce that. Create a composite action: `.gitea/actions/deploy-obs-stack/action.yml` with inputs `postgres_password`, `postgres_host`, and `env_file_prefix`. Then both workflows call it with their respective values. 10 lines per workflow, one source of truth for the deploy logic. ### Concerns **Hardcoded container names are fragile.** `archiv-staging-db-1` and `archiv-production-db-1` in the CI secret files derive from Docker Compose project names (`archiv-staging`, `archiv-production`) and service name (`db`). If the Compose project is renamed, these values silently break. The comment acknowledges this: > A project rename requires updating this value. Fair — but this is exactly the kind of thing that gets forgotten. Could `docker compose -p archiv-staging port db 5432` or a simpler lookup replace the hardcoded name? Probably overkill for this project's change frequency, but worth noting. **`cp docker-compose.observability.yml /opt/familienarchiv/` leaves stale copies.** This is called out correctly in DEPLOYMENT.md. Since the config tree uses `rsync -a --delete`, the stale compose file is the only manual cleanup risk. Very low probability event, acceptable given how rarely compose files are renamed. **GlitchTip connectivity is not automatically verified.** The health assertion step checks `obs-loki obs-prometheus obs-grafana obs-tempo` but not GlitchTip. If `POSTGRES_HOST` is misconfigured, the worker silently fails to connect and the CI run still turns green. The test plan calls this out as a manual check — that's honest, but means a bad `POSTGRES_HOST` wouldn't be caught until someone looks at GlitchTip. ### What's done well - `rsync -a --delete` for the config tree — clean mirror, handles removed files ✅ - `chmod 600` on the secrets file — correct ✅ - `docker compose ... config --quiet` validate-before-start — catches env var issues before containers start ✅ - `PORT_GRAFANA:-3003` default in the compose file is consistent with obs.env ✅ - Security warning expanded in `runner-config.yaml` appropriately reflects the new blast radius ✅ - The `${{ secrets.XXX }}` + `<<'EOF'` heredoc correctly uses Gitea template substitution before shell execution — the single-quoted heredoc is fine here ✅ - `GF_SERVER_ROOT_URL` added to the compose file — catches the common Grafana misconfiguration where alert email links use `localhost` ✅ ### Note on the Tempo fix Removing the duplicate `metrics_generator.processors` block from the top-level config is the correct fix for the Tempo 2.7.2 breaking change. The processors were already defined under `overrides.defaults.metrics_generator.processors` — the top-level block was redundant and became invalid. Clean removal. ✅

marcel commented

2026-05-16 09:25:43 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

No exploitable vulnerabilities in this PR. The secrets handling is well-designed. My concerns are about blast radius — this change expands the attack surface of the CI runner in a way that deserves explicit acknowledgment.

What's secure

Secret management is correct.

obs-secrets.env is written fresh from Gitea secrets on every deploy — Gitea is the single source of truth ✅
chmod 600 prevents other processes from reading it ✅
The <<'EOF' heredoc with ${{ secrets.XXX }} is the right pattern — Gitea template substitution happens before the shell executes, and the single-quoted delimiter prevents shell re-expansion ✅
Secrets are never written to git or to workspace paths ✅

The non-secret/secret split is clean.
POSTGRES_USER=archiv in the git-tracked obs.env is fine — a database username is not a secret. The password is in the CI-injected file. The line is drawn correctly.

The security comment in runner-config.yaml is explicit.
Naming the exact capabilities ("root-equivalent access to the host Docker daemon", "Read/write access to /opt/familienarchiv/") and the exact trust model ("single-tenant, trusted authors only") is exactly the right documentation style. ✅

Concerns (non-blocking but worth tracking)

Expanded blast radius is documented but not mitigated.

The new /opt/familienarchiv mount means any CI job step — including steps that install npm packages, run mvnw, or execute plugin code — runs in a container that can write to /opt/familienarchiv/docker-compose.observability.yml. If a supply-chain compromise reaches any CI dependency (a Maven plugin, an npm package, a base runner image layer), it can overwrite production compose files.

This is accepted risk for a single-tenant private repo. But I'd recommend one lightweight mitigation: file integrity monitoring on /opt/familienarchiv/. Something as simple as a nightly cron that checksums the compose file and alerts if it changes between expected deploy windows. The runner-config.yaml warning is good documentation; a tripwire is an actual control.

obs-secrets.env on the host filesystem.

The file lives at /opt/familienarchiv/obs-secrets.env with chmod 600. Since the CI runner has root-equivalent Docker access, any job container could mount the host filesystem and read it (docker run -v /opt/familienarchiv:/mnt ...). This is the same trust boundary as the docker socket — already accepted in the existing runner-config.yaml threat model. Noting it for completeness so the threat model is fully explicit.

GlitchTip receives DB credentials as environment variables.

In docker-compose.observability.yml:

DATABASE_URL: postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST:-archive-db}:5432/glitchtip

This is standard for containerised apps and acceptable here. GlitchTip is a trusted internal service. Just confirming this is intentional and in scope of the existing threat model.

No action required

The POSTGRES_HOST variable doesn't introduce injection risk — it's interpolated into a connection string, not a shell command. The compose file handles it safely.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** No exploitable vulnerabilities in this PR. The secrets handling is well-designed. My concerns are about blast radius — this change expands the attack surface of the CI runner in a way that deserves explicit acknowledgment. ### What's secure **Secret management is correct.** - `obs-secrets.env` is written fresh from Gitea secrets on every deploy — Gitea is the single source of truth ✅ - `chmod 600` prevents other processes from reading it ✅ - The `<<'EOF'` heredoc with `${{ secrets.XXX }}` is the right pattern — Gitea template substitution happens before the shell executes, and the single-quoted delimiter prevents shell re-expansion ✅ - Secrets are never written to git or to workspace paths ✅ **The non-secret/secret split is clean.** `POSTGRES_USER=archiv` in the git-tracked `obs.env` is fine — a database username is not a secret. The password is in the CI-injected file. The line is drawn correctly. **The security comment in `runner-config.yaml` is explicit.** Naming the exact capabilities ("root-equivalent access to the host Docker daemon", "Read/write access to `/opt/familienarchiv/`") and the exact trust model ("single-tenant, trusted authors only") is exactly the right documentation style. ✅ ### Concerns (non-blocking but worth tracking) **Expanded blast radius is documented but not mitigated.** The new `/opt/familienarchiv` mount means any CI job step — including steps that install npm packages, run `mvnw`, or execute plugin code — runs in a container that can write to `/opt/familienarchiv/docker-compose.observability.yml`. If a supply-chain compromise reaches any CI dependency (a Maven plugin, an npm package, a base runner image layer), it can overwrite production compose files. This is accepted risk for a single-tenant private repo. But I'd recommend one lightweight mitigation: **file integrity monitoring on `/opt/familienarchiv/`**. Something as simple as a nightly cron that checksums the compose file and alerts if it changes between expected deploy windows. The `runner-config.yaml` warning is good documentation; a tripwire is an actual control. **`obs-secrets.env` on the host filesystem.** The file lives at `/opt/familienarchiv/obs-secrets.env` with `chmod 600`. Since the CI runner has root-equivalent Docker access, any job container could mount the host filesystem and read it (`docker run -v /opt/familienarchiv:/mnt ...`). This is the same trust boundary as the docker socket — already accepted in the existing `runner-config.yaml` threat model. Noting it for completeness so the threat model is fully explicit. **GlitchTip receives DB credentials as environment variables.** In `docker-compose.observability.yml`: ```yaml DATABASE_URL: postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST:-archive-db}:5432/glitchtip ``` This is standard for containerised apps and acceptable here. GlitchTip is a trusted internal service. Just confirming this is intentional and in scope of the existing threat model. ### No action required The `POSTGRES_HOST` variable doesn't introduce injection risk — it's interpolated into a connection string, not a shell command. The compose file handles it safely.

marcel commented

2026-05-16 09:25:59 +02:00

🧪 Sara Holt — Senior QA Engineer

Verdict: ⚠️ Approved with concerns

What's good

The PR adds two genuine quality gates that weren't there before:

"Validate observability compose config" — a dry-run that resolves all variable substitutions before containers start. This catches undefined variables and YAML errors early. Exactly the right pattern. ✅
"Assert observability stack health" — explicit health check on obs-loki, obs-prometheus, obs-grafana, obs-tempo after docker compose up --wait. The --wait flag only covers services with healthcheck directives; this step covers the gap for services where healthcheck exists but might be missed. The collect-then-exit pattern (accumulating all unhealthy services before exit 1) gives full diagnostic output rather than stopping at the first failure. ✅

Concerns

GlitchTip connectivity is not automatically verified.

The test plan includes:

GlitchTip connects to Postgres when POSTGRES_HOST is set in .env

But neither obs-glitchtip nor obs-glitchtip-worker have Docker healthchecks, and neither is in the Assert observability stack health step. If POSTGRES_HOST is misconfigured:

docker compose up --wait passes (no healthcheck to fail)
The assertion step passes (not checking GlitchTip)
CI turns green
GlitchTip silently fails to connect to Postgres

This is a gap between the test plan and what CI actually verifies. Adding a healthcheck to the GlitchTip service in the compose file would close it — something like:

obs-glitchtip:
  healthcheck:
    test: ["CMD", "wget", "-qO-", "http://localhost:8000/api/0/"]
    interval: 15s
    timeout: 5s
    retries: 5
    start_period: 60s

Then the assertion step can include obs-glitchtip. Without this, the POSTGRES_HOST test plan item remains manual.

The test plan is checkbox-based but not automated.

Items 3–5 in the test plan are manual checks:

GlitchTip connects to Postgres with POSTGRES_HOST set
GlitchTip connects to archive-db by default
Grafana binds to port 3003

The port binding can be automated in CI:

# After 'Start observability stack'
docker exec obs-grafana sh -c 'wget -qO- http://localhost:3000/api/health' | grep -q '"database": "ok"'

Not blocking — this is a bug-fix PR and manual verification is documented — but these items would be good candidates for automated smoke tests in a follow-up.

No issues with static analysis

The infra/observability/obs.env file is git-tracked non-secret config. It's clean and well-commented. The variable inventory in DEPLOYMENT.md §4 matches the actual file contents. ✅

## 🧪 Sara Holt — Senior QA Engineer **Verdict: ⚠️ Approved with concerns** ### What's good The PR adds two genuine quality gates that weren't there before: 1. **"Validate observability compose config"** — a dry-run that resolves all variable substitutions before containers start. This catches undefined variables and YAML errors early. Exactly the right pattern. ✅ 2. **"Assert observability stack health"** — explicit health check on `obs-loki`, `obs-prometheus`, `obs-grafana`, `obs-tempo` after `docker compose up --wait`. The `--wait` flag only covers services with healthcheck directives; this step covers the gap for services where healthcheck exists but might be missed. The collect-then-exit pattern (accumulating all unhealthy services before `exit 1`) gives full diagnostic output rather than stopping at the first failure. ✅ ### Concerns **GlitchTip connectivity is not automatically verified.** The test plan includes: > - [ ] GlitchTip connects to Postgres when `POSTGRES_HOST` is set in `.env` But neither `obs-glitchtip` nor `obs-glitchtip-worker` have Docker healthchecks, and neither is in the `Assert observability stack health` step. If `POSTGRES_HOST` is misconfigured: - `docker compose up --wait` passes (no healthcheck to fail) - The assertion step passes (not checking GlitchTip) - CI turns green - GlitchTip silently fails to connect to Postgres This is a gap between the test plan and what CI actually verifies. Adding a healthcheck to the GlitchTip service in the compose file would close it — something like: ```yaml obs-glitchtip: healthcheck: test: ["CMD", "wget", "-qO-", "http://localhost:8000/api/0/"] interval: 15s timeout: 5s retries: 5 start_period: 60s ``` Then the assertion step can include `obs-glitchtip`. Without this, the `POSTGRES_HOST` test plan item remains manual. **The test plan is checkbox-based but not automated.** Items 3–5 in the test plan are manual checks: - GlitchTip connects to Postgres with `POSTGRES_HOST` set - GlitchTip connects to `archive-db` by default - Grafana binds to port 3003 The port binding can be automated in CI: ```bash # After 'Start observability stack' docker exec obs-grafana sh -c 'wget -qO- http://localhost:3000/api/health' | grep -q '"database": "ok"' ``` Not blocking — this is a bug-fix PR and manual verification is documented — but these items would be good candidates for automated smoke tests in a follow-up. ### No issues with static analysis The `infra/observability/obs.env` file is git-tracked non-secret config. It's clean and well-commented. The variable inventory in DEPLOYMENT.md §4 matches the actual file contents. ✅

marcel commented

2026-05-16 09:26:11 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

This is a pure infrastructure bug-fix PR with no product-facing requirements implications. The scope is well-contained: three specific breakages, three specific fixes, each documented with the exact error message or symptom that triggered it.

What's clear

The PR description is exemplary requirements documentation for an ops fix:

Tempo fix: quotes the exact error (field processors not found in type generator.Config), names the version where the schema changed (2.7.2), and explains the fix (removed duplicate block, kept the overrides config)
Postgres host fix: describes the exact failure condition (staging-only stack, DNS resolution fails) and the safe default (archive-db preserved for production)
Grafana port fix: names the conflicting process and the symptom (Bind for 127.0.0.1:3001 failed)

The test plan covers all three fixes with checkable items. ✅

One open question (non-blocking)

The POSTGRES_HOST convention — archiv-staging-db-1, archiv-production-db-1 — derives from Docker Compose's container naming scheme: {project_name}-{service}-{instance}. This is an implicit convention, not a configured value.

The comments in CI acknowledge this:

A project rename requires updating this value.

From a requirements perspective: is there a defined constraint on Compose project names for the staging and production stacks? If archiv-staging and archiv-production are established conventions (which the DEPLOYMENT.md suggests they are), then the implicit dependency is acceptable. If the project names could ever change as part of a migration, these hardcoded host names in CI become a silent failure waiting to happen.

Suggestion: document the Compose project names as a first-class configuration constant somewhere (e.g., in DEPLOYMENT.md §3 or as a comment in the CI env files), so a future rename decision knows to also update POSTGRES_HOST in the CI secrets.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** This is a pure infrastructure bug-fix PR with no product-facing requirements implications. The scope is well-contained: three specific breakages, three specific fixes, each documented with the exact error message or symptom that triggered it. ### What's clear The PR description is exemplary requirements documentation for an ops fix: - **Tempo fix**: quotes the exact error (`field processors not found in type generator.Config`), names the version where the schema changed (2.7.2), and explains the fix (removed duplicate block, kept the `overrides` config) - **Postgres host fix**: describes the exact failure condition (staging-only stack, DNS resolution fails) and the safe default (`archive-db` preserved for production) - **Grafana port fix**: names the conflicting process and the symptom (`Bind for 127.0.0.1:3001 failed`) The test plan covers all three fixes with checkable items. ✅ ### One open question (non-blocking) The `POSTGRES_HOST` convention — `archiv-staging-db-1`, `archiv-production-db-1` — derives from Docker Compose's container naming scheme: `{project_name}-{service}-{instance}`. This is an implicit convention, not a configured value. The comments in CI acknowledge this: > A project rename requires updating this value. From a requirements perspective: is there a defined constraint on Compose project names for the staging and production stacks? If `archiv-staging` and `archiv-production` are established conventions (which the DEPLOYMENT.md suggests they are), then the implicit dependency is acceptable. If the project names could ever change as part of a migration, these hardcoded host names in CI become a silent failure waiting to happen. Suggestion: document the Compose project names as a first-class configuration constant somewhere (e.g., in DEPLOYMENT.md §3 or as a comment in the CI env files), so a future rename decision knows to also update `POSTGRES_HOST` in the CI secrets.

marcel commented

2026-05-16 09:26:16 +02:00

🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist

Verdict: ✅ Approved — LGTM

This PR contains zero frontend changes. No Svelte components, no routes, no CSS, no ARIA attributes, no Tailwind classes were modified.

Checked:

No changes to frontend/src/
No changes to any .svelte file
No changes to layout.css or design tokens
No changes to user-facing error messages or i18n strings

Nothing in scope for UI, UX, or accessibility review. Approved unconditionally from a design perspective.

If any follow-up work adds GlitchTip's error reporting UI or Grafana dashboards to the user-facing application surface, loop me in for the UX review at that point.

## 🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist **Verdict: ✅ Approved — LGTM** This PR contains zero frontend changes. No Svelte components, no routes, no CSS, no ARIA attributes, no Tailwind classes were modified. Checked: - No changes to `frontend/src/` - No changes to any `.svelte` file - No changes to `layout.css` or design tokens - No changes to user-facing error messages or i18n strings Nothing in scope for UI, UX, or accessibility review. Approved unconditionally from a design perspective. --- *If any follow-up work adds GlitchTip's error reporting UI or Grafana dashboards to the user-facing application surface, loop me in for the UX review at that point.*

marcel added 1 commit 2026-05-16 09:35:46 +02:00

ci(obs): reference #603 in keep-in-sync comments; add obs-glitchtip to health assertion

CI / Unit & Component Tests (pull_request) Has been cancelled

Details

CI / OCR Service Tests (pull_request) Has been cancelled

Details

CI / Backend Unit Tests (pull_request) Has been cancelled

Details

CI / fail2ban Regex (pull_request) Has been cancelled

Details

CI / Compose Bucket Idempotency (pull_request) Has been cancelled

Details

b194b565f6

marcel added 1 commit 2026-05-16 09:36:41 +02:00

ci(obs): add obs-glitchtip to health assertion loop (now has /_health/ healthcheck)

CI / Unit & Component Tests (pull_request) Waiting to run

Details

CI / OCR Service Tests (pull_request) Waiting to run

Details

CI / Backend Unit Tests (pull_request) Waiting to run

Details

CI / fail2ban Regex (pull_request) Waiting to run

Details

CI / Compose Bucket Idempotency (pull_request) Waiting to run

Details

0bb0a314ad

marcel added 1 commit 2026-05-16 09:37:20 +02:00

fix(obs): add GlitchTip healthcheck on /_health/ (port 8080)

CI / Unit & Component Tests (push) Waiting to run

Details

CI / Unit & Component Tests (pull_request) Has been cancelled

Details

CI / OCR Service Tests (pull_request) Has been cancelled

Details

CI / Backend Unit Tests (pull_request) Has been cancelled

Details

CI / fail2ban Regex (pull_request) Has been cancelled

Details

CI / Compose Bucket Idempotency (pull_request) Has been cancelled

Details

CI / OCR Service Tests (push) Successful in 42s

Details

CI / fail2ban Regex (push) Has been cancelled

Details

CI / Compose Bucket Idempotency (push) Has been cancelled

Details

CI / Backend Unit Tests (push) Has been cancelled

Details

3658733003

marcel commented

2026-05-16 09:43:25 +02:00

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: ✅ Approved

Well-engineered. Three immediate wins stand out right away:

Secrets out of CI env heredocs — GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, POSTGRES_PASSWORD previously lived in the .env.staging / .env.production heredocs written to the workspace. They're now in /opt/familienarchiv/obs-secrets.env (chmod 600, written fresh from Gitea secrets). Net improvement.
rsync -a --delete for infra/observability/ — stale config files are removed on every run. Clean mirror.
Validate-before-start step — docker compose config --quiet catches missing vars and YAML errors before containers start. This is exactly the right gate.

GlitchTip healthcheck was overdue — docker compose up --wait now actually covers it, and the assert-health step can meaningfully verify its status.

Suggestions (not blockers)

docker-compose.observability.yml uses cp, not rsync.
The infra/observability/ directory gets rsync -a --delete (clean mirror), but the compose file itself is a plain cp. If the file is ever deleted from the repo, the stale copy persists at /opt/familienarchiv/. Acknowledged in ADR-016 and DEPLOYMENT.md. Low risk given how infrequently this file changes, but worth keeping in mind.

POSTGRES_HOST=archive-db in obs.env + manual restarts.
If someone restarts the obs stack on the server using only --env-file obs.env (forgetting the secrets file), POSTGRES_HOST defaults to archive-db. The manual restart command in DEPLOYMENT.md is correct (includes both files). No code change needed — this is a doc awareness note.

/opt/familienarchiv/ directory permissions.
obs-secrets.env gets chmod 600 (good). The directory itself has default permissions. Not a blocker for a single-tenant runner, but worth knowing if another service ever gets write access to the runner host.

Everything else is clean. The two-source env model (git-tracked non-secrets + CI-written secrets) is the right pattern here.

## 🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ✅ Approved** Well-engineered. Three immediate wins stand out right away: 1. **Secrets out of CI env heredocs** — `GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_SECRET_KEY`, `POSTGRES_PASSWORD` previously lived in the `.env.staging` / `.env.production` heredocs written to the workspace. They're now in `/opt/familienarchiv/obs-secrets.env` (chmod 600, written fresh from Gitea secrets). Net improvement. 2. **`rsync -a --delete` for `infra/observability/`** — stale config files are removed on every run. Clean mirror. 3. **Validate-before-start step** — `docker compose config --quiet` catches missing vars and YAML errors before containers start. This is exactly the right gate. GlitchTip healthcheck was overdue — `docker compose up --wait` now actually covers it, and the assert-health step can meaningfully verify its status. ### Suggestions (not blockers) **`docker-compose.observability.yml` uses `cp`, not `rsync`.** The `infra/observability/` directory gets `rsync -a --delete` (clean mirror), but the compose file itself is a plain `cp`. If the file is ever deleted from the repo, the stale copy persists at `/opt/familienarchiv/`. Acknowledged in ADR-016 and DEPLOYMENT.md. Low risk given how infrequently this file changes, but worth keeping in mind. **`POSTGRES_HOST=archive-db` in obs.env + manual restarts.** If someone restarts the obs stack on the server using only `--env-file obs.env` (forgetting the secrets file), `POSTGRES_HOST` defaults to `archive-db`. The manual restart command in DEPLOYMENT.md is correct (includes both files). No code change needed — this is a doc awareness note. **`/opt/familienarchiv/` directory permissions.** `obs-secrets.env` gets `chmod 600` (good). The directory itself has default permissions. Not a blocker for a single-tenant runner, but worth knowing if another service ever gets write access to the runner host. Everything else is clean. The two-source env model (git-tracked non-secrets + CI-written secrets) is the right pattern here.

marcel commented

2026-05-16 09:43:36 +02:00

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Verdict: ✅ Approved

Pure infrastructure — no Java, TypeScript, or Svelte touched. Reviewing the CI scripts on their own terms.

What I verified

HEREDOC quoting is correct.
The workflow steps use <<'EOF' (single-quoted delimiter), which prevents the shell from expanding anything between the EOF markers. The ${{ secrets.X }} tokens are Gitea Actions template expressions — they're substituted by the Actions engine before the YAML is executed as a shell script. So by the time the shell runs, the secrets are already plain values. The 'EOF' quoting just prevents a second round of shell expansion on those already-expanded values (important if a secret contains $(), backticks, or similar). This is intentionally correct.

Assert-health script logic is clean.

for svc in obs-loki obs-prometheus obs-grafana obs-tempo obs-glitchtip; do
    status=$(docker inspect "$svc" --format '{{.State.Health.Status}}' 2>/dev/null || echo "missing")
    if [ "$status" != "healthy" ]; then
        echo "::error::$svc is not healthy (status: $status)"
        unhealthy="$unhealthy $svc"
    fi
done
[ -z "$unhealthy" ] || exit 1

Accumulate-then-fail pattern is the right approach — you see all unhealthy services in one run, not just the first. The 2>/dev/null || echo "missing" handles containers that failed to start entirely (no inspect output).

Comments justify WHY, not WHAT.
The multi-line comments per CI step are extensive, but warranted. The env-file ordering rationale, which services lack healthchecks, why absolute paths are needed for bind mounts — these are non-obvious infrastructure constraints that would take time to re-derive. No comment explains what a YAML key does; they all explain the constraint behind the choice. That's the bar.

Suggestion

The Validate observability compose config and Assert observability stack health steps are copy-pasted identically between nightly.yml and release.yml. The comments even note "Keep in sync with the equivalent step in nightly.yml (#603)." This is manual sync debt — if the assert logic changes, it needs to be updated in two places. For now it's fine, but worth extracting into a composite action if this pattern expands further.

## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer **Verdict: ✅ Approved** Pure infrastructure — no Java, TypeScript, or Svelte touched. Reviewing the CI scripts on their own terms. ### What I verified **HEREDOC quoting is correct.** The workflow steps use `<<'EOF'` (single-quoted delimiter), which prevents the shell from expanding anything between the EOF markers. The `${{ secrets.X }}` tokens are Gitea Actions template expressions — they're substituted by the Actions engine *before* the YAML is executed as a shell script. So by the time the shell runs, the secrets are already plain values. The `'EOF'` quoting just prevents a second round of shell expansion on those already-expanded values (important if a secret contains `$()`, backticks, or similar). This is intentionally correct. **Assert-health script logic is clean.** ```bash for svc in obs-loki obs-prometheus obs-grafana obs-tempo obs-glitchtip; do status=$(docker inspect "$svc" --format '{{.State.Health.Status}}' 2>/dev/null || echo "missing") if [ "$status" != "healthy" ]; then echo "::error::$svc is not healthy (status: $status)" unhealthy="$unhealthy $svc" fi done [ -z "$unhealthy" ] || exit 1 ``` Accumulate-then-fail pattern is the right approach — you see all unhealthy services in one run, not just the first. The `2>/dev/null || echo "missing"` handles containers that failed to start entirely (no inspect output). **Comments justify WHY, not WHAT.** The multi-line comments per CI step are extensive, but warranted. The env-file ordering rationale, which services lack healthchecks, why absolute paths are needed for bind mounts — these are non-obvious infrastructure constraints that would take time to re-derive. No comment explains what a YAML key does; they all explain the constraint behind the choice. That's the bar. ### Suggestion The `Validate observability compose config` and `Assert observability stack health` steps are copy-pasted identically between `nightly.yml` and `release.yml`. The comments even note "Keep in sync with the equivalent step in nightly.yml (#603)." This is manual sync debt — if the assert logic changes, it needs to be updated in two places. For now it's fine, but worth extracting into a composite action if this pattern expands further.

marcel commented

2026-05-16 09:43:51 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: ✅ Approved

What improved

Secrets out of CI workspace files — significant improvement.
Previously, GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, and GLITCHTIP_DOMAIN were written into the CI workspace as part of .env.staging / .env.production. Those workspace files get deleted after the run, but the secrets briefly existed as plaintext in a path that's potentially logged, snapshotted, or cached by the runner. They now live only in /opt/familienarchiv/obs-secrets.env (chmod 600), written directly to the permanent path from Gitea secrets. This is a meaningful reduction in secret exposure surface.

Two-source model is correctly structured.
Non-secret config in git-tracked obs.env (ports, URLs, usernames). Secrets in CI-written obs-secrets.env (passwords, keys). The separation is real — obs.env contains nothing sensitive. Rotating a secret only requires updating the Gitea secret; no PR needed.

HEREDOC quoting (confirmed correct)

<<'EOF' is the right choice here. It prevents the shell from performing a second round of variable expansion on the already-substituted secret values. If a secret contained $(command), backticks, or ${var}, shell expansion without 'EOF' quoting would interpret them. Single-quoting the delimiter is the defensive choice. (CWE-78 prevention at CI layer.)

Runner mount blast radius (acknowledged, acceptable)

ADR-016 and runner-config.yaml both document this explicitly:

"Read/write access to /opt/familienarchiv/ — including the main app's compose files, Caddy config, and observability configs. A malicious workflow step could overwrite any file in that directory."

For a single-tenant runner running trusted, private-repo code only, this is a sound judgment call. The security note in runner-config.yaml is unusually honest and thorough — it correctly names the blast radius and explicitly warns against adding external contributors. That documentation is itself a security control.

Minor observations (not blockers)

/opt/familienarchiv/ directory permissions are default (755). obs-secrets.env is protected at file level (600). Acceptable since the runner is single-tenant.
The POSTGRES_USER=archiv appears in git-tracked obs.env. Usernames are generally not considered secret, so this is standard practice. No issue.
There is no sanitization needed in the healthcheck shell loop — it only uses hardcoded service names, no user input. No injection surface.

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** ### What improved **Secrets out of CI workspace files — significant improvement.** Previously, `GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_SECRET_KEY`, and `GLITCHTIP_DOMAIN` were written into the CI workspace as part of `.env.staging` / `.env.production`. Those workspace files get deleted after the run, but the secrets briefly existed as plaintext in a path that's potentially logged, snapshotted, or cached by the runner. They now live only in `/opt/familienarchiv/obs-secrets.env` (chmod 600), written directly to the permanent path from Gitea secrets. This is a meaningful reduction in secret exposure surface. **Two-source model is correctly structured.** Non-secret config in git-tracked `obs.env` (ports, URLs, usernames). Secrets in CI-written `obs-secrets.env` (passwords, keys). The separation is real — `obs.env` contains nothing sensitive. Rotating a secret only requires updating the Gitea secret; no PR needed. ### HEREDOC quoting (confirmed correct) `<<'EOF'` is the right choice here. It prevents the shell from performing a second round of variable expansion on the already-substituted secret values. If a secret contained `$(command)`, backticks, or `${var}`, shell expansion without `'EOF'` quoting would interpret them. Single-quoting the delimiter is the defensive choice. (CWE-78 prevention at CI layer.) ### Runner mount blast radius (acknowledged, acceptable) ADR-016 and `runner-config.yaml` both document this explicitly: > "Read/write access to /opt/familienarchiv/ — including the main app's compose files, Caddy config, and observability configs. A malicious workflow step could overwrite any file in that directory." For a single-tenant runner running trusted, private-repo code only, this is a sound judgment call. The security note in runner-config.yaml is unusually honest and thorough — it correctly names the blast radius and explicitly warns against adding external contributors. That documentation is itself a security control. ### Minor observations (not blockers) - `/opt/familienarchiv/` directory permissions are default (755). `obs-secrets.env` is protected at file level (600). Acceptable since the runner is single-tenant. - The `POSTGRES_USER=archiv` appears in git-tracked `obs.env`. Usernames are generally not considered secret, so this is standard practice. No issue. - There is no sanitization needed in the healthcheck shell loop — it only uses hardcoded service names, no user input. No injection surface.

marcel commented

2026-05-16 09:44:02 +02:00

🏛️ Markus Keller (@mkeller) — Application Architect

Verdict: ✅ Approved

ADR quality

ADR-016 is well-constructed. The context explains the specific failure mode (workspace wipe breaks bind-mount sources). Three alternatives are considered with concrete rejection reasons — not hand-waving. The consequences section honestly names both negatives: cp doesn't remove deleted files, and the runner mount expands blast radius. An ADR that hides negatives is worse than no ADR. This one doesn't.

Documentation checklist

Per the doc-update matrix:

Changed	Required doc update	Status
New infrastructure approach / permanent obs path	`docs/architecture/c4/l2-containers.puml`	✓ Updated (System_Boundary path)
New infrastructure approach	`docs/DEPLOYMENT.md`	✓ Comprehensive §4 added
Architectural decision with lasting consequences	New ADR in `docs/adr/`	✓ ADR-016 added

No new runtime services were added (GlitchTip already existed in the C4 diagram). The obs.env two-source model is a configuration pattern, not a new container, so no new C4 node is needed. All required docs are updated.

Architectural observation

The co-location decision at /opt/familienarchiv/ is architecturally correct. GlitchTip is coupled to the main app's PostgreSQL (archive-db) and archiv-net. Keeping it at a separate path (e.g., /opt/obs/) would imply isolation that doesn't actually exist. Co-location reflects the real coupling.

The two-source env model — git-tracked non-secrets + CI-written secrets — is a 12-factor pattern applied appropriately here. The boundary between the two files is principled: anything that can be in a PR review is in obs.env; anything that requires a secret rotation is in obs-secrets.env. No mixing.

Minor suggestion

The "Note: cp -r does not remove deleted files" warning in DEPLOYMENT.md appears in a blockquote after an extensive code block and table. It's important operational knowledge. A > ⚠️ prefix or a dedicated "Pitfalls" callout would make it harder to miss. Low priority.

## 🏛️ Markus Keller (@mkeller) — Application Architect **Verdict: ✅ Approved** ### ADR quality ADR-016 is well-constructed. The context explains the specific failure mode (workspace wipe breaks bind-mount sources). Three alternatives are considered with concrete rejection reasons — not hand-waving. The consequences section honestly names both negatives: `cp` doesn't remove deleted files, and the runner mount expands blast radius. An ADR that hides negatives is worse than no ADR. This one doesn't. ### Documentation checklist Per the doc-update matrix: | Changed | Required doc update | Status | |---|---|---| | New infrastructure approach / permanent obs path | `docs/architecture/c4/l2-containers.puml` | ✓ Updated (System_Boundary path) | | New infrastructure approach | `docs/DEPLOYMENT.md` | ✓ Comprehensive §4 added | | Architectural decision with lasting consequences | New ADR in `docs/adr/` | ✓ ADR-016 added | No new runtime services were added (GlitchTip already existed in the C4 diagram). The `obs.env` two-source model is a configuration pattern, not a new container, so no new C4 node is needed. All required docs are updated. ### Architectural observation The co-location decision at `/opt/familienarchiv/` is architecturally correct. GlitchTip is coupled to the main app's PostgreSQL (`archive-db`) and `archiv-net`. Keeping it at a separate path (e.g., `/opt/obs/`) would imply isolation that doesn't actually exist. Co-location reflects the real coupling. The two-source env model — git-tracked non-secrets + CI-written secrets — is a 12-factor pattern applied appropriately here. The boundary between the two files is principled: anything that can be in a PR review is in `obs.env`; anything that requires a secret rotation is in `obs-secrets.env`. No mixing. ### Minor suggestion The "Note: `cp -r` does not remove deleted files" warning in DEPLOYMENT.md appears in a blockquote after an extensive code block and table. It's important operational knowledge. A `> ⚠️` prefix or a dedicated "Pitfalls" callout would make it harder to miss. Low priority.

marcel commented

2026-05-16 09:44:13 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

Reviewing in Brownfield mode — auditing the PR against issue #601's stated requirements.

PR description quality

Excellent. Each of the three fixes includes:

The exact error message or failure mode (field processors not found in type generator.Config, Bind for 127.0.0.1:3001 failed, DNS resolution failure)
The root cause (schema breaking change in Tempo 2.7.2, hardcoded hostname, port collision)
The specific fix applied and why it's safe (processors were already configured under overrides)

This is the level of context that makes a PR reviewable without chasing down the original issue.

Requirements coverage

Stated bug	Fix	Covered?
Tempo refuses to start	Remove duplicate `metrics_generator.processors` from top-level	✓
GlitchTip fails DNS when only staging running	`${POSTGRES_HOST:-archive-db}` variable substitution	✓
Grafana port 3001 conflicts with staging frontend	Default changed to 3003	✓

No scope creep. No missing requirements.

Test plan completeness

All five checkboxes are verifiable and directly tied to the three bugs. No ambiguity in what "passing" looks like for each. The docker compose config validates-without-errors check is the correct gate for item 1.

One clarifying question (not a blocker)

obs.env sets POSTGRES_HOST=archive-db and the comment says "For local dev, set POSTGRES_HOST in your .env file (defaults to archive-db there)." But obs.env itself sets the value to archive-db. For a developer who runs the obs stack locally with --env-file obs.env, they'd get archive-db as the hostname. Is that the correct container name in the dev compose setup (i.e., does the dev docker-compose define container_name: archive-db explicitly)? If yes, the comment should say "defaults to archive-db via obs.env" rather than "set it in your .env file." Minor phrasing clarification only.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** Reviewing in Brownfield mode — auditing the PR against issue #601's stated requirements. ### PR description quality Excellent. Each of the three fixes includes: - The exact error message or failure mode (`field processors not found in type generator.Config`, `Bind for 127.0.0.1:3001 failed`, DNS resolution failure) - The root cause (schema breaking change in Tempo 2.7.2, hardcoded hostname, port collision) - The specific fix applied and why it's safe (processors were already configured under `overrides`) This is the level of context that makes a PR reviewable without chasing down the original issue. ### Requirements coverage | Stated bug | Fix | Covered? | |---|---|---| | Tempo refuses to start | Remove duplicate `metrics_generator.processors` from top-level | ✓ | | GlitchTip fails DNS when only staging running | `${POSTGRES_HOST:-archive-db}` variable substitution | ✓ | | Grafana port 3001 conflicts with staging frontend | Default changed to 3003 | ✓ | No scope creep. No missing requirements. ### Test plan completeness All five checkboxes are verifiable and directly tied to the three bugs. No ambiguity in what "passing" looks like for each. The `docker compose config` validates-without-errors check is the correct gate for item 1. ### One clarifying question (not a blocker) `obs.env` sets `POSTGRES_HOST=archive-db` and the comment says "For local dev, set POSTGRES_HOST in your .env file (defaults to archive-db there)." But obs.env itself sets the value to `archive-db`. For a developer who runs the obs stack locally with `--env-file obs.env`, they'd get `archive-db` as the hostname. Is that the correct container name in the dev compose setup (i.e., does the dev docker-compose define `container_name: archive-db` explicitly)? If yes, the comment should say "defaults to archive-db via obs.env" rather than "set it in your .env file." Minor phrasing clarification only.

marcel commented

2026-05-16 09:44:26 +02:00

🧪 Sara Holt (@saraholt) — QA Engineer

Verdict: ✅ Approved

No application test code to review. Evaluating the CI verification additions.

What was added for verification

Validate observability compose config step
docker compose ... config --quiet is a static correctness check: resolves all variable substitutions and reports any missing required keys or YAML errors. This runs before any container starts, which is the right gate placement. Catches broken configs fast and cheaply.

Assert observability stack health step
Explicitly loops 5 named services: obs-loki, obs-prometheus, obs-grafana, obs-tempo, obs-glitchtip. The script:

Uses 2>/dev/null || echo "missing" to handle containers that failed to start entirely (inspect fails, status becomes "missing", which != "healthy" — correctly flagged)
Accumulates all failures before exiting — you see every unhealthy service in one run
Prints ::error:: format for Gitea Actions error annotations

The 5 services checked match exactly those described as "healthchecked critical services" in the step comment. The 4 excluded services (promtail, cadvisor, node-exporter, glitchtip-worker) correctly have no healthcheck directive, so asserting their health status would always fail.

GlitchTip healthcheck added
Before this PR, docker compose up --wait silently skipped GlitchTip because it had no healthcheck. Now it has one: wget -qO- http://localhost:8080/_health/ || exit 1, 30s interval, 5 retries, 60s start_period. The 60s start_period is appropriate — GlitchTip runs Django migrations on first start. The assert-health step can now actually verify it.

Known gap (acceptable)

No rollback assertion after failure. If the obs stack fails health checks, the pipeline fails (correct) but does not attempt recovery. Subsequent CI runs will also fail until the issue is resolved manually. Since the obs stack is secondary infrastructure (not the main app pipeline), this is an acceptable operational constraint. Worth documenting in a runbook if it becomes a recurring pain point.

Duplication in nightly.yml / release.yml

The Validate and Assert steps are copy-pasted between both files. The comments acknowledge this ("Keep in sync with the equivalent step in nightly.yml (#603)"). If the assertion logic needs to change (e.g., adding a new service to the health check list), both files must be updated. No immediate action needed, but a composite action would eliminate this debt.

## 🧪 Sara Holt (@saraholt) — QA Engineer **Verdict: ✅ Approved** No application test code to review. Evaluating the CI verification additions. ### What was added for verification **`Validate observability compose config` step** `docker compose ... config --quiet` is a static correctness check: resolves all variable substitutions and reports any missing required keys or YAML errors. This runs before any container starts, which is the right gate placement. Catches broken configs fast and cheaply. **`Assert observability stack health` step** Explicitly loops 5 named services: `obs-loki`, `obs-prometheus`, `obs-grafana`, `obs-tempo`, `obs-glitchtip`. The script: - Uses `2>/dev/null || echo "missing"` to handle containers that failed to start entirely (inspect fails, status becomes "missing", which != "healthy" — correctly flagged) - Accumulates all failures before exiting — you see every unhealthy service in one run - Prints `::error::` format for Gitea Actions error annotations The 5 services checked match exactly those described as "healthchecked critical services" in the step comment. The 4 excluded services (promtail, cadvisor, node-exporter, glitchtip-worker) correctly have no healthcheck directive, so asserting their health status would always fail. **GlitchTip healthcheck added** Before this PR, `docker compose up --wait` silently skipped GlitchTip because it had no healthcheck. Now it has one: `wget -qO- http://localhost:8080/_health/ || exit 1`, 30s interval, 5 retries, 60s start_period. The 60s start_period is appropriate — GlitchTip runs Django migrations on first start. The assert-health step can now actually verify it. ### Known gap (acceptable) No rollback assertion after failure. If the obs stack fails health checks, the pipeline fails (correct) but does not attempt recovery. Subsequent CI runs will also fail until the issue is resolved manually. Since the obs stack is secondary infrastructure (not the main app pipeline), this is an acceptable operational constraint. Worth documenting in a runbook if it becomes a recurring pain point. ### Duplication in nightly.yml / release.yml The `Validate` and `Assert` steps are copy-pasted between both files. The comments acknowledge this ("Keep in sync with the equivalent step in nightly.yml (#603)"). If the assertion logic needs to change (e.g., adding a new service to the health check list), both files must be updated. No immediate action needed, but a composite action would eliminate this debt.

marcel commented

2026-05-16 09:44:30 +02:00

🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist

Verdict: ✅ Approved

No UI changes in this PR. Every touched file is infrastructure:

.gitea/workflows/nightly.yml — CI pipeline
.gitea/workflows/release.yml — CI pipeline
docker-compose.observability.yml — container config
runner-config.yaml — Gitea Act runner config
infra/observability/obs.env — non-secret env config
infra/observability/tempo/tempo.yml — Tempo config
docs/adr/016-obs-stack-co-location-ci-push.md — ADR
docs/DEPLOYMENT.md — ops documentation
docs/architecture/c4/l2-containers.puml — architecture diagram
.env.example — developer onboarding

No Svelte components, no routes, no Tailwind classes, no i18n keys, no design tokens, no accessibility implications. No UX concerns raised.

## 🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** No UI changes in this PR. Every touched file is infrastructure: - `.gitea/workflows/nightly.yml` — CI pipeline - `.gitea/workflows/release.yml` — CI pipeline - `docker-compose.observability.yml` — container config - `runner-config.yaml` — Gitea Act runner config - `infra/observability/obs.env` — non-secret env config - `infra/observability/tempo/tempo.yml` — Tempo config - `docs/adr/016-obs-stack-co-location-ci-push.md` — ADR - `docs/DEPLOYMENT.md` — ops documentation - `docs/architecture/c4/l2-containers.puml` — architecture diagram - `.env.example` — developer onboarding No Svelte components, no routes, no Tailwind classes, no i18n keys, no design tokens, no accessibility implications. No UX concerns raised.

marcel merged commit 3658733003 into main

2026-05-16 09:46:12 +02:00

marcel deleted branch fix/issue-601-obs-stack-permanent

2026-05-16 09:46:14 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#602