fix(obs): Tempo processors schema fix + configurable Postgres host and Grafana port #602
Reference in New Issue
Block a user
Delete Branch "fix/issue-601-obs-stack-permanent"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Fixes discovered during the observability stack migration to
/opt/familienarchiv/(see #601).metrics_generator.processorswas removed from the top-level config and is only valid underoverrides.defaults.metrics_generator.processors. The duplicate top-level block caused Tempo to refuse to start withfield processors not found in type generator.Config. Removed the now-invalid block — the processors were already configured underoverrides.archive-dbas the Postgres hostname. When only the staging stack is running (containerarchiv-staging-db-1), DNS resolution fails.${POSTGRES_HOST:-archive-db}makes the host overridable via.envwithout changing the production default.3001to3003— port 3001 is occupied by the staging frontend (archiv-staging-frontend-1), causingBind for 127.0.0.1:3001 failed: port is already allocatedon compose up.Test plan
docker compose -f docker-compose.observability.yml configvalidates without errorsfield processors not founderror in logs)POSTGRES_HOSTis set in.envarchive-dbby default (no.envoverride needed in production)PORT_GRAFANACloses #601
🤖 Generated with Claude Code
🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
The root cause fix is solid — stable bind mounts at
/opt/familienarchiv/survive workspace wipes by definition. The pattern mirrors how the app stack already operates. A few operational details need attention before this is bulletproof.Blockers
GLITCHTIP_DOMAINwill revert tohttp://localhost:3002after this merge. The nightly step previously passed--env-file .env.staging, which setGLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloud. That flag is removed in this PR. Docker Compose now reads/opt/familienarchiv/.envautomatically — but that file doesn't currently includeGLITCHTIP_DOMAIN. The compose default is${GLITCHTIP_DOMAIN:-http://localhost:3002}, so GlitchTip will report the wrong domain in error report URLs. Fix: addGLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloudto/opt/familienarchiv/.envon the server before the next nightly run.Runner must be restarted for
runner-config.yamlchanges to take effect. The new/opt/familienarchivmount only applies to job containers spawned after the runner process has reloaded its config. Until then, the "Deploy observability configs" step will fail when it tries to write to/opt/familienarchiv/. Fix: restart the Gitea Act runner on the host after this PR is merged:systemctl restart gitea-runner(or equivalent).Suggestions
cp -rdoesn't remove deleted files. If a config file is removed from the repo, it persists in/opt/familienarchiv/infra/observability/indefinitely. For this project's change frequency that's acceptable — but worth knowing. Arsync -a --deletewould give a clean mirror.docker compose up -d --waitis the right call. It waits for all healthchecks to pass before the step exits, so a broken Tempo config or Loki failure fails CI immediately. Good that it was preserved from the old step.Named volumes are unaffected.
loki_data,grafana_data,prometheus_data,tempo_dataare Docker-managed and not touched by the copy step or the compose project name change. No data migration risk.🏗️ Markus Keller — Application Architect
Verdict: 🚫 Changes requested
The architectural decision is correct: the observability stack is persistent infrastructure and must be colocated with the permanent app deployment in
/opt/familienarchiv/. The CI-push model (copy on every nightly deploy) is the right fit for this single-operator setup — no git credentials on the server, no server-pull complexity.However, the documentation obligations triggered by this change are not met.
Blockers
docs/DEPLOYMENT.mdis missing the observability section. Per the CLAUDE.md docs table, any change to "New Docker service / infrastructure component" location requires adocs/DEPLOYMENT.mdupdate. This PR moves the obs stack's operational location. The deployment docs should cover: (1) starting the obs stack from/opt/familienarchiv/, (2) required keys in/opt/familienarchiv/.envwith a note on$$escaping for passwords containing$, (3) the one-time GlitchTipmigrate+createsuperusersteps, (4) the runner restart requirement after updatingrunner-config.yaml.ADR-016 is missing. This PR makes a permanent architectural decision: the obs stack shares
/opt/familienarchiv/with the app stack, and config sync is CI-push (not server-pull, not a separate directory). The alternatives had different trade-offs worth recording. Without an ADR, the next person who touches this infrastructure will not know why it was designed this way. The Decision Queue comment on issue #601 explicitly flagged this choice as needing resolution — resolving it in code without documenting it is incomplete.Concerns
/opt/familienarchiv/.envcorrectness with no validation. IfGLITCHTIP_SECRET_KEYis missing or$$escaping is wrong for a$-containing password, GlitchTip starts silently with a broken or default config. Adocker compose -f /opt/familienarchiv/docker-compose.observability.yml configdry-run step beforeupwould catch substitution errors in CI before containers start.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
No application code changes — pure infrastructure. The implementation is clean and the scope is well-contained. One documentation obligation is triggered that must be met before merge.
Blockers
docs/architecture/c4/l2-containers.pumlanddocs/DEPLOYMENT.md. The obs stack's operational location changed — that's a container topology change the C4 L2 diagram should reflect. DEPLOYMENT.md needs the ops runbook section.LGTM
processorsblock is the right fix for the Tempo 2.7.2 schema change — the processors are already correctly configured underoverrides.defaults.metrics_generator.processors.POSTGRES_HOSTvariable is well-defaulted.${POSTGRES_HOST:-archive-db}is backward compatible — production with the full stack running needs zero.envchanges.PORT_GRAFANAdefault change is low-risk.3003avoids the collision with the staging frontend. Any environment without an explicitPORT_GRAFANAvar gets the safer default.🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
No new application-layer vulnerabilities introduced. The infrastructure changes have one expanded security surface worth explicitly acknowledging, and one net security improvement.
Concerns (not blockers for a private single-operator repo)
/opt/familienarchiv/into CI job containers expands the blast radius of a compromised workflow step. Before this PR, a malicious job step could write to the workspace and use the Docker socket (which already grants root-equivalent host access). After this PR, it can also directly overwrite files in/opt/familienarchiv/— including the main app's compose files and Caddy config. The existing comment inrunner-config.yamlcorrectly notes "only trusted code from this private repo runs on this runner." Recommend hardening the comment to make the risk explicit:# WARNING: grants CI jobs write access to /opt/familienarchiv/ — do not add this runner to repos with external contributors.Net Security Improvement
--env-file .env.stagingfor the obs stack fixes a silent credential truncation bug. The.env.stagingheredoc uses shell expansion. Passwords containing$(likeGRAFANA_ADMIN_PASSWORD=me30g@b$Nt$Z2g) were silently truncated:$Ntand$Z2gwere expanded as empty variables, so Grafana was starting withme30g@bas the admin password. The/opt/familienarchiv/.envfile uses$$→$escaping that Docker Compose handles correctly. The full intended password now reaches Grafana. This is a genuine security improvement as a side effect of this PR.LGTM
127.0.0.1. No new ports exposed externally.:latestreferences introduced.🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
The fix addresses the root failure mode. The
--waitflag ondocker compose upprovides implicit health verification — if a healthchecked service fails, the CI step fails. That's better than nothing, but some issue #601 acceptance criteria remain unchecked.Concerns
Services without
healthcheckdirectives are not verified.docker compose up --waitwaits for services with healthchecks to reachhealthy. Services without them (obs-promtail,obs-cadvisor,obs-node-exporter,obs-glitchtip-worker) are considered "started" the moment the process runs. A follow-updocker inspectassertion step would give CI a definitive pass/fail for the critical services:docker inspect obs-loki obs-prometheus obs-grafana obs-tempo --format '{{.Name}} {{.State.Health.Status}}'and assert all arehealthy.Issue #601 acceptance criteria coverage gaps not in this PR:
.env.examplenot updated with observability vars and$$escaping notedocs/DEPLOYMENT.mdmissingcreatesuperuserone-time step and required.envkeys/opt/familienarchiv/path is used (a future workflow refactor could revert to workspace-relative paths silently)LGTM
--waitis preserved from the old step. The meaningful health gate was not accidentally dropped.processorsis at the wrong config level. Removing it will eliminate the crash loop — no ambiguity about whether it works.POSTGRES_HOSTdefault is backward compatible. Production behavior is unchanged when the env var is absent.🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
No concerns from my angle.
This PR is entirely server-side infrastructure — CI workflow steps, runner configuration, Docker Compose adjustments, and a Tempo YAML fix. No user-facing routes, Svelte components, CSS changes, or UI states are involved.
The Grafana provisioning directory (
infra/observability/grafana/provisioning/) is copied as-is — dashboards, datasources, and dashboard JSON files are unchanged. No Grafana UI modification is implied. Nora's observation that the Grafana admin password is now correct (fixed from the previous$truncation) actually improves the operator experience.Nothing to flag from a UX or accessibility perspective.
📋 Elicit — Requirements Engineer
Verdict: ⚠️ Approved with concerns
Reviewing against the acceptance criteria in issue #601, including additions from the two findings comments.
Acceptance Criteria Coverage
/opt/familienarchiv//opt/familienarchiv/infra/observability/…project.config_fileslabel points to/opt/familienarchiv/-f /opt/familienarchiv/docker-compose.observability.ymldocker rmrequired after workspace wipe6.1.6in main; unchanged here.env.exampleupdated with observability vars +$$escaping notedocs/DEPLOYMENT.mddocumentscreatesuperuserone-time step--waitcovers services with healthchecks; services without are not explicitly checkedConcerns
Two ACs from #601 are unaddressed (
.env.exampleanddocs/DEPLOYMENT.md). These were added in the second findings comment. They are not blockers for the infrastructure fix, but #601 cannot be closed cleanly without them. Recommend either including them here or opening a follow-up issue before closing #601.The Decision Queue open decision is resolved by this PR but not recorded. The issue's Decision Queue asked: CI-push vs server-pull for config sync? This PR answers CI-push. That resolution should appear in ADR-016 (which Markus flagged as missing) so future contributors understand why server-pull was not chosen.
/opt/familienarchiv/.envrequired-key inventory is undocumented. The PR description correctly states the file is "managed separately on the server — not written or deleted by CI." But there is no documented list of what keys must be present. A new operator following any runbook would not know. The DEPLOYMENT.md update should include a required-key checklist with example values and the$$escaping rule for passwords containing$.Review concerns addressed
All open reviewer concerns have been resolved in 7 additional commits on this branch.
Blockers resolved
Markus — docs/DEPLOYMENT.md missing ✅
/opt/familienarchiv/, the complete/opt/familienarchiv/.envrequired-key inventory, the$$escaping rule for$-containing passwords, and a note on what happens whenGLITCHTIP_DOMAINis missing (defaults tohttp://localhost:3002). Commit7b7d0c92.Markus — ADR-016 missing ✅
docs/adr/016-obs-stack-co-location-ci-push.mddocumenting the CI-push model, the co-location decision, and the two rejected alternatives (server-pull, separate directory). The Decision Queue open point from issue #601 is now recorded. Commit4e94d85d.Felix — l2-containers.puml not updated ✅
System_Boundarylabel to show(/opt/familienarchiv/docker-compose.observability.yml). Commitdec6b813.Suggestions / concerns resolved
Nora — blast-radius comment too weak ✅
runner-config.yamloptions comment with an explicitSECURITY WARNINGblock calling out (1) root-equivalent Docker socket access and (2) read/write access to/opt/familienarchiv/including app compose files and Caddy config. Commitc7d2eeb3.Sara + Elicit — .env.example not updated ✅
PORT_GRAFANAfrom3001to3003, addedPOSTGRES_HOSTwith the default and override context, added$$escaping note with an example. Commit448c3cdc.Markus — no .env correctness validation before
up✅Validate observability compose configstep (docker compose config --quiet) immediately beforeStart observability stack. Catches missing keys, truncated passwords, and YAML errors before any containers start. Commitdf37113d.Sara — services without healthchecks not explicitly verified ✅
Assert observability stack healthstep afterup --wait. Iteratesobs-loki,obs-prometheus,obs-grafana,obs-tempo; emits a::error::annotation and exits non-zero if any container is nothealthy. Commit79735e23.Tobias — operational actions (not fixable in code)
These two items require manual server-side action before or after merge:
GLITCHTIP_DOMAINin/opt/familienarchiv/.env— AddGLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloudto the file on the server before the next nightly run. Now documented in DEPLOYMENT.md §4 required-key table.systemctl restart gitea-runnerafter this PR is merged so the/opt/familienarchivmount takes effect. Now documented in DEPLOYMENT.md §3.1.GitOps refactor — obs env model simplified (5 more commits)
After reviewing the secrets flow, the previous approach left a gap: Gitea secrets were written to
.env.staging/.env.productionbut those files were no longer passed to the obs stack, meaning secrets could silently drift. Replaced with a clean two-source model matching how the main app stack works.What changed
infra/observability/obs.env(new, tracked in git)All non-secret obs config: ports, public URLs (
GLITCHTIP_DOMAIN,GF_SERVER_ROOT_URL). Changes go through PR review. No secrets./opt/familienarchiv/obs-secrets.env(written by CI on every deploy)Only passwords and secret keys, injected fresh from Gitea secrets. Gitea is the single source of truth — rotating a secret takes effect on the next nightly/release run with no manual server action needed.
Both files passed explicitly via
--env-fileto every obs compose command (validate dry-run + up). No implicit auto-read.env.release.ymlnow matches nightly — adds the same "Deploy observability configs" step, deploys from/opt/familienarchiv/, uses the two-file env model.GF_SERVER_ROOT_URLadded to obs-grafana service (defaults tohttp://localhost:3003for local dev, set tohttps://grafana.archiv.raddatz.cloudvia obs.env in production).Dead obs vars removed from
.env.staging/.env.production— those files now contain only vars thatdocker-compose.prod.ymlactually uses.GRAFANA_ADMIN_PASSWORD,GLITCHTIP_SECRET_KEY, and port vars are no longer in those files.POSTGRES_HOSTis now per-environment —archiv-staging-db-1(nightly) andarchiv-production-db-1(release), injected via obs-secrets.env. The oldarchive-dbdefault was wrong for both CI environments (nocontainer_nameis set in the prod compose).Why obs differs from the main stack (documented in DEPLOYMENT.md §4)
The main app stack has no config-file bind mounts — a workspace wipe doesn't affect running containers. The obs stack bind-mounts
prometheus.yml,tempo.yml, Grafana provisioning, etc. into containers. A workspace wipe removes those source paths; the nextdocker compose upcan't find them. That's why obs needs/opt/familienarchiv/as a permanent home. Everything else (secrets model, env-file pattern) now mirrors the main stack.🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
The permanent-path approach is the right call. Bind mounts from ephemeral workspace paths have always been a ticking clock, and the CI-push model mirrors the main stack's deploy pattern without introducing new infrastructure. The two-file env split (git-tracked
obs.env+ CI-injectedobs-secrets.env) is clean. The Tempo schema fix and port collision fix are correct.Blockers
1. ADR-016 "Decision" section describes a secrets model that doesn't exist
docs/adr/016-obs-stack-co-location-ci-push.md(Decision section) says:But the actual implementation does the opposite: CI writes
/opt/familienarchiv/obs-secrets.envfresh on every run from Gitea secrets. The operator doesn't manage it. Docker Compose doesn't auto-read it — it's passed explicitly via--env-file. This is a significant inaccuracy in the ADR that will mislead the next person trying to understand or troubleshoot the obs stack. The ADR needs to be corrected to match the realobs.env+obs-secrets.envtwo-source model.2.
release.ymlis missing the "Validate observability compose config" and "Assert observability stack health" steps thatnightly.ymlhasnightly.ymlvalidates the resolved compose config before starting (docker compose config --quiet) and then explicitly asserts the health of the four critical services (obs-loki,obs-prometheus,obs-grafana,obs-tempo) afterup --wait.release.ymlskips both steps — it just runsup -d --wait --remove-orphansand moves on.This matters for production: a malformed config file (e.g., a variable left undefined, a typo in a YAML key) could cause
upto silently succeed on already-running containers while the new config is never applied. And if a service is unhealthy after the deploy, the release workflow has no gate to catch it. The two steps should be identical between nightly and release.Suggestions
cp -rleaves deleted files behind: acknowledged in ADR-016 ("Negative" consequences). If a config file is removed from the repo, its stale copy persists at/opt/familienarchiv/infra/observability/indefinitely.rsync -a --delete infra/observability/ /opt/familienarchiv/infra/observability/would give a clean mirror. Low priority given the current change frequency, but worth doing while this code is fresh.POSTGRES_HOST container names are fragile:
archiv-staging-db-1andarchiv-production-db-1are Docker Compose-derived names that depend on the project name and service name staying stable. A project rename silently breaks the obs stack's DB connection. A comment noting this dependency (and that a project rename requires updating these values) would reduce future confusion.What's done well
valid_volumesand security comment inrunner-config.yamlare accurate and honest about the blast radius.obs.envcommitted to git covers all non-secret config. No operator-managed non-secret files needed.DEPLOYMENT.md §4is now genuinely useful as an ops procedure reference.db-initcontainer and bothglitchtip-web/glitchtip-workerservices all updated consistently to use${POSTGRES_HOST:-archive-db}.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
This is a pure infrastructure PR — no application code, no frontend, no backend Java. I'll focus on CI YAML quality and configuration file hygiene.
No Blockers
The code is functionally correct. The Tempo schema fix (removing the invalid top-level
processorsblock fromtempo.yml) is clearly right — it was a duplicate of theoverrides.defaults.metrics_generator.processorsblock that Tempo 2.7.2 stopped tolerating. The parameterized${POSTGRES_HOST:-archive-db}replacement hits all three places that hardcodedarchive-db(the twoDATABASE_URLenvs and thedb-initcommand). The port change is consistent acrossdocker-compose.observability.yml,.env.example, anddocs/DEPLOYMENT.md.Suggestions
DRY violation between
nightly.ymlandrelease.yml: The "Deploy observability configs" step is nearly identical in both files, but they've already diverged —nightly.ymlhas "Validate observability compose config" and "Assert observability stack health" steps thatrelease.ymldoesn't. CI workflow duplication is the right place to apply the DRY rule: any future change to the obs deploy procedure must now be made in two places, and last time it wasn't (one got the health checks, one didn't). A composite action (uses: ./.gitea/actions/deploy-obs) would be the clean solution; a simpler workaround is a comment at the top of each step saying "Keep in sync with release.yml / nightly.yml step N."obs.envcomment formatting: The comment block at the top ofinfra/observability/obs.envis clear. The multi-line explanation forPOSTGRES_HOSTreads well. No issues.GF_SERVER_ROOT_URLdefault: Thedocker-compose.observability.ymlchange adds:The default
http://localhost:3003is correct for local dev.obs.envsets the production value tohttps://grafana.archiv.raddatz.cloud. This is well-structured.What's done well
archive-dbare updated atomically — no partial fix.set -e+ accumulate-then-exit pattern in the health assertion step is correct: it checks all four services before exiting, so you get a full picture of which services are unhealthy rather than stopping at the first failure.🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
Reviewed from a security perspective: secret handling, filesystem exposure, CI job container permissions, and network isolation.
No Blockers
The security posture is appropriate for a single-tenant private runner. The blast-radius warnings in
runner-config.yamlare honest and well-placed — whoever adds this runner to an external repo in the future has been explicitly warned.Observations (non-blocking)
obs-secrets.env file permissions on the host: The
cat > /opt/familienarchiv/obs-secrets.env <<EOFstep creates the file with default umask permissions (typically644— world-readable). The file contains the Grafana admin password, GlitchTip secret key, and the Postgres password. On a single-tenant VPS with no other users this is acceptable, butchmod 600 /opt/familienarchiv/obs-secrets.envimmediately after creation would be a free defense-in-depth measure:Secret rotation surface: The secrets
STAGING_POSTGRES_PASSWORDandPROD_POSTGRES_PASSWORDare injected intoobs-secrets.envasPOSTGRES_PASSWORD. If these Gitea secrets are rotated, the obs stack reconnects correctly on the next CI run (because obs-secrets.env is rewritten). This is correct. TheGLITCHTIP_SECRET_KEYandGRAFANA_ADMIN_PASSWORDare shared between nightly and release — documented appropriately inDEPLOYMENT.md.$$escaping note in.env.example: This is a good catch. Docker Compose variable interpolation in.envfiles silently consuming$in passwords is a real operational hazard. The note is in the right place.GF_SERVER_ROOT_URL exposure: The new
GF_SERVER_ROOT_URLenv var passed to the Grafana container is non-sensitive (it's the public URL). No concern.Port binding: All observability ports remain bound to
127.0.0.1— Caddy is the only external entry point. The port change from 3001 to 3003 doesn't change the binding scope.No hardcoded credentials in committed files:
obs.envcontains no secrets (verified). Secrets flow only through Gitea secrets → CI → ephemeralobs-secrets.env. The model is correct.What's done well
runner-config.yamlis explicit and actionable ("Do NOT add this runner to any repo with external contributors")./opt/familienarchivinto CI job containers.🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
Infrastructure change with no application code impact. Reviewing CI verification coverage, health assertion correctness, and consistency between workflow files.
Concerns (non-blocking but worth addressing)
Missing verification gates in
release.yml:nightly.ymlhas two explicit gates thatrelease.ymllacks:docker compose config --quiet) — this dry-run resolves all variable substitutions and reports missing required keys before any container starts. Without it, a broken config could reach production silently.obs-loki,obs-prometheus,obs-grafana, andobs-tempoarehealthyafterup --wait. Without it, an unhealthy service after a production deploy goes undetected until someone checks manually.This asymmetry means nightly is better-tested than the production deploy path, which is backwards. Both steps should be in
release.yml.Health assertion correctness: The health assertion step in
nightly.ymlis well-constructed:The
2>/dev/null || echo "missing"handles the case where the container doesn't exist. The accumulate-then-exit pattern (unhealthy="$unhealthy $svc") gives a full picture before failing. The explicit list of four services (obs-loki obs-prometheus obs-grafana obs-tempo) matches the services with declared healthchecks indocker-compose.observability.yml— verified correct.No automated test for the Tempo schema fix: The
tempo.ymlchange (removing the duplicateprocessorsblock) is infra config, not application code. There's no unit test possible here. The only verification path is "Tempo starts without thefield processors not founderror." That's covered by the health assertion step (obs-tempo must becomehealthy). Acceptable.docker compose up --waitsemantics: The comment in the nightly "Assert observability stack health" step correctly documents the limitation —--waitonly covers services WITH healthcheck directives. The four services without healthchecks (obs-promtail,obs-cadvisor,obs-node-exporter,obs-glitchtip-worker) are not asserted. This is an acknowledged gap, documented appropriately.What's done well
--wait.upis the right pattern — catches missing variables before containers fail to start.🎨 Leonie Voss — UI/UX Design Lead
Verdict: ✅ Approved
Pure infrastructure PR — CI workflows, Docker Compose config, observability stack config, documentation, and ADR. No frontend components, no Svelte files, no UI changes, no user-facing routes.
Nothing in this diff touches the user interface, accessibility, design tokens, or brand presentation. From a UX perspective, this is a clean LGTM.
The only tangentially user-relevant change is fixing the observability stack (Grafana, GlitchTip) so it reliably starts and stays healthy — which improves the operator experience for whoever monitors the family archive. The documentation improvements in
DEPLOYMENT.md(especially the clear separation of "Dev — start from workspace" vs "Production — managed from/opt/familienarchiv/") will help future operators navigate the setup without frustration.🏗️ Markus Keller — Senior Application Architect
Verdict: ⚠️ Approved with concerns
The architectural decision is correct: bind-mount sources must be at a path that outlasts the CI workspace lifecycle. The CI-push model to
/opt/familienarchiv/is the right choice given the constraints — it avoids adding server-side automation (cron, git pull), stays consistent with the main stack's deploy pattern, and reflects that the obs stack is architecturally coupled to the main stack (sharedarchiv-net, sharedarchive-db). ADR-016 is the right artifact for this decision.Blockers
ADR-016 "Decision" section contradicts the implementation
The Decision section of
docs/adr/016-obs-stack-co-location-ci-push.mdsays:This describes a model where the operator manually manages a
.envfile and CI doesn't touch it. The actual implementation is the opposite: CI writes/opt/familienarchiv/obs-secrets.envfresh on every deploy from Gitea secrets, and Docker Compose reads it explicitly via--env-file(not auto-read). The operator manages nothing for secrets — CI owns it entirely.An ADR whose "Decision" section describes the wrong implementation will cause the next person reading it (likely a future incident responder) to look for an operator-managed
.envfile that doesn't exist, or to believe CI doesn't overwrite the secrets when it does. This needs to be corrected before merge.The correct description should match what
DEPLOYMENT.md §4already documents accurately: the two-source model whereobs.env(git-tracked) contains non-secret config andobs-secrets.env(CI-written from Gitea secrets) contains credentials.Suggestions
Documentation cross-reference check: Per the architect review checklist, this PR introduces a new infrastructure component path (
/opt/familienarchiv/as a permanent config store) and updatesdocs/architecture/c4/l2-containers.puml. The container diagram label update (docker-compose.observability.yml→/opt/familienarchiv/docker-compose.observability.yml) is present — good.DEPLOYMENT.mdis updated with the full ops procedure — good. ADR-016 exists — good (modulo the content issue above).Co-location rationale is sound: Rejecting "separate directory (e.g.,
/opt/obs/)" because GlitchTip sharesarchive-dbandarchiv-netwith the main stack is the right call. The coupling is real — co-location reflects it honestly.rsync --deleteas future improvement: The ADR namescp -r's inability to remove deleted files as a known consequence. This is acceptable for the current change frequency. If the obs stack config grows significantly,rsync -a --deletewould be the natural upgrade path.What's done well
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Reviewing from a requirements and documentation completeness perspective: whether the change is traceable, the rationale is captured, the test plan is adequate, and operator-facing documentation leaves no gaps.
Observations
PR description quality: The summary is concise and structured. Each of the three fixes has a one-line root cause ("Tempo 2.7.2 schema breaking change", "hardcoded
archive-db", "port 3001 occupied by staging frontend") and a one-line fix. The test plan covers all five observable outcomes. This is a well-scoped PR with clear traceability from problem to fix.The
POSTGRES_HOSTnaming could cause confusion: The variable is introduced indocker-compose.observability.ymlandobs.envasPOSTGRES_HOST, but the main app stack also connects to PostgreSQL. A future operator might wonder whetherPOSTGRES_HOSTaffects both the main stack and the obs stack, or only one. The documentation inDEPLOYMENT.mdandobs.envmakes it clear this is obs-specific, but the variable name itself doesn't signal that.OBS_POSTGRES_HOSTwould be unambiguous. Low priority — the current documentation is adequate — but worth flagging.$$escaping note in.env.example: The comment block explaining that passwords containing$must use$$is genuinely useful operator documentation. This is the kind of subtle behavior that causes production incidents ("GlitchTip can't authenticate even though the password looks correct"). Correctly placed in.env.examplewhere an operator setting up the file will see it.obs.envcomment forPOSTGRES_HOSTis accurate: The comment explains why the value isn't a fixed string and exactly what the values are for staging vs production (based on Compose project name + service name). This prevents the "why doesn'tarchive-dbwork in staging?" question.Gap: The
DEPLOYMENT.mdupdate references "§4" for the ops procedure (mkdir -p /opt/familienarchiv/infraetc.) but the section containing that code block isn't explicitly labelled "§4" in the diff. If the section numbering has drifted, the cross-references in the ADR and the runner restart warning would point to the wrong section. Low priority, but worth verifying the section numbers are consistent.What's done well
DEPLOYMENT.md— easy for a future operator to understand at a glance.Second-round review concerns addressed
All open concerns from the second review cycle resolved in 4 commits.
Blockers resolved
Tobias + Markus — ADR-016 Decision section contradicts the implementation ✅
The Decision section described an operator-managed
/opt/familienarchiv/.envthat CI never touches. The actual implementation is the opposite. Corrected to accurately describe the two-source model:obs.env(git-tracked, non-secret config) +obs-secrets.env(CI-written fresh from Gitea secrets on every deploy, passed explicitly via--env-file). The stale Consequences bullet ("secrets decoupled from CI — CI cannot accidentally overwrite them") is replaced with the accurate statement: rotating a Gitea secret takes effect on the next deploy with no manual server action. Commit4c5ee96e.Tobias + Sara —
release.ymlmissing "Validate observability compose config" and "Assert observability stack health" steps ✅Both steps added to
release.yml, identical to thenightly.ymlequivalents. The validate step (docker compose config --quiet) now runs beforeupon production deploys, and the health assertion loop checksobs-loki,obs-prometheus,obs-grafana,obs-tempoarehealthyafterup --wait. "Keep in sync with nightly.yml" comments added at each step to make the duplication intentional and visible. Commitf628ab64.Non-blocking suggestions resolved
Nora —
obs-secrets.envcreated with world-readable permissions ✅Added
chmod 600 /opt/familienarchiv/obs-secrets.envimmediately after the heredoc in bothnightly.ymlandrelease.yml. Free defence-in-depth on the single-tenant VPS. Commitdec0001b.Tobias — POSTGRES_HOST container names are fragile ✅
Added a comment at the
POSTGRES_HOSTassignment in both workflow files explaining the name is derived from the Compose project name + service name and must be updated on a project rename. Commitf5c7be93.🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
This PR makes the right architectural call: stable host path → bind mounts survive workspace wipes. The two-source env model (git-tracked
obs.env+ CI-writtenobs-secrets.env) is clean and auditable. Good work overall.Blockers
Unquoted heredoc with Gitea secrets — can corrupt secret values
In both
nightly.ymlandrelease.yml,obs-secrets.envis written via an unquoted heredoc (<<EOF):Gitea expands
${{ secrets.X }}at template-render time before the shell runs. If a secret value contains$(common in strong passwords — e.g.P@$s5w0rd), the shell's heredoc will try to expand$s5w0rdas a shell variable — silently truncating the value. Result: the service starts with the wrong password and fails to connect to the DB.Fix: Quote the heredoc delimiter to suppress all shell expansion:
<<'EOF'writes the Gitea-substituted value literally, without further shell expansion. Apply to bothnightly.ymlandrelease.yml.Suggestions
cp -rdoesn't clean deleted files (acknowledged in ADR-016) — if an obs config file is removed from git, it lives on in/opt/familienarchiv/infra/observability/until manually deleted. Worth adding a one-liner runbook note inDEPLOYMENT.md §4:What's done well:
chmod 600 /opt/familienarchiv/obs-secrets.env— correct permissions on the secrets file.docker compose config --quiet) beforeupis a solid practice — catches undefined var substitutions before containers start.--waitgap was real and this fixes it correctly..envblock in both workflows is the right cleanup.🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
The security posture of this PR is generally solid, but there is one shell injection risk in the CI workflows worth fixing before merge.
Blockers
CWE-78 / Shell variable expansion in unquoted heredoc — secrets silently mangled
Both
nightly.ymlandrelease.ymlwriteobs-secrets.envusing an unquoted heredoc delimiter (<<EOF). Gitea substitutes${{ secrets.X }}into the script before the shell runs. If a Gitea secret value contains$, the shell heredoc will perform a second expansion:This silently corrupts the secret at rest — GlitchTip would start with a broken
SECRET_KEYand invalidate all existing sessions. Severity: Medium (integrity / availability, not confidentiality, since the attacker would need access to the file to read it).Fix — quote the heredoc delimiter:
Single-quoted
'EOF'tells the shell to skip all variable expansion inside the heredoc body. Gitea's${{ }}expansion has already happened at the template layer, so the secret values are written verbatim. Apply to both workflow files.Observations (non-blocking)
chmod 600onobs-secrets.env— correct. The file contains three sensitive values (Grafana admin password, GlitchTip secret key, Postgres password).chmod 600ensures only the runner's process user can read it.Blast radius of
/opt/familienarchivmount — acknowledged. The ADR-016 negative consequence is honest: a compromised workflow step could overwrite app compose files, Caddy config, or the secrets file itself. The runner-config.yaml security comment update accurately describes this. Acceptable for a single-tenant, private-repo runner.No credential leakage through logs. The Gitea
${{ secrets.X }}substitution is masked in CI logs. The heredoc body (containing the expanded values) is not echoed. Noset -xin the secret-writing steps. Clean.Tempo schema fix (
tempo.yml- removing duplicateprocessorsblock) is purely config — no security implications.🏛️ Markus Keller — Senior Application Architect
Verdict: ✅ Approved
The architectural decision documented in ADR-016 is sound given the constraints (no new infrastructure dependencies, single-tenant runner, DooD setup). The co-location at
/opt/familienarchiv/is the simplest path that satisfies all four decision drivers, and the alternatives section correctly rejects the alternatives.Architecture Assessment
Two-source env model is well-scoped.
infra/observability/obs.env(git-tracked, no secrets) +/opt/familienarchiv/obs-secrets.env(CI-written, secrets only) separates concerns cleanly. Non-secret config goes through PR review; secrets are rotated via Gitea without touching git. This is the right boundary.CI-push model mirrors the main app stack pattern. The nightly and release workflows already push the main app compose files via a similar mechanism. Extending this pattern to the obs stack avoids introducing a second deployment mechanism (server-pull, systemd timer, etc.). Boring is good.
GF_SERVER_ROOT_URLaddition is necessary. Without it, Grafana generates relative URLs in alert emails and OAuth redirects, which break when accessed through Caddy's reverse proxy. This was a latent bug in the original compose file.Documentation Checklist
docs/architecture/c4/l2-containers.pumldocs/DEPLOYMENT.mddocs/adr/All required documentation updates are present. No missed updates.
ADR-016 Quality
The ADR follows the established project format (Context, Decision, Consequences). The "Negative" consequences section is unusually honest about the blast radius trade-off — that's the right call. The
cp -r/ no-delete limitation is the main operational footgun and is correctly flagged. Worth adding a manual cleanup note inDEPLOYMENT.md §4for completeness (see Tobias's comment), but not a blocker.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
This PR touches infrastructure and CI only — no application code, no Svelte components, no Spring Boot services. There's nothing for me to flag on the code quality front. A few observations from a developer perspective:
What I Checked
set -ecorrectly and accumulates failures before exiting — avoiding the "exits on first failure, you don't know what else is broken" problem. Variable names (unhealthy,svc,status) are clear.obs.envfile: Comments explain why values are set, not what the keys mean. That's the right level of comment.tempo.ymlremoval of the duplicateprocessorsblock: Straightforward schema compliance fix. The comment at the top of the file (# Tempo HTTP API is unauthenticated) is appropriate — it's a non-obvious security posture decision that a reader needs to know.DATABASE_URL, worker env,psql -h) are consistent. No site was missed. The-h ${POSTGRES_HOST:-archive-db}pattern correctly preserves the production default while allowing override.LGTM
No code quality issues. The shell scripts are clean, the env file separation is sensible, and the docs are updated. Ship it once the heredoc quoting concern Tobias raised is resolved.
🧪 Sara Holt — Senior QA Engineer
Verdict: ⚠️ Approved with concerns
No application test changes in this PR (expected for a CI/infra fix). I'm reviewing the CI pipeline itself as the test artifact here.
What's Good
Validate observability compose config step — dry-run
docker compose config --quietbeforeupis a proper preflight gate. It catches undefined variable substitutions, YAML parse errors, and missing required keys before any container starts. This is the right approach and it was missing before.Assert observability stack health step — this closes a real gap.
docker compose up --waitonly waits for services with healthcheck directives. The four services without healthchecks (obs-promtail,obs-cadvisor,obs-node-exporter,obs-glitchtip-worker) are correctly noted, and the four that do have healthchecks (obs-loki,obs-prometheus,obs-grafana,obs-tempo) are explicitly asserted before the smoke test proceeds. Good.The loop pattern is deterministic: accumulate failures into
$unhealthy, exit non-zero only after checking all four. This gives a full failure report rather than aborting on the first unhealthy service.Concerns
No assert step for worker services.
obs-glitchtip-workerhas no healthcheck directive and is not checked in the assert step. The comment correctly acknowledges it. However, a failed worker (e.g. due to a badDATABASE_URL) would pass the CI gate silently — you'd only discover it when GlitchTip fails to process error reports in production.A minimal worker check (verify it's running, not just started) could be:
Not a blocker (it's pre-existing and out of scope for this PR), but worth a follow-up issue.
nightly.ymlandrelease.ymldiverge only inPOSTGRES_HOSTandPOSTGRES_PASSWORDsecret name. The duplicate validate + assert step bodies are identical. The comments say "Keep in sync with the equivalent step in nightly.yml." This is a maintenance risk — if the assert step is updated in one file but not the other, CI diverges silently. A composite action would eliminate the sync burden, but that's a cleanup issue, not a blocker here.Test Plan Coverage
The PR's test plan is concrete and verifiable:
docker compose configvalidates — covered by the new validate step in CIarchive-dbby default — docker-compose default${POSTGRES_HOST:-archive-db}preserves this📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
This PR closes three distinct bugs discovered during the issue #601 migration. All three are clearly specified in the PR description with root cause, fix, and verifiable outcome. From a requirements standpoint, this is well-scoped and implementation-ready.
Requirements Traceability
processorsschema errormetrics_generator.processorsmoved underoverridesin Tempo 2.7.2field processors not founderrorarchive-dbhostname${POSTGRES_HOST:-archive-db}variable substitutionPOSTGRES_HOSTset in envBind for 127.0.0.1:3001 failedgoneAll three fixes are consistent between code, documentation (
DEPLOYMENT.md), env template (.env.example), and the newobs.envconfig file.Acceptance Criteria Assessment
The PR test plan maps one-to-one to verifiable system behaviors. No ambiguous terms ("better", "faster", "improved") appear. Each checklist item is a concrete, testable state.
Non-Functional Requirements
/opt/familienarchiv/) satisfies the unstated NFR that the obs stack survives CI workspace wipes. ADR-016 documents this explicitly.${POSTGRES_HOST:-archive-db}preserves the production default — no operator action required on existing deployments.Open Questions
None remaining. ADR-016's open question register is empty and the scope is tightly bounded to the three bugs. The
cp -r/ stale-file limitation is acknowledged in ADR-016 and is an acceptable trade-off at this project's change frequency.🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist
Verdict: ✅ Approved
This PR modifies CI workflows, Docker Compose configuration, environment files, infrastructure config, and ADR documentation. There are no frontend components, Svelte files, or UI-facing changes in scope.
I checked the diff for any inadvertent frontend impact (e.g. changes to environment variables that affect the SvelteKit frontend or its build):
PORT_GRAFANAchange from 3001 → 3003: Grafana is an internal tool only — not user-facing, not part of the SvelteKit frontend. No UI impact.GF_SERVER_ROOT_URLaddition: Grafana-internal config for alert email links. Not user-facing.POSTGRES_HOSTvariable: Backend infrastructure only.LGTM from a UX/UI and accessibility perspective. Nothing to flag.
Prevents shell from expanding '$' in Gitea-rendered secret values. Without the quote, a password like 'P@$s5w0rd' has '$s5w0rd' silently expanded to '' — writing a truncated value to obs-secrets.env. '<<'EOF'' suppresses shell expansion; Gitea's '${{ }}' template rendering already ran before the shell sees the script. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>Third-round review concerns addressed
All open concerns from the third review cycle resolved in 3 commits.
Blockers resolved
Tobias + Nora — Unquoted heredoc in
nightly.ymlandrelease.ymlsilently mangles$-containing secrets ✅Both
obs-secrets.envheredocs changed from<<EOFto<<'EOF'. Single-quoted delimiter suppresses shell variable expansion inside the heredoc body. Gitea's${{ secrets.X }}template rendering already happened before the shell runs, so secret values are written verbatim — a password likeP@$s5w0rdno longer has$s5w0rdsilently expanded to an empty string. Applied to:nightly.yml— commit9662ff5frelease.yml— commit25062be6Suggestion resolved
Tobias — Add manual cleanup runbook note for
cp -rstale files ✅Added a note in
DEPLOYMENT.mdimmediately after the manual start/restart command block, explaining thatcp -rdoes not remove deleted files and documenting the manualrmstep for config files removed from the repo. Commita7f60ebe.🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ✅ Approved
The final blocker (unquoted heredoc) is fixed.
<<'EOF'is correct — Gitea renders${{ secrets.X }}before the shell runs, so single-quoting the delimiter suppresses the second shell expansion without losing the secret values.chmod 600is in place. Cleanup note in DEPLOYMENT.md added. This is merge-ready.What's done well
obs.env), credentials freshly written from Gitea secrets each deploy (obs-secrets.env). No operator-managed file to drift.grafana_data,loki_data,prometheus_data,tempo_data) are Docker-managed and untouched by the copy step — no data migration risk.:latestintroduced.docker compose config --quiet) beforeupis the right pattern. Catches undefined variables and YAML errors before containers start.Deployment maintainability suggestions (no blockers)
1. Replace
cp -rwithrsync -a --deleteThe stale-file limitation is documented in both ADR-016 and DEPLOYMENT.md, but it's a one-command fix that eliminates the problem class entirely:
Now that
cpappears in two workflow files and a runbook, fixing it later costs three edits instead of one. Low priority given the current config change frequency — but cheapest to fix here.2. Wrap the manual start command in a script or Makefile target
The manual restart command in DEPLOYMENT.md is four lines with absolute paths and two
--env-fileflags. An operator copying it under pressure can easily typo a path:Operators copy the script name, not four lines of flags. The script also stays in sync automatically since it lives in the repo.
3.
POSTGRES_USER=archivhardcoded in the secrets heredocPOSTGRES_USERis not a secret — it's the application's Postgres role name, as predictable asarchive-db. It's currently written intoobs-secrets.envalongside the actual secrets. If the role is ever renamed, this is a third place to update (in addition todocker-compose.observability.ymland the main.env.*files). Moving it toobs.envalongsidePOSTGRES_HOST=archive-dbkeeps the secrets file to credentials only.4. Composite action for the duplicated obs deploy steps
The "Keep in sync with nightly.yml / release.yml" comments are better than nothing, but the class of failure (one workflow updated, the other not) is only prevented by the next reviewer's attention. A Gitea composite action at
.gitea/actions/deploy-obs/action.yaml(acceptingenv: staging|production,postgres_password:, etc. as inputs) would make "updated in one place" structurally impossible. Acceptable at current change frequency; the right answer before the obs deploy procedure grows further.🏗️ Markus Keller — Senior Application Architect
Verdict: ✅ Approved
Documentation obligations are fully met: ADR-016 is present and accurate,
docs/architecture/c4/l2-containers.pumlreflects the new operational path,DEPLOYMENT.mdhas the full ops procedure with the two-source env model table. The architectural choice (CI-push to/opt/familienarchiv/, co-located with the main stack) is sound — it reflects the real coupling (sharedarchiv-net, shared PostgreSQL) and requires no new infrastructure. Boring and correct.Architecture assessment
The Decision section of ADR-016 now correctly describes the two-source model — no longer contradicts the implementation. The rejected alternatives are evaluated with genuine rationale, not strawmanned. The
cp -r/ no-delete limitation is the main operational footgun and is correctly flagged in Consequences.Deployment maintainability observations (no blockers)
1.
obs.envcarriesPOSTGRES_HOST=archive-db— understanding the precedence mattersobs.envsetsPOSTGRES_HOST=archive-dbas the default. In CI,obs-secrets.envoverrides this with the environment-specific value (archiv-staging-db-1/archiv-production-db-1). The file loaded last wins — andobs-secrets.envis listed second in both workflow--env-filechains, so override precedence is correct.Worth confirming: if these
--env-fileflags were accidentally swapped,POSTGRES_HOST=archive-dbfromobs.envwould silently win over the CI override, and GlitchTip would fail to connect in both environments. The current order is right; document the ordering contract explicitly in the validate step comment so a future refactor doesn't reverse it.2. Variable naming ambiguity:
POSTGRES_HOSTvsOBS_POSTGRES_HOSTPOSTGRES_HOSTis obs-specific but the name doesn't signal that. The two-source model keeps obs vars isolated (they're only passed to the obs compose command), so there's no actual collision risk today. But if the main app compose ever needed a similar override, the name would conflict.OBS_POSTGRES_HOSTis unambiguous at zero implementation cost — worth considering during any future obs-vars cleanup.3. Docker network dependency is implicit but unavoidable
The obs stack joins
archiv-net, which is created by the main app stack. If the main stack is not running when obs starts, Docker emits a clear error ("network archiv-net not found"). This is not a gap introduced by this PR — the compose file already declaresexternal: truefor the network. Confirming this is handled correctly.4. If
/opt/familienarchiv/ever relocates, three places changeThe ADR documents the runner restart requirement and cp-r limitation in Consequences, but not the
/opt/familienarchiv/relocation footgun: if the permanent directory moves, you must updaterunner-config.yamlvalid_volumes, both workflow files, and DEPLOYMENT.md §4. A one-sentence Consequences bullet noting this would complete the picture for a future incident responder.🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
All security concerns from previous cycles are resolved.
<<'EOF'correctly suppresses shell expansion — the two-stage expansion issue (Gitea template render → shell execution) is properly handled.chmod 600onobs-secrets.envis in place. Zero hardcoded credentials in committed files. All image tags remain pinned. Security posture is appropriate for a single-tenant private runner.What's done well
<<'EOF') is the correct fix. Gitea's${{ secrets.X }}substitution happens at template render time before the shell executes the script. The single quote tells the shell to treat the heredoc body as a literal string — the already-rendered secret value lands verbatim in the file.chmod 600 obs-secrets.envimmediately after creation. Correct.set -xin secret-writing steps — secret values are not echoed to CI logs.runner-config.yamlis explicit and actionable.obs.envcontains no secrets — verified in the diff.Deployment security/maintainability observations (no blockers)
1.
POSTGRES_USER=archivin the secrets heredoc mixes credential and non-credential configThe secrets file (
obs-secrets.env) currently contains:Keeping
POSTGRES_USERand potentiallyPOSTGRES_HOSTin the secrets file obscures the security boundary. The clean split would be:obs-secrets.envcontains only values that would cause harm if leaked (passwords, secret keys). Everything else lives inobs.env. This also means a future rotation ofobs-secrets.envcan be audited for "does this file contain only secrets?" without cross-referencingobs.env.2.
obs-secrets.envis rewritten on every deploy — operator manual edits will be lostThis is the intended design and is documented. Confirming the security implication: emergency credential rotation cannot be done by editing
/opt/familienarchiv/obs-secrets.envon the server — the next nightly run will overwrite it. Correct procedure: update the Gitea secret, then either wait for the next nightly run or manually trigger the workflow. The DEPLOYMENT.md mentions this indirectly; an explicit sentence ("manual edits to obs-secrets.env are overwritten on the next CI run — rotate secrets via Gitea") would prevent a future operator from wasting time on a server edit.3.
cat > file <<'EOF'creates the file before setting permissionsThe sequence is: create file with default umask (likely
644) → write content →chmod 600. On a single-tenant VPS with no other users, the brief world-readable window is negligible. For future hardening:install -m 600 /dev/null /opt/familienarchiv/obs-secrets.env && cat >> /opt/familienarchiv/obs-secrets.env <<'EOF'creates the file with correct permissions before any content is written. Not a current risk; file it as a hardening note.🧪 Sara Holt — Senior QA Engineer
Verdict: ✅ Approved
Verification coverage for the obs deploy path is now meaningfully better than before this PR. The validate step (dry-run before
up) and explicit health assertion step are both innightly.ymlandrelease.yml. The heredoc quoting fix eliminates the silent secret corruption that would have causeddocker compose config --quietto pass (expanded variables look like empty strings, not missing ones) but containers to fail at runtime.Test coverage assessment
Deployment maintainability observations (no blockers)
1.
obs-glitchtip-workercrash loop would pass CI undetectedIf
DATABASE_URLis malformed (e.g. wrongPOSTGRES_HOST), the worker process crashes immediately after starting.docker compose up --waitconsiders it "started" (it has no healthcheck), and the health assertion step only checks the four services with declared healthchecks. A one-liner added to the existing loop would catch immediate crash loops:Using
::warning::(not::error::) since these services lack healthchecks — a warning surfaces the state without blocking a deploy that otherwise looks healthy. Worth a follow-up issue rather than a blocker here.2. Duplicate workflow bodies are the next maintenance risk
The validate and health assertion steps are now byte-for-byte identical in both files (modulo the "Keep in sync" comment). The next time either changes — for example, adding
obs-glitchtiporobs-redisto the assertion list — there are two files to update with no automated reminder. The "Keep in sync" comment relies on the next reviewer noticing. A Gitea composite action (.gitea/actions/deploy-obs/) would make the single-place update structurally enforced. Acceptable at current change frequency; the right investment before any further obs deploy procedure changes.3.
docker compose config --quietpasses even with empty variable substitutionsIf a Gitea secret is undefined (e.g.
GRAFANA_ADMIN_PASSWORDnot set),${{ secrets.GRAFANA_ADMIN_PASSWORD }}evaluates to an empty string. The heredoc then writesGRAFANA_ADMIN_PASSWORD=(empty) intoobs-secrets.env. Docker Composeconfigaccepts this — it's a valid empty value, not an undefined variable (which would emit a warning or error). The validate step catches undefined variables (no=at all) but not empty ones. This is a known limitation of the validate-then-up pattern; documenting it in the step comment would set expectations correctly.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
Pure infrastructure PR — no application code, no Svelte components, no Spring Boot services. The heredoc quoting fix is correct. Bash scripts are readable and intention-revealing.
What I checked
Heredoc quoting (
<<'EOF'): Correct fix. Gitea's${{ secrets.X }}template substitution runs before the shell sees the script. Single-quoting the delimiter tells the shell to treat the heredoc body as a literal string — the already-substituted secret value is written verbatim. No second shell expansion of$-containing passwords.Health assertion loop: The
set -e+ accumulate-then-exit pattern is correct:All four services are checked before the exit, so CI shows the full failure picture rather than stopping at the first unhealthy service. Clear variable names (
unhealthy,svc,status).tempo.ymlchange: Removing the duplicate top-levelprocessorsblock is correct. Tempo 2.7.2 enforces thatprocessorslives exclusively underoverrides.defaults.metrics_generator.processors, which was already correctly configured. The removed block was the invalid duplicate.docker-compose.observability.yml: All three substitution sites — bothDATABASE_URLenvs and thepsql -hcommand inobs-glitchtip-db-init— are updated consistently to${POSTGRES_HOST:-archive-db}. No site missed.Deployment maintainability (CI hygiene angle)
The only remaining structural maintenance debt is the duplicated obs deploy section between
nightly.ymlandrelease.yml. The "Keep in sync with" comments are a code smell that usually resolves into a composite action once the duplication grows beyond two files or two sections. The obs deploy now has four steps (Deploy configs → Validate config → Start stack → Assert health) duplicated in two workflows — that's eight step definitions to keep synchronized. A composite action at.gitea/actions/deploy-obs/action.yamlaccepting environment-specific inputs (secrets, postgres host) would collapse this to twouses:calls. Not a blocker for this PR; the natural trigger is the next time either workflow diverges despite the "Keep in sync" note.📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
All three root-cause fixes from issue #601 are traceable from stated problem → implementation → verifiable outcome. The PR description test plan maps one-to-one to concrete, testable system states. Documentation chain is complete.
Requirements traceability (final check)
processorsschemaoverridesfield processors not foundPOSTGRES_HOSTarchive-dbfails when only staging stack runsPOSTGRES_HOSTsetPORT_GRAFANAdefault 3003port already allocatedDeployment maintainability observations (no blockers)
1. Directory relocation consequence missing from ADR-016
ADR-016 documents the runner restart requirement and the
cp -rstale-file limitation in Consequences. It does not mention that if/opt/familienarchiv/is ever relocated, three artifacts must change simultaneously:runner-config.yaml(valid_volumes), both workflow files (all absolute paths), and DEPLOYMENT.md §4. This is the kind of operational footgun that causes incidents when someone "cleans up" the VPS without reading the ADR. A one-sentence Consequences bullet would close the documentation gap:2.
POSTGRES_USERandPOSTGRES_HOSTsplit across secrets and config filesobs-secrets.envcurrently carriesPOSTGRES_USER=archivalongside actual secrets. From a requirements clarity standpoint: the two-source model's stated contract is "non-secret config inobs.env, secrets inobs-secrets.env."POSTGRES_USERis not a secret — it's a configuration constant. Moving it toobs.envwould make the two-source model's contract self-consistent and eliminate one non-secret from the secrets file.3. The empty-secret failure mode is undocumented
If a Gitea secret is not set (e.g.
GRAFANA_ADMIN_PASSWORDmissing),${{ secrets.GRAFANA_ADMIN_PASSWORD }}evaluates to an empty string and the heredoc writesGRAFANA_ADMIN_PASSWORD=intoobs-secrets.env. Docker Composeconfig --quietaccepts this (empty value ≠ undefined variable). Grafana would start with thechangemedefault password since an emptyGF_SECURITY_ADMIN_PASSWORDfalls back to the default. The requirement "rotating a secret takes effect on the next deploy" implicitly assumes the secret is set. A pre-flight check —[ -n "$GRAFANA_ADMIN_PASSWORD" ] || { echo "::error::GRAFANA_ADMIN_PASSWORD secret is not set"; exit 1; }— before the heredoc would make this failure fast and loud rather than silent. Worth a follow-up issue.🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist
Verdict: ✅ Approved
Pure infrastructure PR — CI workflows, Docker Compose config, Tempo YAML, env files, ADR, and deployment documentation. No frontend components, no Svelte files, no design tokens, no routes, no user-facing changes.
I reviewed the diff specifically for env vars or config changes that could inadvertently affect the SvelteKit frontend or the operator-facing Grafana/GlitchTip UI:
PORT_GRAFANA3001 → 3003: Internal observability tooling only. Grafana is not user-facing and is bound to127.0.0.1. No UI impact.GF_SERVER_ROOT_URL: Grafana-internal setting for alert email links and OAuth redirects. Not part of the application UI. Correctly set to the production URL inobs.env.POSTGRES_HOST: Backend infrastructure only. No frontend impact.GLITCHTIP_DOMAIN: Error reporting ingestion endpoint. Not user-facing in the application UI.Nothing to flag from a UX, accessibility, or brand perspective. The documentation improvements in DEPLOYMENT.md ("Dev — start from workspace" vs "Production — managed from
/opt/familienarchiv/") will help operators navigate the setup without frustration — an ergonomic improvement for the human running this infrastructure.🏗️ Markus Keller — Senior Application Architect
Verdict: ✅ Approved
Documentation currency check
Per our doc-update rules:
/opt/familienarchiv/)docs/architecture/c4/l2-containers.pumldocs/DEPLOYMENT.mddocs/adr/Architecture assessment
What's solid:
The two-source env model — git-tracked
obs.envfor non-secrets, CI-writtenobs-secrets.envfor credentials — is architecturally clean. Secrets flow from Gitea → CI → host file, never through git. Non-secret config is reviewable in PRs. The sources of truth are separated by concern.Co-locating the obs stack at
/opt/familienarchiv/correctly reflects that GlitchTip sharesarchive-dbandarchiv-netwith the main app stack. They are the same deployment unit. A separate/opt/obs/would have implied an independence that doesn't exist.ADR-016 shows genuine judgment: the alternatives section dismisses server-pull (adds a second deployment mechanism), separate directory (misrepresents coupling), and Docker Swarm configs (disproportionate dependency) with concrete reasons. That's the right format.
The validate-before-start pattern (
docker compose ... config --quiet) is exactly right — catch missing variables and YAML errors before containers attempt to start.Minor inconsistency (non-blocking)
ADR-016 Consequences says: "cp -r does not remove deleted files; a config file removed from the repo persists in
/opt/familienarchiv/infra/observability/until manually deleted."But the implementation uses
rsync -a --delete infra/observability/ /opt/familienarchiv/infra/observability/— which does remove deleted files from the config directory. The note is only accurate for the top-level compose file itself (cp docker-compose.observability.yml /opt/familienarchiv/), which uses plaincp.The ADR should distinguish these two cases: the config tree is clean (rsync handles deletions); only a renamed/deleted top-level compose file would leave a stale copy. This doesn't change the decision — just makes the consequences section more accurate.
Suggested ADR fix:
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
This is a pure infrastructure PR — no application code, no testable business logic. TDD doesn't apply here. My focus is on CI script quality and the one code quality issue that stands out.
Blocker
Copy-paste between
nightly.ymlandrelease.yml— this is the dominant code smell. The following four steps are near-identical in both files:That's roughly 130 lines duplicated across two workflow files. The only differences are:
STAGING_POSTGRES_PASSWORDvsPROD_POSTGRES_PASSWORDarchiv-staging-db-1vsarchiv-production-db-1Gitea Actions supports composite actions. A
deploy-obs-stackcomposite action in.gitea/actions/would DRY this down to a ~10-line call site in each workflow with the two differing values passed as inputs. When the deploy logic changes (and it will), there's currently two files to update in sync — the "Keep in sync with the equivalent step in nightly.yml" comments are the red flag that says this should be an abstraction.I understand this might be deferred work (the comment already acknowledges the duplication), but it's a blocker for me because the "keep in sync" comment is a maintenance debt bomb on a critical path.
Suggestions
Heredoc indentation (cosmetic, functionally fine):
The heredoc content has leading whitespace because it follows the YAML indentation. Docker Compose env file parsers do strip leading whitespace, so this works. But
<<-'EOF'with tab-indented content (or restructuring therun:block) would make the generated file cleaner. Minor — not a blocker.Health check loop — good pattern:
Collecting all failures before exiting (instead of
set -eon the first check) gives complete diagnostics. This is the right way to write this kind of loop. ✅Step comments explain WHY — the
# Runs with absolute paths so bind mounts resolve to stable host pathsstyle comments are exactly what I want to see in infra scripts. ✅🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
This fixes three real production issues cleanly. The overall approach is sound. I have one blocker (CI duplication) and a few concerns worth naming.
Blockers
130 lines of copy-pasted CI steps. The four obs-stack steps in
nightly.ymlandrelease.ymlare identical except for secret names and container hostnames. The "Keep in sync" comments inrelease.ymlare the CI equivalent of aTODO: fix this. When the deploy script changes — and it will, because obs stacks always need tuning — two files need editing in lockstep, and there is no tooling to enforce that.Create a composite action:
.gitea/actions/deploy-obs-stack/action.ymlwith inputspostgres_password,postgres_host, andenv_file_prefix. Then both workflows call it with their respective values. 10 lines per workflow, one source of truth for the deploy logic.Concerns
Hardcoded container names are fragile.
archiv-staging-db-1andarchiv-production-db-1in the CI secret files derive from Docker Compose project names (archiv-staging,archiv-production) and service name (db). If the Compose project is renamed, these values silently break. The comment acknowledges this:Fair — but this is exactly the kind of thing that gets forgotten. Could
docker compose -p archiv-staging port db 5432or a simpler lookup replace the hardcoded name? Probably overkill for this project's change frequency, but worth noting.cp docker-compose.observability.yml /opt/familienarchiv/leaves stale copies. This is called out correctly in DEPLOYMENT.md. Since the config tree usesrsync -a --delete, the stale compose file is the only manual cleanup risk. Very low probability event, acceptable given how rarely compose files are renamed.GlitchTip connectivity is not automatically verified. The health assertion step checks
obs-loki obs-prometheus obs-grafana obs-tempobut not GlitchTip. IfPOSTGRES_HOSTis misconfigured, the worker silently fails to connect and the CI run still turns green. The test plan calls this out as a manual check — that's honest, but means a badPOSTGRES_HOSTwouldn't be caught until someone looks at GlitchTip.What's done well
rsync -a --deletefor the config tree — clean mirror, handles removed files ✅chmod 600on the secrets file — correct ✅docker compose ... config --quietvalidate-before-start — catches env var issues before containers start ✅PORT_GRAFANA:-3003default in the compose file is consistent with obs.env ✅runner-config.yamlappropriately reflects the new blast radius ✅${{ secrets.XXX }}+<<'EOF'heredoc correctly uses Gitea template substitution before shell execution — the single-quoted heredoc is fine here ✅GF_SERVER_ROOT_URLadded to the compose file — catches the common Grafana misconfiguration where alert email links uselocalhost✅Note on the Tempo fix
Removing the duplicate
metrics_generator.processorsblock from the top-level config is the correct fix for the Tempo 2.7.2 breaking change. The processors were already defined underoverrides.defaults.metrics_generator.processors— the top-level block was redundant and became invalid. Clean removal. ✅🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
No exploitable vulnerabilities in this PR. The secrets handling is well-designed. My concerns are about blast radius — this change expands the attack surface of the CI runner in a way that deserves explicit acknowledgment.
What's secure
Secret management is correct.
obs-secrets.envis written fresh from Gitea secrets on every deploy — Gitea is the single source of truth ✅chmod 600prevents other processes from reading it ✅<<'EOF'heredoc with${{ secrets.XXX }}is the right pattern — Gitea template substitution happens before the shell executes, and the single-quoted delimiter prevents shell re-expansion ✅The non-secret/secret split is clean.
POSTGRES_USER=archivin the git-trackedobs.envis fine — a database username is not a secret. The password is in the CI-injected file. The line is drawn correctly.The security comment in
runner-config.yamlis explicit.Naming the exact capabilities ("root-equivalent access to the host Docker daemon", "Read/write access to
/opt/familienarchiv/") and the exact trust model ("single-tenant, trusted authors only") is exactly the right documentation style. ✅Concerns (non-blocking but worth tracking)
Expanded blast radius is documented but not mitigated.
The new
/opt/familienarchivmount means any CI job step — including steps that install npm packages, runmvnw, or execute plugin code — runs in a container that can write to/opt/familienarchiv/docker-compose.observability.yml. If a supply-chain compromise reaches any CI dependency (a Maven plugin, an npm package, a base runner image layer), it can overwrite production compose files.This is accepted risk for a single-tenant private repo. But I'd recommend one lightweight mitigation: file integrity monitoring on
/opt/familienarchiv/. Something as simple as a nightly cron that checksums the compose file and alerts if it changes between expected deploy windows. Therunner-config.yamlwarning is good documentation; a tripwire is an actual control.obs-secrets.envon the host filesystem.The file lives at
/opt/familienarchiv/obs-secrets.envwithchmod 600. Since the CI runner has root-equivalent Docker access, any job container could mount the host filesystem and read it (docker run -v /opt/familienarchiv:/mnt ...). This is the same trust boundary as the docker socket — already accepted in the existingrunner-config.yamlthreat model. Noting it for completeness so the threat model is fully explicit.GlitchTip receives DB credentials as environment variables.
In
docker-compose.observability.yml:This is standard for containerised apps and acceptable here. GlitchTip is a trusted internal service. Just confirming this is intentional and in scope of the existing threat model.
No action required
The
POSTGRES_HOSTvariable doesn't introduce injection risk — it's interpolated into a connection string, not a shell command. The compose file handles it safely.🧪 Sara Holt — Senior QA Engineer
Verdict: ⚠️ Approved with concerns
What's good
The PR adds two genuine quality gates that weren't there before:
"Validate observability compose config" — a dry-run that resolves all variable substitutions before containers start. This catches undefined variables and YAML errors early. Exactly the right pattern. ✅
"Assert observability stack health" — explicit health check on
obs-loki,obs-prometheus,obs-grafana,obs-tempoafterdocker compose up --wait. The--waitflag only covers services with healthcheck directives; this step covers the gap for services where healthcheck exists but might be missed. The collect-then-exit pattern (accumulating all unhealthy services beforeexit 1) gives full diagnostic output rather than stopping at the first failure. ✅Concerns
GlitchTip connectivity is not automatically verified.
The test plan includes:
But neither
obs-glitchtipnorobs-glitchtip-workerhave Docker healthchecks, and neither is in theAssert observability stack healthstep. IfPOSTGRES_HOSTis misconfigured:docker compose up --waitpasses (no healthcheck to fail)This is a gap between the test plan and what CI actually verifies. Adding a healthcheck to the GlitchTip service in the compose file would close it — something like:
Then the assertion step can include
obs-glitchtip. Without this, thePOSTGRES_HOSTtest plan item remains manual.The test plan is checkbox-based but not automated.
Items 3–5 in the test plan are manual checks:
POSTGRES_HOSTsetarchive-dbby defaultThe port binding can be automated in CI:
Not blocking — this is a bug-fix PR and manual verification is documented — but these items would be good candidates for automated smoke tests in a follow-up.
No issues with static analysis
The
infra/observability/obs.envfile is git-tracked non-secret config. It's clean and well-commented. The variable inventory in DEPLOYMENT.md §4 matches the actual file contents. ✅📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
This is a pure infrastructure bug-fix PR with no product-facing requirements implications. The scope is well-contained: three specific breakages, three specific fixes, each documented with the exact error message or symptom that triggered it.
What's clear
The PR description is exemplary requirements documentation for an ops fix:
field processors not found in type generator.Config), names the version where the schema changed (2.7.2), and explains the fix (removed duplicate block, kept theoverridesconfig)archive-dbpreserved for production)Bind for 127.0.0.1:3001 failed)The test plan covers all three fixes with checkable items. ✅
One open question (non-blocking)
The
POSTGRES_HOSTconvention —archiv-staging-db-1,archiv-production-db-1— derives from Docker Compose's container naming scheme:{project_name}-{service}-{instance}. This is an implicit convention, not a configured value.The comments in CI acknowledge this:
From a requirements perspective: is there a defined constraint on Compose project names for the staging and production stacks? If
archiv-stagingandarchiv-productionare established conventions (which the DEPLOYMENT.md suggests they are), then the implicit dependency is acceptable. If the project names could ever change as part of a migration, these hardcoded host names in CI become a silent failure waiting to happen.Suggestion: document the Compose project names as a first-class configuration constant somewhere (e.g., in DEPLOYMENT.md §3 or as a comment in the CI env files), so a future rename decision knows to also update
POSTGRES_HOSTin the CI secrets.🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist
Verdict: ✅ Approved — LGTM
This PR contains zero frontend changes. No Svelte components, no routes, no CSS, no ARIA attributes, no Tailwind classes were modified.
Checked:
frontend/src/.sveltefilelayout.cssor design tokensNothing in scope for UI, UX, or accessibility review. Approved unconditionally from a design perspective.
If any follow-up work adds GlitchTip's error reporting UI or Grafana dashboards to the user-facing application surface, loop me in for the UX review at that point.
🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: ✅ Approved
Well-engineered. Three immediate wins stand out right away:
GRAFANA_ADMIN_PASSWORD,GLITCHTIP_SECRET_KEY,POSTGRES_PASSWORDpreviously lived in the.env.staging/.env.productionheredocs written to the workspace. They're now in/opt/familienarchiv/obs-secrets.env(chmod 600, written fresh from Gitea secrets). Net improvement.rsync -a --deleteforinfra/observability/— stale config files are removed on every run. Clean mirror.docker compose config --quietcatches missing vars and YAML errors before containers start. This is exactly the right gate.GlitchTip healthcheck was overdue —
docker compose up --waitnow actually covers it, and the assert-health step can meaningfully verify its status.Suggestions (not blockers)
docker-compose.observability.ymlusescp, notrsync.The
infra/observability/directory getsrsync -a --delete(clean mirror), but the compose file itself is a plaincp. If the file is ever deleted from the repo, the stale copy persists at/opt/familienarchiv/. Acknowledged in ADR-016 and DEPLOYMENT.md. Low risk given how infrequently this file changes, but worth keeping in mind.POSTGRES_HOST=archive-dbin obs.env + manual restarts.If someone restarts the obs stack on the server using only
--env-file obs.env(forgetting the secrets file),POSTGRES_HOSTdefaults toarchive-db. The manual restart command in DEPLOYMENT.md is correct (includes both files). No code change needed — this is a doc awareness note./opt/familienarchiv/directory permissions.obs-secrets.envgetschmod 600(good). The directory itself has default permissions. Not a blocker for a single-tenant runner, but worth knowing if another service ever gets write access to the runner host.Everything else is clean. The two-source env model (git-tracked non-secrets + CI-written secrets) is the right pattern here.
👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Verdict: ✅ Approved
Pure infrastructure — no Java, TypeScript, or Svelte touched. Reviewing the CI scripts on their own terms.
What I verified
HEREDOC quoting is correct.
The workflow steps use
<<'EOF'(single-quoted delimiter), which prevents the shell from expanding anything between the EOF markers. The${{ secrets.X }}tokens are Gitea Actions template expressions — they're substituted by the Actions engine before the YAML is executed as a shell script. So by the time the shell runs, the secrets are already plain values. The'EOF'quoting just prevents a second round of shell expansion on those already-expanded values (important if a secret contains$(), backticks, or similar). This is intentionally correct.Assert-health script logic is clean.
Accumulate-then-fail pattern is the right approach — you see all unhealthy services in one run, not just the first. The
2>/dev/null || echo "missing"handles containers that failed to start entirely (no inspect output).Comments justify WHY, not WHAT.
The multi-line comments per CI step are extensive, but warranted. The env-file ordering rationale, which services lack healthchecks, why absolute paths are needed for bind mounts — these are non-obvious infrastructure constraints that would take time to re-derive. No comment explains what a YAML key does; they all explain the constraint behind the choice. That's the bar.
Suggestion
The
Validate observability compose configandAssert observability stack healthsteps are copy-pasted identically betweennightly.ymlandrelease.yml. The comments even note "Keep in sync with the equivalent step in nightly.yml (#603)." This is manual sync debt — if the assert logic changes, it needs to be updated in two places. For now it's fine, but worth extracting into a composite action if this pattern expands further.🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
What improved
Secrets out of CI workspace files — significant improvement.
Previously,
GRAFANA_ADMIN_PASSWORD,GLITCHTIP_SECRET_KEY, andGLITCHTIP_DOMAINwere written into the CI workspace as part of.env.staging/.env.production. Those workspace files get deleted after the run, but the secrets briefly existed as plaintext in a path that's potentially logged, snapshotted, or cached by the runner. They now live only in/opt/familienarchiv/obs-secrets.env(chmod 600), written directly to the permanent path from Gitea secrets. This is a meaningful reduction in secret exposure surface.Two-source model is correctly structured.
Non-secret config in git-tracked
obs.env(ports, URLs, usernames). Secrets in CI-writtenobs-secrets.env(passwords, keys). The separation is real —obs.envcontains nothing sensitive. Rotating a secret only requires updating the Gitea secret; no PR needed.HEREDOC quoting (confirmed correct)
<<'EOF'is the right choice here. It prevents the shell from performing a second round of variable expansion on the already-substituted secret values. If a secret contained$(command), backticks, or${var}, shell expansion without'EOF'quoting would interpret them. Single-quoting the delimiter is the defensive choice. (CWE-78 prevention at CI layer.)Runner mount blast radius (acknowledged, acceptable)
ADR-016 and
runner-config.yamlboth document this explicitly:For a single-tenant runner running trusted, private-repo code only, this is a sound judgment call. The security note in runner-config.yaml is unusually honest and thorough — it correctly names the blast radius and explicitly warns against adding external contributors. That documentation is itself a security control.
Minor observations (not blockers)
/opt/familienarchiv/directory permissions are default (755).obs-secrets.envis protected at file level (600). Acceptable since the runner is single-tenant.POSTGRES_USER=archivappears in git-trackedobs.env. Usernames are generally not considered secret, so this is standard practice. No issue.🏛️ Markus Keller (@mkeller) — Application Architect
Verdict: ✅ Approved
ADR quality
ADR-016 is well-constructed. The context explains the specific failure mode (workspace wipe breaks bind-mount sources). Three alternatives are considered with concrete rejection reasons — not hand-waving. The consequences section honestly names both negatives:
cpdoesn't remove deleted files, and the runner mount expands blast radius. An ADR that hides negatives is worse than no ADR. This one doesn't.Documentation checklist
Per the doc-update matrix:
docs/architecture/c4/l2-containers.pumldocs/DEPLOYMENT.mddocs/adr/No new runtime services were added (GlitchTip already existed in the C4 diagram). The
obs.envtwo-source model is a configuration pattern, not a new container, so no new C4 node is needed. All required docs are updated.Architectural observation
The co-location decision at
/opt/familienarchiv/is architecturally correct. GlitchTip is coupled to the main app's PostgreSQL (archive-db) andarchiv-net. Keeping it at a separate path (e.g.,/opt/obs/) would imply isolation that doesn't actually exist. Co-location reflects the real coupling.The two-source env model — git-tracked non-secrets + CI-written secrets — is a 12-factor pattern applied appropriately here. The boundary between the two files is principled: anything that can be in a PR review is in
obs.env; anything that requires a secret rotation is inobs-secrets.env. No mixing.Minor suggestion
The "Note:
cp -rdoes not remove deleted files" warning in DEPLOYMENT.md appears in a blockquote after an extensive code block and table. It's important operational knowledge. A> ⚠️prefix or a dedicated "Pitfalls" callout would make it harder to miss. Low priority.📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Reviewing in Brownfield mode — auditing the PR against issue #601's stated requirements.
PR description quality
Excellent. Each of the three fixes includes:
field processors not found in type generator.Config,Bind for 127.0.0.1:3001 failed, DNS resolution failure)overrides)This is the level of context that makes a PR reviewable without chasing down the original issue.
Requirements coverage
metrics_generator.processorsfrom top-level${POSTGRES_HOST:-archive-db}variable substitutionNo scope creep. No missing requirements.
Test plan completeness
All five checkboxes are verifiable and directly tied to the three bugs. No ambiguity in what "passing" looks like for each. The
docker compose configvalidates-without-errors check is the correct gate for item 1.One clarifying question (not a blocker)
obs.envsetsPOSTGRES_HOST=archive-dband the comment says "For local dev, set POSTGRES_HOST in your .env file (defaults to archive-db there)." But obs.env itself sets the value toarchive-db. For a developer who runs the obs stack locally with--env-file obs.env, they'd getarchive-dbas the hostname. Is that the correct container name in the dev compose setup (i.e., does the dev docker-compose definecontainer_name: archive-dbexplicitly)? If yes, the comment should say "defaults to archive-db via obs.env" rather than "set it in your .env file." Minor phrasing clarification only.🧪 Sara Holt (@saraholt) — QA Engineer
Verdict: ✅ Approved
No application test code to review. Evaluating the CI verification additions.
What was added for verification
Validate observability compose configstepdocker compose ... config --quietis a static correctness check: resolves all variable substitutions and reports any missing required keys or YAML errors. This runs before any container starts, which is the right gate placement. Catches broken configs fast and cheaply.Assert observability stack healthstepExplicitly loops 5 named services:
obs-loki,obs-prometheus,obs-grafana,obs-tempo,obs-glitchtip. The script:2>/dev/null || echo "missing"to handle containers that failed to start entirely (inspect fails, status becomes "missing", which != "healthy" — correctly flagged)::error::format for Gitea Actions error annotationsThe 5 services checked match exactly those described as "healthchecked critical services" in the step comment. The 4 excluded services (promtail, cadvisor, node-exporter, glitchtip-worker) correctly have no healthcheck directive, so asserting their health status would always fail.
GlitchTip healthcheck added
Before this PR,
docker compose up --waitsilently skipped GlitchTip because it had no healthcheck. Now it has one:wget -qO- http://localhost:8080/_health/ || exit 1, 30s interval, 5 retries, 60s start_period. The 60s start_period is appropriate — GlitchTip runs Django migrations on first start. The assert-health step can now actually verify it.Known gap (acceptable)
No rollback assertion after failure. If the obs stack fails health checks, the pipeline fails (correct) but does not attempt recovery. Subsequent CI runs will also fail until the issue is resolved manually. Since the obs stack is secondary infrastructure (not the main app pipeline), this is an acceptable operational constraint. Worth documenting in a runbook if it becomes a recurring pain point.
Duplication in nightly.yml / release.yml
The
ValidateandAssertsteps are copy-pasted between both files. The comments acknowledge this ("Keep in sync with the equivalent step in nightly.yml (#603)"). If the assertion logic needs to change (e.g., adding a new service to the health check list), both files must be updated. No immediate action needed, but a composite action would eliminate this debt.🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
No UI changes in this PR. Every touched file is infrastructure:
.gitea/workflows/nightly.yml— CI pipeline.gitea/workflows/release.yml— CI pipelinedocker-compose.observability.yml— container configrunner-config.yaml— Gitea Act runner configinfra/observability/obs.env— non-secret env configinfra/observability/tempo/tempo.yml— Tempo configdocs/adr/016-obs-stack-co-location-ci-push.md— ADRdocs/DEPLOYMENT.md— ops documentationdocs/architecture/c4/l2-containers.puml— architecture diagram.env.example— developer onboardingNo Svelte components, no routes, no Tailwind classes, no i18n keys, no design tokens, no accessibility implications. No UX concerns raised.