devops: move observability stack out of CI workspace into /opt/familienarchiv/ #601
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The observability stack (
docker-compose.observability.yml) is currently managed from the Gitea Act runner's workspace at/srv/gitea-workspace/marcel/familienarchiv/. This caused a production outage tonight:obs-lokikept running — but its bind-mounted config file (infra/observability/loki/loki-config.yml) was deleted with the workspace.read /etc/loki/loki-config.yml: is a directory.depends_on: loki(Promtail, Grafana, Tempo) failed to start in CI.Manual recovery required: stop + rm the container, restore the config file, SCP the compose file back, recreate from compose.
Root cause
The observability stack should be infrastructure — always-on, independent of CI. Running it from the CI workspace couples a long-running service to an ephemeral directory.
Goal
Move all observability-related files to
/opt/familienarchiv/(where the main app stack already lives permanently) and ensure all containers reference that permanent location.Acceptance criteria
docker-compose.observability.ymlis deployed to and managed from/opt/familienarchiv//opt/familienarchiv/infra/observability/…(or a stable subfolder)obs-*containers showproject.config_fileslabel pointing to/opt/familienarchiv/, not/srv/gitea-workspace/…infra/observability/from git to the server (so new config changes are applied on deploy, not manually)docker rm+docker runafter a workspace wipeOut of scope
Context
root@raddatz.cloud/opt/familienarchiv//srv/gitea-workspace/marcel/familienarchiv/obs-loki, and anything depending on it (obs-promtail,obs-grafana,obs-tempo)🔧 Tobias Wendt — DevOps & Platform Engineer
Observations
docker-compose.observability.yml: five services have config file bind mounts pointing to./infra/observability/…—prometheus.yml,loki-config.yml,promtail-config.yml,tempo.yml, and the entiregrafana/provisioning/directory. All five will produce the same "is a directory" crash if any of them is missing when a container is recreated. Only Loki crashed tonight because it was the firstdepends_onchain to fail, but the others are equally at risk on any future Docker restart/recreate.loki_data,prometheus_data,grafana_data,tempo_data,promtail_positions,glitchtip_dataare all Docker-managed named volumes. They survivedocker compose down,docker rm, and changing the working directory. Migration will not lose stored data.scripts/contains data-prep and schema utilities — nothing that deploys or syncs the observability stack to the server.glitchtip/glitchtip:v4is not pinned. Every other image in the compose uses a specific version (grafana/loki:3.4.2,grafana/grafana-oss:11.6.1, etc.).v4is a moving tag — a future 4.x release will be pulled silently without Renovate noticing a version bump..env.stagingis currently read from the workspace path. It containsGLITCHTIP_SECRET_KEYandGRAFANA_ADMIN_PASSWORD. These values need to be present at/opt/familienarchiv/.env.staging(or equivalent) before the containers can be recreated from the new location.Recommendations
docker-compose.observability.ymlandinfra/observability/into/opt/familienarchiv/. The compose's working dir becomes/opt/familienarchiv/— all relative./infra/observability/…paths resolve correctly without changing the compose file.scripts/deploy-observability.shwith three steps: (a) rsyncinfra/observability/from the repo to/opt/familienarchiv/infra/observability/, (b)docker compose -f docker-compose.observability.yml up -d --remove-orphans, (c) smoke-test each obs-* container is healthy. Wire this into the release workflow after the main stack is confirmed healthy.glitchtip/glitchtip:v4to a specific patch version (e.g.v4.0.6) and add it torenovate.jsonso version bumps create PRs..env.stagingat the new location has all required keys before the migration runs. The deploy script should fail fast with a named error ifGLITCHTIP_SECRET_KEYis unset rather than silently starting with a broken config.docker compose -f /opt/familienarchiv/docker-compose.observability.yml up -dshould be the documented command in the runbook — not the workspace path.🏗️ Markus Keller — Application Architect
Observations
/opt/familienarchiv/, not like a CI artifact. Running it from the CI workspace created an implicit lifecycle coupling: the monitoring system's uptime depended on an ephemeral directory that CI routinely destroys. This is the architectural bug; the bind-mount directory issue is just how it manifested./opt/familienarchiv/, someone must keepinfra/observability/*.ymlin sync between git and the server when configs change. This sync path currently doesn't exist. Without it, a Prometheus scrape target added in git will silently fail to apply on the server until someone manually copies the file — which defeats the purpose.docs/DEPLOYMENT.mdhas no section describing how to operate the observability stack. After this migration,docs/architecture/c4/l2-containers.pumlshould show the obs stack running from/opt/familienarchiv/and DEPLOYMENT.md should document the startup procedure and update mechanism./opt/familienarchiv/with the app stack rather than having its own directory or systemd service? The alternatives (e.g.,/opt/familienarchiv-obs/, a separate git checkout, systemd units) have different trade-offs around access control and env file sharing. Documenting the choice prevents reversals without context.Recommendations
docker-compose.observability.ymlwith the app stack under/opt/familienarchiv/. Key points: sharedarchiv-netnetwork requires both stacks under Docker Compose, shared.env.stagingavoids credential duplication, operational simplicity of a singledocker composeinvocation per stack. Note the alternative (separate directory per stack) and why it was rejected.docs/DEPLOYMENT.mdto cover: (a) starting the obs stack, (b) how config changes ininfra/observability/are applied to the server (the deploy script), (c) how to verify the obs stack is healthy.docs/architecture/c4/l2-containers.pumlto show obs services (obs-loki,obs-grafana, etc.) as managed from/opt/familienarchiv/— this is a new infrastructure component location that the diagram should reflect.🔐 Nora "NullX" Steiner — Application Security Engineer
Observations
127.0.0.1:3001, Prometheus to127.0.0.1:9090, no obs services exposed to the internet. Moving the compose to/opt/familienarchiv/doesn't change any of this.changeme. The compose hasGF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}. If.env.stagingat the new location is missing this key, Grafana starts with a known default admin password. This is not new, but the migration is the right moment to verify the key is present at the destination before recreating the container.glitchtip/glitchtip:v4is a supply-chain risk. An unpinned floating tag means any future GlitchTip 4.x release is pulled automatically ondocker compose pull. GlitchTip processes error reports that may contain application secrets and stack traces. Pin to a specific patch version and route updates through Renovate PRs.GLITCHTIP_SECRET_KEY,GRAFANA_ADMIN_PASSWORD, andPOSTGRES_PASSWORD. These must not appear inset -xoutput or CI logs. The script should validate that required env vars are non-empty (exit 1 with a name, not a value) rather than echoing them.# a compromised Promtail has full daemon access. This is an accepted risk given single-operator context, but it means Promtail's container image should be pinned and Renovate-managed — it already is (grafana/promtail:3.4.2). Maintain this discipline after migration.Recommendations
GRAFANA_ADMIN_PASSWORDis set to a non-default value in/opt/familienarchiv/.env.staging. Add this check to the migration runbook.glitchtip/glitchtip:v4→glitchtip/glitchtip:v4.0.6(or current patch). Add torenovate.jsonunder Docker image rules so future patch bumps create PRs.set -xglobally. Validate required keys with[ -n "${GLITCHTIP_SECRET_KEY}" ] || { echo "GLITCHTIP_SECRET_KEY is required"; exit 1; }— log the key name, never the value.🧪 Sara Holt — QA Engineer & Test Strategist
Observations
docker ps --filter name=obs-lokior pings/readyon Loki. Tonight's outage was discovered when the nightly CI failed because it depended on Loki — not because a dedicated health check fired. The fix should include a proactive health check step, not just a reactive one.docker compose downdoes not delete named volumes by default, and changing the working directory of a compose project doesn't migrate volumes — they stay under their original Docker volume names. This is correct behavior here (existingloki_data,grafana_dataetc. will be found by the new compose invocation because the volume names match), but it should be explicitly validated during the migration, not assumed.docker compose upsucceeds for some services but not others), the issue doesn't describe how to recover. The old compose is gone from the workspace; a rollback would require recreating it from git.Recommendations
docker inspect obs-loki --format '{{.State.Health.Status}}'returninghealthyas a required pre-condition. This catches the problem before dependent services fail.volume inspectcan confirm this.obs-*container is nothealthywithin 60 seconds ofdocker compose up.git show HEAD:docker-compose.observability.yml > /srv/gitea-workspace/.../docker-compose.observability.ymland recreate from there" — or document that the git-tracked file in/opt/familienarchiv/IS the rollback artifact.👨💻 Felix Brandt — Senior Fullstack Developer
Observations
docs/architecture/c4/l2-containers.pumlanddocs/DEPLOYMENT.md. These updates should be part of the same PR that closes this issue.scripts/directory has no deployment scripts — only data-prep utilities. The newscripts/deploy-observability.sh(or equivalent) will be the first deploy script added. Its naming and placement should follow the existing convention: lowercase, hyphenated,.shsuffix.## Infrastructuresection points readers todocs/DEPLOYMENT.mdfor operational details. That file should gain a section on the obs stack after this migration so future sessions have a reliable reference.Recommendations
docs/DEPLOYMENT.mdanddocs/architecture/c4/l2-containers.pumlupdates in the PR that closes this issue — don't leave them as follow-up. The docs table makes these mandatory, not optional.🎨 Leonie Voss — UX Designer & Accessibility Strategist
No concerns from my angle.
This issue is entirely server-side infrastructure — no user-facing routes, Svelte components, or UI states are affected. The observability stack (Grafana, Loki, Prometheus) is an internal operator tool, not part of the family-facing application. Moving its compose file to a different directory on the server has zero impact on the UI, accessibility, or responsiveness of the Familienarchiv frontend.
I reviewed the issue to confirm no frontend or Grafana UI provisioning changes are implied — provisioning files (
datasources.yml,dashboards.yml) are being moved as-is, not modified. Grafana dashboard JSON files (loki-logs.json,node-exporter-full.json,spring-boot-observability.json) move unchanged.📋 Elicit — Requirements Engineer
Observations
docker compose up; (B) server-pull: a script on the server runsgit pulland thendocker compose up. These have different failure modes, auth requirements, and dependency on network topology. The AC should name the mechanism.loki/loki-config.yml,prometheus/prometheus.yml,promtail/promtail-config.yml,tempo/tempo.yml, andgrafana/provisioning/(a directory, not a file). AC #2 says "bind-mount source paths … point to /opt/familienarchiv/infra/observability/…" — this is correct but should be explicit that the grafana provisioning directory is included.docker rmto pick up the new compose working directory). Named volumes (loki_data,grafana_data,prometheus_data) must be preserved — 30 days of log history and Grafana dashboard state live in them. This is a "done" condition that's missing from the ACs.obs-*containers will be briefly offline. If anything in the nightly CI depends on Loki being healthy (it does — tonight's failure proves this), the migration should be timed to avoid CI runs. This is an operational constraint the issue doesn't name.Recommendations
loki_data,grafana_data,prometheus_data,tempo_data,promtail_positions,glitchtip_data) verified non-empty after migration — no historical data lost."healthy(per Docker healthcheck) within 90 seconds of running the deploy script."🗳️ Decision Queue — Action Required
1 decision needs your input before implementation starts.
Infrastructure
infra/observability/reach the server? Two models: (A) CI-push — the release workflow rsyncs config files to/opt/familienarchiv/infra/observability/as part of deployment, then runsdocker compose up; (B) server-pull — a script on the server doesgit pull /opt/familienarchivand thendocker compose up. Model A is consistent with how the current release workflow already copies built artefacts to the server; model B requires the server to have git credentials and a git remote configured, but avoids an extra rsync step. Model A keeps the server stateless (no git checkout on the server); model B means the server IS the git checkout and changes are always in sync. (Raised by: Elicit)🔍 Additional findings from tonight's recovery — update scope
While attempting to bring up the full observability stack after fixing Loki, three more issues were found. All should be addressed in the implementation of this issue.
1. All five config bind mounts were affected, not just Loki
When Docker auto-created directories for missing bind-mount sources, the following paths became directories instead of files on the workspace host:
infra/observability/prometheus/prometheus.yml→ directoryinfra/observability/tempo/tempo.yml→ directoryinfra/observability/promtail/→ directory missing entirelyobs-tempowas crash-looping for the same reason asobs-loki.obs-prometheus,obs-promtail,obs-grafana,obs-glitchtipwere all inCreatedstate (not running) because their config files were also missing or corrupted.Impact on scope: the deploy script must rsync the entire
infra/observability/tree, not just the loki subdirectory.2.
glitchtip/glitchtip:v4does not exist on Docker HubThe compose image tag
glitchtip/glitchtip:v4fails to pull —not found. The running worker was using6.1.6. The compose file must be updated to a pinned tag before the migration is run (this also addresses Nora's supply-chain concern).Fix: change
glitchtip/glitchtip:v4→glitchtip/glitchtip:6.1.6indocker-compose.observability.yml. Add this image to Renovate.3.
.env.stagingwas lost with the workspace wipe — secrets must be documentedThe env file that GlitchTip and Grafana read (
GLITCHTIP_SECRET_KEY,GRAFANA_ADMIN_PASSWORD,POSTGRES_USER,POSTGRES_PASSWORD) was stored only in the workspace and was deleted when CI wiped it. It had to be reconstructed by extracting values from running container environment variables.Impact on scope: the observability-specific env vars (
GLITCHTIP_SECRET_KEY,GRAFANA_ADMIN_PASSWORD,GLITCHTIP_DOMAIN,PORT_GRAFANA,PORT_PROMETHEUS) must be merged into the persistent/opt/familienarchiv/.envfile (shared with the main app stack). The.env.exampleshould be updated to include these keys with placeholder values.4. GlitchTip superuser has never been created
The
obs-glitchtip-db-initservice creates the PostgreSQL database, but does not create the GlitchTip Django superuser. The commanddocker exec obs-glitchtip ./manage.py createsuperusermust be run manually once after first startup. This should be documented indocs/DEPLOYMENT.mdas a one-time post-migration step.Updated acceptance criteria (additions)
glitchtip/glitchtip:v4updated to a pinned patch version indocker-compose.observability.ymlGLITCHTIP_SECRET_KEY,GRAFANA_ADMIN_PASSWORD,GLITCHTIP_DOMAIN, port vars) added to.env.examplewith placeholder valuesdocs/DEPLOYMENT.mddocuments the one-timecreatesuperuserstep for GlitchTip🔍 Second round of findings — full migration attempt tonight
After the first recovery comment, a full
docker compose up -dwas attempted from/opt/familienarchiv/. Five more issues surfaced. All are now resolved in the local compose/config files but not yet committed.5. Tempo 2.7.2 rejects
metrics_generator.processorsat the top levelobs-tempocrash-looped with:The
processorslist was removed from the top-levelmetrics_generatorblock in Tempo 2.x. It is only valid underoverrides.defaults.metrics_generator.processors, which was already present in the config.Fix (applied locally): removed lines 39–41 from
infra/observability/tempo/tempo.yml:6.
PORT_GRAFANA=3001conflicts with the staging frontendGrafana's bind failed:
Port 3001 is occupied by
archiv-staging-frontend-1. Ports in use on127.0.0.1: 2019 (Caddy admin), 3001 (staging frontend), 3005 (Gitea), 8081 (staging backend), 9090 (Prometheus).Fix (applied in
.env):PORT_GRAFANA=3003. The compose default (:-3001) should also be updated to:-3003to avoid a future footgun if the env var is missing.Note for Caddy config: The nightly build generates a Caddy entry for Grafana. If it references port 3001, it must be updated to 3003.
7.
archive-dbhostname is not resolvable — only staging stack is runningobs-glitchtip-db-initfailed with:No production stack is currently running on the server — only
archiv-staging-*. The database container isarchiv-staging-db-1, andarchive-dbdoes not exist onarchiv-net. Verified vianslookup archive-dbfrom insidearchiv-net.Fix (applied locally):
DATABASE_URLanddb-initcommand indocker-compose.observability.ymlnow use${POSTGRES_HOST:-archive-db}. The.envfile setsPOSTGRES_HOST=archiv-staging-db-1for the current server state. When the production stack is running, this var can be unset (defaulting back toarchive-db).Long-term: the deploy script should set
POSTGRES_HOSTbased on whether the production or staging DB is the target.8.
$characters in.envvalues must be escaped as$$GRAFANA_ADMIN_PASSWORDcontains$characters. Docker Compose interpolates$VARreferences inside.envvalues, silently replacing them with empty strings. The valueme30g@b$Nt$Z2gbecameme30g@bafter compose read the file.Fix: use
$$in.envfor literal$:GRAFANA_ADMIN_PASSWORD=me30g@b$$Nt$$Z2g. Docker Compose renders$$→$when passing the value to the container.Impact on deploy script and
.env.example: the example file and any generation script must document this escaping rule. Passwords with$are otherwise silently truncated with no error — a subtle and dangerous failure mode.9. GlitchTip required 104 unapplied migrations before
createsuperuserRunning
createsuperuserwithout first migrating produced a warning about 104 unapplied migrations and refused to create the user.Fix: run
./manage.py migratebefore./manage.py createsuperuser. Both are one-time post-migration steps. Applied tonight:createsuperuserrequires an interactive TTY:Without
-it, the command exits silently. This must be documented indocs/DEPLOYMENT.mdas a one-time step.Current state after tonight's fixes
All containers now managed from
/opt/familienarchiv/— the permanent migration from the CI workspace is complete for this session.Files changed locally (not yet committed)
docker-compose.observability.yml— GlitchTipv4→6.1.6,POSTGRES_HOSTvariable,$$escaping noteinfra/observability/tempo/tempo.yml— removed top-levelmetrics_generator.processors/opt/familienarchiv/.env— created on server (not in git; contains secrets)