devops: move observability stack out of CI workspace into /opt/familienarchiv/ #601

New Issue

marcel · 2026-05-15T21:17:45+02:00

marcel commented

2026-05-15 21:17:45 +02:00

Problem

The observability stack (docker-compose.observability.yml) is currently managed from the Gitea Act runner's workspace at /srv/gitea-workspace/marcel/familienarchiv/. This caused a production outage tonight:

The CI wiped the workspace between runs (expected behavior).
obs-loki kept running — but its bind-mounted config file (infra/observability/loki/loki-config.yml) was deleted with the workspace.
On the next Docker operation, Docker auto-created a directory at the now-missing source path (Docker's default behavior when a bind-mount source doesn't exist).
Loki entered a permanent crash loop: read /etc/loki/loki-config.yml: is a directory.
All downstream services that depends_on: loki (Promtail, Grafana, Tempo) failed to start in CI.

Manual recovery required: stop + rm the container, restore the config file, SCP the compose file back, recreate from compose.

Root cause

The observability stack should be infrastructure — always-on, independent of CI. Running it from the CI workspace couples a long-running service to an ephemeral directory.

Goal

Move all observability-related files to /opt/familienarchiv/ (where the main app stack already lives permanently) and ensure all containers reference that permanent location.

Acceptance criteria

docker-compose.observability.yml is deployed to and managed from /opt/familienarchiv/
All bind-mount source paths in the observability compose point to /opt/familienarchiv/infra/observability/… (or a stable subfolder)
All obs-* containers show project.config_files label pointing to /opt/familienarchiv/, not /srv/gitea-workspace/…
The CI workspace being wiped has zero effect on the running observability stack
A deploy script or Makefile target handles syncing infra/observability/ from git to the server (so new config changes are applied on deploy, not manually)
No observability container requires a manual docker rm + docker run after a workspace wipe

Out of scope

Changes to what the observability stack monitors (Prometheus targets, Grafana dashboards, Loki retention, etc.)
Moving the app stack itself — only observability

Context

Server: root@raddatz.cloud
Permanent app location: /opt/familienarchiv/
CI workspace (ephemeral): /srv/gitea-workspace/marcel/familienarchiv/
Affected containers tonight: obs-loki, and anything depending on it (obs-promtail, obs-grafana, obs-tempo)
Fixed tonight by manually restoring files; this issue tracks the permanent fix

## Problem The observability stack (`docker-compose.observability.yml`) is currently **managed from the Gitea Act runner's workspace** at `/srv/gitea-workspace/marcel/familienarchiv/`. This caused a production outage tonight: 1. The CI wiped the workspace between runs (expected behavior). 2. `obs-loki` kept running — but its bind-mounted config file (`infra/observability/loki/loki-config.yml`) was deleted with the workspace. 3. On the next Docker operation, Docker auto-created a **directory** at the now-missing source path (Docker's default behavior when a bind-mount source doesn't exist). 4. Loki entered a permanent crash loop: `read /etc/loki/loki-config.yml: is a directory`. 5. All downstream services that `depends_on: loki` (Promtail, Grafana, Tempo) failed to start in CI. **Manual recovery required:** stop + rm the container, restore the config file, SCP the compose file back, recreate from compose. ## Root cause The observability stack should be **infrastructure** — always-on, independent of CI. Running it from the CI workspace couples a long-running service to an ephemeral directory. ## Goal Move all observability-related files to `/opt/familienarchiv/` (where the main app stack already lives permanently) and ensure all containers reference that permanent location. ## Acceptance criteria - [ ] `docker-compose.observability.yml` is deployed to and managed from `/opt/familienarchiv/` - [ ] All bind-mount source paths in the observability compose point to `/opt/familienarchiv/infra/observability/…` (or a stable subfolder) - [ ] All `obs-*` containers show `project.config_files` label pointing to `/opt/familienarchiv/`, not `/srv/gitea-workspace/…` - [ ] The CI workspace being wiped has zero effect on the running observability stack - [ ] A deploy script or Makefile target handles syncing `infra/observability/` from git to the server (so new config changes are applied on deploy, not manually) - [ ] No observability container requires a manual `docker rm` + `docker run` after a workspace wipe ## Out of scope - Changes to what the observability stack monitors (Prometheus targets, Grafana dashboards, Loki retention, etc.) - Moving the app stack itself — only observability ## Context - Server: `root@raddatz.cloud` - Permanent app location: `/opt/familienarchiv/` - CI workspace (ephemeral): `/srv/gitea-workspace/marcel/familienarchiv/` - Affected containers tonight: `obs-loki`, and anything depending on it (`obs-promtail`, `obs-grafana`, `obs-tempo`) - Fixed tonight by manually restoring files; this issue tracks the permanent fix

marcel added the P1-high devops labels 2026-05-15 21:17:51 +02:00

marcel commented

2026-05-15 21:20:36 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Observations

The issue is broader than Loki. Reading docker-compose.observability.yml: five services have config file bind mounts pointing to ./infra/observability/… — prometheus.yml, loki-config.yml, promtail-config.yml, tempo.yml, and the entire grafana/provisioning/ directory. All five will produce the same "is a directory" crash if any of them is missing when a container is recreated. Only Loki crashed tonight because it was the first depends_on chain to fail, but the others are equally at risk on any future Docker restart/recreate.
Named data volumes are already safe. loki_data, prometheus_data, grafana_data, tempo_data, promtail_positions, glitchtip_data are all Docker-managed named volumes. They survive docker compose down, docker rm, and changing the working directory. Migration will not lose stored data.
No deploy script exists. scripts/ contains data-prep and schema utilities — nothing that deploys or syncs the observability stack to the server.
glitchtip/glitchtip:v4 is not pinned. Every other image in the compose uses a specific version (grafana/loki:3.4.2, grafana/grafana-oss:11.6.1, etc.). v4 is a moving tag — a future 4.x release will be pulled silently without Renovate noticing a version bump.
.env.staging is currently read from the workspace path. It contains GLITCHTIP_SECRET_KEY and GRAFANA_ADMIN_PASSWORD. These values need to be present at /opt/familienarchiv/.env.staging (or equivalent) before the containers can be recreated from the new location.

Recommendations

Move both files together: copy docker-compose.observability.yml and infra/observability/ into /opt/familienarchiv/. The compose's working dir becomes /opt/familienarchiv/ — all relative ./infra/observability/… paths resolve correctly without changing the compose file.
Add scripts/deploy-observability.sh with three steps: (a) rsync infra/observability/ from the repo to /opt/familienarchiv/infra/observability/, (b) docker compose -f docker-compose.observability.yml up -d --remove-orphans, (c) smoke-test each obs-* container is healthy. Wire this into the release workflow after the main stack is confirmed healthy.
Pin glitchtip/glitchtip:v4 to a specific patch version (e.g. v4.0.6) and add it to renovate.json so version bumps create PRs.
Validate the .env.staging at the new location has all required keys before the migration runs. The deploy script should fail fast with a named error if GLITCHTIP_SECRET_KEY is unset rather than silently starting with a broken config.
Update the container restart order: after migration, docker compose -f /opt/familienarchiv/docker-compose.observability.yml up -d should be the documented command in the runbook — not the workspace path.

## 🔧 Tobias Wendt — DevOps & Platform Engineer ### Observations - **The issue is broader than Loki.** Reading `docker-compose.observability.yml`: five services have config file bind mounts pointing to `./infra/observability/…` — `prometheus.yml`, `loki-config.yml`, `promtail-config.yml`, `tempo.yml`, and the entire `grafana/provisioning/` directory. All five will produce the same "is a directory" crash if any of them is missing when a container is recreated. Only Loki crashed tonight because it was the first `depends_on` chain to fail, but the others are equally at risk on any future Docker restart/recreate. - **Named data volumes are already safe.** `loki_data`, `prometheus_data`, `grafana_data`, `tempo_data`, `promtail_positions`, `glitchtip_data` are all Docker-managed named volumes. They survive `docker compose down`, `docker rm`, and changing the working directory. Migration will not lose stored data. - **No deploy script exists.** `scripts/` contains data-prep and schema utilities — nothing that deploys or syncs the observability stack to the server. - **`glitchtip/glitchtip:v4` is not pinned.** Every other image in the compose uses a specific version (`grafana/loki:3.4.2`, `grafana/grafana-oss:11.6.1`, etc.). `v4` is a moving tag — a future 4.x release will be pulled silently without Renovate noticing a version bump. - **`.env.staging`** is currently read from the workspace path. It contains `GLITCHTIP_SECRET_KEY` and `GRAFANA_ADMIN_PASSWORD`. These values need to be present at `/opt/familienarchiv/.env.staging` (or equivalent) before the containers can be recreated from the new location. ### Recommendations 1. **Move both files together:** copy `docker-compose.observability.yml` and `infra/observability/` into `/opt/familienarchiv/`. The compose's working dir becomes `/opt/familienarchiv/` — all relative `./infra/observability/…` paths resolve correctly without changing the compose file. 2. **Add `scripts/deploy-observability.sh`** with three steps: (a) rsync `infra/observability/` from the repo to `/opt/familienarchiv/infra/observability/`, (b) `docker compose -f docker-compose.observability.yml up -d --remove-orphans`, (c) smoke-test each obs-* container is healthy. Wire this into the release workflow after the main stack is confirmed healthy. 3. **Pin `glitchtip/glitchtip:v4`** to a specific patch version (e.g. `v4.0.6`) and add it to `renovate.json` so version bumps create PRs. 4. **Validate the `.env.staging` at the new location** has all required keys before the migration runs. The deploy script should fail fast with a named error if `GLITCHTIP_SECRET_KEY` is unset rather than silently starting with a broken config. 5. **Update the container restart order:** after migration, `docker compose -f /opt/familienarchiv/docker-compose.observability.yml up -d` should be the documented command in the runbook — not the workspace path.

marcel commented

2026-05-15 21:20:51 +02:00

🏗️ Markus Keller — Application Architect

Observations

Architectural misclassification was the root cause. The observability stack is persistent infrastructure — it should be classified and operated like the app stack in /opt/familienarchiv/, not like a CI artifact. Running it from the CI workspace created an implicit lifecycle coupling: the monitoring system's uptime depended on an ephemeral directory that CI routinely destroys. This is the architectural bug; the bind-mount directory issue is just how it manifested.
A new runtime sync dependency is being introduced. Once the obs stack lives at /opt/familienarchiv/, someone must keep infra/observability/*.yml in sync between git and the server when configs change. This sync path currently doesn't exist. Without it, a Prometheus scrape target added in git will silently fail to apply on the server until someone manually copies the file — which defeats the purpose.
Documentation gap. The acceptance criteria mention a deploy script, but docs/DEPLOYMENT.md has no section describing how to operate the observability stack. After this migration, docs/architecture/c4/l2-containers.puml should show the obs stack running from /opt/familienarchiv/ and DEPLOYMENT.md should document the startup procedure and update mechanism.
This decision warrants a short ADR. Why does the obs stack share /opt/familienarchiv/ with the app stack rather than having its own directory or systemd service? The alternatives (e.g., /opt/familienarchiv-obs/, a separate git checkout, systemd units) have different trade-offs around access control and env file sharing. Documenting the choice prevents reversals without context.

Recommendations

Write ADR-016 (or next available number): document the decision to colocate docker-compose.observability.yml with the app stack under /opt/familienarchiv/. Key points: shared archiv-net network requires both stacks under Docker Compose, shared .env.staging avoids credential duplication, operational simplicity of a single docker compose invocation per stack. Note the alternative (separate directory per stack) and why it was rejected.
Update docs/DEPLOYMENT.md to cover: (a) starting the obs stack, (b) how config changes in infra/observability/ are applied to the server (the deploy script), (c) how to verify the obs stack is healthy.
Update docs/architecture/c4/l2-containers.puml to show obs services (obs-loki, obs-grafana, etc.) as managed from /opt/familienarchiv/ — this is a new infrastructure component location that the diagram should reflect.
The deploy script is mandatory, not optional. Without it, config drift between git and the server is guaranteed. The AC "deploy script or Makefile target" should be a hard requirement, not a nice-to-have.

## 🏗️ Markus Keller — Application Architect ### Observations - **Architectural misclassification was the root cause.** The observability stack is persistent infrastructure — it should be classified and operated like the app stack in `/opt/familienarchiv/`, not like a CI artifact. Running it from the CI workspace created an implicit lifecycle coupling: the monitoring system's uptime depended on an ephemeral directory that CI routinely destroys. This is the architectural bug; the bind-mount directory issue is just how it manifested. - **A new runtime sync dependency is being introduced.** Once the obs stack lives at `/opt/familienarchiv/`, someone must keep `infra/observability/*.yml` in sync between git and the server when configs change. This sync path currently doesn't exist. Without it, a Prometheus scrape target added in git will silently fail to apply on the server until someone manually copies the file — which defeats the purpose. - **Documentation gap.** The acceptance criteria mention a deploy script, but `docs/DEPLOYMENT.md` has no section describing how to operate the observability stack. After this migration, `docs/architecture/c4/l2-containers.puml` should show the obs stack running from `/opt/familienarchiv/` and DEPLOYMENT.md should document the startup procedure and update mechanism. - **This decision warrants a short ADR.** Why does the obs stack share `/opt/familienarchiv/` with the app stack rather than having its own directory or systemd service? The alternatives (e.g., `/opt/familienarchiv-obs/`, a separate git checkout, systemd units) have different trade-offs around access control and env file sharing. Documenting the choice prevents reversals without context. ### Recommendations 1. **Write ADR-016** (or next available number): document the decision to colocate `docker-compose.observability.yml` with the app stack under `/opt/familienarchiv/`. Key points: shared `archiv-net` network requires both stacks under Docker Compose, shared `.env.staging` avoids credential duplication, operational simplicity of a single `docker compose` invocation per stack. Note the alternative (separate directory per stack) and why it was rejected. 2. **Update `docs/DEPLOYMENT.md`** to cover: (a) starting the obs stack, (b) how config changes in `infra/observability/` are applied to the server (the deploy script), (c) how to verify the obs stack is healthy. 3. **Update `docs/architecture/c4/l2-containers.puml`** to show obs services (`obs-loki`, `obs-grafana`, etc.) as managed from `/opt/familienarchiv/` — this is a new infrastructure component location that the diagram should reflect. 4. **The deploy script is mandatory, not optional.** Without it, config drift between git and the server is guaranteed. The AC "deploy script or Makefile target" should be a hard requirement, not a nice-to-have.

marcel commented

2026-05-15 21:21:05 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Observations

No security regression from the migration itself. The network isolation model is correct and unchanged: Grafana bound to 127.0.0.1:3001, Prometheus to 127.0.0.1:9090, no obs services exposed to the internet. Moving the compose to /opt/familienarchiv/ doesn't change any of this.
Grafana admin password defaults to changeme. The compose has GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}. If .env.staging at the new location is missing this key, Grafana starts with a known default admin password. This is not new, but the migration is the right moment to verify the key is present at the destination before recreating the container.
glitchtip/glitchtip:v4 is a supply-chain risk. An unpinned floating tag means any future GlitchTip 4.x release is pulled automatically on docker compose pull. GlitchTip processes error reports that may contain application secrets and stack traces. Pin to a specific patch version and route updates through Renovate PRs.
Secrets must not be logged during the deploy script. The deploy script will handle GLITCHTIP_SECRET_KEY, GRAFANA_ADMIN_PASSWORD, and POSTGRES_PASSWORD. These must not appear in set -x output or CI logs. The script should validate that required env vars are non-empty (exit 1 with a name, not a value) rather than echoing them.
Promtail mounts Docker socket. The compose already documents the accepted risk: # a compromised Promtail has full daemon access. This is an accepted risk given single-operator context, but it means Promtail's container image should be pinned and Renovate-managed — it already is (grafana/promtail:3.4.2). Maintain this discipline after migration.

Recommendations

Before recreating Grafana: verify GRAFANA_ADMIN_PASSWORD is set to a non-default value in /opt/familienarchiv/.env.staging. Add this check to the migration runbook.
Pin GlitchTip: glitchtip/glitchtip:v4 → glitchtip/glitchtip:v4.0.6 (or current patch). Add to renovate.json under Docker image rules so future patch bumps create PRs.
Deploy script secret hygiene: do not use set -x globally. Validate required keys with [ -n "${GLITCHTIP_SECRET_KEY}" ] || { echo "GLITCHTIP_SECRET_KEY is required"; exit 1; } — log the key name, never the value.

## 🔐 Nora "NullX" Steiner — Application Security Engineer ### Observations - **No security regression from the migration itself.** The network isolation model is correct and unchanged: Grafana bound to `127.0.0.1:3001`, Prometheus to `127.0.0.1:9090`, no obs services exposed to the internet. Moving the compose to `/opt/familienarchiv/` doesn't change any of this. - **Grafana admin password defaults to `changeme`.** The compose has `GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}`. If `.env.staging` at the new location is missing this key, Grafana starts with a known default admin password. This is not new, but the migration is the right moment to verify the key is present at the destination before recreating the container. - **`glitchtip/glitchtip:v4` is a supply-chain risk.** An unpinned floating tag means any future GlitchTip 4.x release is pulled automatically on `docker compose pull`. GlitchTip processes error reports that may contain application secrets and stack traces. Pin to a specific patch version and route updates through Renovate PRs. - **Secrets must not be logged during the deploy script.** The deploy script will handle `GLITCHTIP_SECRET_KEY`, `GRAFANA_ADMIN_PASSWORD`, and `POSTGRES_PASSWORD`. These must not appear in `set -x` output or CI logs. The script should validate that required env vars are non-empty (exit 1 with a name, not a value) rather than echoing them. - **Promtail mounts Docker socket.** The compose already documents the accepted risk: `# a compromised Promtail has full daemon access`. This is an accepted risk given single-operator context, but it means Promtail's container image should be pinned and Renovate-managed — it already is (`grafana/promtail:3.4.2`). Maintain this discipline after migration. ### Recommendations 1. **Before recreating Grafana:** verify `GRAFANA_ADMIN_PASSWORD` is set to a non-default value in `/opt/familienarchiv/.env.staging`. Add this check to the migration runbook. 2. **Pin GlitchTip:** `glitchtip/glitchtip:v4` → `glitchtip/glitchtip:v4.0.6` (or current patch). Add to `renovate.json` under Docker image rules so future patch bumps create PRs. 3. **Deploy script secret hygiene:** do not use `set -x` globally. Validate required keys with `[ -n "${GLITCHTIP_SECRET_KEY}" ] || { echo "GLITCHTIP_SECRET_KEY is required"; exit 1; }` — log the key name, never the value.

marcel commented

2026-05-15 21:21:20 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Observations

No CI job verifies observability stack health. The nightly and release workflows don't include a step that checks docker ps --filter name=obs-loki or pings /ready on Loki. Tonight's outage was discovered when the nightly CI failed because it depended on Loki — not because a dedicated health check fired. The fix should include a proactive health check step, not just a reactive one.
The core acceptance criterion "CI workspace wipe has zero effect" is not mechanically verifiable as written. There's no test that simulates a workspace wipe and confirms the obs stack survives. This criterion will be confirmed by inspection at migration time, but can silently regress if, for example, someone later adds a bind mount that points back into the workspace.
Named data volumes are preserved automatically. docker compose down does not delete named volumes by default, and changing the working directory of a compose project doesn't migrate volumes — they stay under their original Docker volume names. This is correct behavior here (existing loki_data, grafana_data etc. will be found by the new compose invocation because the volume names match), but it should be explicitly validated during the migration, not assumed.
No rollback procedure defined. If the migration fails halfway (e.g. docker compose up succeeds for some services but not others), the issue doesn't describe how to recover. The old compose is gone from the workspace; a rollback would require recreating it from git.
Missing post-migration smoke test definition. The deploy script acceptance criterion says it "handles syncing" but doesn't define what "working" looks like after it runs.

Recommendations

Add a health-check step to the nightly workflow: after its startup phase, add docker inspect obs-loki --format '{{.State.Health.Status}}' returning healthy as a required pre-condition. This catches the problem before dependent services fail.
Add to acceptance criteria: "All named data volumes (loki_data, grafana_data, prometheus_data, tempo_data) verified non-empty after migration." Docker volume inspect can confirm this.
Define the post-deploy smoke test explicitly in the issue: the deploy script should exit non-zero if any obs-* container is not healthy within 60 seconds of docker compose up.
Document the rollback: "If migration fails, restore using git show HEAD:docker-compose.observability.yml > /srv/gitea-workspace/.../docker-compose.observability.yml and recreate from there" — or document that the git-tracked file in /opt/familienarchiv/ IS the rollback artifact.

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Observations - **No CI job verifies observability stack health.** The nightly and release workflows don't include a step that checks `docker ps --filter name=obs-loki` or pings `/ready` on Loki. Tonight's outage was discovered when the nightly CI failed because it depended on Loki — not because a dedicated health check fired. The fix should include a proactive health check step, not just a reactive one. - **The core acceptance criterion "CI workspace wipe has zero effect" is not mechanically verifiable** as written. There's no test that simulates a workspace wipe and confirms the obs stack survives. This criterion will be confirmed by inspection at migration time, but can silently regress if, for example, someone later adds a bind mount that points back into the workspace. - **Named data volumes are preserved automatically.** `docker compose down` does not delete named volumes by default, and changing the working directory of a compose project doesn't migrate volumes — they stay under their original Docker volume names. This is correct behavior here (existing `loki_data`, `grafana_data` etc. will be found by the new compose invocation because the volume names match), but it should be explicitly validated during the migration, not assumed. - **No rollback procedure defined.** If the migration fails halfway (e.g. `docker compose up` succeeds for some services but not others), the issue doesn't describe how to recover. The old compose is gone from the workspace; a rollback would require recreating it from git. - **Missing post-migration smoke test definition.** The deploy script acceptance criterion says it "handles syncing" but doesn't define what "working" looks like after it runs. ### Recommendations 1. **Add a health-check step to the nightly workflow:** after its startup phase, add `docker inspect obs-loki --format '{{.State.Health.Status}}'` returning `healthy` as a required pre-condition. This catches the problem before dependent services fail. 2. **Add to acceptance criteria:** "All named data volumes (loki_data, grafana_data, prometheus_data, tempo_data) verified non-empty after migration." Docker `volume inspect` can confirm this. 3. **Define the post-deploy smoke test explicitly in the issue:** the deploy script should exit non-zero if any `obs-*` container is not `healthy` within 60 seconds of `docker compose up`. 4. **Document the rollback:** "If migration fails, restore using `git show HEAD:docker-compose.observability.yml > /srv/gitea-workspace/.../docker-compose.observability.yml` and recreate from there" — or document that the git-tracked file in `/opt/familienarchiv/` IS the rollback artifact.

marcel commented

2026-05-15 21:21:27 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

No application code changes are needed for this issue — it's pure infrastructure. But per the documentation table in CLAUDE.md, this change touches "New Docker service / infrastructure component location" which requires updating docs/architecture/c4/l2-containers.puml and docs/DEPLOYMENT.md. These updates should be part of the same PR that closes this issue.
The scripts/ directory has no deployment scripts — only data-prep utilities. The new scripts/deploy-observability.sh (or equivalent) will be the first deploy script added. Its naming and placement should follow the existing convention: lowercase, hyphenated, .sh suffix.
The CLAUDE.md ## Infrastructure section points readers to docs/DEPLOYMENT.md for operational details. That file should gain a section on the obs stack after this migration so future sessions have a reliable reference.

Recommendations

Include the docs/DEPLOYMENT.md and docs/architecture/c4/l2-containers.puml updates in the PR that closes this issue — don't leave them as follow-up. The docs table makes these mandatory, not optional.
No concerns from me beyond the docs. The operational scope is Tobias's territory; I'm flagging the documentation obligation.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - No application code changes are needed for this issue — it's pure infrastructure. But per the documentation table in CLAUDE.md, this change touches "New Docker service / infrastructure component location" which requires updating `docs/architecture/c4/l2-containers.puml` and `docs/DEPLOYMENT.md`. These updates should be part of the same PR that closes this issue. - The `scripts/` directory has no deployment scripts — only data-prep utilities. The new `scripts/deploy-observability.sh` (or equivalent) will be the first deploy script added. Its naming and placement should follow the existing convention: lowercase, hyphenated, `.sh` suffix. - The CLAUDE.md `## Infrastructure` section points readers to `docs/DEPLOYMENT.md` for operational details. That file should gain a section on the obs stack after this migration so future sessions have a reliable reference. ### Recommendations - Include the `docs/DEPLOYMENT.md` and `docs/architecture/c4/l2-containers.puml` updates in the PR that closes this issue — don't leave them as follow-up. The docs table makes these mandatory, not optional. - No concerns from me beyond the docs. The operational scope is Tobias's territory; I'm flagging the documentation obligation.

marcel commented

2026-05-15 21:21:32 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

No concerns from my angle.

This issue is entirely server-side infrastructure — no user-facing routes, Svelte components, or UI states are affected. The observability stack (Grafana, Loki, Prometheus) is an internal operator tool, not part of the family-facing application. Moving its compose file to a different directory on the server has zero impact on the UI, accessibility, or responsiveness of the Familienarchiv frontend.

I reviewed the issue to confirm no frontend or Grafana UI provisioning changes are implied — provisioning files (datasources.yml, dashboards.yml) are being moved as-is, not modified. Grafana dashboard JSON files (loki-logs.json, node-exporter-full.json, spring-boot-observability.json) move unchanged.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist No concerns from my angle. This issue is entirely server-side infrastructure — no user-facing routes, Svelte components, or UI states are affected. The observability stack (Grafana, Loki, Prometheus) is an internal operator tool, not part of the family-facing application. Moving its compose file to a different directory on the server has zero impact on the UI, accessibility, or responsiveness of the Familienarchiv frontend. I reviewed the issue to confirm no frontend or Grafana UI provisioning changes are implied — provisioning files (`datasources.yml`, `dashboards.yml`) are being moved as-is, not modified. Grafana dashboard JSON files (`loki-logs.json`, `node-exporter-full.json`, `spring-boot-observability.json`) move unchanged.

marcel commented

2026-05-15 21:21:47 +02:00

📋 Elicit — Requirements Engineer

Observations

"Syncing" is ambiguous in AC #5. The criterion "a deploy script or Makefile target handles syncing infra/observability/ from git to the server" leaves open how the sync happens. Two meaningfully different models: (A) CI-push: the release workflow rsyncs config files from the repo to the server as part of deployment, then runs docker compose up; (B) server-pull: a script on the server runs git pull and then docker compose up. These have different failure modes, auth requirements, and dependency on network topology. The AC should name the mechanism.
AC #2 covers paths but not the full scope of bind mounts. The compose has bind-mounted config in five locations: loki/loki-config.yml, prometheus/prometheus.yml, promtail/promtail-config.yml, tempo/tempo.yml, and grafana/provisioning/ (a directory, not a file). AC #2 says "bind-mount source paths … point to /opt/familienarchiv/infra/observability/…" — this is correct but should be explicit that the grafana provisioning directory is included.
Volume preservation is not covered. The migration will recreate containers (they need docker rm to pick up the new compose working directory). Named volumes (loki_data, grafana_data, prometheus_data) must be preserved — 30 days of log history and Grafana dashboard state live in them. This is a "done" condition that's missing from the ACs.
No acceptable downtime window defined. During migration, obs-* containers will be briefly offline. If anything in the nightly CI depends on Loki being healthy (it does — tonight's failure proves this), the migration should be timed to avoid CI runs. This is an operational constraint the issue doesn't name.
AC #6 "no manual docker rm + docker run required after workspace wipe" is the right core criterion. It's machine-verifiable by simulating a wipe and checking container labels. Good.

Recommendations

Resolve the "syncing" ambiguity in AC #5: specify whether this is a CI-push rsync step in the release workflow, or a pull-on-server script. Given the existing release workflow structure, CI-push (rsync during deploy) is more consistent with the current pattern.
Add AC: "All named data volumes (loki_data, grafana_data, prometheus_data, tempo_data, promtail_positions, glitchtip_data) verified non-empty after migration — no historical data lost."
Add AC: "All obs-* containers are healthy (per Docker healthcheck) within 90 seconds of running the deploy script."
Add operational note: migration should be performed outside the nightly CI window to avoid false positives in the CI health check.

## 📋 Elicit — Requirements Engineer ### Observations - **"Syncing" is ambiguous in AC #5.** The criterion "a deploy script or Makefile target handles syncing infra/observability/ from git to the server" leaves open how the sync happens. Two meaningfully different models: (A) **CI-push**: the release workflow rsyncs config files from the repo to the server as part of deployment, then runs `docker compose up`; (B) **server-pull**: a script on the server runs `git pull` and then `docker compose up`. These have different failure modes, auth requirements, and dependency on network topology. The AC should name the mechanism. - **AC #2 covers paths but not the full scope of bind mounts.** The compose has bind-mounted config in five locations: `loki/loki-config.yml`, `prometheus/prometheus.yml`, `promtail/promtail-config.yml`, `tempo/tempo.yml`, and `grafana/provisioning/` (a directory, not a file). AC #2 says "bind-mount source paths … point to /opt/familienarchiv/infra/observability/…" — this is correct but should be explicit that the grafana provisioning directory is included. - **Volume preservation is not covered.** The migration will recreate containers (they need `docker rm` to pick up the new compose working directory). Named volumes (`loki_data`, `grafana_data`, `prometheus_data`) must be preserved — 30 days of log history and Grafana dashboard state live in them. This is a "done" condition that's missing from the ACs. - **No acceptable downtime window defined.** During migration, `obs-*` containers will be briefly offline. If anything in the nightly CI depends on Loki being healthy (it does — tonight's failure proves this), the migration should be timed to avoid CI runs. This is an operational constraint the issue doesn't name. - **AC #6 "no manual docker rm + docker run required after workspace wipe" is the right core criterion.** It's machine-verifiable by simulating a wipe and checking container labels. Good. ### Recommendations 1. **Resolve the "syncing" ambiguity** in AC #5: specify whether this is a CI-push rsync step in the release workflow, or a pull-on-server script. Given the existing release workflow structure, CI-push (rsync during deploy) is more consistent with the current pattern. 2. **Add AC:** "All named data volumes (`loki_data`, `grafana_data`, `prometheus_data`, `tempo_data`, `promtail_positions`, `glitchtip_data`) verified non-empty after migration — no historical data lost." 3. **Add AC:** "All obs-* containers are `healthy` (per Docker healthcheck) within 90 seconds of running the deploy script." 4. **Add operational note:** migration should be performed outside the nightly CI window to avoid false positives in the CI health check.

marcel commented

2026-05-15 21:22:05 +02:00

🗳️ Decision Queue — Action Required

1 decision needs your input before implementation starts.

Infrastructure

How should config changes in infra/observability/ reach the server? Two models: (A) CI-push — the release workflow rsyncs config files to /opt/familienarchiv/infra/observability/ as part of deployment, then runs docker compose up; (B) server-pull — a script on the server does git pull /opt/familienarchiv and then docker compose up. Model A is consistent with how the current release workflow already copies built artefacts to the server; model B requires the server to have git credentials and a git remote configured, but avoids an extra rsync step. Model A keeps the server stateless (no git checkout on the server); model B means the server IS the git checkout and changes are always in sync. (Raised by: Elicit)

## 🗳️ Decision Queue — Action Required _1 decision needs your input before implementation starts._ ### Infrastructure - **How should config changes in `infra/observability/` reach the server?** Two models: (A) **CI-push** — the release workflow rsyncs config files to `/opt/familienarchiv/infra/observability/` as part of deployment, then runs `docker compose up`; (B) **server-pull** — a script on the server does `git pull /opt/familienarchiv` and then `docker compose up`. Model A is consistent with how the current release workflow already copies built artefacts to the server; model B requires the server to have git credentials and a git remote configured, but avoids an extra rsync step. Model A keeps the server stateless (no git checkout on the server); model B means the server IS the git checkout and changes are always in sync. _(Raised by: Elicit)_

marcel commented

2026-05-15 21:31:40 +02:00

🔍 Additional findings from tonight's recovery — update scope

While attempting to bring up the full observability stack after fixing Loki, three more issues were found. All should be addressed in the implementation of this issue.

1. All five config bind mounts were affected, not just Loki

When Docker auto-created directories for missing bind-mount sources, the following paths became directories instead of files on the workspace host:

infra/observability/prometheus/prometheus.yml → directory
infra/observability/tempo/tempo.yml → directory
infra/observability/promtail/ → directory missing entirely

obs-tempo was crash-looping for the same reason as obs-loki. obs-prometheus, obs-promtail, obs-grafana, obs-glitchtip were all in Created state (not running) because their config files were also missing or corrupted.

Impact on scope: the deploy script must rsync the entire infra/observability/ tree, not just the loki subdirectory.

2. `glitchtip/glitchtip:v4` does not exist on Docker Hub

The compose image tag glitchtip/glitchtip:v4 fails to pull — not found. The running worker was using 6.1.6. The compose file must be updated to a pinned tag before the migration is run (this also addresses Nora's supply-chain concern).

Fix: change glitchtip/glitchtip:v4 → glitchtip/glitchtip:6.1.6 in docker-compose.observability.yml. Add this image to Renovate.

3. `.env.staging` was lost with the workspace wipe — secrets must be documented

The env file that GlitchTip and Grafana read (GLITCHTIP_SECRET_KEY, GRAFANA_ADMIN_PASSWORD, POSTGRES_USER, POSTGRES_PASSWORD) was stored only in the workspace and was deleted when CI wiped it. It had to be reconstructed by extracting values from running container environment variables.

Impact on scope: the observability-specific env vars (GLITCHTIP_SECRET_KEY, GRAFANA_ADMIN_PASSWORD, GLITCHTIP_DOMAIN, PORT_GRAFANA, PORT_PROMETHEUS) must be merged into the persistent /opt/familienarchiv/.env file (shared with the main app stack). The .env.example should be updated to include these keys with placeholder values.

4. GlitchTip superuser has never been created

The obs-glitchtip-db-init service creates the PostgreSQL database, but does not create the GlitchTip Django superuser. The command docker exec obs-glitchtip ./manage.py createsuperuser must be run manually once after first startup. This should be documented in docs/DEPLOYMENT.md as a one-time post-migration step.

Updated acceptance criteria (additions)

glitchtip/glitchtip:v4 updated to a pinned patch version in docker-compose.observability.yml
Observability env vars (GLITCHTIP_SECRET_KEY, GRAFANA_ADMIN_PASSWORD, GLITCHTIP_DOMAIN, port vars) added to .env.example with placeholder values
docs/DEPLOYMENT.md documents the one-time createsuperuser step for GlitchTip
All obs-* containers verified healthy after the migration, not just obs-loki

## 🔍 Additional findings from tonight's recovery — update scope While attempting to bring up the full observability stack after fixing Loki, three more issues were found. All should be addressed in the implementation of this issue. ### 1. All five config bind mounts were affected, not just Loki When Docker auto-created directories for missing bind-mount sources, the following paths became directories instead of files on the workspace host: - `infra/observability/prometheus/prometheus.yml` → directory - `infra/observability/tempo/tempo.yml` → directory - `infra/observability/promtail/` → directory missing entirely `obs-tempo` was crash-looping for the same reason as `obs-loki`. `obs-prometheus`, `obs-promtail`, `obs-grafana`, `obs-glitchtip` were all in `Created` state (not running) because their config files were also missing or corrupted. **Impact on scope:** the deploy script must rsync the entire `infra/observability/` tree, not just the loki subdirectory. ### 2. `glitchtip/glitchtip:v4` does not exist on Docker Hub The compose image tag `glitchtip/glitchtip:v4` fails to pull — `not found`. The running worker was using `6.1.6`. The compose file must be updated to a pinned tag before the migration is run (this also addresses Nora's supply-chain concern). **Fix:** change `glitchtip/glitchtip:v4` → `glitchtip/glitchtip:6.1.6` in `docker-compose.observability.yml`. Add this image to Renovate. ### 3. `.env.staging` was lost with the workspace wipe — secrets must be documented The env file that GlitchTip and Grafana read (`GLITCHTIP_SECRET_KEY`, `GRAFANA_ADMIN_PASSWORD`, `POSTGRES_USER`, `POSTGRES_PASSWORD`) was stored only in the workspace and was deleted when CI wiped it. It had to be reconstructed by extracting values from running container environment variables. **Impact on scope:** the observability-specific env vars (`GLITCHTIP_SECRET_KEY`, `GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_DOMAIN`, `PORT_GRAFANA`, `PORT_PROMETHEUS`) must be merged into the persistent `/opt/familienarchiv/.env` file (shared with the main app stack). The `.env.example` should be updated to include these keys with placeholder values. ### 4. GlitchTip superuser has never been created The `obs-glitchtip-db-init` service creates the PostgreSQL database, but does not create the GlitchTip Django superuser. The command `docker exec obs-glitchtip ./manage.py createsuperuser` must be run manually once after first startup. This should be documented in `docs/DEPLOYMENT.md` as a one-time post-migration step. ### Updated acceptance criteria (additions) - [ ] `glitchtip/glitchtip:v4` updated to a pinned patch version in `docker-compose.observability.yml` - [ ] Observability env vars (`GLITCHTIP_SECRET_KEY`, `GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_DOMAIN`, port vars) added to `.env.example` with placeholder values - [ ] `docs/DEPLOYMENT.md` documents the one-time `createsuperuser` step for GlitchTip - [ ] All obs-* containers verified healthy after the migration, not just obs-loki

marcel commented

2026-05-15 21:42:41 +02:00

🔍 Second round of findings — full migration attempt tonight

After the first recovery comment, a full docker compose up -d was attempted from /opt/familienarchiv/. Five more issues surfaced. All are now resolved in the local compose/config files but not yet committed.

5. Tempo 2.7.2 rejects `metrics_generator.processors` at the top level

obs-tempo crash-looped with:

failed to parse configFile /etc/tempo.yml: field processors not found in type generator.Config

The processors list was removed from the top-level metrics_generator block in Tempo 2.x. It is only valid under overrides.defaults.metrics_generator.processors, which was already present in the config.

Fix (applied locally): removed lines 39–41 from infra/observability/tempo/tempo.yml:

# removed:
  processors:
    - service-graphs
    - span-metrics

6. `PORT_GRAFANA=3001` conflicts with the staging frontend

Grafana's bind failed:

Bind for 127.0.0.1:3001 failed: port is already allocated

Port 3001 is occupied by archiv-staging-frontend-1. Ports in use on 127.0.0.1: 2019 (Caddy admin), 3001 (staging frontend), 3005 (Gitea), 8081 (staging backend), 9090 (Prometheus).

Fix (applied in .env): PORT_GRAFANA=3003. The compose default (:-3001) should also be updated to :-3003 to avoid a future footgun if the env var is missing.

Note for Caddy config: The nightly build generates a Caddy entry for Grafana. If it references port 3001, it must be updated to 3003.

7. `archive-db` hostname is not resolvable — only staging stack is running

obs-glitchtip-db-init failed with:

psql: error: could not translate host name "archive-db" to address: Try again

No production stack is currently running on the server — only archiv-staging-*. The database container is archiv-staging-db-1, and archive-db does not exist on archiv-net. Verified via nslookup archive-db from inside archiv-net.

Fix (applied locally): DATABASE_URL and db-init command in docker-compose.observability.yml now use ${POSTGRES_HOST:-archive-db}. The .env file sets POSTGRES_HOST=archiv-staging-db-1 for the current server state. When the production stack is running, this var can be unset (defaulting back to archive-db).

Long-term: the deploy script should set POSTGRES_HOST based on whether the production or staging DB is the target.

8. `$` characters in `.env` values must be escaped as `$$`

GRAFANA_ADMIN_PASSWORD contains $ characters. Docker Compose interpolates $VAR references inside .env values, silently replacing them with empty strings. The value me30g@b$Nt$Z2g became me30g@b after compose read the file.

Fix: use $$ in .env for literal $: GRAFANA_ADMIN_PASSWORD=me30g@b$$Nt$$Z2g. Docker Compose renders $$ → $ when passing the value to the container.

Impact on deploy script and .env.example: the example file and any generation script must document this escaping rule. Passwords with $ are otherwise silently truncated with no error — a subtle and dangerous failure mode.

9. GlitchTip required 104 unapplied migrations before `createsuperuser`

Running createsuperuser without first migrating produced a warning about 104 unapplied migrations and refused to create the user.

Fix: run ./manage.py migrate before ./manage.py createsuperuser. Both are one-time post-migration steps. Applied tonight:

docker exec obs-glitchtip ./manage.py migrate

createsuperuser requires an interactive TTY:

docker exec -it obs-glitchtip ./manage.py createsuperuser

Without -it, the command exits silently. This must be documented in docs/DEPLOYMENT.md as a one-time step.

Current state after tonight's fixes

Container	Status
obs-loki	✅ healthy
obs-prometheus	✅ healthy
obs-tempo	✅ healthy
obs-grafana	✅ healthy
obs-promtail	✅ up
obs-redis	✅ healthy
obs-glitchtip	✅ up (migrations applied)
obs-glitchtip-worker	✅ up
obs-cadvisor	✅ up
obs-node-exporter	✅ up

All containers now managed from /opt/familienarchiv/ — the permanent migration from the CI workspace is complete for this session.

Files changed locally (not yet committed)

docker-compose.observability.yml — GlitchTip v4 → 6.1.6, POSTGRES_HOST variable, $$ escaping note
infra/observability/tempo/tempo.yml — removed top-level metrics_generator.processors
/opt/familienarchiv/.env — created on server (not in git; contains secrets)

## 🔍 Second round of findings — full migration attempt tonight After the first recovery comment, a full `docker compose up -d` was attempted from `/opt/familienarchiv/`. Five more issues surfaced. All are now resolved in the local compose/config files but not yet committed. --- ### 5. Tempo 2.7.2 rejects `metrics_generator.processors` at the top level `obs-tempo` crash-looped with: ``` failed to parse configFile /etc/tempo.yml: field processors not found in type generator.Config ``` The `processors` list was removed from the top-level `metrics_generator` block in Tempo 2.x. It is only valid under `overrides.defaults.metrics_generator.processors`, which was already present in the config. **Fix (applied locally):** removed lines 39–41 from `infra/observability/tempo/tempo.yml`: ```yaml # removed: processors: - service-graphs - span-metrics ``` --- ### 6. `PORT_GRAFANA=3001` conflicts with the staging frontend Grafana's bind failed: ``` Bind for 127.0.0.1:3001 failed: port is already allocated ``` Port 3001 is occupied by `archiv-staging-frontend-1`. Ports in use on `127.0.0.1`: 2019 (Caddy admin), 3001 (staging frontend), 3005 (Gitea), 8081 (staging backend), 9090 (Prometheus). **Fix (applied in `.env`):** `PORT_GRAFANA=3003`. The compose default (`:-3001`) should also be updated to `:-3003` to avoid a future footgun if the env var is missing. **Note for Caddy config:** The nightly build generates a Caddy entry for Grafana. If it references port 3001, it must be updated to 3003. --- ### 7. `archive-db` hostname is not resolvable — only staging stack is running `obs-glitchtip-db-init` failed with: ``` psql: error: could not translate host name "archive-db" to address: Try again ``` No production stack is currently running on the server — only `archiv-staging-*`. The database container is `archiv-staging-db-1`, and `archive-db` does not exist on `archiv-net`. Verified via `nslookup archive-db` from inside `archiv-net`. **Fix (applied locally):** `DATABASE_URL` and `db-init` command in `docker-compose.observability.yml` now use `${POSTGRES_HOST:-archive-db}`. The `.env` file sets `POSTGRES_HOST=archiv-staging-db-1` for the current server state. When the production stack is running, this var can be unset (defaulting back to `archive-db`). **Long-term:** the deploy script should set `POSTGRES_HOST` based on whether the production or staging DB is the target. --- ### 8. `$` characters in `.env` values must be escaped as `$$` `GRAFANA_ADMIN_PASSWORD` contains `$` characters. Docker Compose interpolates `$VAR` references inside `.env` values, silently replacing them with empty strings. The value `me30g@b$Nt$Z2g` became `me30g@b` after compose read the file. **Fix:** use `$$` in `.env` for literal `$`: `GRAFANA_ADMIN_PASSWORD=me30g@b$$Nt$$Z2g`. Docker Compose renders `$$` → `$` when passing the value to the container. **Impact on deploy script and `.env.example`:** the example file and any generation script must document this escaping rule. Passwords with `$` are otherwise silently truncated with no error — a subtle and dangerous failure mode. --- ### 9. GlitchTip required 104 unapplied migrations before `createsuperuser` Running `createsuperuser` without first migrating produced a warning about 104 unapplied migrations and refused to create the user. **Fix:** run `./manage.py migrate` before `./manage.py createsuperuser`. Both are one-time post-migration steps. Applied tonight: ```bash docker exec obs-glitchtip ./manage.py migrate ``` **`createsuperuser` requires an interactive TTY:** ```bash docker exec -it obs-glitchtip ./manage.py createsuperuser ``` Without `-it`, the command exits silently. This must be documented in `docs/DEPLOYMENT.md` as a one-time step. --- ### Current state after tonight's fixes | Container | Status | |---|---| | obs-loki | ✅ healthy | | obs-prometheus | ✅ healthy | | obs-tempo | ✅ healthy | | obs-grafana | ✅ healthy | | obs-promtail | ✅ up | | obs-redis | ✅ healthy | | obs-glitchtip | ✅ up (migrations applied) | | obs-glitchtip-worker | ✅ up | | obs-cadvisor | ✅ up | | obs-node-exporter | ✅ up | All containers now managed from `/opt/familienarchiv/` — the permanent migration from the CI workspace is complete for this session. ### Files changed locally (not yet committed) - `docker-compose.observability.yml` — GlitchTip `v4` → `6.1.6`, `POSTGRES_HOST` variable, `$$` escaping note - `infra/observability/tempo/tempo.yml` — removed top-level `metrics_generator.processors` - `/opt/familienarchiv/.env` — created on server (not in git; contains secrets)

marcel referenced this issue from a commit

2026-05-15 21:46:33 +02:00

fix(obs): remove invalid processors block from tempo metrics_generator

marcel referenced this issue from a commit

2026-05-15 21:46:33 +02:00

fix(obs): make Postgres host configurable and fix PORT_GRAFANA default

marcel referenced a pull request that will close this issue

2026-05-15 21:46:47 +02:00

fix(obs): Tempo processors schema fix + configurable Postgres host and Grafana port #602

marcel referenced this issue from a commit

2026-05-15 21:59:44 +02:00

fix(ci): deploy obs configs to /opt/familienarchiv/ before starting stack