fix(ci): sync observability configs to host before docker compose up #599

Merged
marcel merged 5 commits from fix/issue-598-obs-dood-bind-mounts into main 2026-05-15 19:51:05 +02:00
Owner

Summary

Fixes #598 — observability stack fails to start in the DooD CI runner because relative bind mounts resolve to paths the host daemon cannot see.

  • docker-compose.observability.yml: All 5 config bind mounts now use ${OBS_CONFIG_DIR:-./infra/observability} as the base path — defaults to the relative path for local docker compose up, configurable via env var in CI.
  • nightly.yml / release.yml: New "Sync observability configs to host" step copies the config tree from the job container's overlay2 filesystem (visible in the host mount namespace) to a stable host path using the existing nsenter/Alpine pattern. OBS_CONFIG_DIR is injected into the env file.

Root cause

runner-config.yaml only mounts /var/run/docker.sock — no workspace directory is shared with the host. When Compose resolves ./infra/observability/prometheus/prometheus.yml as a bind mount source, it passes the absolute path to the host daemon. The host has no such path, auto-creates an empty directory there, and the container fails with:

error mounting "…/prometheus/prometheus.yml" to rootfs at "/etc/prometheus/prometheus.yml": not a directory

Sync mechanism

OVERLAY=$(docker inspect "$(hostname)" --format '{{.GraphDriver.Data.MergedDir}}')
SRC="${OVERLAY}$(pwd)/infra/observability"
docker run --rm --privileged --pid=host alpine:3.21@sha256:… \
  sh -c "nsenter -t 1 -m -- sh -c 'mkdir -p /srv/…/obs-configs && cp -r \"${SRC}/.\" /srv/…/obs-configs/'"

The overlay2 MergedDir is the job container's complete filesystem as seen from the host mount namespace — no Docker socket tricks, no custom images. Same Alpine digest already pinned in the Caddy reload step.

Test plan

  • Nightly CI run starts the observability stack without "not a directory" errors
  • Local docker compose -f docker-compose.observability.yml up still works (no OBS_CONFIG_DIR needed)
  • Production release deploy starts observability stack successfully

🤖 Generated with Claude Code

## Summary Fixes #598 — observability stack fails to start in the DooD CI runner because relative bind mounts resolve to paths the host daemon cannot see. - **`docker-compose.observability.yml`**: All 5 config bind mounts now use `${OBS_CONFIG_DIR:-./infra/observability}` as the base path — defaults to the relative path for local `docker compose up`, configurable via env var in CI. - **`nightly.yml`** / **`release.yml`**: New "Sync observability configs to host" step copies the config tree from the job container's overlay2 filesystem (visible in the host mount namespace) to a stable host path using the existing `nsenter`/Alpine pattern. `OBS_CONFIG_DIR` is injected into the env file. ## Root cause `runner-config.yaml` only mounts `/var/run/docker.sock` — no workspace directory is shared with the host. When Compose resolves `./infra/observability/prometheus/prometheus.yml` as a bind mount source, it passes the absolute path to the **host** daemon. The host has no such path, auto-creates an empty directory there, and the container fails with: ``` error mounting "…/prometheus/prometheus.yml" to rootfs at "/etc/prometheus/prometheus.yml": not a directory ``` ## Sync mechanism ```bash OVERLAY=$(docker inspect "$(hostname)" --format '{{.GraphDriver.Data.MergedDir}}') SRC="${OVERLAY}$(pwd)/infra/observability" docker run --rm --privileged --pid=host alpine:3.21@sha256:… \ sh -c "nsenter -t 1 -m -- sh -c 'mkdir -p /srv/…/obs-configs && cp -r \"${SRC}/.\" /srv/…/obs-configs/'" ``` The overlay2 `MergedDir` is the job container's complete filesystem as seen from the host mount namespace — no Docker socket tricks, no custom images. Same Alpine digest already pinned in the Caddy reload step. ## Test plan - [ ] Nightly CI run starts the observability stack without "not a directory" errors - [ ] Local `docker compose -f docker-compose.observability.yml up` still works (no `OBS_CONFIG_DIR` needed) - [ ] Production release deploy starts observability stack successfully 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 1 commit 2026-05-15 19:03:45 +02:00
fix(ci): sync observability configs to host before docker compose up (#598)
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m26s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 2m40s
CI / fail2ban Regex (pull_request) Successful in 41s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
1fc47888d5
DooD runner only shares /var/run/docker.sock — no workspace directory is
mapped to the host daemon. Relative bind mounts in
docker-compose.observability.yml resolved to paths that didn't exist on
the host; Docker auto-created directories in their place, causing
'not a directory' mount failures for all five config files.

Fix:
- docker-compose.observability.yml: replace hardcoded ./infra/observability/
  prefix with ${OBS_CONFIG_DIR:-./infra/observability} so the path is
  configurable while remaining backwards-compatible for local use.
- nightly.yml / release.yml: add a 'Sync observability configs to host'
  step that finds the job container's overlay2 MergedDir (the container's
  full filesystem as seen from the host mount namespace), then uses the
  existing nsenter/alpine pattern to cp the config tree into a stable host
  path (/srv/familienarchiv-{staging,production}/obs-configs).
  OBS_CONFIG_DIR is injected into the env file so Compose picks it up.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

🛠️ Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

This is a pragmatic and well-explained fix for a real DooD limitation. The overlay2 trick is non-obvious but the inline comments explain the why clearly, which is exactly what infrastructure code should do. Here's my full assessment.


What's correct

  • Step ordering is right. Sync observability configs to host is inserted before Start observability stack in both nightly.yml and release.yml. The env file gets OBS_CONFIG_DIR set before the main stack starts, and the configs land on the host before they're bind-mounted. The sequence is sound.
  • Alpine image is SHA-pinned. alpine:3.21@sha256:48b0309… is the same pinned digest already used in the Caddy reload step — good consistency, Renovate will pick it up.
  • docker-compose.observability.yml fallback default is correct. ${OBS_CONFIG_DIR:-./infra/observability} means local docker compose up requires zero config changes. The CI path is injected via env file; local development is unchanged.
  • Comments justify the non-obvious code. The block comment on the step explains the full causal chain (DooD → socket-only mount → host path resolution → empty dir → "not a directory"). A new team member can follow this.

Concerns

Suggestion — Overlay2 driver assumption (non-blocker for a single known runner)

The script assumes the runner's Docker daemon uses the overlay2 storage driver:

OVERLAY=$(docker inspect "$(hostname)" --format '{{.GraphDriver.Data.MergedDir}}')

If the runner is ever migrated to a VFS or devicemapper-backed daemon, MergedDir will be empty or absent and SRC will silently resolve to a garbage path — configs won't sync, observability stack will fail with the original "not a directory" error.

Since this is a single known self-hosted runner on a fixed VPS, this is acceptable as-is. But consider adding a guard:

OVERLAY=$(docker inspect "$(hostname)" --format '{{.GraphDriver.Data.MergedDir}}')
[ -d "$OVERLAY" ] || { echo "ERROR: overlay2 MergedDir not found — wrong storage driver?"; exit 1; }

Suggestion — Stale config files accumulate on the host (minor)

cp -r "${SRC}/." /srv/…/obs-configs/ copies INTO the existing directory. If a config file is renamed or removed from the repo, the old filename survives on the host forever. The bind mount will then serve the new file and retain the orphaned old file (which observability services ignore, but it's untidy).

A rsync --delete would be cleaner here, but that requires rsync in the Alpine image. Alternatively:

rm -rf /srv/familienarchiv-staging/obs-configs && mkdir -p /srv/familienarchiv-staging/obs-configs && cp -r "${SRC}/." /srv/familienarchiv-staging/obs-configs/

Suggestion — No post-sync verification before observability stack starts

The sync step has no exit-code check beyond the cp command itself. Adding a quick sanity check (e.g., verify Prometheus config landed) gives a faster, more descriptive failure if the overlay path is wrong:

ls "${DEST}/prometheus/prometheus.yml" || { echo "ERROR: sync failed — prometheus config missing"; exit 1; }

Observation — Hardcoded paths duplicated between workflow and env file

/srv/familienarchiv-staging/obs-configs appears twice in nightly.yml: once in the OBS_CONFIG_DIR=... line in the env file write step, and once inside the nsenter command. These must stay in sync manually. Not a blocker, but if the path ever changes, it's easy to update one and miss the other. Consider a YAML anchor or a local shell variable at the top of the step to DRY it up — but I recognize Gitea Actions YAML is limited here.


Summary

The fix is correct and the comments are good. The overlay2 guard and a post-sync verification are worth adding for robustness before this sees a live production run. The stale-file and path-duplication issues are minor polish items. Approving — but I'd feel better with the verification check before merging to main.

## 🛠️ Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** This is a pragmatic and well-explained fix for a real DooD limitation. The overlay2 trick is non-obvious but the inline comments explain the why clearly, which is exactly what infrastructure code should do. Here's my full assessment. --- ### What's correct ✅ - **Step ordering is right.** `Sync observability configs to host` is inserted *before* `Start observability stack` in both `nightly.yml` and `release.yml`. The env file gets `OBS_CONFIG_DIR` set before the main stack starts, and the configs land on the host before they're bind-mounted. The sequence is sound. - **Alpine image is SHA-pinned.** `alpine:3.21@sha256:48b0309…` is the same pinned digest already used in the Caddy reload step — good consistency, Renovate will pick it up. - **`docker-compose.observability.yml` fallback default is correct.** `${OBS_CONFIG_DIR:-./infra/observability}` means local `docker compose up` requires zero config changes. The CI path is injected via env file; local development is unchanged. - **Comments justify the non-obvious code.** The block comment on the step explains the full causal chain (DooD → socket-only mount → host path resolution → empty dir → "not a directory"). A new team member can follow this. --- ### Concerns #### Suggestion — Overlay2 driver assumption (non-blocker for a single known runner) The script assumes the runner's Docker daemon uses the `overlay2` storage driver: ```bash OVERLAY=$(docker inspect "$(hostname)" --format '{{.GraphDriver.Data.MergedDir}}') ``` If the runner is ever migrated to a VFS or devicemapper-backed daemon, `MergedDir` will be empty or absent and `SRC` will silently resolve to a garbage path — configs won't sync, observability stack will fail with the original "not a directory" error. Since this is a single known self-hosted runner on a fixed VPS, this is acceptable as-is. But consider adding a guard: ```bash OVERLAY=$(docker inspect "$(hostname)" --format '{{.GraphDriver.Data.MergedDir}}') [ -d "$OVERLAY" ] || { echo "ERROR: overlay2 MergedDir not found — wrong storage driver?"; exit 1; } ``` #### Suggestion — Stale config files accumulate on the host (minor) `cp -r "${SRC}/." /srv/…/obs-configs/` copies INTO the existing directory. If a config file is renamed or removed from the repo, the old filename survives on the host forever. The bind mount will then serve the new file *and* retain the orphaned old file (which observability services ignore, but it's untidy). A `rsync --delete` would be cleaner here, but that requires `rsync` in the Alpine image. Alternatively: ```bash rm -rf /srv/familienarchiv-staging/obs-configs && mkdir -p /srv/familienarchiv-staging/obs-configs && cp -r "${SRC}/." /srv/familienarchiv-staging/obs-configs/ ``` #### Suggestion — No post-sync verification before observability stack starts The sync step has no exit-code check beyond the `cp` command itself. Adding a quick sanity check (e.g., verify Prometheus config landed) gives a faster, more descriptive failure if the overlay path is wrong: ```bash ls "${DEST}/prometheus/prometheus.yml" || { echo "ERROR: sync failed — prometheus config missing"; exit 1; } ``` #### Observation — Hardcoded paths duplicated between workflow and env file `/srv/familienarchiv-staging/obs-configs` appears twice in `nightly.yml`: once in the `OBS_CONFIG_DIR=...` line in the env file write step, and once inside the `nsenter` command. These must stay in sync manually. Not a blocker, but if the path ever changes, it's easy to update one and miss the other. Consider a YAML anchor or a local shell variable at the top of the step to DRY it up — but I recognize Gitea Actions YAML is limited here. --- ### Summary The fix is correct and the comments are good. The overlay2 guard and a post-sync verification are worth adding for robustness before this sees a live production run. The stale-file and path-duplication issues are minor polish items. Approving — but I'd feel better with the verification check before merging to main.
Author
Owner

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

This PR introduces a privileged Docker container operation as part of the CI pipeline. It's solving a real DooD bind-mount problem, but the privilege model deserves explicit acknowledgment. Here's the security read.


What's good

  • Alpine digest is pinned. alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d prevents a supply-chain substitution attack — you can't pull a malicious image by swapping the tag. This is the right approach.
  • Variables are derived from controlled sources. OVERLAY comes from docker inspect "$(hostname)", and $(pwd) is the checked-out workspace. Neither is user-controlled input. No injection risk in the current context.
  • The operation is scoped. The cp -r targets a specific obs-configs directory. The intent is file copy, not arbitrary code execution.

Concern — --privileged --pid=host is a significant privilege level (blocker-class observation, accepted risk)

docker run --rm --privileged --pid=host \
  alpine:3.21@sha256:… \
  sh -c "nsenter -t 1 -m -- sh -c 'mkdir -p /srv/… && cp -r …'"

This container has:

  • --privileged: full capability set, including CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_NET_ADMIN, device access, etc.
  • --pid=host: can kill -9 any process on the host, attach debuggers, read /proc/*/mem
  • nsenter -t 1 -m: enters the host's mount namespace — can read/write anywhere on the host filesystem

This is essentially root shell access on the host machine. The current cp command is benign. But if the CI pipeline were compromised (e.g., via a malicious PR, a compromised secret, or a supply-chain attack on an earlier step), this step could be repurposed to:

  • Write an SSH key to /root/.ssh/authorized_keys
  • Modify /etc/cron.d/ for persistence
  • Exfiltrate production secrets from the .env.production file

This is an accepted architectural risk of DooD — the socket is already shared, and anyone who can run docker run already has effective root on the host. I'm not flagging this as a new vulnerability introduced by this PR. I'm flagging it for explicit acknowledgment: this CI runner is a high-value target and the security posture depends entirely on the strength of the Gitea secrets and the runner's network isolation.

Recommendation: Add a comment in the workflow file (similar to the existing bind-mount comment in docker-compose.observability.yml) documenting the threat model explicitly:

# Security note: --privileged --pid=host grants effective root on the host.
# This is an accepted consequence of DooD; the docker socket already carries
# equivalent privilege. The operation is scoped to a read-only cp from the
# overlay2 merged directory to a stable host path. If CI secrets or the runner
# are compromised, this step is the highest-impact escalation point in the pipeline.

Concern — docker inspect "$(hostname)" relies on container naming

The "$(hostname)" expansion assumes the job container's hostname matches its Docker container name. This is the Gitea Actions default but is not guaranteed if the runner configuration changes (e.g., custom hostname, network alias). If docker inspect fails silently, OVERLAY will be empty, SRC becomes $(pwd)/infra/observability (relative — a bare path without the overlay prefix), and the nsenter command will try to copy from a path that doesn't exist in the host mount namespace.

A failed copy would not be caught until Start observability stack fails. Adding the exit guard Tobias suggested ([ -d "$OVERLAY" ] || exit 1) provides a faster, clearer failure mode.


Minor — No set -e or set -o pipefail in the shell script

run: |
  OVERLAY=$(docker inspect "$(hostname)" --format '{{.GraphDriver.Data.MergedDir}}')
  SRC="${OVERLAY}$(pwd)/infra/observability"
  docker run ...

Gitea Actions runs each run: step with bash -e by default, so the first failing command will abort the step. This is fine as-is, but the docker run exit code depends on the sh -c '...' inside — if nsenter exits non-zero (e.g., permission denied in a different environment), bash will surface it. No action required, just confirming the default behavior is adequate here.


Summary

The privileged container is the dominant risk in this PR — not a new vulnerability, but an explicit escalation of the CI attack surface that deserves documented acknowledgment. The Alpine image is properly pinned, the variables are not user-controlled, and the operation is scoped to file copy. Approving as a known-and-accepted risk with the recommendation to add the threat-model comment.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** This PR introduces a privileged Docker container operation as part of the CI pipeline. It's solving a real DooD bind-mount problem, but the privilege model deserves explicit acknowledgment. Here's the security read. --- ### What's good ✅ - **Alpine digest is pinned.** `alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d` prevents a supply-chain substitution attack — you can't pull a malicious image by swapping the tag. This is the right approach. - **Variables are derived from controlled sources.** `OVERLAY` comes from `docker inspect "$(hostname)"`, and `$(pwd)` is the checked-out workspace. Neither is user-controlled input. No injection risk in the current context. - **The operation is scoped.** The `cp -r` targets a specific `obs-configs` directory. The intent is file copy, not arbitrary code execution. --- ### Concern — `--privileged --pid=host` is a significant privilege level (blocker-class observation, accepted risk) ```yaml docker run --rm --privileged --pid=host \ alpine:3.21@sha256:… \ sh -c "nsenter -t 1 -m -- sh -c 'mkdir -p /srv/… && cp -r …'" ``` This container has: - `--privileged`: full capability set, including `CAP_SYS_ADMIN`, `CAP_SYS_PTRACE`, `CAP_NET_ADMIN`, device access, etc. - `--pid=host`: can `kill -9` any process on the host, attach debuggers, read `/proc/*/mem` - `nsenter -t 1 -m`: enters the host's mount namespace — can read/write **anywhere** on the host filesystem This is essentially root shell access on the host machine. The current `cp` command is benign. But if the CI pipeline were compromised (e.g., via a malicious PR, a compromised secret, or a supply-chain attack on an earlier step), this step could be repurposed to: - Write an SSH key to `/root/.ssh/authorized_keys` - Modify `/etc/cron.d/` for persistence - Exfiltrate production secrets from the `.env.production` file **This is an accepted architectural risk of DooD** — the socket is already shared, and anyone who can run `docker run` already has effective root on the host. I'm not flagging this as a new vulnerability introduced by this PR. I'm flagging it for explicit acknowledgment: this CI runner is a high-value target and the security posture depends entirely on the strength of the Gitea secrets and the runner's network isolation. **Recommendation:** Add a comment in the workflow file (similar to the existing bind-mount comment in `docker-compose.observability.yml`) documenting the threat model explicitly: ```yaml # Security note: --privileged --pid=host grants effective root on the host. # This is an accepted consequence of DooD; the docker socket already carries # equivalent privilege. The operation is scoped to a read-only cp from the # overlay2 merged directory to a stable host path. If CI secrets or the runner # are compromised, this step is the highest-impact escalation point in the pipeline. ``` --- ### Concern — `docker inspect "$(hostname)"` relies on container naming The `"$(hostname)"` expansion assumes the job container's hostname matches its Docker container name. This is the Gitea Actions default but is not guaranteed if the runner configuration changes (e.g., custom hostname, network alias). If `docker inspect` fails silently, `OVERLAY` will be empty, `SRC` becomes `$(pwd)/infra/observability` (relative — a bare path without the overlay prefix), and the `nsenter` command will try to copy from a path that doesn't exist in the host mount namespace. A failed copy would not be caught until `Start observability stack` fails. Adding the exit guard Tobias suggested (`[ -d "$OVERLAY" ] || exit 1`) provides a faster, clearer failure mode. --- ### Minor — No `set -e` or `set -o pipefail` in the shell script ```bash run: | OVERLAY=$(docker inspect "$(hostname)" --format '{{.GraphDriver.Data.MergedDir}}') SRC="${OVERLAY}$(pwd)/infra/observability" docker run ... ``` Gitea Actions runs each `run:` step with `bash -e` by default, so the first failing command will abort the step. This is fine as-is, but the `docker run` exit code depends on the `sh -c '...'` inside — if `nsenter` exits non-zero (e.g., permission denied in a different environment), bash will surface it. No action required, just confirming the default behavior is adequate here. --- ### Summary The privileged container is the dominant risk in this PR — not a new vulnerability, but an explicit escalation of the CI attack surface that deserves documented acknowledgment. The Alpine image is properly pinned, the variables are not user-controlled, and the operation is scoped to file copy. Approving as a known-and-accepted risk with the recommendation to add the threat-model comment.
Author
Owner

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Verdict: Approved

No application code changed. This is pure CI infrastructure — my lens here is code quality within the workflow files and the compose change.


What's good

  • Comments explain the why, not the what. The block comment on the new step documents the causal chain (DooD socket-only mount → host path resolution → empty dir → mount failure). That's exactly the right level of comment — a future reader can understand why this workaround exists without re-deriving the DooD filesystem model from scratch.
  • docker-compose.observability.yml change is minimal and correct. Five bind mounts updated to ${OBS_CONFIG_DIR:-./infra/observability}/.... The fallback default means local docker compose up works without any new env var. The change is mechanical and scoped.
  • The release.yml comment cross-references nightly.yml rather than duplicating the full rationale. That's DRY at the documentation level: one authoritative explanation, one pointer.

Minor observation — Shell variable interpolation in the sh -c string

sh -c "nsenter -t 1 -m -- sh -c 'mkdir -p /srv/familienarchiv-staging/obs-configs && cp -r \"${SRC}/.\" /srv/familienarchiv-staging/obs-configs/'"

The escaped \"${SRC}/.\" expands ${SRC} in the outer shell (the runner) before passing it to the docker run command. This is intentional — SRC is set on the runner and needs to be embedded as a literal path inside the nsenter command. It works correctly. The nested quoting is ugly but there's no simpler way given the escaping requirements across two shell contexts.

No change needed — just confirming I read it and it's correct, not a bug.


Summary

Nothing here touches application code, services, or domain logic. The workflow changes are well-commented, minimal, and correctly ordered. LGTM from my side.

## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer **Verdict: ✅ Approved** No application code changed. This is pure CI infrastructure — my lens here is code quality within the workflow files and the compose change. --- ### What's good ✅ - **Comments explain the *why*, not the *what*.** The block comment on the new step documents the causal chain (DooD socket-only mount → host path resolution → empty dir → mount failure). That's exactly the right level of comment — a future reader can understand why this workaround exists without re-deriving the DooD filesystem model from scratch. - **`docker-compose.observability.yml` change is minimal and correct.** Five bind mounts updated to `${OBS_CONFIG_DIR:-./infra/observability}/...`. The fallback default means local `docker compose up` works without any new env var. The change is mechanical and scoped. - **The `release.yml` comment cross-references `nightly.yml`** rather than duplicating the full rationale. That's DRY at the documentation level: one authoritative explanation, one pointer. --- ### Minor observation — Shell variable interpolation in the `sh -c` string ```bash sh -c "nsenter -t 1 -m -- sh -c 'mkdir -p /srv/familienarchiv-staging/obs-configs && cp -r \"${SRC}/.\" /srv/familienarchiv-staging/obs-configs/'" ``` The escaped `\"${SRC}/.\"` expands `${SRC}` in the *outer* shell (the runner) before passing it to the `docker run` command. This is intentional — `SRC` is set on the runner and needs to be embedded as a literal path inside the `nsenter` command. It works correctly. The nested quoting is ugly but there's no simpler way given the escaping requirements across two shell contexts. No change needed — just confirming I read it and it's correct, not a bug. --- ### Summary Nothing here touches application code, services, or domain logic. The workflow changes are well-commented, minimal, and correctly ordered. LGTM from my side.
Author
Owner

🏗️ Markus Keller (@mkeller) — Application Architect

Verdict: Approved

This is a CI pipeline fix with no architectural surface area. Let me run the doc-update checklist and confirm there's nothing missed.


Doc update checklist

PR contains Required doc update Status
New Flyway migration docs/architecture/db/ diagrams Not applicable
New backend package or domain module CLAUDE.md package table + C4 diagram Not applicable
New controller or service Matching C4 diagram Not applicable
New SvelteKit route CLAUDE.md route table + C4 diagram Not applicable
New Docker service or infrastructure component docs/architecture/c4/l2-containers.puml + docs/DEPLOYMENT.md See below
New external system integrated docs/architecture/c4/l1-context.puml Not applicable
Auth or upload flow change Sequence diagrams Not applicable
New ErrorCode or Permission value CLAUDE.md + docs/ARCHITECTURE.md Not applicable
New domain concept or term docs/GLOSSARY.md Not applicable
Architectural decision with lasting consequences New ADR in docs/adr/ See below

Docker service / infrastructure component

No new Docker service is added. The observability stack (prometheus, loki, promtail, tempo, grafana) already exists. The change is to how config files reach those containers in CI — a deployment mechanism change, not a topology change. l2-containers.puml does not need to change.

However, docs/DEPLOYMENT.md likely documents how the observability stack is deployed. If it describes the bind mount paths or the assumption that relative paths work, that section should be updated to reflect the OBS_CONFIG_DIR mechanism. This is a minor concern — if DEPLOYMENT.md has no such detail, no update is needed.

ADR consideration

The DooD overlay2 pattern is a non-obvious architectural workaround with lasting consequences: any future CI workflow that uses bind mounts in an observability stack must follow the same pattern. This is a candidate for a lightweight ADR:

ADR-0XX: Observability config sync via overlay2 merged directory in DooD runners
Context: runner-config.yaml only shares /var/run/docker.sock — no workspace directory.
Decision: Resolve bind-mount sources via container's overlay2 MergedDir and nsenter into host mount namespace.
Alternatives: (a) Add workspace directory to runner mounts — rejected: requires runner operator access. (b) Build configs into a dedicated Docker image — rejected: increases build complexity.
Consequences: Any CI step using bind mounts for non-data volumes must use this pattern. Requires overlay2 storage driver.

This is a suggestion, not a blocker. The inline comment already captures the rationale. An ADR would just make it discoverable for future infrastructure work.


Summary

No blocking doc gaps. The change is correctly scoped to CI plumbing. The optional ADR and a DEPLOYMENT.md spot-check are the only follow-up items worth considering.

## 🏗️ Markus Keller (@mkeller) — Application Architect **Verdict: ✅ Approved** This is a CI pipeline fix with no architectural surface area. Let me run the doc-update checklist and confirm there's nothing missed. --- ### Doc update checklist | PR contains | Required doc update | Status | |---|---|---| | New Flyway migration | `docs/architecture/db/` diagrams | Not applicable | | New backend package or domain module | `CLAUDE.md` package table + C4 diagram | Not applicable | | New controller or service | Matching C4 diagram | Not applicable | | New SvelteKit route | `CLAUDE.md` route table + C4 diagram | Not applicable | | **New Docker service or infrastructure component** | `docs/architecture/c4/l2-containers.puml` + `docs/DEPLOYMENT.md` | **See below** | | New external system integrated | `docs/architecture/c4/l1-context.puml` | Not applicable | | Auth or upload flow change | Sequence diagrams | Not applicable | | New `ErrorCode` or `Permission` value | `CLAUDE.md` + `docs/ARCHITECTURE.md` | Not applicable | | New domain concept or term | `docs/GLOSSARY.md` | Not applicable | | Architectural decision with lasting consequences | New ADR in `docs/adr/` | **See below** | --- ### Docker service / infrastructure component No new Docker service is added. The observability stack (`prometheus`, `loki`, `promtail`, `tempo`, `grafana`) already exists. The change is to how config files reach those containers in CI — a deployment mechanism change, not a topology change. `l2-containers.puml` does not need to change. However, `docs/DEPLOYMENT.md` likely documents how the observability stack is deployed. If it describes the bind mount paths or the assumption that relative paths work, that section should be updated to reflect the `OBS_CONFIG_DIR` mechanism. This is a **minor** concern — if `DEPLOYMENT.md` has no such detail, no update is needed. ### ADR consideration The DooD overlay2 pattern is a non-obvious architectural workaround with lasting consequences: any future CI workflow that uses bind mounts in an observability stack must follow the same pattern. This is a candidate for a lightweight ADR: ``` ADR-0XX: Observability config sync via overlay2 merged directory in DooD runners Context: runner-config.yaml only shares /var/run/docker.sock — no workspace directory. Decision: Resolve bind-mount sources via container's overlay2 MergedDir and nsenter into host mount namespace. Alternatives: (a) Add workspace directory to runner mounts — rejected: requires runner operator access. (b) Build configs into a dedicated Docker image — rejected: increases build complexity. Consequences: Any CI step using bind mounts for non-data volumes must use this pattern. Requires overlay2 storage driver. ``` This is a **suggestion**, not a blocker. The inline comment already captures the rationale. An ADR would just make it discoverable for future infrastructure work. --- ### Summary No blocking doc gaps. The change is correctly scoped to CI plumbing. The optional ADR and a `DEPLOYMENT.md` spot-check are the only follow-up items worth considering.
Author
Owner

🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

No application tests were added or changed, which is expected for a CI pipeline fix. My concerns are about observability and failure-detection within the workflow itself.


What's covered

  • The PR test plan checklist is present and reasonable:
    • Nightly CI run starts observability stack without "not a directory" errors
    • Local docker compose up still works
    • Production release starts observability stack successfully
  • The fix is self-testing in the sense that the CI run itself will either succeed or fail, providing fast feedback.
  • The env var fallback ${OBS_CONFIG_DIR:-./infra/observability} preserves the local dev testing path — no breakage of developer-side verification.

Concern — No verification step between sync and observability stack start (suggestion)

The "Sync observability configs to host" step runs a cp -r inside an nsenter shell. If the overlay2 path is wrong, SRC will be an invalid path and the cp will either silently copy nothing or fail with a non-zero exit code.

If cp exits non-zero, the step fails and CI stops. Good. But if SRC resolves to an existing but wrong directory (edge case), configs could be missing without a clear error message — and the failure would surface later as a cryptic "not a directory" or missing config error inside the observability container startup, which is harder to diagnose.

Suggestion: Add an explicit verification step (or a check at the end of the sync step) before Start observability stack:

# Verify sync succeeded for critical configs
for f in prometheus/prometheus.yml loki/loki-config.yml promtail/promtail-config.yml; do
  test -f "/srv/familienarchiv-staging/obs-configs/${f}" || { echo "ERROR: missing ${f} after sync"; exit 1; }
done

This turns a late, cryptic failure (container startup error) into an early, explicit failure (sync verification error). Five minutes saved per failed CI run.


Concern — No smoke test that observability stack is actually healthy after the deploy

The existing Start observability stack step starts the containers. But there's no assertion that Prometheus, Grafana, and Loki become healthy before the workflow continues. If a config file has a syntax error (e.g., from a future edit), the containers would start and then crash in the background.

Consider adding a health check step after starting the observability stack:

docker compose -f docker-compose.observability.yml ps --format json | \
  jq -e 'all(.[]; .Health == "healthy" or .State == "running")' || \
  { echo "Observability stack unhealthy"; docker compose -f docker-compose.observability.yml logs; exit 1; }

This is a suggestion, not a blocker for this PR — the PR is fixing a regression, not improving the test harness. But worth tracking.


Observation — Test plan item 3 is manual, not automated

Production release deploy starts observability stack successfully

This is verified by running the workflow, which requires a release trigger. There's no automated way to verify this in a PR CI run. Acceptable limitation given the DooD constraint — just noting it as a coverage gap.


Summary

The sync step will fail the workflow if cp exits non-zero, which is the right behavior. The missing piece is an explicit verification that the right files landed at the right paths before the observability stack attempts to mount them. That's a low-cost improvement that would make future failures dramatically easier to diagnose. Approving the fix as-is, with the verification suggestion as a follow-up.

## 🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** No application tests were added or changed, which is expected for a CI pipeline fix. My concerns are about observability and failure-detection within the workflow itself. --- ### What's covered ✅ - The PR test plan checklist is present and reasonable: - Nightly CI run starts observability stack without "not a directory" errors - Local `docker compose up` still works - Production release starts observability stack successfully - The fix is self-testing in the sense that the CI run itself will either succeed or fail, providing fast feedback. - The env var fallback `${OBS_CONFIG_DIR:-./infra/observability}` preserves the local dev testing path — no breakage of developer-side verification. --- ### Concern — No verification step between sync and observability stack start (suggestion) The "Sync observability configs to host" step runs a `cp -r` inside an `nsenter` shell. If the overlay2 path is wrong, `SRC` will be an invalid path and the `cp` will either silently copy nothing or fail with a non-zero exit code. If `cp` exits non-zero, the step fails and CI stops. Good. But if `SRC` resolves to an existing but wrong directory (edge case), configs could be missing without a clear error message — and the failure would surface later as a cryptic "not a directory" or missing config error inside the observability container startup, which is harder to diagnose. **Suggestion:** Add an explicit verification step (or a check at the end of the sync step) before `Start observability stack`: ```bash # Verify sync succeeded for critical configs for f in prometheus/prometheus.yml loki/loki-config.yml promtail/promtail-config.yml; do test -f "/srv/familienarchiv-staging/obs-configs/${f}" || { echo "ERROR: missing ${f} after sync"; exit 1; } done ``` This turns a late, cryptic failure (container startup error) into an early, explicit failure (sync verification error). Five minutes saved per failed CI run. --- ### Concern — No smoke test that observability stack is actually healthy after the deploy The existing `Start observability stack` step starts the containers. But there's no assertion that Prometheus, Grafana, and Loki become healthy before the workflow continues. If a config file has a syntax error (e.g., from a future edit), the containers would start and then crash in the background. Consider adding a health check step after starting the observability stack: ```bash docker compose -f docker-compose.observability.yml ps --format json | \ jq -e 'all(.[]; .Health == "healthy" or .State == "running")' || \ { echo "Observability stack unhealthy"; docker compose -f docker-compose.observability.yml logs; exit 1; } ``` This is a suggestion, not a blocker for this PR — the PR is fixing a regression, not improving the test harness. But worth tracking. --- ### Observation — Test plan item 3 is manual, not automated > ✅ Production release deploy starts observability stack successfully This is verified by running the workflow, which requires a release trigger. There's no automated way to verify this in a PR CI run. Acceptable limitation given the DooD constraint — just noting it as a coverage gap. --- ### Summary The sync step will fail the workflow if `cp` exits non-zero, which is the right behavior. The missing piece is an explicit verification that the right files landed at the right paths *before* the observability stack attempts to mount them. That's a low-cost improvement that would make future failures dramatically easier to diagnose. Approving the fix as-is, with the verification suggestion as a follow-up.
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved

Reviewing this PR from a requirements completeness and traceability perspective.


Issue traceability

  • PR title references fix(ci) and body references Fixes #598. Issue linkage is present.
  • The root cause section in the PR description is precise and matches the fix: the DooD runner shares only /var/run/docker.sock, not the workspace directory, so relative bind mount paths fail on the host daemon.

Requirements coverage assessment

FR-1 (implicit): The CI pipeline must start the observability stack without errors when running under a DooD runner.
→ The PR addresses this via the overlay2 sync mechanism. Covered.

FR-2 (implicit): Local docker compose -f docker-compose.observability.yml up must continue to work without OBS_CONFIG_DIR set.
→ The ${OBS_CONFIG_DIR:-./infra/observability} fallback in docker-compose.observability.yml preserves this. Covered.

FR-3 (implicit): Both staging (nightly) and production (release) workflows must be fixed.
→ Both nightly.yml and release.yml receive the sync step and the OBS_CONFIG_DIR env injection. Covered.


Test plan quality

The three test plan items are correctly scoped and testable:

  1. Nightly CI run (automated, observable in CI logs)
  2. Local docker compose up (manually verifiable by the developer)
  3. Production release deploy (observable on the next release run)

Items are specific enough to verify — no "ensure it works" vagueness.


Open question — Non-overlay2 runner behavior

The fix assumes overlay2 storage driver. There is no fallback and no documented behavior if the driver changes. This is an implicit assumption that is not captured as a constraint in either the issue or the PR description. Worth adding to the issue or as a code comment:

Assumption: The self-hosted Gitea runner uses Docker with the overlay2 storage driver. This fix is invalid under VFS or other drivers.


Summary

Requirements are fully covered. Traceability from issue to fix is clear. The test plan is actionable. The only gap is the undocumented overlay2 assumption, which is a minor documentation concern, not a functional gap.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** Reviewing this PR from a requirements completeness and traceability perspective. --- ### Issue traceability ✅ - PR title references `fix(ci)` and body references `Fixes #598`. Issue linkage is present. - The root cause section in the PR description is precise and matches the fix: the DooD runner shares only `/var/run/docker.sock`, not the workspace directory, so relative bind mount paths fail on the host daemon. --- ### Requirements coverage assessment **FR-1 (implicit): The CI pipeline must start the observability stack without errors when running under a DooD runner.** → The PR addresses this via the overlay2 sync mechanism. Covered. **FR-2 (implicit): Local `docker compose -f docker-compose.observability.yml up` must continue to work without `OBS_CONFIG_DIR` set.** → The `${OBS_CONFIG_DIR:-./infra/observability}` fallback in `docker-compose.observability.yml` preserves this. Covered. **FR-3 (implicit): Both staging (nightly) and production (release) workflows must be fixed.** → Both `nightly.yml` and `release.yml` receive the sync step and the `OBS_CONFIG_DIR` env injection. Covered. --- ### Test plan quality ✅ The three test plan items are correctly scoped and testable: 1. Nightly CI run (automated, observable in CI logs) 2. Local `docker compose up` (manually verifiable by the developer) 3. Production release deploy (observable on the next release run) Items are specific enough to verify — no "ensure it works" vagueness. --- ### Open question — Non-overlay2 runner behavior The fix assumes `overlay2` storage driver. There is no fallback and no documented behavior if the driver changes. This is an implicit assumption that is not captured as a constraint in either the issue or the PR description. Worth adding to the issue or as a code comment: > **Assumption:** The self-hosted Gitea runner uses Docker with the `overlay2` storage driver. This fix is invalid under VFS or other drivers. --- ### Summary Requirements are fully covered. Traceability from issue to fix is clear. The test plan is actionable. The only gap is the undocumented `overlay2` assumption, which is a minor documentation concern, not a functional gap.
Author
Owner

🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead & Accessibility Strategist

Verdict: Approved

This PR makes no changes to the frontend, UI components, Svelte files, or any user-facing behavior. It is entirely contained within CI workflow files and the Docker Compose observability configuration.

I checked:

  • No .svelte files modified
  • No frontend routes, components, or CSS tokens changed
  • No i18n strings added or removed
  • No Tailwind classes or design tokens modified
  • No user-visible error messages or status indicators affected

The observability stack (Prometheus, Grafana, Loki, Tempo, Promtail) is internal infrastructure — it has no direct user interface surface in the Familienarchiv application itself.

LGTM from a UI/UX and accessibility perspective. Nothing to review here.

## 🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead & Accessibility Strategist **Verdict: ✅ Approved** This PR makes no changes to the frontend, UI components, Svelte files, or any user-facing behavior. It is entirely contained within CI workflow files and the Docker Compose observability configuration. I checked: - No `.svelte` files modified - No frontend routes, components, or CSS tokens changed - No i18n strings added or removed - No Tailwind classes or design tokens modified - No user-visible error messages or status indicators affected The observability stack (Prometheus, Grafana, Loki, Tempo, Promtail) is internal infrastructure — it has no direct user interface surface in the Familienarchiv application itself. LGTM from a UI/UX and accessibility perspective. Nothing to review here.
marcel added 3 commits 2026-05-15 19:39:06 +02:00
Set workdir_parent to /volume1/gitea-workspace so act_runner stores job
workspaces at a real NAS path. Mounting that path at the same absolute
location in job containers means $(pwd) inside any job container resolves
to a host path the daemon can find — no overlay2 tricks needed.

Prerequisite (NAS): mkdir -p /volume1/gitea-workspace and add
  - /volume1/gitea-workspace:/volume1/gitea-workspace
to the runner service volumes in gitea's docker-compose.yml, then restart
the runner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
runner-config.yaml: correct path to /srv/gitea-workspace (VPS, not Synology).
docker-compose.observability.yml: revert 5 bind mounts to plain relative paths;
  OBS_CONFIG_DIR variable is no longer needed.
nightly.yml / release.yml: remove OBS_CONFIG_DIR env injection and the
  "Sync observability configs to host" step from both workflows.

With workdir_parent=/srv/gitea-workspace and an identical host<->container
bind mount, $(pwd) inside job containers resolves to a real host path the
daemon can find — no privileged container, no overlay2 inspection, no nsenter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs(adr): add ADR-015 for DooD workspace bind-mount approach
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m36s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 3m7s
CI / fail2ban Regex (pull_request) Successful in 39s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
15ef079eff
Documents the decision to use workdir_parent + identical host<->container
path instead of the overlay2 MergedDir sync that was in the initial fix.
Captures the alternatives (nsenter sync, image-baked configs, path mismatch)
and the operational consequences (prereq directory, out-of-band compose.yaml).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

Implementation update — replaced overlay2 sync with workspace bind mount

After reviewer discussion, the initial overlay2/nsenter sync approach was replaced with a simpler fix that removes the --privileged --pid=host container entirely.

Root insight

The bind-mount resolution problem requires that $(pwd) inside a job container is a real host path. The overlay2 sync worked around this after the fact; the workspace bind mount prevents the problem at the source.

Changes (3 commits, 15ef079e)

fix(ci): configure workspace bind mount for DooD bind-mount resolutionrunner-config.yaml

  • workdir_parent: /srv/gitea-workspace — act_runner now stores job workspaces at a real host path
  • /srv/gitea-workspace added to valid_volumes
  • options updated to mount /srv/gitea-workspace:/srv/gitea-workspace (identical path both sides — this is the key constraint)

fix(ci): replace overlay2 sync with workspace bind mount for DooDdocker-compose.observability.yml, nightly.yml, release.yml

  • Reverted all 5 observability bind mounts to plain ./infra/observability/…OBS_CONFIG_DIR variable removed
  • Removed "Sync observability configs to host" step from both workflows
  • Removed OBS_CONFIG_DIR env injection from both env-file write steps

docs(adr): add ADR-015 for DooD workspace bind-mount approachdocs/adr/015-dood-workspace-bind-mount.md

  • Documents the decision, the path-identity constraint, alternatives (overlay2 sync, image-baked configs), and consequences

Out-of-band changes applied to the VPS

  • Created /srv/gitea-workspace on raddatz.cloud
  • Added - /srv/gitea-workspace:/srv/gitea-workspace to the runner service in ~/docker/gitea/compose.yaml
  • Updated ~/docker/gitea/runner-config.yaml to match
  • Restarted the runner (docker compose restart runner)

Reviewer concerns addressed

Concern Resolution
@tobiwendt: overlay2 driver assumption, stale files, exit guard, hardcoded path duplication Entire sync step removed — none of these apply
@nullx: --privileged --pid=host threat model comment Privileged container removed entirely
@nullx: docker inspect "$(hostname)" reliability Removed
@mkeller: ADR for the overlay2 pattern ADR-015 written for the workspace bind-mount decision instead
@saraholt: post-sync file verification No longer needed — files are in the workspace, not copied
## Implementation update — replaced overlay2 sync with workspace bind mount After reviewer discussion, the initial overlay2/nsenter sync approach was replaced with a simpler fix that removes the `--privileged --pid=host` container entirely. ### Root insight The bind-mount resolution problem requires that `$(pwd)` inside a job container is a **real host path**. The overlay2 sync worked around this after the fact; the workspace bind mount prevents the problem at the source. ### Changes (3 commits, `15ef079e`) **`fix(ci): configure workspace bind mount for DooD bind-mount resolution`** — `runner-config.yaml` - `workdir_parent: /srv/gitea-workspace` — act_runner now stores job workspaces at a real host path - `/srv/gitea-workspace` added to `valid_volumes` - `options` updated to mount `/srv/gitea-workspace:/srv/gitea-workspace` (identical path both sides — this is the key constraint) **`fix(ci): replace overlay2 sync with workspace bind mount for DooD`** — `docker-compose.observability.yml`, `nightly.yml`, `release.yml` - Reverted all 5 observability bind mounts to plain `./infra/observability/…` — `OBS_CONFIG_DIR` variable removed - Removed "Sync observability configs to host" step from both workflows - Removed `OBS_CONFIG_DIR` env injection from both env-file write steps **`docs(adr): add ADR-015 for DooD workspace bind-mount approach`** — `docs/adr/015-dood-workspace-bind-mount.md` - Documents the decision, the path-identity constraint, alternatives (overlay2 sync, image-baked configs), and consequences ### Out-of-band changes applied to the VPS - Created `/srv/gitea-workspace` on `raddatz.cloud` - Added `- /srv/gitea-workspace:/srv/gitea-workspace` to the runner service in `~/docker/gitea/compose.yaml` - Updated `~/docker/gitea/runner-config.yaml` to match - Restarted the runner (`docker compose restart runner`) ### Reviewer concerns addressed | Concern | Resolution | |---|---| | @tobiwendt: overlay2 driver assumption, stale files, exit guard, hardcoded path duplication | Entire sync step removed — none of these apply | | @nullx: `--privileged --pid=host` threat model comment | Privileged container removed entirely | | @nullx: `docker inspect "$(hostname)"` reliability | Removed | | @mkeller: ADR for the overlay2 pattern | ADR-015 written for the workspace bind-mount decision instead | | @saraholt: post-sync file verification | No longer needed — files are in the workspace, not copied |
Author
Owner

🏛️ Markus Keller (@mkeller) — Senior Application Architect

Verdict: ⚠️ Approved with concerns

Blockers

PR description is stale — describes v1, diff implements v2.
The PR body still describes the OBS_CONFIG_DIR env-var approach and the nsenter/Alpine sync step (the "v1" approach). The actual diff contains none of that — it's the workspace bind-mount approach. This mismatch means the PR body actively misleads a reviewer about what is being merged. The body should be rewritten to describe the workspace bind-mount approach before merge.

docs/DEPLOYMENT.md likely needs an update.
runner-config.yaml is infrastructure. The doc table in my review checklist maps "New Docker service or infrastructure component" → docs/DEPLOYMENT.md. The workspace path /srv/gitea-workspace and the runner compose.yaml volume requirement are new operational facts for whoever manages the VPS. If docs/DEPLOYMENT.md or docs/infrastructure/production-compose.md already document the runner setup, this section needs a note. If not, consider whether this is the right moment to add it.

Suggestions

Out-of-band operational prerequisite is documented, but fragile.
The ADR correctly notes:

The runner container's compose.yaml (maintained outside this repo at ~/docker/gitea/compose.yaml on the VPS) must include the - /srv/gitea-workspace:/srv/gitea-workspace volume line.

This is an acceptable constraint for a single-person operation. Good that it's in the ADR. Consider whether a comment in runner-config.yaml (pointing to the ADR) is also warranted for the person who next edits the file.

ADR quality is excellent. The four alternatives table with explicit rejection reasons is exactly the format I want. The "identical host ↔ container path" constraint is correctly identified as the load-bearing invariant and explained with enough depth that a future developer won't accidentally break it by mapping to a different container path. This is a permanent architectural asset.

## 🏛️ Markus Keller (@mkeller) — Senior Application Architect **Verdict: ⚠️ Approved with concerns** ### Blockers **PR description is stale — describes v1, diff implements v2.** The PR body still describes the `OBS_CONFIG_DIR` env-var approach and the `nsenter`/Alpine sync step (the "v1" approach). The actual diff contains none of that — it's the workspace bind-mount approach. This mismatch means the PR body actively misleads a reviewer about what is being merged. The body should be rewritten to describe the workspace bind-mount approach before merge. **`docs/DEPLOYMENT.md` likely needs an update.** `runner-config.yaml` is infrastructure. The doc table in my review checklist maps "New Docker service or infrastructure component" → `docs/DEPLOYMENT.md`. The workspace path `/srv/gitea-workspace` and the runner `compose.yaml` volume requirement are new operational facts for whoever manages the VPS. If `docs/DEPLOYMENT.md` or `docs/infrastructure/production-compose.md` already document the runner setup, this section needs a note. If not, consider whether this is the right moment to add it. ### Suggestions **Out-of-band operational prerequisite is documented, but fragile.** The ADR correctly notes: > The runner container's `compose.yaml` (maintained outside this repo at `~/docker/gitea/compose.yaml` on the VPS) must include the `- /srv/gitea-workspace:/srv/gitea-workspace` volume line. This is an acceptable constraint for a single-person operation. Good that it's in the ADR. Consider whether a comment in `runner-config.yaml` (pointing to the ADR) is also warranted for the person who next edits the file. **ADR quality is excellent.** The four alternatives table with explicit rejection reasons is exactly the format I want. The "identical host ↔ container path" constraint is correctly identified as the load-bearing invariant and explained with enough depth that a future developer won't accidentally break it by mapping to a different container path. This is a permanent architectural asset.
Author
Owner

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

Blockers

PR description describes the wrong implementation.
The body still refers to OBS_CONFIG_DIR, nightly.yml/release.yml workflow steps, and the nsenter/Alpine pattern — none of which appear in the diff. A reviewer who reads the body and then the diff will be confused. Please rewrite the PR body to match the actual changes before merge.

Observations (no code issues)

There's no application code in this PR — just YAML config and a Markdown ADR. Nothing to flag on naming, function size, or TDD evidence. The changed lines are infrastructure config, not business logic.

Comments in runner-config.yaml follow the right pattern.
The new block at runner-config.yaml:8–19 explains why workdir_parent must equal the host path and names the failure mode that motivated it. That's exactly the kind of comment that belongs in a config file. The # SECURITY: note (carried forward from before) is correct and should stay.

ADR references the previous approach honestly.
The alternatives table calls out "the previous approach, see PR #599 v1" with a clear rejection reason. This is the kind of traceability I want: future readers can understand why the simpler-looking fix was tried and abandoned without digging through PR history.

Minor

Nothing else to flag here. The YAML is minimal and the ADR is the right vehicle for this level of explanation.

## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** ### Blockers **PR description describes the wrong implementation.** The body still refers to `OBS_CONFIG_DIR`, `nightly.yml`/`release.yml` workflow steps, and the `nsenter`/Alpine pattern — none of which appear in the diff. A reviewer who reads the body and then the diff will be confused. Please rewrite the PR body to match the actual changes before merge. ### Observations (no code issues) There's no application code in this PR — just YAML config and a Markdown ADR. Nothing to flag on naming, function size, or TDD evidence. The changed lines are infrastructure config, not business logic. **Comments in `runner-config.yaml` follow the right pattern.** The new block at `runner-config.yaml:8–19` explains *why* `workdir_parent` must equal the host path and names the failure mode that motivated it. That's exactly the kind of comment that belongs in a config file. The `# SECURITY:` note (carried forward from before) is correct and should stay. **ADR references the previous approach honestly.** The alternatives table calls out "the previous approach, see PR #599 v1" with a clear rejection reason. This is the kind of traceability I want: future readers can understand why the simpler-looking fix was tried and abandoned without digging through PR history. ### Minor Nothing else to flag here. The YAML is minimal and the ADR is the right vehicle for this level of explanation.
Author
Owner

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

Blockers

PR description is completely wrong for what's in the diff.
The body describes the v1 sync approach (OBS_CONFIG_DIR, nsenter Alpine step, nightly.yml/release.yml changes). The actual diff is two files: a new ADR and a runner-config.yaml change. These must match before merge — anyone triaging later will be misled.

Two manual host-side steps are prerequisites for this config to work, and they're easy to miss.

From the ADR:

  1. mkdir -p /srv/gitea-workspace on the VPS
  2. Add - /srv/gitea-workspace:/srv/gitea-workspace to the runner's compose.yaml (outside this repo)

The runner-config.yaml comment documents item 2. But if either prerequisite is missing, the runner silently continues running with the old workspace layout — jobs won't fail at startup, they'll fail when docker compose tries to resolve bind mounts. This is a confusing failure mode. The prerequisite comment is good; I'd also suggest verifying these steps are done before merging, since the test plan checkboxes are all still unchecked.

Suggestions

Workspace persistence under /srv/gitea-workspace needs a cleanup strategy.
The ADR says:

Job workspaces persist across runs under /srv/gitea-workspace. act_runner manages per-run subdirectory cleanup. Orphaned directories from interrupted runs should be cleaned up manually if disk space becomes a concern.

On a CX32 with ~40GB disk, this is survivable — but for a long-running instance, a simple cron like find /srv/gitea-workspace -maxdepth 3 -name "*.lock" -mtime +7 -delete (or a full cleanup of orphaned run dirs) is worth adding to the ops runbook before it becomes an incident. Not a blocker, but worth tracking.

The valid_volumes + options pattern is correct.

  • valid_volumes whitelists paths that workflow steps may reference ✓
  • options mounts them into the job container at startup ✓
  • Identical host ↔ container path for /srv/gitea-workspace is correctly enforced ✓
  • Single-tenant trust model is appropriate here ✓

The config change itself is clean. The main risk is the operational bootstrapping.

## 🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** ### Blockers **PR description is completely wrong for what's in the diff.** The body describes the v1 sync approach (`OBS_CONFIG_DIR`, nsenter Alpine step, `nightly.yml`/`release.yml` changes). The actual diff is two files: a new ADR and a `runner-config.yaml` change. These must match before merge — anyone triaging later will be misled. **Two manual host-side steps are prerequisites for this config to work, and they're easy to miss.** From the ADR: 1. `mkdir -p /srv/gitea-workspace` on the VPS 2. Add `- /srv/gitea-workspace:/srv/gitea-workspace` to the runner's `compose.yaml` (outside this repo) The `runner-config.yaml` comment documents item 2. But if either prerequisite is missing, the runner silently continues running with the old workspace layout — jobs won't fail at startup, they'll fail when `docker compose` tries to resolve bind mounts. This is a confusing failure mode. The prerequisite comment is good; I'd also suggest verifying these steps are done before merging, since the test plan checkboxes are all still unchecked. ### Suggestions **Workspace persistence under `/srv/gitea-workspace` needs a cleanup strategy.** The ADR says: > Job workspaces persist across runs under `/srv/gitea-workspace`. act_runner manages per-run subdirectory cleanup. Orphaned directories from interrupted runs should be cleaned up manually if disk space becomes a concern. On a CX32 with ~40GB disk, this is survivable — but for a long-running instance, a simple cron like `find /srv/gitea-workspace -maxdepth 3 -name "*.lock" -mtime +7 -delete` (or a full cleanup of orphaned run dirs) is worth adding to the ops runbook before it becomes an incident. Not a blocker, but worth tracking. **The `valid_volumes` + `options` pattern is correct.** - `valid_volumes` whitelists paths that workflow steps may reference ✓ - `options` mounts them into the job container at startup ✓ - Identical host ↔ container path for `/srv/gitea-workspace` is correctly enforced ✓ - Single-tenant trust model is appropriate here ✓ The config change itself is clean. The main risk is the operational bootstrapping.
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: ⚠️ Approved with concerns

Blockers

The PR's stated requirements and test plan do not match the implementation in the diff.

The PR body defines acceptance criteria anchored to the v1 approach:

  • Nightly CI run starts the observability stack without "not a directory" errors
  • Local docker compose -f docker-compose.observability.yml up still works (no OBS_CONFIG_DIR needed)
  • Production release deploy starts observability stack successfully

The second item explicitly mentions OBS_CONFIG_DIR — a concept that no longer exists in this implementation. The diff contains no changes to docker-compose.observability.yml, nightly.yml, or release.yml, which the body implies were changed.

Revised acceptance criteria for the actual implementation (workspace bind-mount approach):

Given: the VPS has /srv/gitea-workspace created and the runner compose.yaml mounts it
When: a nightly or release CI job runs docker compose up for the observability stack
Then: all five config file bind mounts resolve correctly on the host daemon
And: no "not a directory" mount errors occur at startup

Given: no OBS_CONFIG_DIR env var is set
When: docker compose -f docker-compose.observability.yml up is run locally
Then: bind mounts resolve relative to the checked-out repo path (existing behaviour preserved)

Given: the workspace bind-mount prerequisite is NOT configured on the runner
When: a CI job runs
Then: the failure is detectable (observable symptoms + fallback behaviour documented)

Suggestions

The operational prerequisites are implicit requirements. They are documented in the ADR but they are not in the test plan. Before closing issue #598, confirm:

  • /srv/gitea-workspace created on VPS ✓/✗
  • Runner compose.yaml updated ✓/✗
  • CI job verified to start the observability stack successfully ✓/✗

These items represent the Definition of Done for issue #598 and should be checked before the PR is merged.

## 📋 Elicit — Requirements Engineer **Verdict: ⚠️ Approved with concerns** ### Blockers **The PR's stated requirements and test plan do not match the implementation in the diff.** The PR body defines acceptance criteria anchored to the v1 approach: > - [ ] Nightly CI run starts the observability stack without "not a directory" errors > - [ ] Local `docker compose -f docker-compose.observability.yml up` still works (no `OBS_CONFIG_DIR` needed) > - [ ] Production release deploy starts observability stack successfully The second item explicitly mentions `OBS_CONFIG_DIR` — a concept that no longer exists in this implementation. The diff contains no changes to `docker-compose.observability.yml`, `nightly.yml`, or `release.yml`, which the body implies were changed. **Revised acceptance criteria for the actual implementation (workspace bind-mount approach):** ``` Given: the VPS has /srv/gitea-workspace created and the runner compose.yaml mounts it When: a nightly or release CI job runs docker compose up for the observability stack Then: all five config file bind mounts resolve correctly on the host daemon And: no "not a directory" mount errors occur at startup Given: no OBS_CONFIG_DIR env var is set When: docker compose -f docker-compose.observability.yml up is run locally Then: bind mounts resolve relative to the checked-out repo path (existing behaviour preserved) Given: the workspace bind-mount prerequisite is NOT configured on the runner When: a CI job runs Then: the failure is detectable (observable symptoms + fallback behaviour documented) ``` ### Suggestions The operational prerequisites are implicit requirements. They are documented in the ADR but they are not in the test plan. Before closing issue #598, confirm: - [ ] `/srv/gitea-workspace` created on VPS ✓/✗ - [ ] Runner `compose.yaml` updated ✓/✗ - [ ] CI job verified to start the observability stack successfully ✓/✗ These items represent the Definition of Done for issue #598 and should be checked before the PR is merged.
Author
Owner

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved

What I checked

Docker socket exposure — unchanged, acceptable for this trust model.
The options line adds -v /srv/gitea-workspace:/srv/gitea-workspace to the existing -v /var/run/docker.sock:/var/run/docker.sock. The socket mount is the dominant privilege grant here; the workspace volume adds no meaningful new attack surface. The # SECURITY: comment correctly scopes the trust model to "only trusted code from this private repo" — this is the right annotation for a future auditor.

Workspace persistence — minor risk, documented.
Job workspaces under /srv/gitea-workspace persist across runs. act_runner creates per-run subdirectories, but orphaned directories from interrupted runs could contain:

  • Build artifacts (JARs, node_modules, Docker layer caches)
  • Intermediate credentials if any CI step wrote env vars to disk

This is not a new attack surface relative to the v1 approach (overlay2 MergedDir was also persistent). The ADR acknowledges it. For a private single-tenant runner it's acceptable. If the runner is ever opened to untrusted contributors, this persistence model would need re-evaluation — but that's explicitly out of scope per ADR-011.

Path traversal concern — not applicable.
The bind-mount source is a fixed path (/srv/gitea-workspace) not derived from user input. No injection vector here.

Privilege escalation via --privileged — removed.
The ADR notes that the v1 approach required --privileged --pid=host (effective root on the host). The v2 approach eliminates this. This is a meaningful security improvement — the blast radius of a compromised job container is now limited to the Docker socket (which was already present) rather than full host filesystem access via nsenter.

Summary

v2 is measurably more secure than v1. The socket mount is the accepted risk for this deployment model. No new findings.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** ### What I checked **Docker socket exposure — unchanged, acceptable for this trust model.** The `options` line adds `-v /srv/gitea-workspace:/srv/gitea-workspace` to the existing `-v /var/run/docker.sock:/var/run/docker.sock`. The socket mount is the dominant privilege grant here; the workspace volume adds no meaningful new attack surface. The `# SECURITY:` comment correctly scopes the trust model to "only trusted code from this private repo" — this is the right annotation for a future auditor. **Workspace persistence — minor risk, documented.** Job workspaces under `/srv/gitea-workspace` persist across runs. act_runner creates per-run subdirectories, but orphaned directories from interrupted runs could contain: - Build artifacts (JARs, node_modules, Docker layer caches) - Intermediate credentials if any CI step wrote env vars to disk This is not a new attack surface relative to the v1 approach (overlay2 MergedDir was also persistent). The ADR acknowledges it. For a private single-tenant runner it's acceptable. If the runner is ever opened to untrusted contributors, this persistence model would need re-evaluation — but that's explicitly out of scope per ADR-011. **Path traversal concern — not applicable.** The bind-mount source is a fixed path (`/srv/gitea-workspace`) not derived from user input. No injection vector here. **Privilege escalation via `--privileged` — removed.** The ADR notes that the v1 approach required `--privileged --pid=host` (effective root on the host). The v2 approach eliminates this. This is a meaningful security improvement — the blast radius of a compromised job container is now limited to the Docker socket (which was already present) rather than full host filesystem access via nsenter. ### Summary v2 is measurably more secure than v1. The socket mount is the accepted risk for this deployment model. No new findings.
Author
Owner

🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

What I checked

Test plan is stale and partially invalid.
The PR test plan references OBS_CONFIG_DIR ("Local docker compose -f docker-compose.observability.yml up still works (no OBS_CONFIG_DIR needed)") — a concept that belongs to v1, which was abandoned. With the v2 workspace bind-mount approach, OBS_CONFIG_DIR was never added to docker-compose.observability.yml, so this criterion no longer makes sense. The test plan needs rewriting to match the actual implementation.

All three test plan checkboxes remain unchecked.

- [ ] Nightly CI run starts the observability stack without "not a directory" errors
- [ ] Local docker compose -f docker-compose.observability.yml up still works
- [ ] Production release deploy starts observability stack successfully

This indicates the fix has not been verified in any environment. The core fix depends on two out-of-band prerequisites on the VPS (the directory creation and runner compose.yaml volume). Without those being applied and verified, this PR can't be considered done.

No automated test coverage possible — expected.
CI config changes and runner YAML are not unit-testable. The test strategy here is necessarily manual verification + CI observation. That's fine. But the manual verification steps must actually be completed and the boxes checked before merge.

Suggestion

Before merging, I'd want to see at minimum:

  1. Checkboxes checked with evidence (e.g., a link to a successful CI run showing observability stack started)
  2. Test plan updated to remove the OBS_CONFIG_DIR reference
  3. Confirmation that the VPS prerequisites are in place

The config change itself looks correct — the risk is that it's been theoretically correct but not yet empirically validated.

## 🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** ### What I checked **Test plan is stale and partially invalid.** The PR test plan references `OBS_CONFIG_DIR` ("Local `docker compose -f docker-compose.observability.yml up` still works (no `OBS_CONFIG_DIR` needed)") — a concept that belongs to v1, which was abandoned. With the v2 workspace bind-mount approach, `OBS_CONFIG_DIR` was never added to `docker-compose.observability.yml`, so this criterion no longer makes sense. The test plan needs rewriting to match the actual implementation. **All three test plan checkboxes remain unchecked.** ``` - [ ] Nightly CI run starts the observability stack without "not a directory" errors - [ ] Local docker compose -f docker-compose.observability.yml up still works - [ ] Production release deploy starts observability stack successfully ``` This indicates the fix has not been verified in any environment. The core fix depends on two out-of-band prerequisites on the VPS (the directory creation and runner compose.yaml volume). Without those being applied and verified, this PR can't be considered done. **No automated test coverage possible — expected.** CI config changes and runner YAML are not unit-testable. The test strategy here is necessarily manual verification + CI observation. That's fine. But the manual verification steps must actually be completed and the boxes checked before merge. ### Suggestion Before merging, I'd want to see at minimum: 1. Checkboxes checked with evidence (e.g., a link to a successful CI run showing observability stack started) 2. Test plan updated to remove the `OBS_CONFIG_DIR` reference 3. Confirmation that the VPS prerequisites are in place The config change itself looks correct — the risk is that it's been theoretically correct but not yet empirically validated.
Author
Owner

🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead

Verdict: Approved

This PR contains no user-facing changes — it's a CI runner configuration change and an Architecture Decision Record document. There are no Svelte components, no route changes, no CSS, and no UI interactions affected.

I reviewed the two changed files to confirm:

  • docs/adr/015-dood-workspace-bind-mount.md — Technical infrastructure documentation, no UX implications
  • runner-config.yaml — CI runner YAML configuration, no UX implications

Nothing for me to flag here. LGTM from a UI/UX perspective.

## 🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead **Verdict: ✅ Approved** This PR contains no user-facing changes — it's a CI runner configuration change and an Architecture Decision Record document. There are no Svelte components, no route changes, no CSS, and no UI interactions affected. I reviewed the two changed files to confirm: - `docs/adr/015-dood-workspace-bind-mount.md` — Technical infrastructure documentation, no UX implications - `runner-config.yaml` — CI runner YAML configuration, no UX implications Nothing for me to flag here. LGTM from a UI/UX perspective.
marcel added 1 commit 2026-05-15 19:47:27 +02:00
docs(ci): document workspace bind-mount setup for DooD runners
Some checks failed
CI / OCR Service Tests (pull_request) Successful in 59s
CI / Unit & Component Tests (push) Has been cancelled
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
CI / Backend Unit Tests (pull_request) Successful in 5m22s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
CI / Unit & Component Tests (pull_request) Failing after 14m44s
e4ac5f08e7
Add the /srv/gitea-workspace prerequisite step to DEPLOYMENT.md §3.1
and a new "Workspace bind-mount setup" subsection plus failure mode 4
to ci-gitea.md, covering the root cause, one-time host setup, disk
management, and troubleshooting for the bind-mount resolution fix
introduced in ADR-015.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel merged commit e4ac5f08e7 into main 2026-05-15 19:51:05 +02:00
marcel deleted branch fix/issue-598-obs-dood-bind-mounts 2026-05-15 19:51:05 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#599