Captures the architectural decision, alternatives considered (sudo systemctl, Caddy admin API, SSH), and consequences (symlink contract, Renovate review requirement, step duplication tracked in #539). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
64 lines
5.4 KiB
Markdown
64 lines
5.4 KiB
Markdown
# ADR-012: nsenter via privileged sibling container for host service management in CI
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
The deploy workflows (`.gitea/workflows/nightly.yml`, `release.yml`) run job steps inside Docker containers under a Docker-out-of-Docker (DooD) setup: the Gitea runner container mounts the host Docker socket, and act_runner spawns a sibling container for each job. That job container also gets the Docker socket mounted (via `valid_volumes` in `runner-config.yaml`).
|
|
|
|
This architecture has one significant limitation: **job containers cannot manage host services**. Specifically:
|
|
|
|
- Job containers are not in the host's PID, mount, UTS, network, or IPC namespaces.
|
|
- There is no systemd PID 1 inside a job container — `systemctl` has nothing to talk to.
|
|
- `sudo` is not present in standard container images; even if it were, it would not help.
|
|
- Caddy runs as a **host systemd service** (not a Docker container), managing TLS certificates via Let's Encrypt. It must be running on the host to serve port 443.
|
|
|
|
The deploy workflows need to tell Caddy to reload its config after each deploy so that committed Caddyfile changes are applied before the smoke test validates the public surface. Without a reload step, Caddy silently serves the previous config and the smoke test may pass against stale configuration.
|
|
|
|
## Decision
|
|
|
|
Use the host Docker socket (already mounted in every job container via `runner-config.yaml`) to spin up a **privileged sibling container** in the host PID namespace, then use `nsenter` to enter all host namespaces and call `systemctl reload caddy`:
|
|
|
|
```yaml
|
|
- name: Reload Caddy
|
|
run: |
|
|
docker run --rm --privileged --pid=host \
|
|
alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d \
|
|
sh -c 'apk add --no-cache util-linux -q && nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy'
|
|
```
|
|
|
|
`nsenter -t 1 -m -u -n -p -i` enters the init process's mount, UTS, IPC, network, PID, and cgroup namespaces, giving `systemctl` a view of the real host systemd daemon.
|
|
|
|
**Alpine is used** instead of Ubuntu: ~5 MB vs ~70 MB pull size, no unnecessary tooling. `util-linux` (which ships `nsenter`) is installed at run time; apk add takes ~1 s on the warm VPS cache. The image digest is pinned so any upstream change requires an explicit Renovate bump PR.
|
|
|
|
**`reload` not `restart`**: reload sends SIGHUP so Caddy re-reads its config in-process without dropping TLS connections or in-flight requests.
|
|
|
|
**No sudoers entry is required**: the Docker socket already grants root-equivalent host access. This pattern makes existing implicit privileges explicit rather than introducing new ones.
|
|
|
|
This decision applies the same pattern to both `nightly.yml` and `release.yml` since both deploy the app stack and must apply Caddyfile changes before smoke-testing the public surface.
|
|
|
|
## Alternatives Considered
|
|
|
|
| Alternative | Why rejected |
|
|
|---|---|
|
|
| `sudo systemctl reload caddy` in the job container | No systemd PID 1 inside the container — `systemctl` has nothing to connect to. `sudo` is not present in container images and would not help even if it were. |
|
|
| Caddy admin API (`curl localhost:2019/load`) | Job containers do not share the host network namespace; `localhost:2019` on the host is unreachable. Exposing `:2019` on a host-bound port would add a network attack surface with no benefit over the current approach. |
|
|
| SSH from the job container to the VPS host | Requires storing an SSH private key as a CI secret, managing authorized_keys on the host, and opening an inbound SSH path from the container. Adds key management overhead for a pattern that the Docker socket already enables more directly. |
|
|
| Running Caddy as a Docker container (instead of host service) | Caddy manages TLS certificates via Let's Encrypt; running it in Docker complicates certificate persistence and renewal. As a host service, cert storage is straightforward and restarts do not risk rate-limit issues. This would be a larger infrastructure change unrelated to the CI gap. |
|
|
|
|
## Consequences
|
|
|
|
- The runner host's Docker socket access is now a capability relied upon for host service management, not just for running `docker compose` commands. This is stated explicitly in the YAML comment so future reviewers understand the trust boundary.
|
|
- The Caddyfile symlink on the VPS (`/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile`) is a required contract for CI to succeed. It is documented in `docs/DEPLOYMENT.md §3.1` and `docs/infrastructure/ci-gitea.md`. If the symlink is absent or mis-pointed, `systemctl reload caddy` succeeds but Caddy serves stale config.
|
|
- Renovate will create bump PRs when a new Alpine 3.21 digest is published. Because the container runs `--privileged --pid=host`, these bump PRs must be reviewed manually and must not be auto-merged. A `packageRule` in `renovate.json` enforces this.
|
|
- The step is duplicated between `nightly.yml` and `release.yml` (tracked in issue #539 for extraction into a composite action).
|
|
- If Caddy is not running when the step executes, `systemctl reload` exits non-zero and the workflow aborts before the smoke test — preventing a misleading "port 443 refused" curl error.
|
|
|
|
## References
|
|
|
|
- `docs/infrastructure/ci-gitea.md` §"Running host-level commands from CI (nsenter pattern)" — full operational context, troubleshooting guide
|
|
- `docs/DEPLOYMENT.md` §3.1 — Caddyfile symlink bootstrap step
|
|
- ADR-011 — single-tenant runner trust model (Docker socket access scope)
|