ci(nightly): reload Caddy before smoke test #537
@@ -120,6 +120,39 @@ jobs:
|
||||
--profile staging \
|
||||
up -d --wait --remove-orphans
|
||||
|
||||
- name: Reload Caddy
|
||||
# Apply any committed Caddyfile changes before smoke-testing the
|
||||
# public surface. Without this step, a Caddyfile edit lands in the
|
||||
# repo but Caddy keeps serving the previous config until someone
|
||||
# reloads it manually — the smoke test would then catch a stale
|
||||
# header or a still-proxied /actuator route rather than confirming
|
||||
# the current config is live.
|
||||
#
|
||||
# The runner executes job steps inside Docker containers (DooD).
|
||||
# `systemctl` is not present in container images and cannot reach
|
||||
# the host's systemd directly. We use the Docker socket (mounted
|
||||
# into every job container via runner-config.yaml) to spin up a
|
||||
# privileged sibling container in the host PID namespace; nsenter
|
||||
# then enters the host's namespaces so systemctl talks to the real
|
||||
# host systemd daemon. No sudoers entry is required — the Docker
|
||||
# socket already grants root-equivalent host access.
|
||||
#
|
||||
# Alpine is used: ~5 MB vs ~70 MB for ubuntu, no unnecessary
|
||||
# tooling, and the digest is pinned so any upstream change requires
|
||||
# an explicit bump PR. util-linux (which ships nsenter) is installed
|
||||
# at run time; apk add takes ~1 s on the warm VPS cache.
|
||||
#
|
||||
# `reload` not `restart`: reload sends SIGHUP so Caddy re-reads its
|
||||
# config in-process without dropping TLS connections. `restart`
|
||||
# would briefly stop the service, losing in-flight requests.
|
||||
#
|
||||
# If Caddy is not running this step fails fast before the smoke test
|
||||
# issues a misleading "port 443 refused" error.
|
||||
run: |
|
||||
docker run --rm --privileged --pid=host \
|
||||
alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d \
|
||||
sh -c 'apk add --no-cache util-linux -q && nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy'
|
||||
|
||||
- name: Smoke test deployed environment
|
||||
# Healthchecks confirm containers are healthy; they do NOT confirm the
|
||||
# public surface works. This step catches: Caddy not reloaded, HSTS
|
||||
|
||||
@@ -93,6 +93,18 @@ jobs:
|
||||
--env-file .env.production \
|
||||
up -d --wait --remove-orphans
|
||||
|
||||
- name: Reload Caddy
|
||||
# See nightly.yml — same rationale and mechanism: DooD job containers
|
||||
# cannot call systemctl directly; nsenter via a privileged sibling
|
||||
# container reaches the host systemd. Must run after deploy (so the
|
||||
# latest Caddyfile is on disk) and before the smoke test (so the
|
||||
# public surface reflects the current config). Alpine with pinned
|
||||
# digest; reload not restart — see nightly.yml for full rationale.
|
||||
run: |
|
||||
docker run --rm --privileged --pid=host \
|
||||
alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d \
|
||||
sh -c 'apk add --no-cache util-linux -q && nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy'
|
||||
|
||||
- name: Smoke test deployed environment
|
||||
# See nightly.yml — same three checks, against the prod vhost.
|
||||
# --resolve pins archiv.raddatz.cloud to the runner's loopback so
|
||||
|
||||
@@ -151,6 +151,9 @@ ufw default deny incoming && ufw allow 22/tcp && ufw allow 80/tcp && ufw allow 4
|
||||
apt install caddy
|
||||
|
||||
# Use the Caddyfile from the repo (replace path with the runner's clone target)
|
||||
# CI DEPENDENCY: the nightly and release workflows run `systemctl reload caddy` to
|
||||
# pick up committed Caddyfile changes. They find the file via this symlink — if it
|
||||
# is absent or points elsewhere, the reload succeeds but serves stale config.
|
||||
ln -sf /opt/familienarchiv/infra/caddy/Caddyfile /etc/caddy/Caddyfile
|
||||
systemctl reload caddy
|
||||
|
||||
|
||||
63
docs/adr/012-nsenter-for-host-service-management-in-ci.md
Normal file
63
docs/adr/012-nsenter-for-host-service-management-in-ci.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# ADR-012: nsenter via privileged sibling container for host service management in CI
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The deploy workflows (`.gitea/workflows/nightly.yml`, `release.yml`) run job steps inside Docker containers under a Docker-out-of-Docker (DooD) setup: the Gitea runner container mounts the host Docker socket, and act_runner spawns a sibling container for each job. That job container also gets the Docker socket mounted (via `valid_volumes` in `runner-config.yaml`).
|
||||
|
||||
This architecture has one significant limitation: **job containers cannot manage host services**. Specifically:
|
||||
|
||||
- Job containers are not in the host's PID, mount, UTS, network, or IPC namespaces.
|
||||
- There is no systemd PID 1 inside a job container — `systemctl` has nothing to talk to.
|
||||
- `sudo` is not present in standard container images; even if it were, it would not help.
|
||||
- Caddy runs as a **host systemd service** (not a Docker container), managing TLS certificates via Let's Encrypt. It must be running on the host to serve port 443.
|
||||
|
||||
The deploy workflows need to tell Caddy to reload its config after each deploy so that committed Caddyfile changes are applied before the smoke test validates the public surface. Without a reload step, Caddy silently serves the previous config and the smoke test may pass against stale configuration.
|
||||
|
||||
## Decision
|
||||
|
||||
Use the host Docker socket (already mounted in every job container via `runner-config.yaml`) to spin up a **privileged sibling container** in the host PID namespace, then use `nsenter` to enter all host namespaces and call `systemctl reload caddy`:
|
||||
|
||||
```yaml
|
||||
- name: Reload Caddy
|
||||
run: |
|
||||
docker run --rm --privileged --pid=host \
|
||||
alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d \
|
||||
sh -c 'apk add --no-cache util-linux -q && nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy'
|
||||
```
|
||||
|
||||
`nsenter -t 1 -m -u -n -p -i` enters the init process's mount, UTS, IPC, network, PID, and cgroup namespaces, giving `systemctl` a view of the real host systemd daemon.
|
||||
|
||||
**Alpine is used** instead of Ubuntu: ~5 MB vs ~70 MB pull size, no unnecessary tooling. `util-linux` (which ships `nsenter`) is installed at run time; apk add takes ~1 s on the warm VPS cache. The image digest is pinned so any upstream change requires an explicit Renovate bump PR.
|
||||
|
||||
**`reload` not `restart`**: reload sends SIGHUP so Caddy re-reads its config in-process without dropping TLS connections or in-flight requests.
|
||||
|
||||
**No sudoers entry is required**: the Docker socket already grants root-equivalent host access. This pattern makes existing implicit privileges explicit rather than introducing new ones.
|
||||
|
||||
This decision applies the same pattern to both `nightly.yml` and `release.yml` since both deploy the app stack and must apply Caddyfile changes before smoke-testing the public surface.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
| Alternative | Why rejected |
|
||||
|---|---|
|
||||
| `sudo systemctl reload caddy` in the job container | No systemd PID 1 inside the container — `systemctl` has nothing to connect to. `sudo` is not present in container images and would not help even if it were. |
|
||||
| Caddy admin API (`curl localhost:2019/load`) | Job containers do not share the host network namespace; `localhost:2019` on the host is unreachable. Exposing `:2019` on a host-bound port would add a network attack surface with no benefit over the current approach. |
|
||||
| SSH from the job container to the VPS host | Requires storing an SSH private key as a CI secret, managing authorized_keys on the host, and opening an inbound SSH path from the container. Adds key management overhead for a pattern that the Docker socket already enables more directly. |
|
||||
| Running Caddy as a Docker container (instead of host service) | Caddy manages TLS certificates via Let's Encrypt; running it in Docker complicates certificate persistence and renewal. As a host service, cert storage is straightforward and restarts do not risk rate-limit issues. This would be a larger infrastructure change unrelated to the CI gap. |
|
||||
|
||||
## Consequences
|
||||
|
||||
- The runner host's Docker socket access is now a capability relied upon for host service management, not just for running `docker compose` commands. This is stated explicitly in the YAML comment so future reviewers understand the trust boundary.
|
||||
- The Caddyfile symlink on the VPS (`/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile`) is a required contract for CI to succeed. It is documented in `docs/DEPLOYMENT.md §3.1` and `docs/infrastructure/ci-gitea.md`. If the symlink is absent or mis-pointed, `systemctl reload caddy` succeeds but Caddy serves stale config.
|
||||
- Renovate will create bump PRs when a new Alpine 3.21 digest is published. Because the container runs `--privileged --pid=host`, these bump PRs must be reviewed manually and must not be auto-merged. A `packageRule` in `renovate.json` enforces this.
|
||||
- The step is duplicated between `nightly.yml` and `release.yml` (tracked in issue #539 for extraction into a composite action).
|
||||
- If Caddy is not running when the step executes, `systemctl reload` exits non-zero and the workflow aborts before the smoke test — preventing a misleading "port 443 refused" curl error.
|
||||
|
||||
## References
|
||||
|
||||
- `docs/infrastructure/ci-gitea.md` §"Running host-level commands from CI (nsenter pattern)" — full operational context, troubleshooting guide
|
||||
- `docs/DEPLOYMENT.md` §3.1 — Caddyfile symlink bootstrap step
|
||||
- ADR-011 — single-tenant runner trust model (Docker socket access scope)
|
||||
@@ -4,16 +4,109 @@ This document covers the Gitea Actions CI workflow for Familienarchiv, including
|
||||
|
||||
---
|
||||
|
||||
## Self-Hosted Runner Provisioning
|
||||
## Runner Architecture
|
||||
|
||||
Gitea Actions requires self-hosted runners. GitHub Actions provides `ubuntu-latest` for free; on Gitea you run the runner yourself.
|
||||
Familienarchiv uses **two runners** on the same Hetzner VPS:
|
||||
|
||||
```bash
|
||||
# On the VPS — register a Gitea Actions runner
|
||||
docker run -d --name gitea-runner --restart unless-stopped -v /var/run/docker.sock:/var/run/docker.sock -v gitea-runner-data:/data -e GITEA_INSTANCE_URL=https://gitea.example.com -e GITEA_RUNNER_REGISTRATION_TOKEN=<token-from-gitea-settings> -e GITEA_RUNNER_NAME=vps-runner-1 -e GITEA_RUNNER_LABELS=ubuntu-latest:docker://node:20-bullseye gitea/act_runner:latest
|
||||
| Runner | Purpose | Config |
|
||||
|---|---|---|
|
||||
| `gitea` (Docker container) | Hosts Gitea itself | `infra/gitea/docker-compose.yml` |
|
||||
| `gitea-runner` (Docker container) | Runs all CI and deploy jobs | `infra/gitea/docker-compose.yml` + `/root/docker/gitea/runner-config.yaml` |
|
||||
|
||||
Both containers live in the `gitea_gitea` Docker network on the VPS. The runner connects to Gitea via the LAN IP so job containers (which don't share the `gitea_gitea` network) can also reach it.
|
||||
|
||||
### Docker-out-of-Docker (DooD)
|
||||
|
||||
The `gitea-runner` container mounts the host Docker socket (`/var/run/docker.sock`). When a workflow job runs, act_runner spawns a **sibling container** for each job. That job container also gets the Docker socket mounted (via `valid_volumes` in `runner-config.yaml`), enabling `docker compose` calls in workflow steps.
|
||||
|
||||
### Running host-level commands from CI (nsenter pattern)
|
||||
|
||||
Job containers are unprivileged and do not share the host's PID/mount/network namespaces. Commands like `systemctl` that target the host daemon are therefore unavailable by default. When a workflow step needs to manage a host service (e.g. `systemctl reload caddy`), it uses the Docker socket to spin up a **privileged sibling container** in the host PID namespace:
|
||||
|
||||
```yaml
|
||||
- name: Reload Caddy
|
||||
run: |
|
||||
docker run --rm --privileged --pid=host \
|
||||
alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d \
|
||||
sh -c 'apk add --no-cache util-linux -q && nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy'
|
||||
```
|
||||
|
||||
The runner label `ubuntu-latest` maps to the Docker image it uses -- this is how `runs-on: ubuntu-latest` in the workflow YAML continues to work unchanged.
|
||||
`nsenter -t 1 -m -u -n -p -i` enters the init process's mount, UTS, IPC, network, PID, and cgroup namespaces, giving `systemctl` a view of the real host systemd. No sudoers entry is required — the Docker socket already grants root-equivalent host access.
|
||||
|
||||
Alpine is used instead of Ubuntu: ~5 MB vs ~70 MB, and the digest is pinned to a specific sha256 so any upstream change requires an explicit Renovate bump PR. `util-linux` (which ships `nsenter`) is not part of the Alpine base image but is installed at run time in ~1 s from the warm VPS cache.
|
||||
|
||||
#### Why not `sudo systemctl` in the job container?
|
||||
|
||||
Job containers run as root inside an unprivileged Docker namespace. There is no systemd PID 1 inside the container — `systemctl` would attempt to reach a socket that does not exist. `sudo` is not present in container images and would not help even if it were.
|
||||
|
||||
#### Why not Caddy's admin API?
|
||||
|
||||
Caddy ships a localhost admin API at `:2019` by default. Job containers do not share the host network namespace, so they cannot reach `localhost:2019` on the host. Exposing `:2019` on a host-bound port to make it reachable would add a network attack surface with no benefit over the current approach.
|
||||
|
||||
### Caddyfile symlink contract
|
||||
|
||||
The deploy workflows reload Caddy to pick up committed Caddyfile changes. This relies on a symlink that must exist on the VPS:
|
||||
|
||||
```
|
||||
/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile
|
||||
```
|
||||
|
||||
Created once during server bootstrap (see `docs/DEPLOYMENT.md §3.1`). Verify with:
|
||||
|
||||
```bash
|
||||
ls -la /etc/caddy/Caddyfile
|
||||
# Expected: lrwxrwxrwx ... /etc/caddy/Caddyfile -> /opt/familienarchiv/infra/caddy/Caddyfile
|
||||
```
|
||||
|
||||
### Troubleshooting: Reload Caddy step fails
|
||||
|
||||
**Failure mode 1 — Caddy is stopped**
|
||||
|
||||
Symptom in CI log:
|
||||
```
|
||||
Failed to reload caddy.service: Unit caddy.service is not active.
|
||||
```
|
||||
|
||||
Recovery:
|
||||
```bash
|
||||
ssh root@<vps>
|
||||
systemctl start caddy
|
||||
systemctl status caddy # confirm Active: active (running)
|
||||
```
|
||||
|
||||
Re-run the workflow via Gitea Actions → "Re-run workflow".
|
||||
|
||||
**Failure mode 2 — Caddyfile symlink is missing or mis-pointed**
|
||||
|
||||
This failure is silent — `systemctl reload caddy` exits 0 but Caddy reloads whatever `/etc/caddy/Caddyfile` currently resolves to. The smoke test may then pass against stale config.
|
||||
|
||||
Symptom: smoke test fails on the HSTS value or the `/actuator/health → 404` check despite the Reload Caddy step succeeding.
|
||||
|
||||
Diagnosis:
|
||||
```bash
|
||||
ssh root@<vps>
|
||||
ls -la /etc/caddy/Caddyfile
|
||||
# Should be: lrwxrwxrwx ... /etc/caddy/Caddyfile -> /opt/familienarchiv/infra/caddy/Caddyfile
|
||||
```
|
||||
|
||||
Recovery if symlink is wrong or missing:
|
||||
```bash
|
||||
ln -sf /opt/familienarchiv/infra/caddy/Caddyfile /etc/caddy/Caddyfile
|
||||
systemctl reload caddy
|
||||
```
|
||||
|
||||
**Failure mode 3 — nsenter / Docker socket unavailable**
|
||||
|
||||
Symptom in CI log:
|
||||
```
|
||||
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock.
|
||||
```
|
||||
or
|
||||
```
|
||||
nsenter: failed to execute /bin/systemctl: No such file or directory
|
||||
```
|
||||
|
||||
The first error means the Docker socket is not mounted into the job container — check `valid_volumes` in `/root/docker/gitea/runner-config.yaml` on the VPS. The second means the Alpine image is running but cannot enter the host mount namespace; verify `--privileged` and `--pid=host` are both present in the workflow step.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -5,6 +5,13 @@
|
||||
"matchPackagePatterns": ["^@tiptap/"],
|
||||
"groupName": "tiptap",
|
||||
"automerge": false
|
||||
},
|
||||
{
|
||||
"description": "Digest bumps for images used in privileged CI steps (--privileged --pid=host) must be reviewed manually — a compromised image has root-equivalent host access.",
|
||||
"matchPaths": [".gitea/workflows/**"],
|
||||
"matchUpdateTypes": ["digest"],
|
||||
"automerge": false,
|
||||
"reviewersFromCodeOwners": false
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user