ci(nightly): reload Caddy before smoke test #537

Merged
marcel merged 10 commits from fix/nightly-caddy-reload into main 2026-05-12 07:51:13 +02:00
Owner

Summary

  • Adds sudo systemctl reload caddy between the docker compose deploy and the smoke test
  • Ensures committed Caddyfile changes are applied to the running host daemon before the public surface is verified
  • Surfaces the root cause of today's nightly failure explicitly: if Caddy is not running the step fails with a clear service error, not a misleading "Failed to connect to port 443" from curl

Root cause of tonight's failure

Before commit 5f352943 (frontend healthcheck localhost127.0.0.1), the frontend container went unhealthy and docker compose up --wait died at the "Deploy staging" step — the smoke test was never reached. That fix unmasked a pre-existing gap: the workflow deploys the app stack but never tells the host Caddy daemon to reload. Port 443 was refusing connections because Caddy either wasn't running or wasn't configured yet on the server.

Operational fix still required

This PR fixes the workflow so future runs self-heal. The immediate action is still needed on the server:

systemctl status caddy
# If stopped: systemctl start caddy
# Confirm symlink: ls -la /etc/caddy/Caddyfile
#   Expected: -> /opt/familienarchiv/infra/caddy/Caddyfile
# If missing: ln -sf /opt/familienarchiv/infra/caddy/Caddyfile /etc/caddy/Caddyfile
systemctl reload caddy

Test plan

  • Merge and let next nightly run pass the "Reload Caddy" step without error
  • Confirm smoke test proceeds to check /login, HSTS header, and /actuator/health → 404
## Summary - Adds `sudo systemctl reload caddy` between the docker compose deploy and the smoke test - Ensures committed Caddyfile changes are applied to the running host daemon before the public surface is verified - Surfaces the root cause of today's nightly failure explicitly: if Caddy is not running the step fails with a clear service error, not a misleading "Failed to connect to port 443" from curl ## Root cause of tonight's failure Before commit `5f352943` (frontend healthcheck `localhost` → `127.0.0.1`), the frontend container went unhealthy and `docker compose up --wait` died at the "Deploy staging" step — the smoke test was never reached. That fix unmasked a pre-existing gap: the workflow deploys the app stack but never tells the host Caddy daemon to reload. Port 443 was refusing connections because Caddy either wasn't running or wasn't configured yet on the server. ## Operational fix still required This PR fixes the workflow so future runs self-heal. The **immediate action** is still needed on the server: ```bash systemctl status caddy # If stopped: systemctl start caddy # Confirm symlink: ls -la /etc/caddy/Caddyfile # Expected: -> /opt/familienarchiv/infra/caddy/Caddyfile # If missing: ln -sf /opt/familienarchiv/infra/caddy/Caddyfile /etc/caddy/Caddyfile systemctl reload caddy ``` ## Test plan - [ ] Merge and let next nightly run pass the "Reload Caddy" step without error - [ ] Confirm smoke test proceeds to check `/login`, HSTS header, and `/actuator/health → 404`
Author
Owner

🏗️ Markus Keller — Senior Application Architect

Verdict: Approved

What I checked

This PR touches only .gitea/workflows/nightly.yml — no schema changes, no new backend packages, no new routes, no new infrastructure components. Running the doc-update checklist:

Trigger Required update Applies?
New Docker service / infra component docs/architecture/c4/l2-containers.puml + DEPLOYMENT.md No new service
New external system integrated docs/architecture/c4/l1-context.puml No new system
Architectural decision with lasting consequences New ADR Borderline — see note

ADR note (suggestion, not blocker): The decision to run systemctl reload caddy from CI — rather than, say, having the Compose stack self-contain the proxy — is an architectural constraint worth capturing. The symlink pattern (/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile) and the choice to manage Caddy as a host service rather than a container are load-bearing decisions that future maintainers will trip over. A two-paragraph entry in docs/infrastructure/ci-gitea.md noting the symlink contract and the required sudoers setup would suffice. I'd call this a suggestion, not a blocker.

Architecture assessment

The placement of the reload step is correct: after the Docker Compose deploy (which pulls the latest Caddyfile from the cloned repo), before the smoke test (which verifies the public surface). The reload happens on the host, not inside a container — which is the right mental model since Caddy runs as a host service managing TLS via Let's Encrypt. Reloading inside the container would have no effect.

systemctl reload (SIGHUP) over systemctl restart is the correct choice: Caddy performs a graceful config reload without dropping open connections. This is boring infrastructure behavior done right.

No concerns at the architecture layer.

## 🏗️ Markus Keller — Senior Application Architect **Verdict: ✅ Approved** ### What I checked This PR touches only `.gitea/workflows/nightly.yml` — no schema changes, no new backend packages, no new routes, no new infrastructure components. Running the doc-update checklist: | Trigger | Required update | Applies? | |---|---|---| | New Docker service / infra component | `docs/architecture/c4/l2-containers.puml` + `DEPLOYMENT.md` | ❌ No new service | | New external system integrated | `docs/architecture/c4/l1-context.puml` | ❌ No new system | | Architectural decision with lasting consequences | New ADR | Borderline — see note | **ADR note (suggestion, not blocker):** The decision to run `systemctl reload caddy` from CI — rather than, say, having the Compose stack self-contain the proxy — is an architectural constraint worth capturing. The symlink pattern (`/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile`) and the choice to manage Caddy as a host service rather than a container are load-bearing decisions that future maintainers will trip over. A two-paragraph entry in `docs/infrastructure/ci-gitea.md` noting the symlink contract and the required sudoers setup would suffice. I'd call this a **suggestion**, not a blocker. ### Architecture assessment The placement of the reload step is correct: after the Docker Compose deploy (which pulls the latest Caddyfile from the cloned repo), before the smoke test (which verifies the public surface). The reload happens on the host, not inside a container — which is the right mental model since Caddy runs as a host service managing TLS via Let's Encrypt. Reloading inside the container would have no effect. `systemctl reload` (SIGHUP) over `systemctl restart` is the correct choice: Caddy performs a graceful config reload without dropping open connections. This is boring infrastructure behavior done right. No concerns at the architecture layer.
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: Approved

TDD evidence

CI workflow steps are not testable via unit tests — the test plan ("merge and let the next nightly run confirm it") is the only viable approach here. Accepted.

Code quality

The change is a single run: sudo systemctl reload caddy with a well-structured comment block. Reviewing against clean code standards for CI YAML:

  • Comment quality: The comment explains why (committed Caddyfile changes need to be picked up, smoke test would otherwise test stale config) and how (SIGHUP, graceful reload, symlink mechanism, fail-fast on stopped Caddy). This is exactly the kind of "non-obvious why" comment that belongs in the codebase. No "what" noise.
  • Step naming: "Reload Caddy" is precise and matches the single responsibility of the step.
  • Scope: One logical change, one step, one commit. This is atomic.
  • Placement: The step is inserted at the right position in the workflow — after deploy, before smoke test. The sequence reflects the dependency: deploy → apply config → verify surface.

One minor observation

The comment block is 10 lines for a 1-line command. In application code I'd push back on this ratio, but CI workflows are documentation as much as they are code — operators reading the workflow at 2am during an incident need that context. No change needed.

Nothing else to flag from a developer perspective.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** ### TDD evidence CI workflow steps are not testable via unit tests — the test plan ("merge and let the next nightly run confirm it") is the only viable approach here. Accepted. ### Code quality The change is a single `run: sudo systemctl reload caddy` with a well-structured comment block. Reviewing against clean code standards for CI YAML: - **Comment quality: ✅** The comment explains *why* (committed Caddyfile changes need to be picked up, smoke test would otherwise test stale config) and *how* (SIGHUP, graceful reload, symlink mechanism, fail-fast on stopped Caddy). This is exactly the kind of "non-obvious why" comment that belongs in the codebase. No "what" noise. - **Step naming: ✅** "Reload Caddy" is precise and matches the single responsibility of the step. - **Scope: ✅** One logical change, one step, one commit. This is atomic. - **Placement: ✅** The step is inserted at the right position in the workflow — after deploy, before smoke test. The sequence reflects the dependency: deploy → apply config → verify surface. ### One minor observation The comment block is 10 lines for a 1-line command. In application code I'd push back on this ratio, but CI workflows are documentation as much as they are code — operators reading the workflow at 2am during an incident need that context. No change needed. Nothing else to flag from a developer perspective.
Author
Owner

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

What's correct

  • reload over restart: Correct. Caddy's graceful reload via SIGHUP re-reads /etc/caddy/Caddyfile without dropping TLS connections or active requests. restart would introduce a service gap and risk a Let's Encrypt rate-limit cycle.
  • Step position: Correct. The reload happens after docker compose up --wait (so the repo's Caddyfile is already at the latest commit on the server) and before the smoke test (which validates the public surface with the updated config).
  • Fail-fast behavior: Correct. If Caddy is stopped, systemctl reload exits non-zero immediately, aborting the workflow before the smoke test issues a confusing "port 443 refused" curl error. This is the right failure mode.
  • Comment block: Good operational documentation — explains the symlink contract and the SIGHUP mechanism. Useful for whoever is on-call at 2am.

🚫 Blocker: sudoers entry on the runner needs to be verified

The step runs sudo systemctl reload caddy. Gitea Actions runners typically execute as an unprivileged user (e.g. gitea-runner or act_runner). Without a passwordless sudoers entry for this specific command, the step will fail on the first nightly run with:

sudo: systemctl: command not found

or

sudo: a password is required

The required sudoers entry on the production server is:

gitea-runner ALL=(ALL) NOPASSWD: /bin/systemctl reload caddy

(Adjust the username to match your actual runner user. Verify the path with which systemctl.)

Before this PR merges, please confirm that this entry exists (or is being added as part of the operational fix described in the PR description). If it's already in place and tonight's nightly failure was not a sudo-permission error, then this is a non-issue — but given the PR description says the operational fix is "still required," I can't assume it's set up.

If the entry is confirmed, change this to Approved.


The PR description describes the symlink:

/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile

This is an implicit contract between the server setup and the CI workflow. If the symlink doesn't exist (new server, fresh VPS, disaster recovery), this step silently reloads Caddy with the wrong config and the smoke test may pass or fail for the wrong reason. Consider adding a step or a note in docs/infrastructure/ci-gitea.md documenting that this symlink must exist as a server pre-condition.

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** ### What's correct - **`reload` over `restart`:** Correct. Caddy's graceful reload via SIGHUP re-reads `/etc/caddy/Caddyfile` without dropping TLS connections or active requests. `restart` would introduce a service gap and risk a Let's Encrypt rate-limit cycle. - **Step position:** Correct. The reload happens after `docker compose up --wait` (so the repo's Caddyfile is already at the latest commit on the server) and before the smoke test (which validates the public surface with the updated config). - **Fail-fast behavior:** Correct. If Caddy is stopped, `systemctl reload` exits non-zero immediately, aborting the workflow before the smoke test issues a confusing "port 443 refused" curl error. This is the right failure mode. - **Comment block:** Good operational documentation — explains the symlink contract and the SIGHUP mechanism. Useful for whoever is on-call at 2am. --- ### 🚫 Blocker: sudoers entry on the runner needs to be verified The step runs `sudo systemctl reload caddy`. Gitea Actions runners typically execute as an unprivileged user (e.g. `gitea-runner` or `act_runner`). Without a passwordless sudoers entry for this specific command, the step will fail on the first nightly run with: ``` sudo: systemctl: command not found ``` or ``` sudo: a password is required ``` The required sudoers entry on the production server is: ``` gitea-runner ALL=(ALL) NOPASSWD: /bin/systemctl reload caddy ``` (Adjust the username to match your actual runner user. Verify the path with `which systemctl`.) **Before this PR merges, please confirm** that this entry exists (or is being added as part of the operational fix described in the PR description). If it's already in place and tonight's nightly failure was *not* a sudo-permission error, then this is a non-issue — but given the PR description says the operational fix is "still required," I can't assume it's set up. If the entry is confirmed, change this to ✅ Approved. --- ### Suggestion (non-blocking): document the symlink contract The PR description describes the symlink: ``` /etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile ``` This is an implicit contract between the server setup and the CI workflow. If the symlink doesn't exist (new server, fresh VPS, disaster recovery), this step silently reloads Caddy with the wrong config and the smoke test may pass or fail for the wrong reason. Consider adding a step or a note in `docs/infrastructure/ci-gitea.md` documenting that this symlink must exist as a server pre-condition.
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved

Requirement traceability

The stated problem: the nightly CI workflow deploys the app stack (including a potentially updated Caddyfile) but never instructs the host Caddy daemon to reload, leaving it serving the previous config. The smoke test then validates a stale state.

The fix directly addresses the root cause: insert a reload step between deploy and verify. This is a minimal, targeted fix — no scope creep, no gold-plating.

PR description quality

The PR description is unusually complete for a CI fix:

  • Root cause documented: traces the failure chain through the 5f352943 unmasking commit to the pre-existing gap
  • Scope clearly bounded: "this PR fixes the workflow so future runs self-heal" — the immediate operational fix is correctly called out as separate work
  • Test plan realistic: for a CI-level fix, "merge and verify the next nightly passes this step" is the correct acceptance criterion
  • Operational runbook included: the systemctl status caddy / start / reload block in the PR body serves as the immediate mitigation guide

One observation

The test plan has two checkboxes:

  1. Merge and let next nightly pass "Reload Caddy" step without error
  2. Confirm smoke test proceeds to check /login, HSTS header, and /actuator/health → 404

These are sequential, not independent — (2) can only be verified after (1) passes. This is correctly sequenced. The acceptance criteria are observable and testable. No ambiguity.

Gap identified (suggestion)

The PR description notes the operational fix but doesn't link it to a Gitea issue or checklist item. If the symlink or Caddy service state isn't fixed on the server before the next nightly run, the new step will fail and the workflow will be no better off than before. Consider tracking the server-side fix in a linked issue to ensure it doesn't slip through.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** ### Requirement traceability The stated problem: the nightly CI workflow deploys the app stack (including a potentially updated Caddyfile) but never instructs the host Caddy daemon to reload, leaving it serving the previous config. The smoke test then validates a stale state. The fix directly addresses the root cause: insert a reload step between deploy and verify. This is a minimal, targeted fix — no scope creep, no gold-plating. ### PR description quality The PR description is unusually complete for a CI fix: - ✅ **Root cause documented**: traces the failure chain through the `5f352943` unmasking commit to the pre-existing gap - ✅ **Scope clearly bounded**: "this PR fixes the workflow so future runs self-heal" — the immediate operational fix is correctly called out as separate work - ✅ **Test plan realistic**: for a CI-level fix, "merge and verify the next nightly passes this step" is the correct acceptance criterion - ✅ **Operational runbook included**: the `systemctl status caddy / start / reload` block in the PR body serves as the immediate mitigation guide ### One observation The test plan has two checkboxes: 1. Merge and let next nightly pass "Reload Caddy" step without error 2. Confirm smoke test proceeds to check `/login`, HSTS header, and `/actuator/health → 404` These are sequential, not independent — (2) can only be verified after (1) passes. This is correctly sequenced. The acceptance criteria are observable and testable. No ambiguity. ### Gap identified (suggestion) The PR description notes the operational fix but doesn't link it to a Gitea issue or checklist item. If the symlink or Caddy service state isn't fixed on the server before the next nightly run, the new step will fail and the workflow will be no better off than before. Consider tracking the server-side fix in a linked issue to ensure it doesn't slip through.
Author
Owner

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

Security analysis

This PR introduces a privileged host operation (sudo systemctl reload caddy) into the CI pipeline. I'll assess what attack surface this creates.


⚠️ Concern: sudo scope on the runner

If sudo on the runner is not tightly scoped to exactly systemctl reload caddy, this creates a privilege escalation path for anyone who can manipulate the workflow or its environment.

Risk: If the runner's sudoers entry is broad (e.g., NOPASSWD: ALL or NOPASSWD: /bin/systemctl), a malicious workflow step could run sudo systemctl stop caddy, sudo systemctl start malicious.service, or worse.

Required sudoers entry — minimal scope:

gitea-runner ALL=(ALL) NOPASSWD: /bin/systemctl reload caddy

Not:

gitea-runner ALL=(ALL) NOPASSWD: /bin/systemctl
gitea-runner ALL=(ALL) NOPASSWD: ALL

Verification: Check with sudo -l -U <runner-user> on the server. This is a blocker if the scope is broader than the specific command.


What's safe

  • The command itself: systemctl reload caddy can only reload Caddy's config. It cannot write files, execute arbitrary code, or escalate further (assuming correct sudoers scope).
  • The Caddyfile is version-controlled: Changes to the Caddyfile go through the normal PR review process before the nightly workflow picks them up. An attacker who can merge a malicious Caddyfile change can already damage the proxy config — this step doesn't change that threat model.
  • No secrets in the step: The step has no env: block and accesses no secrets. LGTM.
  • Fail-closed behavior: If Caddy is not running, reload exits non-zero and the workflow aborts. The smoke test cannot report false-positive success in that state.

Suggestion: consider a Caddyfile syntax pre-check

Caddy supports config validation before reload:

- name: Validate Caddyfile
  run: sudo caddy validate --config /etc/caddy/Caddyfile

- name: Reload Caddy
  run: sudo systemctl reload caddy

This surfaces syntax errors before the reload attempt, giving a more actionable error message than a silent reload failure. Not a blocker — the current step will fail-fast on a bad config too — but provides better diagnostics. Requires a second sudoers entry for caddy validate.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** ### Security analysis This PR introduces a privileged host operation (`sudo systemctl reload caddy`) into the CI pipeline. I'll assess what attack surface this creates. --- ### ⚠️ Concern: sudo scope on the runner If `sudo` on the runner is not tightly scoped to *exactly* `systemctl reload caddy`, this creates a privilege escalation path for anyone who can manipulate the workflow or its environment. **Risk:** If the runner's sudoers entry is broad (e.g., `NOPASSWD: ALL` or `NOPASSWD: /bin/systemctl`), a malicious workflow step could run `sudo systemctl stop caddy`, `sudo systemctl start malicious.service`, or worse. **Required sudoers entry — minimal scope:** ``` gitea-runner ALL=(ALL) NOPASSWD: /bin/systemctl reload caddy ``` Not: ``` gitea-runner ALL=(ALL) NOPASSWD: /bin/systemctl gitea-runner ALL=(ALL) NOPASSWD: ALL ``` **Verification:** Check with `sudo -l -U <runner-user>` on the server. This is a **blocker** if the scope is broader than the specific command. --- ### ✅ What's safe - **The command itself:** `systemctl reload caddy` can only reload Caddy's config. It cannot write files, execute arbitrary code, or escalate further (assuming correct sudoers scope). - **The Caddyfile is version-controlled:** Changes to the Caddyfile go through the normal PR review process before the nightly workflow picks them up. An attacker who can merge a malicious Caddyfile change can already damage the proxy config — this step doesn't change that threat model. - **No secrets in the step:** The step has no `env:` block and accesses no secrets. LGTM. - **Fail-closed behavior:** If Caddy is not running, `reload` exits non-zero and the workflow aborts. The smoke test cannot report false-positive success in that state. --- ### Suggestion: consider a Caddyfile syntax pre-check Caddy supports config validation before reload: ```yaml - name: Validate Caddyfile run: sudo caddy validate --config /etc/caddy/Caddyfile - name: Reload Caddy run: sudo systemctl reload caddy ``` This surfaces syntax errors before the reload attempt, giving a more actionable error message than a silent reload failure. Not a blocker — the current step will fail-fast on a bad config too — but provides better diagnostics. Requires a second sudoers entry for `caddy validate`.
Author
Owner

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: Approved

Test coverage assessment

This is a CI workflow change — there is no viable automated test at the unit or integration layer for a systemctl reload step. The acceptance criteria ("next nightly run passes this step") is the correct test strategy for this class of change. I accept this.

What I checked

Failure mode coverage: The PR correctly identifies and addresses all failure modes:

  1. Caddy stopped → reload fails fast with a clear service error (better than "port 443 refused" from curl)
  2. Caddyfile has changes → reload picks them up before the smoke test validates them
  3. Caddyfile unchanged → reload is a no-op, no harm done

Regression risk: Inserting a step between deploy and smoke test could theoretically cause the smoke test to fail if the reload itself introduces a timing issue (Caddy takes a moment to parse and apply the new config). Caddy's SIGHUP handling is synchronous — the process does not return from reload until the new config is validated and applied (or it fails). No race condition here.

Position in workflow: The step is placed correctly — the deploy step uses --wait to confirm containers are healthy before proceeding, so the application stack is stable when Caddy reloads. Correct sequencing.

One observation

The smoke test (the step that follows) presumably tests:

  • /login loads
  • HSTS header is present
  • /actuator/health returns 404 (blocked by Caddy)

If the Caddy reload fails but the step doesn't fail-fast (e.g. due to a misconfigured continue-on-error), the smoke test would catch the wrong config. Looking at the diff, there is no continue-on-error: true on the reload step — which means it will correctly abort the workflow on failure.

No quality gate concerns. The change is minimal, correctly positioned, and the failure modes are handled.

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ✅ Approved** ### Test coverage assessment This is a CI workflow change — there is no viable automated test at the unit or integration layer for a `systemctl reload` step. The acceptance criteria ("next nightly run passes this step") is the correct test strategy for this class of change. I accept this. ### What I checked **Failure mode coverage:** The PR correctly identifies and addresses all failure modes: 1. ✅ Caddy stopped → `reload` fails fast with a clear service error (better than "port 443 refused" from curl) 2. ✅ Caddyfile has changes → reload picks them up before the smoke test validates them 3. ✅ Caddyfile unchanged → reload is a no-op, no harm done **Regression risk:** Inserting a step between deploy and smoke test could theoretically cause the smoke test to fail if the reload itself introduces a timing issue (Caddy takes a moment to parse and apply the new config). Caddy's SIGHUP handling is synchronous — the process does not return from reload until the new config is validated and applied (or it fails). No race condition here. **Position in workflow:** The step is placed correctly — the deploy step uses `--wait` to confirm containers are healthy before proceeding, so the application stack is stable when Caddy reloads. Correct sequencing. ### One observation The smoke test (the step that follows) presumably tests: - `/login` loads - HSTS header is present - `/actuator/health` returns 404 (blocked by Caddy) If the Caddy reload fails but the step doesn't fail-fast (e.g. due to a misconfigured `continue-on-error`), the smoke test would catch the wrong config. Looking at the diff, there is no `continue-on-error: true` on the reload step — which means it will correctly abort the workflow on failure. ✅ No quality gate concerns. The change is minimal, correctly positioned, and the failure modes are handled.
Author
Owner

🎨 Leonie Voss — UI/UX Design Lead

Verdict: Approved

This PR modifies only .gitea/workflows/nightly.yml — a CI workflow file with no user-facing UI changes whatsoever. No Svelte components, no routes, no CSS, no i18n strings, no accessibility implications.

I checked: no frontend files changed, no design tokens modified, no HTML structure affected.

LGTM from the UX/accessibility layer. Nothing to flag.

## 🎨 Leonie Voss — UI/UX Design Lead **Verdict: ✅ Approved** This PR modifies only `.gitea/workflows/nightly.yml` — a CI workflow file with no user-facing UI changes whatsoever. No Svelte components, no routes, no CSS, no i18n strings, no accessibility implications. I checked: no frontend files changed, no design tokens modified, no HTML structure affected. **LGTM from the UX/accessibility layer.** Nothing to flag.
Author
Owner

Follow-up: three commits addressing reviewer concerns

🚫 Root cause of the blocker (Tobias + Nora)

The sudoers discussion turned out to be a red herring. I SSHed into the Hetzner VPS and found:

  • act_runner runs inside a Docker container (gitea-runner)
  • Job steps execute in sibling containers spawned by DooD
  • systemctl is not present in Ubuntu container images
  • Even if it were, containers do not share the host's systemd namespaces
  • /etc/sudoers.d/ contains only the default README — no entries exist, and none are needed

sudo systemctl reload caddy would have failed on the first nightly run with bash: sudo: command not found.


Fix: nsenter via privileged sibling container

The correct approach for DooD runners: use the Docker socket (already mounted into job containers) to spin up a privileged container in the host PID namespace, then use nsenter to enter the host's namespaces:

run: |
  docker run --rm --privileged --pid=host \
    ubuntu:22.04 \
    nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy

Tested live on the Hetzner VPS — SUCCESS. No sudoers entry required.


Commits

Commit Change
99de6f1d fix(ci/nightly): replace sudo systemctl reload caddy with the nsenter pattern; update comment to explain DooD architecture
f87504fb fix(ci/release): add the same Reload Caddy step to release.yml — it had the identical deploy→smoke gap
52a96f65 docs(ci): replace stale generic runner provisioning docs in ci-gitea.md with accurate description of the two-container VPS setup, the nsenter pattern, and the Caddyfile symlink contract

Confirmed server state (as of this push)

  • Caddy running: Active: active (running) since Mon 2026-05-11 14:55:11 CEST
  • Caddyfile symlink: /etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile
  • Reload tested: live nsenter call returned exit 0
## Follow-up: three commits addressing reviewer concerns ### 🚫 Root cause of the blocker (Tobias + Nora) The sudoers discussion turned out to be a red herring. I SSHed into the Hetzner VPS and found: - `act_runner` runs inside a Docker container (`gitea-runner`) - Job steps execute in **sibling containers** spawned by DooD - `systemctl` is **not present** in Ubuntu container images - Even if it were, containers do not share the host's systemd namespaces - `/etc/sudoers.d/` contains only the default `README` — no entries exist, and none are needed `sudo systemctl reload caddy` would have failed on the first nightly run with `bash: sudo: command not found`. --- ### ✅ Fix: nsenter via privileged sibling container The correct approach for DooD runners: use the Docker socket (already mounted into job containers) to spin up a privileged container in the host PID namespace, then use `nsenter` to enter the host's namespaces: ```yaml run: | docker run --rm --privileged --pid=host \ ubuntu:22.04 \ nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy ``` Tested live on the Hetzner VPS — `SUCCESS`. No sudoers entry required. --- ### Commits | Commit | Change | |---|---| | `99de6f1d` | `fix(ci/nightly)`: replace `sudo systemctl reload caddy` with the nsenter pattern; update comment to explain DooD architecture | | `f87504fb` | `fix(ci/release)`: add the same Reload Caddy step to `release.yml` — it had the identical deploy→smoke gap | | `52a96f65` | `docs(ci)`: replace stale generic runner provisioning docs in `ci-gitea.md` with accurate description of the two-container VPS setup, the nsenter pattern, and the Caddyfile symlink contract | --- ### Confirmed server state (as of this push) - ✅ Caddy running: `Active: active (running) since Mon 2026-05-11 14:55:11 CEST` - ✅ Caddyfile symlink: `/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile` - ✅ Reload tested: live `nsenter` call returned exit 0
Author
Owner

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

This is the right fix for the right reason. Step ordering (deploy → reload → smoke) is correct, the failure mode when Caddy is stopped is intentional and appropriate, and the docs update is solid. Two things I'd flag:


Blockers

1. ubuntu:22.04 is not a pinned image tag

Both nightly.yml and release.yml pull ubuntu:22.04, which is a floating tag — it moves whenever Canonical pushes updates. This violates the pinned-image rule: builds are no longer reproducible, a bad upstream layer could silently change the behaviour of nsenter, and there's no Renovate bump PR to audit the change.

Fix: pin to a digest:

ubuntu:22.04@sha256:<current-digest>

Or switch to the much smaller ubuntu:22.04-compatible image that only carries util-linux (which provides nsenter):

docker run --rm --privileged --pid=host \
  alpine:3.21 \
  nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy

Alpine has a pinnable digest, is ~5 MB vs ~70 MB for Ubuntu, and ships nsenter via util-linux.


Suggestions

2. Duplication between nightly.yml and release.yml

The Reload Caddy step is copy-pasted verbatim between the two workflows. The comment even says "see nightly.yml — same rationale and mechanism." That's the smell that this should be a reusable composite action.

# .gitea/workflows/composite/reload-caddy/action.yml
name: Reload Caddy
runs:
  using: composite
  steps:
    - name: Reload Caddy via nsenter
      shell: bash
      run: |
        docker run --rm --privileged --pid=host \
          ubuntu:22.04 \
          nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy

This isn't a blocker for this PR — duplicated infrastructure steps aren't dangerous in the same way duplicated app code is, and you've cross-referenced the comment — but worth a follow-up issue.

3. reload vs restart: explicit rationale would help

The comment says reload sends SIGHUP. That's correct and is the right call (no dropped connections). Worth adding one line to the comment explicitly stating why restart was rejected: "We do not use systemctl restart caddy because it drops all in-flight connections during the handshake window." Future maintainers have no other way to know.


What's done well

  • Step ordering is correct: deploy first (so the committed Caddyfile is on disk), reload second, smoke last.
  • Explicit fail-fast behavior: if Caddy is stopped, the step errors and the smoke test never runs — no misleading "port 443 refused" from curl.
  • The Caddyfile symlink contract is now documented in ci-gitea.md. That symlink was previously tribal knowledge.
  • DooD architecture explanation in the docs is accurate and well-written. The nsenter pattern is documented at the right level of detail for an operator who has to debug it at 2am.
## 🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** This is the right fix for the right reason. Step ordering (deploy → reload → smoke) is correct, the failure mode when Caddy is stopped is intentional and appropriate, and the docs update is solid. Two things I'd flag: --- ### Blockers **1. `ubuntu:22.04` is not a pinned image tag** Both `nightly.yml` and `release.yml` pull `ubuntu:22.04`, which is a floating tag — it moves whenever Canonical pushes updates. This violates the pinned-image rule: builds are no longer reproducible, a bad upstream layer could silently change the behaviour of `nsenter`, and there's no Renovate bump PR to audit the change. Fix: pin to a digest: ```yaml ubuntu:22.04@sha256:<current-digest> ``` Or switch to the much smaller `ubuntu:22.04`-compatible image that only carries `util-linux` (which provides `nsenter`): ```yaml docker run --rm --privileged --pid=host \ alpine:3.21 \ nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy ``` Alpine has a pinnable digest, is ~5 MB vs ~70 MB for Ubuntu, and ships `nsenter` via `util-linux`. --- ### Suggestions **2. Duplication between `nightly.yml` and `release.yml`** The Reload Caddy step is copy-pasted verbatim between the two workflows. The comment even says "see nightly.yml — same rationale and mechanism." That's the smell that this should be a reusable composite action. ```yaml # .gitea/workflows/composite/reload-caddy/action.yml name: Reload Caddy runs: using: composite steps: - name: Reload Caddy via nsenter shell: bash run: | docker run --rm --privileged --pid=host \ ubuntu:22.04 \ nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy ``` This isn't a blocker for this PR — duplicated infrastructure steps aren't dangerous in the same way duplicated app code is, and you've cross-referenced the comment — but worth a follow-up issue. **3. `reload` vs `restart`: explicit rationale would help** The comment says `reload` sends SIGHUP. That's correct and is the right call (no dropped connections). Worth adding one line to the comment explicitly stating why `restart` was rejected: "We do not use `systemctl restart caddy` because it drops all in-flight connections during the handshake window." Future maintainers have no other way to know. --- ### What's done well - Step ordering is correct: deploy first (so the committed Caddyfile is on disk), reload second, smoke last. - Explicit fail-fast behavior: if Caddy is stopped, the step errors and the smoke test never runs — no misleading "port 443 refused" from curl. - The Caddyfile symlink contract is now documented in `ci-gitea.md`. That symlink was previously tribal knowledge. - DooD architecture explanation in the docs is accurate and well-written. The nsenter pattern is documented at the right level of detail for an operator who has to debug it at 2am.
Author
Owner

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

The threat model is documented honestly and the approach is pragmatic given the DooD architecture. One real concern and one alternative to consider.


Blockers

1. Unpinned ubuntu:22.04 image in a privileged context

This is a security issue, not just a reproducibility issue. The workflow runs ubuntu:22.04 with --privileged --pid=host — the widest possible host access. If an upstream Ubuntu layer is ever compromised or silently changes what nsenter does, CI picks up the change on the next run with no review gate.

For a container that is explicitly granted root-equivalent host access, the image must be pinned to a digest:

ubuntu:22.04@sha256:<digest>   # Renovate keeps this updated with a PR to review

The risk delta between a pinned and floating tag is small in practice but the blast radius of a compromised privileged container is the entire VPS — not just the container.


Suggestions (not blockers)

2. Consider the Caddy Admin API as a lower-privilege alternative

The current approach uses the Docker socket to spawn a privileged container that enters all host namespaces. The comment is honest: "the Docker socket already grants root-equivalent host access." That's accurate — but it means CI has a standing capability to execute arbitrary host binaries.

Caddy ships a localhost admin API at http://localhost:2019 by default:

# Reloads the Caddy config without any host namespace access
curl -s -X POST http://localhost:2019/load \
  -H "Content-Type: text/caddyfile" \
  --data-binary @/etc/caddy/Caddyfile

If the admin API is reachable from job containers (or if the runner container is on the host network), this eliminates the need for --privileged --pid=host entirely. The attack surface drops from "full host root" to "Caddy admin socket, localhost only."

Worth evaluating: if the Caddy admin API is accessible, replace the nsenter step with a curl call. If not, the current approach is acceptable.

3. The --privileged comment is correctly honest — don't weaken it

The YAML comment already says:

No sudoers entry is required — the Docker socket already grants root-equivalent host access.

This is the right level of transparency. Please keep this phrasing. Future reviewers and auditors need to see this acknowledged, not buried or softened.


What's done well

  • The threat model is stated in the YAML comment. This is rare and valuable — a reviewer can audit the decision without reverse-engineering the intent.
  • systemctl reload (SIGHUP) is correct: it re-reads config without dropping TLS connections. restart would have been a security regression (brief port-443 downtime = potential for connection hijack during the gap).
  • The step fails loudly if Caddy is not running rather than silently succeeding. Fail-closed behaviour.
  • No secrets are passed through the privileged container — it only calls systemctl reload caddy, no credentials involved.
## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** The threat model is documented honestly and the approach is pragmatic given the DooD architecture. One real concern and one alternative to consider. --- ### Blockers **1. Unpinned `ubuntu:22.04` image in a privileged context** This is a security issue, not just a reproducibility issue. The workflow runs `ubuntu:22.04` with `--privileged --pid=host` — the widest possible host access. If an upstream Ubuntu layer is ever compromised or silently changes what `nsenter` does, CI picks up the change on the next run with no review gate. For a container that is explicitly granted root-equivalent host access, the image **must** be pinned to a digest: ```yaml ubuntu:22.04@sha256:<digest> # Renovate keeps this updated with a PR to review ``` The risk delta between a pinned and floating tag is small in practice but the blast radius of a compromised privileged container is the entire VPS — not just the container. --- ### Suggestions (not blockers) **2. Consider the Caddy Admin API as a lower-privilege alternative** The current approach uses the Docker socket to spawn a privileged container that enters all host namespaces. The comment is honest: "the Docker socket already grants root-equivalent host access." That's accurate — but it means CI has a standing capability to execute arbitrary host binaries. Caddy ships a localhost admin API at `http://localhost:2019` by default: ```bash # Reloads the Caddy config without any host namespace access curl -s -X POST http://localhost:2019/load \ -H "Content-Type: text/caddyfile" \ --data-binary @/etc/caddy/Caddyfile ``` If the admin API is reachable from job containers (or if the runner container is on the host network), this eliminates the need for `--privileged --pid=host` entirely. The attack surface drops from "full host root" to "Caddy admin socket, localhost only." Worth evaluating: if the Caddy admin API is accessible, replace the nsenter step with a `curl` call. If not, the current approach is acceptable. **3. The `--privileged` comment is correctly honest — don't weaken it** The YAML comment already says: > No sudoers entry is required — the Docker socket already grants root-equivalent host access. This is the right level of transparency. Please keep this phrasing. Future reviewers and auditors need to see this acknowledged, not buried or softened. --- ### What's done well - The threat model is stated in the YAML comment. This is rare and valuable — a reviewer can audit the decision without reverse-engineering the intent. - `systemctl reload` (SIGHUP) is correct: it re-reads config without dropping TLS connections. `restart` would have been a security regression (brief port-443 downtime = potential for connection hijack during the gap). - The step fails loudly if Caddy is not running rather than silently succeeding. Fail-closed behaviour. - No secrets are passed through the privileged container — it only calls `systemctl reload caddy`, no credentials involved.
Author
Owner

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Verdict: Approved

No application code changed in this PR — this is purely CI workflow YAML and documentation. From a code quality perspective, the changes are clean.


Findings

No blockers.

Observations:

  1. YAML comments explain the WHY, not the WHAT — The # Apply any committed Caddyfile changes... and # The runner executes job steps inside Docker containers (DooD)... blocks are exactly the right kind of comment. They describe the constraint (DooD means systemctl is unavailable), the consequence (need nsenter), and the failure mode (step errors if Caddy is stopped). This is how infrastructure comments should be written.

  2. The Reload Caddy step in release.yml cross-references nightly.yml — The comment says "see nightly.yml — same rationale and mechanism." This is honest about the duplication but means two places to update if the approach changes. Not a blocker, but worth a follow-up issue to extract to a composite action.

  3. Documentation in ci-gitea.md is accurate — The DooD architecture section, the nsenter pattern example, and the Caddyfile symlink contract are all clearly explained. A new team member can understand the deployment without tribal knowledge.


What's done well

  • Correct step ordering in both workflows.
  • reload not restart — preserves in-flight connections.
  • The PR description contains clear root cause analysis and an explicit operational fix that must still happen on the server.
## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer **Verdict: ✅ Approved** No application code changed in this PR — this is purely CI workflow YAML and documentation. From a code quality perspective, the changes are clean. --- ### Findings **No blockers.** **Observations:** 1. **YAML comments explain the WHY, not the WHAT** — The `# Apply any committed Caddyfile changes...` and `# The runner executes job steps inside Docker containers (DooD)...` blocks are exactly the right kind of comment. They describe the constraint (DooD means systemctl is unavailable), the consequence (need nsenter), and the failure mode (step errors if Caddy is stopped). This is how infrastructure comments should be written. 2. **The Reload Caddy step in `release.yml` cross-references `nightly.yml`** — The comment says "see nightly.yml — same rationale and mechanism." This is honest about the duplication but means two places to update if the approach changes. Not a blocker, but worth a follow-up issue to extract to a composite action. 3. **Documentation in `ci-gitea.md` is accurate** — The DooD architecture section, the nsenter pattern example, and the Caddyfile symlink contract are all clearly explained. A new team member can understand the deployment without tribal knowledge. --- ### What's done well - Correct step ordering in both workflows. - `reload` not `restart` — preserves in-flight connections. - The PR description contains clear root cause analysis and an explicit operational fix that must still happen on the server.
Author
Owner

🏛️ Markus Keller (@mkeller) — Senior Application Architect

Verdict: Approved

This PR introduces a runtime pattern (host service management from a DooD CI environment) with lasting operational consequences. I'm checking whether the documentation footprint matches that weight.


Documentation audit

Per the architecture doc-update rule, I check whether this PR triggered the required doc updates:

PR contains Required doc update Status
New CI pattern with infrastructure implications docs/infrastructure/ci-gitea.md Updated
Architectural decision with lasting consequences New ADR in docs/adr/ — (see below)
New Docker service or infrastructure component docs/architecture/c4/l2-containers.puml No new service added

ADR consideration: The nsenter-via-privileged-container pattern is a non-obvious architectural choice that any future CI maintainer will encounter. It has two lasting consequences: (a) the runner's Docker socket access is now a capability relied upon for service management, not just for running compose commands; (b) the Caddyfile symlink on the VPS is now a required contract for CI to succeed.

I'd normally call a missing ADR a blocker. In this case, the ci-gitea.md update captures the what, why, and how of the pattern at a level of detail that a focused ADR would also contain. The symlink contract is explicitly documented. I'll accept the doc-in-ci-gitea-md approach here as equivalent, with one addition request (see below).


Suggestion (not a blocker)

Add a "Decision" and "Alternatives considered" subsection to ci-gitea.md

The current doc explains the what and how of nsenter well. Add one short paragraph on why the alternatives were rejected — so the next maintainer doesn't try to "fix" this:

#### Why not `sudo systemctl` in the job container?
Job containers run as root inside an unprivileged Docker namespace. There is no systemd PID 1 inside the container; `systemctl` has nothing to connect to. `sudo` is not available and would not help.

#### Why not Caddy's admin API?
The admin API at `:2019` is not reachable from job containers because they do not share the host network. Exposing `:2019` on the host would introduce a new network attack surface.

This closes the loop on the decision rationale in a way that stays durable as the codebase evolves.


What's done well

  • Boring technology (nsenter has existed since the Linux kernel 3.8 era). Right call.
  • The symlink contract between the VPS's /etc/caddy/Caddyfile and the repo's infra/caddy/Caddyfile is now a documented, explicit dependency. Previously it was tribal knowledge that would cause a 2am incident.
  • The runner architecture table in ci-gitea.md is a useful addition — two runners, their purposes, and their configs in one place.
## 🏛️ Markus Keller (@mkeller) — Senior Application Architect **Verdict: ✅ Approved** This PR introduces a runtime pattern (host service management from a DooD CI environment) with lasting operational consequences. I'm checking whether the documentation footprint matches that weight. --- ### Documentation audit Per the architecture doc-update rule, I check whether this PR triggered the required doc updates: | PR contains | Required doc update | Status | |---|---|---| | New CI pattern with infrastructure implications | `docs/infrastructure/ci-gitea.md` | ✅ Updated | | Architectural decision with lasting consequences | New ADR in `docs/adr/` | — (see below) | | New Docker service or infrastructure component | `docs/architecture/c4/l2-containers.puml` | ✅ No new service added | **ADR consideration:** The nsenter-via-privileged-container pattern is a non-obvious architectural choice that any future CI maintainer will encounter. It has two lasting consequences: (a) the runner's Docker socket access is now a capability relied upon for service management, not just for running compose commands; (b) the Caddyfile symlink on the VPS is now a required contract for CI to succeed. I'd normally call a missing ADR a blocker. In this case, the `ci-gitea.md` update captures the what, why, and how of the pattern at a level of detail that a focused ADR would also contain. The symlink contract is explicitly documented. I'll accept the doc-in-ci-gitea-md approach here as equivalent, with one addition request (see below). --- ### Suggestion (not a blocker) **Add a "Decision" and "Alternatives considered" subsection to `ci-gitea.md`** The current doc explains the *what* and *how* of nsenter well. Add one short paragraph on why the alternatives were rejected — so the next maintainer doesn't try to "fix" this: ```markdown #### Why not `sudo systemctl` in the job container? Job containers run as root inside an unprivileged Docker namespace. There is no systemd PID 1 inside the container; `systemctl` has nothing to connect to. `sudo` is not available and would not help. #### Why not Caddy's admin API? The admin API at `:2019` is not reachable from job containers because they do not share the host network. Exposing `:2019` on the host would introduce a new network attack surface. ``` This closes the loop on the decision rationale in a way that stays durable as the codebase evolves. --- ### What's done well - Boring technology (nsenter has existed since the Linux kernel 3.8 era). Right call. - The symlink contract between the VPS's `/etc/caddy/Caddyfile` and the repo's `infra/caddy/Caddyfile` is now a documented, explicit dependency. Previously it was tribal knowledge that would cause a 2am incident. - The runner architecture table in `ci-gitea.md` is a useful addition — two runners, their purposes, and their configs in one place.
Author
Owner

🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist

Verdict: Approved

CI workflow changes sit outside the standard test pyramid — there's no unit test for a GitHub Actions step. I'm evaluating the verification strategy and the completeness of the test plan.


Findings

No blockers.

1. Test plan relies on manual observation of the next nightly run

The PR test plan is:

  • Merge and let next nightly run pass the "Reload Caddy" step without error
  • Confirm smoke test proceeds to check /login, HSTS header, and /actuator/health → 404

This is appropriate for CI workflow changes — there's no practical alternative to running the workflow on the real infrastructure. The smoke test that follows (/login, HSTS, /actuator/health → 404) is a meaningful functional verification, not just a liveness check.

One gap: there's no documented test for the failure path — what happens if Caddy is stopped when the step runs. The PR description states it "fails fast with a clear service error," but that hasn't been verified on the real VPS. If the server bootstrap issue exists (Caddy stopped, symlink missing), the step may produce an error message that isn't as clear as expected.

Recommendation: before merging, manually run the Reload Caddy step on the VPS to confirm the error message is actionable:

docker run --rm --privileged --pid=host \
  ubuntu:22.04 \
  nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy

Check what the output looks like if Caddy is stopped (systemctl stop caddy first, then run the above, then start it again).

2. Smoke test coverage looks adequate for this fix

The smoke test (already in place) checks:

  • /login (frontend is up)
  • HSTS header (Caddy security config applied)
  • /actuator/health → 404 (Caddy is blocking Spring actuator from public access)

The third check is directly relevant to this fix — if Caddy were serving a stale config that exposed /actuator, the smoke test would catch it.


What's done well

  • The fix inserts the reload step in the right place in both nightly.yml and release.yml — the smoke test that validates the Caddy config comes after the reload, not before.
  • The failure mode is explicit and loud: the workflow errors before the smoke test rather than producing a misleading port-443 failure.
  • Operational runbook commands are in the PR description — the person resolving a future Caddy-related CI failure has concrete steps to follow.
## 🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist **Verdict: ✅ Approved** CI workflow changes sit outside the standard test pyramid — there's no unit test for a GitHub Actions step. I'm evaluating the verification strategy and the completeness of the test plan. --- ### Findings **No blockers.** **1. Test plan relies on manual observation of the next nightly run** The PR test plan is: - [ ] Merge and let next nightly run pass the "Reload Caddy" step without error - [ ] Confirm smoke test proceeds to check `/login`, HSTS header, and `/actuator/health → 404` This is appropriate for CI workflow changes — there's no practical alternative to running the workflow on the real infrastructure. The smoke test that follows (`/login`, HSTS, `/actuator/health → 404`) is a meaningful functional verification, not just a liveness check. One gap: there's no documented test for the **failure path** — what happens if Caddy is stopped when the step runs. The PR description states it "fails fast with a clear service error," but that hasn't been verified on the real VPS. If the server bootstrap issue exists (Caddy stopped, symlink missing), the step may produce an error message that isn't as clear as expected. Recommendation: before merging, manually run the Reload Caddy step on the VPS to confirm the error message is actionable: ```bash docker run --rm --privileged --pid=host \ ubuntu:22.04 \ nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy ``` Check what the output looks like if Caddy is stopped (`systemctl stop caddy` first, then run the above, then start it again). **2. Smoke test coverage looks adequate for this fix** The smoke test (already in place) checks: - `/login` (frontend is up) - HSTS header (Caddy security config applied) - `/actuator/health → 404` (Caddy is blocking Spring actuator from public access) The third check is directly relevant to this fix — if Caddy were serving a stale config that exposed `/actuator`, the smoke test would catch it. --- ### What's done well - The fix inserts the reload step in the right place in both `nightly.yml` and `release.yml` — the smoke test that validates the Caddy config comes after the reload, not before. - The failure mode is explicit and loud: the workflow errors before the smoke test rather than producing a misleading port-443 failure. - Operational runbook commands are in the PR description — the person resolving a future Caddy-related CI failure has concrete steps to follow.
Author
Owner

📋 Elicit — Requirements Engineer & Business Analyst

Verdict: Approved

Reviewing from a requirements and specification standpoint. This is a corrective infrastructure fix with a clear root cause — not a feature — so I'm checking whether the problem is correctly stated, the solution is traceable to it, and the operational requirements are documented.


Findings

No blockers.

Root cause analysis: clearly stated and correct

The PR description identifies two distinct failure events:

  1. 5f352943localhost127.0.0.1 in the frontend healthcheck — fixed the immediate docker compose up --wait failure.
  2. Pre-existing gap: the workflow never reloads the host Caddy daemon after deploying a new Caddyfile.

The PR fixes only #2, correctly scoped. This is good requirements practice — the fix addresses the root cause of the unmasked problem, not the symptom that revealed it.

Requirements coverage

Requirement Addressed?
CI applies committed Caddyfile changes before smoke test Reload step added before smoke
CI fails with a clear error if Caddy is stopped (not a misleading port error) systemctl reload exits non-zero if Caddy is stopped
Operational runbook for manual server remediation In PR description
VPS symlink contract documented In ci-gitea.md

One gap: the server remediation is in the PR description, not in docs/DEPLOYMENT.md

The operational fix (verify symlink, start/reload Caddy) is documented in the PR description, which becomes git history after merge but is not easily discoverable. The ci-gitea.md update documents the pattern but not the one-time server bootstrap steps.

Recommendation: add a short section to docs/DEPLOYMENT.md (under the relevant server setup section) confirming the symlink requirement:

## Caddyfile symlink (required for CI reload to work)
ln -sf /opt/familienarchiv/infra/caddy/Caddyfile /etc/caddy/Caddyfile
systemctl enable --now caddy

Without this, the next person bootstrapping a new server (or recovering from a disk failure) will hit the same gap that caused tonight's failure, with no discoverable documentation.


What's done well

  • The test plan is explicit and includes both the happy path (step passes) and functional verification (smoke test checks HSTS + actuator blocking).
  • The PR description is unusually thorough for an infrastructure fix — root cause, operational fix, test plan, and expected outcome are all present.
  • Scope is tightly bounded: no unrelated changes, no speculative improvements.
## 📋 Elicit — Requirements Engineer & Business Analyst **Verdict: ✅ Approved** Reviewing from a requirements and specification standpoint. This is a corrective infrastructure fix with a clear root cause — not a feature — so I'm checking whether the problem is correctly stated, the solution is traceable to it, and the operational requirements are documented. --- ### Findings **No blockers.** **Root cause analysis: clearly stated and correct** The PR description identifies two distinct failure events: 1. `5f352943` — `localhost` → `127.0.0.1` in the frontend healthcheck — fixed the immediate `docker compose up --wait` failure. 2. Pre-existing gap: the workflow never reloads the host Caddy daemon after deploying a new Caddyfile. The PR fixes only #2, correctly scoped. This is good requirements practice — the fix addresses the root cause of the unmasked problem, not the symptom that revealed it. **Requirements coverage** | Requirement | Addressed? | |---|---| | CI applies committed Caddyfile changes before smoke test | ✅ Reload step added before smoke | | CI fails with a clear error if Caddy is stopped (not a misleading port error) | ✅ `systemctl reload` exits non-zero if Caddy is stopped | | Operational runbook for manual server remediation | ✅ In PR description | | VPS symlink contract documented | ✅ In `ci-gitea.md` | **One gap: the server remediation is in the PR description, not in `docs/DEPLOYMENT.md`** The operational fix (verify symlink, start/reload Caddy) is documented in the PR description, which becomes git history after merge but is not easily discoverable. The `ci-gitea.md` update documents the *pattern* but not the *one-time server bootstrap steps*. Recommendation: add a short section to `docs/DEPLOYMENT.md` (under the relevant server setup section) confirming the symlink requirement: ```markdown ## Caddyfile symlink (required for CI reload to work) ln -sf /opt/familienarchiv/infra/caddy/Caddyfile /etc/caddy/Caddyfile systemctl enable --now caddy ``` Without this, the next person bootstrapping a new server (or recovering from a disk failure) will hit the same gap that caused tonight's failure, with no discoverable documentation. --- ### What's done well - The test plan is explicit and includes both the happy path (step passes) and functional verification (smoke test checks HSTS + actuator blocking). - The PR description is unusually thorough for an infrastructure fix — root cause, operational fix, test plan, and expected outcome are all present. - Scope is tightly bounded: no unrelated changes, no speculative improvements.
Author
Owner

🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist

Verdict: Approved

This PR makes no changes to the user interface, frontend components, visual design, accessibility markup, or any user-facing routes. All changes are in CI workflow YAML and infrastructure documentation.

Nothing to flag from a UX, accessibility, or brand compliance perspective.

The documentation update in ci-gitea.md is readable and well-structured — the architecture table, code blocks, and descriptive headings follow reasonable technical writing conventions. Not within my design review scope, but no concerns.

## 🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** This PR makes no changes to the user interface, frontend components, visual design, accessibility markup, or any user-facing routes. All changes are in CI workflow YAML and infrastructure documentation. Nothing to flag from a UX, accessibility, or brand compliance perspective. The documentation update in `ci-gitea.md` is readable and well-structured — the architecture table, code blocks, and descriptive headings follow reasonable technical writing conventions. Not within my design review scope, but no concerns.
Author
Owner

Implementation complete

All open reviewer concerns have been addressed. Here's a summary of what changed:

Resolved: ubuntu:22.04 → pinned Alpine digest (Tobias, Nora)

Both nightly.yml and release.yml now use alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d instead of the floating ubuntu:22.04 tag. The digest pins the exact image layer so any upstream change requires an explicit Renovate bump PR — critical given the step runs privileged with --pid=host.

util-linux (which ships nsenter) is not in the Alpine base image and is installed at run time (apk add --no-cache util-linux -q, ~1 s on the warm VPS cache). Verified that nsenter from util-linux 2.40.4 is present and functional in alpine:3.21.

Commit: f608838f

Resolved: Documentation — ci-gitea.md (Elicit)

docs/infrastructure/ci-gitea.md updated:

  • Code example in the nsenter pattern section now shows the pinned Alpine command
  • Added rationale paragraph: Alpine ~5 MB vs Ubuntu ~70 MB, digest pinning, util-linux install
  • Added "Why not sudo systemctl in the job container?" subsection (no systemd PID 1 inside containers)
  • Added "Why not Caddy's admin API?" subsection (job containers don't share host network namespace; exposing :2019 on a host port adds attack surface)

Commit: 1ce0638a

docs/DEPLOYMENT.md §3.1 now includes a # CI DEPENDENCY comment above the ln -sf line explaining that the nightly and release workflows depend on this symlink to pick up committed Caddyfile changes. The symlink command itself already existed.

Commit: 60cb9d5a

Deferred: step duplication → composite action

The duplicated Reload Caddy step across nightly.yml and release.yml is tracked as a follow-up in issue #539. Kept out of this PR to keep scope focused on the immediate fix.

## Implementation complete All open reviewer concerns have been addressed. Here's a summary of what changed: ### ✅ Resolved: ubuntu:22.04 → pinned Alpine digest (Tobias, Nora) Both `nightly.yml` and `release.yml` now use `alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d` instead of the floating `ubuntu:22.04` tag. The digest pins the exact image layer so any upstream change requires an explicit Renovate bump PR — critical given the step runs privileged with `--pid=host`. `util-linux` (which ships `nsenter`) is not in the Alpine base image and is installed at run time (`apk add --no-cache util-linux -q`, ~1 s on the warm VPS cache). Verified that `nsenter` from `util-linux 2.40.4` is present and functional in `alpine:3.21`. Commit: `f608838f` ### ✅ Resolved: Documentation — `ci-gitea.md` (Elicit) `docs/infrastructure/ci-gitea.md` updated: - Code example in the nsenter pattern section now shows the pinned Alpine command - Added rationale paragraph: Alpine ~5 MB vs Ubuntu ~70 MB, digest pinning, `util-linux` install - Added **"Why not `sudo systemctl` in the job container?"** subsection (no systemd PID 1 inside containers) - Added **"Why not Caddy's admin API?"** subsection (job containers don't share host network namespace; exposing `:2019` on a host port adds attack surface) Commit: `1ce0638a` ### ✅ Resolved: DEPLOYMENT.md symlink note (Elicit) `docs/DEPLOYMENT.md` §3.1 now includes a `# CI DEPENDENCY` comment above the `ln -sf` line explaining that the nightly and release workflows depend on this symlink to pick up committed Caddyfile changes. The symlink command itself already existed. Commit: `60cb9d5a` ### Deferred: step duplication → composite action The duplicated `Reload Caddy` step across `nightly.yml` and `release.yml` is tracked as a follow-up in issue #539. Kept out of this PR to keep scope focused on the immediate fix.
Author
Owner

🛠️ Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

This fix is correct and well-reasoned. The nsenter pattern is the right tool for DooD constraints, the Alpine digest is pinned, and the reload-vs-restart rationale is properly commented. The DEPLOYMENT.md symlink note is exactly the kind of operational clarity that prevents late-night debugging.

Blockers

None.

Suggestions

1. DRY violation: step copy-pasted between nightly.yml and release.yml

The full docker run … nsenter … command with its digest appears verbatim in two files. One digest bump in the future means two PRs, or a forgotten file. Gitea Actions has limited composite action support compared to GitHub Actions, but if the runner supports it, a .gitea/actions/reload-caddy/action.yml composite action would eliminate the duplication. If composite actions aren't supported on this runner, at minimum add a comment in release.yml:

# Keep in sync with .gitea/workflows/nightly.yml "Reload Caddy" step

2. Symlink not verified before reload

If the symlink is absent or mis-pointed, systemctl reload caddy succeeds silently — Caddy re-reads whatever /etc/caddy/Caddyfile currently points to. A pre-check inside the nsenter block would make the failure loud:

sh -c 'apk add --no-cache util-linux -q && nsenter -t 1 -m -u -n -p -i -- \
  /bin/sh -c "readlink /etc/caddy/Caddyfile | grep -q Caddyfile || (echo \"ERROR: /etc/caddy/Caddyfile symlink not pointing to repo\" && exit 1) && /bin/systemctl reload caddy"'

Low priority given this is a bootstrap-once concern, but it would turn a silent mismatch into an explicit CI failure.

What's done well: digest pinning is best practice, --no-cache on apk, the long comment block explaining the whole architecture is exactly what future-you will need at 2am.

## 🛠️ Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** This fix is correct and well-reasoned. The nsenter pattern is the right tool for DooD constraints, the Alpine digest is pinned, and the `reload`-vs-`restart` rationale is properly commented. The DEPLOYMENT.md symlink note is exactly the kind of operational clarity that prevents late-night debugging. ### Blockers None. ### Suggestions **1. DRY violation: step copy-pasted between nightly.yml and release.yml** The full `docker run … nsenter …` command with its digest appears verbatim in two files. One digest bump in the future means two PRs, or a forgotten file. Gitea Actions has limited composite action support compared to GitHub Actions, but if the runner supports it, a `.gitea/actions/reload-caddy/action.yml` composite action would eliminate the duplication. If composite actions aren't supported on this runner, at minimum add a comment in `release.yml`: ```yaml # Keep in sync with .gitea/workflows/nightly.yml "Reload Caddy" step ``` **2. Symlink not verified before reload** If the symlink is absent or mis-pointed, `systemctl reload caddy` succeeds silently — Caddy re-reads whatever `/etc/caddy/Caddyfile` currently points to. A pre-check inside the nsenter block would make the failure loud: ```sh sh -c 'apk add --no-cache util-linux -q && nsenter -t 1 -m -u -n -p -i -- \ /bin/sh -c "readlink /etc/caddy/Caddyfile | grep -q Caddyfile || (echo \"ERROR: /etc/caddy/Caddyfile symlink not pointing to repo\" && exit 1) && /bin/systemctl reload caddy"' ``` Low priority given this is a bootstrap-once concern, but it would turn a silent mismatch into an explicit CI failure. **What's done well:** digest pinning is best practice, `--no-cache` on apk, the long comment block explaining the whole architecture is exactly what future-you will need at 2am.
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: Approved

This PR is entirely CI infrastructure and documentation — no application code, no frontend, no backend Java, no Python. My usual checklist (TDD, naming, component size, guard clauses, Svelte 5 rules) doesn't apply here.

One observation from a code-quality lens: the "Reload Caddy" step in nightly.yml has a ~20-line comment block, which I'd normally flag as over-documenting. For CI infrastructure with non-obvious architectural constraints (DooD isolation, nsenter, why reload vs restart), verbose comments are appropriate — the WHY is genuinely non-obvious and would take a future engineer significant time to reconstruct. This is the right exception.

Nothing to block on. Clean from my domain.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** This PR is entirely CI infrastructure and documentation — no application code, no frontend, no backend Java, no Python. My usual checklist (TDD, naming, component size, guard clauses, Svelte 5 rules) doesn't apply here. One observation from a code-quality lens: the "Reload Caddy" step in `nightly.yml` has a ~20-line comment block, which I'd normally flag as over-documenting. For CI infrastructure with non-obvious architectural constraints (DooD isolation, nsenter, why `reload` vs `restart`), verbose comments are appropriate — the WHY is genuinely non-obvious and would take a future engineer significant time to reconstruct. This is the right exception. Nothing to block on. Clean from my domain.
Author
Owner

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

The chosen approach — docker run --privileged --pid=host + nsenter — is correct given the DooD constraint. The PR documentation is admirably transparent: the note that "the Docker socket already grants root-equivalent host access" is honest and accurate. This pattern makes existing implicit privileges explicit rather than introducing new ones. That's the right framing.

Concerns (not blockers)

1. Renovate bump PRs for this digest should not be auto-merged

The digest-pinned Alpine image (alpine:3.21@sha256:48b0309…) is pulled at runtime in a --privileged --pid=host container. Pinning the digest protects against registry substitution . However, when Renovate eventually creates a digest bump PR for this line, that PR deserves manual review — a malicious or compromised digest substitution would give an attacker root access to the production host. This is not a code change to make, but a process note: add a Renovate packageRule to require human approval for digest bumps in workflow files that use --privileged.

2. apk add util-linux is not pinned

The base image digest is pinned, but util-linux is resolved from the Alpine 3.21 package index at runtime. Alpine package integrity is guaranteed by APK's signing chain, so this is an accepted residual risk — but it means the nsenter binary version can vary between runs. This is standard Alpine practice and acceptable here; just worth acknowledging.

Positive findings

  • Hardcoded command (systemctl reload caddy) — no user-controlled input, zero injection risk
  • Pinned digest prevents tag-mutable supply chain attacks
  • --pid=host is the minimum scope needed for nsenter
  • reload semantics are safer than restart from an availability standpoint
  • The threat model is documented inline — any auditor can understand why this is safe without external context
## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** The chosen approach — `docker run --privileged --pid=host` + nsenter — is correct given the DooD constraint. The PR documentation is admirably transparent: the note that "the Docker socket already grants root-equivalent host access" is honest and accurate. This pattern makes existing implicit privileges explicit rather than introducing new ones. That's the right framing. ### Concerns (not blockers) **1. Renovate bump PRs for this digest should not be auto-merged** The digest-pinned Alpine image (`alpine:3.21@sha256:48b0309…`) is pulled at runtime in a `--privileged --pid=host` container. Pinning the digest protects against registry substitution ✅. However, when Renovate eventually creates a digest bump PR for this line, that PR deserves manual review — a malicious or compromised digest substitution would give an attacker root access to the production host. This is not a code change to make, but a process note: add a Renovate `packageRule` to require human approval for digest bumps in workflow files that use `--privileged`. **2. `apk add util-linux` is not pinned** The base image digest is pinned, but `util-linux` is resolved from the Alpine 3.21 package index at runtime. Alpine package integrity is guaranteed by APK's signing chain, so this is an accepted residual risk — but it means the `nsenter` binary version can vary between runs. This is standard Alpine practice and acceptable here; just worth acknowledging. ### Positive findings - Hardcoded command (`systemctl reload caddy`) — no user-controlled input, zero injection risk ✅ - Pinned digest prevents tag-mutable supply chain attacks ✅ - `--pid=host` is the minimum scope needed for nsenter ✅ - `reload` semantics are safer than `restart` from an availability standpoint ✅ - The threat model is documented inline — any auditor can understand why this is safe without external context ✅
Author
Owner

🧪 Sara Holt — Senior QA Engineer

Verdict: ⚠️ Approved with concerns

The test plan ("merge and let next nightly run pass the Reload Caddy step without error") is the only practical approach for CI infrastructure — you can't meaningfully unit-test a workflow step in isolation, and I accept that.

Concerns

1. No runbook for when Reload Caddy fails

The PR correctly documents that "if Caddy is not running this step fails fast." Fast failure is good. But what does the failure look like in the CI log, and what should an engineer do about it? The PR description has the shell commands but they live in the PR, not in the repository. A short "Troubleshooting" subsection in docs/infrastructure/ci-gitea.md under the nsenter section — covering the three failure modes (Caddy stopped, symlink missing, systemd unreachable) and the recovery steps for each — would make this actionable for the next person who sees a red CI run.

2. Stale-config scenario is not caught by smoke test

If the symlink points to an old file, systemctl reload caddy exits 0 and the smoke test runs against stale config with no signal. This scenario is mentioned in the docs but has no automated guard. The test plan doesn't cover it. Low priority (bootstrap-once concern), but worth a note in the open test plan.

3. Test plan is unverified at merge time

The PR test plan checkbox ([ ] Merge and let next nightly run pass) cannot be checked before merge. This is expected for CI changes — just noting it explicitly so there's a shared understanding that merge ≠ verified.

What's done well

The smoke test immediately after the reload step is the right verification layer — if the reload fails, the smoke test provides an independent signal. The sequencing (deploy → reload → smoke test) is correct.

## 🧪 Sara Holt — Senior QA Engineer **Verdict: ⚠️ Approved with concerns** The test plan ("merge and let next nightly run pass the Reload Caddy step without error") is the only practical approach for CI infrastructure — you can't meaningfully unit-test a workflow step in isolation, and I accept that. ### Concerns **1. No runbook for when Reload Caddy fails** The PR correctly documents that "if Caddy is not running this step fails fast." Fast failure is good. But what does the failure look like in the CI log, and what should an engineer do about it? The PR description has the shell commands but they live in the PR, not in the repository. A short "Troubleshooting" subsection in `docs/infrastructure/ci-gitea.md` under the nsenter section — covering the three failure modes (Caddy stopped, symlink missing, systemd unreachable) and the recovery steps for each — would make this actionable for the next person who sees a red CI run. **2. Stale-config scenario is not caught by smoke test** If the symlink points to an old file, `systemctl reload caddy` exits 0 and the smoke test runs against stale config with no signal. This scenario is mentioned in the docs but has no automated guard. The test plan doesn't cover it. Low priority (bootstrap-once concern), but worth a note in the open test plan. **3. Test plan is unverified at merge time** The PR test plan checkbox (`[ ] Merge and let next nightly run pass`) cannot be checked before merge. This is expected for CI changes — just noting it explicitly so there's a shared understanding that merge ≠ verified. ### What's done well The smoke test immediately after the reload step is the right verification layer — if the reload fails, the smoke test provides an independent signal. The sequencing (deploy → reload → smoke test) is correct.
Author
Owner

🎨 Leonie Voss — UI/UX Design Lead

Verdict: Approved

This PR contains no UI, frontend, or user-facing changes. The diff is confined to CI workflow YAML and infrastructure documentation. No accessibility, brand, responsive design, or interaction concerns apply. LGTM from a design perspective.

## 🎨 Leonie Voss — UI/UX Design Lead **Verdict: ✅ Approved** This PR contains no UI, frontend, or user-facing changes. The diff is confined to CI workflow YAML and infrastructure documentation. No accessibility, brand, responsive design, or interaction concerns apply. LGTM from a design perspective.
Author
Owner

🏛️ Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns

The nsenter approach is the right architectural choice given the DooD constraint — it's explicit, well-documented, and doesn't add permanent infrastructure components. The documentation updates in docs/DEPLOYMENT.md and docs/infrastructure/ci-gitea.md are thorough.

Documentation matrix check

PR change Required update Status
New CI infrastructure component (nsenter pattern) docs/infrastructure/ci-gitea.md Done
Symlink contract for Caddy docs/DEPLOYMENT.md Done
New Docker service/component docs/architecture/c4/l2-containers.puml N/A — ephemeral docker run, not a service
Architectural decision with lasting consequences ADR in docs/adr/ ⚠️ Missing — see below

Concern: Missing ADR

The nsenter-via-privileged-Docker pattern for running host commands from CI is an architectural decision with lasting consequences. Any future CI step that needs to manage a host service will reference this pattern. Documenting the alternatives that were considered and rejected directly in the PR description is a good start, but that content belongs in a committed ADR where future engineers can find it:

Suggested ADR: docs/adr/ADR-XXX-nsenter-for-host-service-management-in-ci.md

Content is already written in the PR description and the ci-gitea.md "Why not?" sections — it just needs to be formatted as an ADR and committed. The ADR would capture:

  • Context: DooD runner, no systemd in job containers
  • Decision: nsenter via privileged sibling container
  • Alternatives rejected: sudo systemctl in job container (no systemd PID 1), Caddy admin API (no host network access), SSH to host (requires key management)
  • Consequences: pattern is reusable for any future host-service management in CI; --privileged is the minimum required escalation

Step duplication (echoing Tobias)

The identical step in nightly.yml and release.yml is an architectural concern, not just a style one: two files to keep in sync for digest bumps, and two places where the pattern can silently diverge. If Gitea composite actions are supported, this should be extracted.

## 🏛️ Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** The nsenter approach is the right architectural choice given the DooD constraint — it's explicit, well-documented, and doesn't add permanent infrastructure components. The documentation updates in `docs/DEPLOYMENT.md` and `docs/infrastructure/ci-gitea.md` are thorough. ### Documentation matrix check | PR change | Required update | Status | |---|---|---| | New CI infrastructure component (nsenter pattern) | `docs/infrastructure/ci-gitea.md` | ✅ Done | | Symlink contract for Caddy | `docs/DEPLOYMENT.md` | ✅ Done | | New Docker service/component | `docs/architecture/c4/l2-containers.puml` | N/A — ephemeral `docker run`, not a service | | Architectural decision with lasting consequences | ADR in `docs/adr/` | ⚠️ Missing — see below | ### Concern: Missing ADR The nsenter-via-privileged-Docker pattern for running host commands from CI is an architectural decision with lasting consequences. Any future CI step that needs to manage a host service will reference this pattern. Documenting the alternatives that were considered and rejected directly in the PR description is a good start, but that content belongs in a committed ADR where future engineers can find it: **Suggested ADR:** `docs/adr/ADR-XXX-nsenter-for-host-service-management-in-ci.md` Content is already written in the PR description and the ci-gitea.md "Why not?" sections — it just needs to be formatted as an ADR and committed. The ADR would capture: - **Context**: DooD runner, no systemd in job containers - **Decision**: nsenter via privileged sibling container - **Alternatives rejected**: `sudo systemctl` in job container (no systemd PID 1), Caddy admin API (no host network access), SSH to host (requires key management) - **Consequences**: pattern is reusable for any future host-service management in CI; `--privileged` is the minimum required escalation ### Step duplication (echoing Tobias) The identical step in `nightly.yml` and `release.yml` is an architectural concern, not just a style one: two files to keep in sync for digest bumps, and two places where the pattern can silently diverge. If Gitea composite actions are supported, this should be extracted.
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved

From a requirements perspective this PR is well-specified. The description contains all the elements of a good change request:

  • Problem statement: Clear root cause (Caddy not reloaded → port 443 refusing connections)
  • Scope boundary: Explicitly CI workflow only — the operational fix on the server is called out as a separate action item
  • Preconditions: Symlink must exist (/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile); recovery commands provided
  • Observable acceptance criteria: "Next nightly run passes the Reload Caddy step without error" and "smoke test proceeds to check /login, HSTS header, and /actuator/health → 404"
  • NFRs addressed: Availability (reload vs. restart preserves in-flight connections), failure visibility (explicit error if Caddy not running)

One process observation:

The test plan checkboxes are marked unchecked ([ ]) at merge time, which is honest — these are post-merge verification items. For future CI infrastructure PRs, consider distinguishing between pre-merge and post-merge verification with two sections:

  • Pre-merge: peer review, diff review, manual reasoning about the step
  • Post-merge verification: items that can only be confirmed after the next scheduled run

No requirements gaps found. The change is narrow, well-justified, and the residual risks (symlink contract, step duplication) are documented.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** From a requirements perspective this PR is well-specified. The description contains all the elements of a good change request: - **Problem statement**: Clear root cause (Caddy not reloaded → port 443 refusing connections) - **Scope boundary**: Explicitly CI workflow only — the operational fix on the server is called out as a separate action item - **Preconditions**: Symlink must exist (`/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile`); recovery commands provided - **Observable acceptance criteria**: "Next nightly run passes the Reload Caddy step without error" and "smoke test proceeds to check `/login`, HSTS header, and `/actuator/health → 404`" - **NFRs addressed**: Availability (reload vs. restart preserves in-flight connections), failure visibility (explicit error if Caddy not running) **One process observation:** The test plan checkboxes are marked unchecked (`[ ]`) at merge time, which is honest — these are post-merge verification items. For future CI infrastructure PRs, consider distinguishing between pre-merge and post-merge verification with two sections: - **Pre-merge**: peer review, diff review, manual reasoning about the step - **Post-merge verification**: items that can only be confirmed after the next scheduled run No requirements gaps found. The change is narrow, well-justified, and the residual risks (symlink contract, step duplication) are documented.
Author
Owner

Round 2 implementation complete

Three concerns from the latest review round addressed. All others remain deferred.

Resolved: Missing ADR (Markus — comment #8989)

docs/adr/012-nsenter-for-host-service-management-in-ci.md committed. Captures:

  • Context: DooD runner, job containers have no systemd PID 1
  • Decision: nsenter via privileged sibling container using pinned Alpine digest
  • Alternatives rejected: sudo systemctl (no systemd in containers), Caddy admin API (no host network access), SSH (key management overhead)
  • Consequences: symlink contract, Renovate review requirement, step duplication tracked in #539

Commit: e75ddc31

Resolved: No runbook for Reload Caddy failures (Sara — comment #8987)

Troubleshooting subsection added to docs/infrastructure/ci-gitea.md under the Caddyfile symlink contract section. Covers all three failure modes Sara identified:

  1. Caddy stopped — explicit systemctl error, recovery: systemctl start caddy
  2. Symlink missing/mis-pointed — silent reload, diagnosed via stale smoke test, recovery: ln -sf …
  3. Docker socket / nsenter unavailable — container error, diagnosis points to runner-config.yaml

Commit: a3f1260c

Resolved: Renovate digest bumps must not auto-merge (Nora — comment #8986)

renovate.json extended with a packageRule matching .gitea/workflows/** digest updates with automerge: false and a description field explaining the security rationale (root-equivalent host access). Since Renovate's default is already non-automerge unless explicitly set, this rule serves as explicit documentation of intent rather than a behavior change.

Commit: 08c7dbca

Deferred

Concern Reason
Step duplication → composite action (Tobias, Markus) Tracked in issue #539
Symlink pre-check inside nsenter (Tobias suggestion 2, Sara concern 2) Low priority per Tobias; complex nested quoting; already mitigated by DEPLOYMENT.md and new troubleshooting docs
## Round 2 implementation complete Three concerns from the latest review round addressed. All others remain deferred. ### ✅ Resolved: Missing ADR (Markus — comment #8989) `docs/adr/012-nsenter-for-host-service-management-in-ci.md` committed. Captures: - **Context**: DooD runner, job containers have no systemd PID 1 - **Decision**: nsenter via privileged sibling container using pinned Alpine digest - **Alternatives rejected**: `sudo systemctl` (no systemd in containers), Caddy admin API (no host network access), SSH (key management overhead) - **Consequences**: symlink contract, Renovate review requirement, step duplication tracked in #539 Commit: `e75ddc31` ### ✅ Resolved: No runbook for Reload Caddy failures (Sara — comment #8987) Troubleshooting subsection added to `docs/infrastructure/ci-gitea.md` under the Caddyfile symlink contract section. Covers all three failure modes Sara identified: 1. **Caddy stopped** — explicit `systemctl` error, recovery: `systemctl start caddy` 2. **Symlink missing/mis-pointed** — silent reload, diagnosed via stale smoke test, recovery: `ln -sf …` 3. **Docker socket / nsenter unavailable** — container error, diagnosis points to `runner-config.yaml` Commit: `a3f1260c` ### ✅ Resolved: Renovate digest bumps must not auto-merge (Nora — comment #8986) `renovate.json` extended with a `packageRule` matching `.gitea/workflows/**` digest updates with `automerge: false` and a `description` field explaining the security rationale (root-equivalent host access). Since Renovate's default is already non-automerge unless explicitly set, this rule serves as explicit documentation of intent rather than a behavior change. Commit: `08c7dbca` ### Deferred | Concern | Reason | |---|---| | Step duplication → composite action (Tobias, Markus) | Tracked in issue #539 | | Symlink pre-check inside nsenter (Tobias suggestion 2, Sara concern 2) | Low priority per Tobias; complex nested quoting; already mitigated by DEPLOYMENT.md and new troubleshooting docs |
Author
Owner

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: Approved

This is the right fix, done correctly. Let me walk through it.

What's right

Approach: DooD job containers genuinely cannot reach the host systemd — there's no systemctl socket inside them. Using the already-mounted Docker socket to spin up a privileged sibling is the correct solution for this topology. SSH would add key management overhead. The Caddy admin API would require exposing :2019 on a host-reachable port. nsenter is the correct call.

Alpine over Ubuntu: 5 MB vs 70 MB pull, no unnecessary tooling. Correct choice.

reload not restart: SIGHUP reloads config in-process without dropping TLS connections or in-flight requests. If you'd used restart, you'd briefly drop port 443 and fail HSTS preloaded clients. Good.

Digest pinning: alpine:3.21@sha256:... is exactly right. Tag-only pulls are non-reproducible. The Renovate packageRule blocking automerge on digest bumps for --privileged --pid=host containers is the correct operational guard.

Documentation: ci-gitea.md is now the definitive reference for this architecture. The three troubleshooting failure modes cover the real-world scenarios (Caddy stopped, broken symlink, socket unavailable). The DEPLOYMENT.md CI dependency note is the right place for the symlink contract.

ADR-012: Alternatives table is complete. The reload vs restart rationale is captured. Consequences are honest about what the Renovate bump-review requirement implies.

Suggestions (non-blocking)

apk add util-linux is not digest-pinned: The base image is pinned by sha256, but util-linux is installed from the Alpine CDN at run time. The nsenter binary itself isn't locked to a specific version. In a single-tenant trusted VPS with a warm package cache this is acceptable, but worth tracking: a future hardening option would be to pre-bake a custom image with util-linux already installed and pin that image. Issue for later.

Step duplication (nightly.yml + release.yml): ADR §Consequences already acknowledges this and points to issue #539. Acceptable for now. Composite action extraction is the right next step when you have more workflow maintenance overhead.

The 27-line YAML comment block: It explains the why thoroughly, but ci-gitea.md now has the same content in prose with better formatting. A two-line comment pointing to ci-gitea.md §Running host-level commands from CI would be enough in the YAML. Not a blocker — inline comments are useful for people reading the raw YAML.

Bottom line

The operational gap is real, the fix is correct, the Renovate guard is in place, and the docs are comprehensive. Merge.

## 🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ✅ Approved** This is the right fix, done correctly. Let me walk through it. ### What's right **Approach**: DooD job containers genuinely cannot reach the host systemd — there's no systemctl socket inside them. Using the already-mounted Docker socket to spin up a privileged sibling is the correct solution for this topology. SSH would add key management overhead. The Caddy admin API would require exposing `:2019` on a host-reachable port. nsenter is the correct call. **Alpine over Ubuntu**: 5 MB vs 70 MB pull, no unnecessary tooling. Correct choice. **`reload` not `restart`**: SIGHUP reloads config in-process without dropping TLS connections or in-flight requests. If you'd used `restart`, you'd briefly drop port 443 and fail HSTS preloaded clients. Good. **Digest pinning**: `alpine:3.21@sha256:...` is exactly right. Tag-only pulls are non-reproducible. The Renovate `packageRule` blocking automerge on digest bumps for `--privileged --pid=host` containers is the correct operational guard. **Documentation**: `ci-gitea.md` is now the definitive reference for this architecture. The three troubleshooting failure modes cover the real-world scenarios (Caddy stopped, broken symlink, socket unavailable). The DEPLOYMENT.md CI dependency note is the right place for the symlink contract. **ADR-012**: Alternatives table is complete. The `reload` vs `restart` rationale is captured. Consequences are honest about what the Renovate bump-review requirement implies. ### Suggestions (non-blocking) **`apk add util-linux` is not digest-pinned**: The base image is pinned by sha256, but `util-linux` is installed from the Alpine CDN at run time. The nsenter binary itself isn't locked to a specific version. In a single-tenant trusted VPS with a warm package cache this is acceptable, but worth tracking: a future hardening option would be to pre-bake a custom image with util-linux already installed and pin that image. Issue for later. **Step duplication (nightly.yml + release.yml)**: ADR §Consequences already acknowledges this and points to issue #539. Acceptable for now. Composite action extraction is the right next step when you have more workflow maintenance overhead. **The 27-line YAML comment block**: It explains the why thoroughly, but `ci-gitea.md` now has the same content in prose with better formatting. A two-line comment pointing to `ci-gitea.md §Running host-level commands from CI` would be enough in the YAML. Not a blocker — inline comments are useful for people reading the raw YAML. ### Bottom line The operational gap is real, the fix is correct, the Renovate guard is in place, and the docs are comprehensive. Merge.
Author
Owner

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Verdict: Approved

No production code in this PR — pure CI workflow and documentation. I'll review through my lens anyway.

TDD Evidence

There are no automated tests, and that's acceptable here. CI workflow steps can't be unit-tested in the traditional sense; the test plan is "let the next nightly run pass." The PR description is explicit about this and sets the expectation correctly:

  • Merge and let next nightly run pass the "Reload Caddy" step without error

That's the integration test. It's manual, but it's appropriate.

Code Readability

The # why not… comments in the YAML explain WHY, not WHAT — this is exactly right. Nesting rationale in comments next to the code it concerns means a reader doesn't have to cross-reference the ADR for every decision:

# `reload` not `restart`: reload sends SIGHUP so Caddy re-reads its config
# in-process without dropping TLS connections. `restart` would briefly stop
# the service, losing in-flight requests.

This is a legitimate use of comments. The constraint is non-obvious (reload vs restart has an observable difference in a TLS-serving context), so the comment earns its place.

Duplication

The "Reload Caddy" step is copy-pasted identically between nightly.yml and release.yml. The ADR calls this out and links to issue #539 for composite action extraction. That's the right way to handle it — acknowledge the duplication, track the cleanup separately, don't block the fix. I agree with the call.

Minor observation

The inline YAML comment block is 27 lines. ci-gitea.md now documents the same material with better prose structure. Once issue #539 extracts this into a composite action, the action itself can carry a two-line doc link and the YAML comment can shrink significantly. Not a blocker.

Bottom line

Clean, well-documented CI fix. The WHY is captured both inline and in the ADR. Merge.

## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer **Verdict: ✅ Approved** No production code in this PR — pure CI workflow and documentation. I'll review through my lens anyway. ### TDD Evidence There are no automated tests, and that's acceptable here. CI workflow steps can't be unit-tested in the traditional sense; the test plan is "let the next nightly run pass." The PR description is explicit about this and sets the expectation correctly: > - [ ] Merge and let next nightly run pass the "Reload Caddy" step without error That's the integration test. It's manual, but it's appropriate. ### Code Readability **The `# why not…` comments in the YAML explain WHY, not WHAT** — this is exactly right. Nesting rationale in comments next to the code it concerns means a reader doesn't have to cross-reference the ADR for every decision: ```yaml # `reload` not `restart`: reload sends SIGHUP so Caddy re-reads its config # in-process without dropping TLS connections. `restart` would briefly stop # the service, losing in-flight requests. ``` This is a legitimate use of comments. The constraint is non-obvious (reload vs restart has an observable difference in a TLS-serving context), so the comment earns its place. ### Duplication The "Reload Caddy" step is copy-pasted identically between `nightly.yml` and `release.yml`. The ADR calls this out and links to issue #539 for composite action extraction. That's the right way to handle it — acknowledge the duplication, track the cleanup separately, don't block the fix. I agree with the call. ### Minor observation The inline YAML comment block is 27 lines. `ci-gitea.md` now documents the same material with better prose structure. Once issue #539 extracts this into a composite action, the action itself can carry a two-line doc link and the YAML comment can shrink significantly. Not a blocker. ### Bottom line Clean, well-documented CI fix. The WHY is captured both inline and in the ADR. Merge.
Author
Owner

🏛️ Markus Keller (@mkeller) — Application Architect

Verdict: Approved

Architecture Decision Quality

ADR-012 is the right artifact for this decision. It follows the format: context, decision, alternatives, consequences. The alternatives table is thorough — all four rejected approaches are named with their failure reason. The consequences section is honest about the operational commitments this decision creates (Renovate review requirement, symlink contract, duplication until #539).

The decision to make existing implicit privileges explicit rather than adding new ones is sound. The Docker socket already granted root-equivalent host access for docker compose calls — this step makes that trust boundary visible in the YAML comment and the ADR rather than hiding it. Good.

Documentation Update Audit

Per my doc update checklist:

Trigger Required update Status
New infrastructure pattern (nsenter via DooD) docs/DEPLOYMENT.md Updated with CI dependency note
Architectural decision with lasting consequences New ADR in docs/adr/ ADR-012
Infrastructure operational context docs/infrastructure/ci-gitea.md Major expansion
New Docker service / component docs/architecture/c4/l2-containers.puml Not updated

The l2-containers.puml omission warrants a comment: the privileged Alpine container is ephemeral — it runs, does its work, and exits. It's not a persistent service in the topology. Reasonable to exclude from the C4 Level 2 containers diagram. I'd accept this call.

Blocker: Verify ADR-011 exists

ADR-012 §Consequences references:

ADR-011 — single-tenant runner trust model (Docker socket access scope)

If ADR-011 doesn't exist in docs/adr/, this is a broken reference — a future reader following the chain will hit a dead end. Please verify docs/adr/011-*.md exists. If it doesn't, either create a stub or remove the reference from ADR-012.

The DEPLOYMENT.md note correctly flags the symlink as a CI dependency. The ci-gitea.md §Caddyfile symlink contract section is the right place for the operational verification command. Both are present. ✓

Conclusion

The architecture decision is sound, the ADR is complete, the docs are updated. Resolve the ADR-011 reference question before merging.

## 🏛️ Markus Keller (@mkeller) — Application Architect **Verdict: ✅ Approved** ### Architecture Decision Quality ADR-012 is the right artifact for this decision. It follows the format: context, decision, alternatives, consequences. The alternatives table is thorough — all four rejected approaches are named with their failure reason. The consequences section is honest about the operational commitments this decision creates (Renovate review requirement, symlink contract, duplication until #539). The decision to make existing implicit privileges explicit rather than adding new ones is sound. The Docker socket already granted root-equivalent host access for `docker compose` calls — this step makes that trust boundary visible in the YAML comment and the ADR rather than hiding it. Good. ### Documentation Update Audit Per my doc update checklist: | Trigger | Required update | Status | |---|---|---| | New infrastructure pattern (nsenter via DooD) | `docs/DEPLOYMENT.md` | ✅ Updated with CI dependency note | | Architectural decision with lasting consequences | New ADR in `docs/adr/` | ✅ ADR-012 | | Infrastructure operational context | `docs/infrastructure/ci-gitea.md` | ✅ Major expansion | | New Docker service / component | `docs/architecture/c4/l2-containers.puml` | ⬜ Not updated | The `l2-containers.puml` omission warrants a comment: the privileged Alpine container is ephemeral — it runs, does its work, and exits. It's not a persistent service in the topology. Reasonable to exclude from the C4 Level 2 containers diagram. I'd accept this call. ### Blocker: Verify ADR-011 exists ADR-012 §Consequences references: > ADR-011 — single-tenant runner trust model (Docker socket access scope) If ADR-011 doesn't exist in `docs/adr/`, this is a broken reference — a future reader following the chain will hit a dead end. Please verify `docs/adr/011-*.md` exists. If it doesn't, either create a stub or remove the reference from ADR-012. ### Caddyfile Symlink as CI Contract The DEPLOYMENT.md note correctly flags the symlink as a CI dependency. The ci-gitea.md §Caddyfile symlink contract section is the right place for the operational verification command. Both are present. ✓ ### Conclusion The architecture decision is sound, the ADR is complete, the docs are updated. Resolve the ADR-011 reference question before merging.
Author
Owner

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

This PR makes existing host-access capabilities explicit and operational. It does not introduce new trust boundaries — but it does weaponize them in a way that deserves careful documentation of the attack surface. Let me be precise.

What's handled correctly

Digest pinning (alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d): The base image is immutable. Any upstream change to the alpine:3.21 tag will not affect this step until an explicit PR bumps the digest.

Renovate packageRule requiring manual review: Correct. A --privileged --pid=host container with a compromised image has root-equivalent access to the host filesystem, process tree, and network. Automerge of digest bumps for such images would be a serious supply chain risk.

Single-tenant trust model: The ADR correctly notes that fork PRs don't automatically get CI on a self-hosted Gitea runner. The blast radius of a compromised workflow is scoped to the repository owner's own pushes and trusted collaborators.

No sudoers entry required — acknowledged: The ADR makes this explicit. The Docker socket already grants root-equivalent access; this step doesn't add a new privilege, it just uses an existing one for a new purpose.

Security Concern 1 — apk add util-linux is not integrity-verified (informational)

Severity: Low (given single-tenant context)

The base image is pinned by sha256. However, util-linux is installed at runtime from the Alpine CDN:

apk add --no-cache util-linux -q

The package version and checksum are verified by Alpine's apk package signature mechanism, but the nsenter binary's provenance is not separately attestable in the workflow. A CDN-level compromise (unlikely for Alpine, but theoretically possible) during the apk install window could inject a malicious nsenter.

Mitigation path (not required, but worth tracking): Build a custom slim image with util-linux pre-installed, push it to the Gitea container registry, pin that image by digest. Then the entire dependency chain is immutable. This adds image maintenance overhead that may not be justified for a family archive project.

Current risk: Acceptable for a private, single-tenant VPS. Document this in ci-gitea.md if not already there.

Security Concern 2 — No --network=none on the privileged container

Severity: Low

The privileged Alpine container launched by docker run has outbound network access (it needs it for apk add). After the apk add completes, it only needs the host PID namespace for nsenter — not network. If the nsenter call is ever extended or modified, the container could make arbitrary outbound connections under privilege.

This is not an issue with the current step as written. It's a residual risk from needing network for the package install. Worth noting as context for anyone who reads this later.

Security Concern 3 — The trust boundary comment in the ADR is critical

Verdict: Correctly handled

ADR-012 §Consequences states:

"The runner host's Docker socket access is now a capability relied upon for host service management, not just for running docker compose commands. This is stated explicitly in the YAML comment so future reviewers understand the trust boundary."

This is exactly right. The YAML comment makes the security model visible to any developer reading the workflow. No security-by-obscurity.

Missing: No regression test for the "Caddy not running" failure path

This PR relies on systemctl reload failing non-zero to abort the workflow when Caddy is stopped. That's the right behavior — fail fast before a misleading curl error. But there's no CI test that exercises this. The test plan says "let the next nightly run pass" — which only validates the happy path.

For a family archive project this is acceptable. For a larger team I'd request a CI-level test that verifies the step exits non-zero when the service is absent.

Conclusion

The security model is sound, the privileges are acknowledged and documented, and the Renovate guard is the right operational control. The two low-severity concerns are informational — I'd not block merge on them. Resolve the apk add provenance concern in the documentation if you want a clean audit trail, but it's not a blocker.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** This PR makes existing host-access capabilities explicit and operational. It does not introduce new trust boundaries — but it does weaponize them in a way that deserves careful documentation of the attack surface. Let me be precise. ### What's handled correctly **Digest pinning** (`alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d`): The base image is immutable. Any upstream change to the alpine:3.21 tag will not affect this step until an explicit PR bumps the digest. **Renovate `packageRule` requiring manual review**: Correct. A `--privileged --pid=host` container with a compromised image has root-equivalent access to the host filesystem, process tree, and network. Automerge of digest bumps for such images would be a serious supply chain risk. **Single-tenant trust model**: The ADR correctly notes that fork PRs don't automatically get CI on a self-hosted Gitea runner. The blast radius of a compromised workflow is scoped to the repository owner's own pushes and trusted collaborators. **No sudoers entry required — acknowledged**: The ADR makes this explicit. The Docker socket already grants root-equivalent access; this step doesn't add a new privilege, it just uses an existing one for a new purpose. ### Security Concern 1 — `apk add util-linux` is not integrity-verified (informational) **Severity**: Low (given single-tenant context) The base image is pinned by sha256. However, `util-linux` is installed at runtime from the Alpine CDN: ```sh apk add --no-cache util-linux -q ``` The package version and checksum are verified by Alpine's apk package signature mechanism, but the nsenter binary's provenance is not separately attestable in the workflow. A CDN-level compromise (unlikely for Alpine, but theoretically possible) during the apk install window could inject a malicious nsenter. **Mitigation path** (not required, but worth tracking): Build a custom slim image with util-linux pre-installed, push it to the Gitea container registry, pin that image by digest. Then the entire dependency chain is immutable. This adds image maintenance overhead that may not be justified for a family archive project. **Current risk**: Acceptable for a private, single-tenant VPS. Document this in ci-gitea.md if not already there. ### Security Concern 2 — No `--network=none` on the privileged container **Severity**: Low The privileged Alpine container launched by `docker run` has outbound network access (it needs it for `apk add`). After the `apk add` completes, it only needs the host PID namespace for nsenter — not network. If the nsenter call is ever extended or modified, the container could make arbitrary outbound connections under privilege. This is not an issue with the current step as written. It's a residual risk from needing network for the package install. Worth noting as context for anyone who reads this later. ### Security Concern 3 — The trust boundary comment in the ADR is critical **Verdict**: Correctly handled ✅ ADR-012 §Consequences states: > "The runner host's Docker socket access is now a capability relied upon for host service management, not just for running `docker compose` commands. This is stated explicitly in the YAML comment so future reviewers understand the trust boundary." This is exactly right. The YAML comment makes the security model visible to any developer reading the workflow. No security-by-obscurity. ### Missing: No regression test for the "Caddy not running" failure path This PR relies on `systemctl reload` failing non-zero to abort the workflow when Caddy is stopped. That's the right behavior — fail fast before a misleading curl error. But there's no CI test that exercises this. The test plan says "let the next nightly run pass" — which only validates the happy path. For a family archive project this is acceptable. For a larger team I'd request a CI-level test that verifies the step exits non-zero when the service is absent. ### Conclusion The security model is sound, the privileges are acknowledged and documented, and the Renovate guard is the right operational control. The two low-severity concerns are informational — I'd not block merge on them. Resolve the `apk add` provenance concern in the documentation if you want a clean audit trail, but it's not a blocker.
Author
Owner

🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

This PR has no automated test coverage for the new CI step. That's the expected limitation of infrastructure changes, and I understand it. Let me document what the gap means and what's mitigated.

Test Coverage Gap

The test plan in the PR description is:

  • Merge and let next nightly run pass the "Reload Caddy" step without error
  • Confirm smoke test proceeds to check /login, HSTS header, and /actuator/health → 404

This is a manual, single-run, happy-path-only verification. There is no automated regression test that would catch regressions introduced by future changes to this step (e.g., a Renovate digest bump that changes the Alpine image, or a workflow refactor that accidentally reorders steps).

What I'd normally ask for: A dry-run or pre-flight assertion step before the systemctl reload that verifies the preconditions:

- name: Verify Caddy preconditions
  run: |
    docker run --rm --privileged --pid=host \
      alpine:3.21@sha256:... \
      sh -c 'apk add --no-cache util-linux -q && nsenter -t 1 -m -u -n -p -i -- systemctl is-active caddy'

This would give a named, self-documenting failure ("Caddy is not active") before the reload attempt — distinct from the reload failing for other reasons. It would also produce a cleaner CI log for incident response.

Why I'm not blocking: The troubleshooting guide in ci-gitea.md covers the three real failure modes with exact log snippets, symptoms, and recovery steps. That's operational test documentation in prose form. For a solo project on a single VPS, this is acceptable.

Known Failure Modes Are Documented

The ci-gitea.md §Troubleshooting section covers:

  1. Caddy stoppedsystemctl reload exits non-zero, CI fails before smoke test ✓
  2. Symlink missing/wrongsystemctl reload exits 0 but serves stale config → smoke test catches it via HSTS or /actuator/health check ✓
  3. nsenter / Docker socket unavailabledocker: Cannot connect to Docker daemon

The silent failure mode (Failure mode 2) is the tricky one: systemctl reload caddy succeeds but Caddy reloads whatever the symlink currently points to. The smoke test is the downstream detection — HSTS header mismatch or /actuator/health → 404 would catch a stale Caddyfile. That's an acceptable downstream guard.

Step Ordering Is Correct

"Reload Caddy" is placed after "Deploy staging" (so the latest Caddyfile is on disk) and before "Smoke test" (so the public surface reflects the current config). The ordering is semantically correct.

Renovate Guard = Implicit Quality Gate

The packageRule requiring manual review on digest bumps is an implicit quality gate for this step. Any change to the Alpine image that could affect behavior requires human sign-off. That's the right control.

Conclusion

The test coverage is minimal but the failure modes are documented, the step ordering is correct, and the downstream smoke test acts as a functional gate. I'd recommend adding a systemctl is-active caddy pre-flight check as a follow-up (not a blocker). Merge with the understanding that the next nightly run is the real validation.

## 🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** This PR has no automated test coverage for the new CI step. That's the expected limitation of infrastructure changes, and I understand it. Let me document what the gap means and what's mitigated. ### Test Coverage Gap The test plan in the PR description is: > - [ ] Merge and let next nightly run pass the "Reload Caddy" step without error > - [ ] Confirm smoke test proceeds to check `/login`, HSTS header, and `/actuator/health → 404` This is a manual, single-run, happy-path-only verification. There is no automated regression test that would catch regressions introduced by future changes to this step (e.g., a Renovate digest bump that changes the Alpine image, or a workflow refactor that accidentally reorders steps). **What I'd normally ask for**: A dry-run or pre-flight assertion step before the `systemctl reload` that verifies the preconditions: ```yaml - name: Verify Caddy preconditions run: | docker run --rm --privileged --pid=host \ alpine:3.21@sha256:... \ sh -c 'apk add --no-cache util-linux -q && nsenter -t 1 -m -u -n -p -i -- systemctl is-active caddy' ``` This would give a named, self-documenting failure ("Caddy is not active") before the reload attempt — distinct from the reload failing for other reasons. It would also produce a cleaner CI log for incident response. **Why I'm not blocking**: The troubleshooting guide in `ci-gitea.md` covers the three real failure modes with exact log snippets, symptoms, and recovery steps. That's operational test documentation in prose form. For a solo project on a single VPS, this is acceptable. ### Known Failure Modes Are Documented ✅ The `ci-gitea.md` §Troubleshooting section covers: 1. **Caddy stopped** → `systemctl reload` exits non-zero, CI fails before smoke test ✓ 2. **Symlink missing/wrong** → `systemctl reload` exits 0 but serves stale config → smoke test catches it via HSTS or `/actuator/health` check ✓ 3. **nsenter / Docker socket unavailable** → `docker: Cannot connect to Docker daemon` ✓ The silent failure mode (Failure mode 2) is the tricky one: `systemctl reload caddy` succeeds but Caddy reloads whatever the symlink currently points to. The smoke test is the downstream detection — HSTS header mismatch or `/actuator/health → 404` would catch a stale Caddyfile. That's an acceptable downstream guard. ### Step Ordering Is Correct ✅ "Reload Caddy" is placed after "Deploy staging" (so the latest Caddyfile is on disk) and before "Smoke test" (so the public surface reflects the current config). The ordering is semantically correct. ### Renovate Guard = Implicit Quality Gate ✅ The `packageRule` requiring manual review on digest bumps is an implicit quality gate for this step. Any change to the Alpine image that could affect behavior requires human sign-off. That's the right control. ### Conclusion The test coverage is minimal but the failure modes are documented, the step ordering is correct, and the downstream smoke test acts as a functional gate. I'd recommend adding a `systemctl is-active caddy` pre-flight check as a follow-up (not a blocker). Merge with the understanding that the next nightly run is the real validation.
Author
Owner

🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead

Verdict: Approved

This PR makes no changes to any frontend component, route, Svelte file, Tailwind class, or user-facing interaction. No UI/UX review applies.

Checked for completeness:

  • No .svelte files changed
  • No route additions or modifications
  • No design tokens or CSS touched
  • No i18n strings added

From my perspective, this is a CI infrastructure fix that keeps the deployed app reachable on port 443 — which is a prerequisite for any user reaching the UI at all. I'm glad it's being fixed.

LGTM from design and accessibility standpoint.

## 🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead **Verdict: ✅ Approved** This PR makes no changes to any frontend component, route, Svelte file, Tailwind class, or user-facing interaction. No UI/UX review applies. Checked for completeness: - No `.svelte` files changed - No route additions or modifications - No design tokens or CSS touched - No i18n strings added From my perspective, this is a CI infrastructure fix that keeps the deployed app reachable on port 443 — which is a prerequisite for any user reaching the UI at all. I'm glad it's being fixed. LGTM from design and accessibility standpoint.
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved

Requirements Completeness Review

This PR targets a specific, well-defined operational gap. Let me verify it against a requirements lens.

Problem statement: Clearly stated. The PR description identifies the root cause (deploy workflow never reloads host Caddy), the triggering incident (frontend healthcheck fix unmasked the gap), and the observable symptom (port 443 refusing connections, misleading smoke test error).

Scope separation: The PR correctly distinguishes two separate concerns:

  1. Future state: The workflow change (this PR) — fixes all future nightly/release runs
  2. Current state: The immediate operational fix (manual systemctl reload caddy on the VPS) — out of scope, documented separately in the PR body

This separation prevents scope creep and keeps the PR focused. ✓

Acceptance criteria (from the test plan):

  • Next nightly run passes the "Reload Caddy" step without error
  • Smoke test verifies /login, HSTS header, /actuator/health → 404

These are functional and testable, though manual. For an infrastructure fix, this is appropriate. ✓

Non-functional requirements: The ADR documents:

  • Performance: reload not restart to avoid TLS connection drops ✓
  • Reliability: fail-fast behavior when Caddy is stopped ✓
  • Maintainability: Renovate manual review gate for digest bumps ✓
  • Auditability: ADR-012 captures the decision context ✓

Operational contract (the symlink): This is the most important implicit requirement. The workflow's correctness depends on /etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile existing on the VPS. This is now documented as an explicit CI dependency in docs/DEPLOYMENT.md and ci-gitea.md. The consequence of a missing symlink (silent stale config, smoke test as downstream catch) is documented. ✓

Open question: ADR-012 references ADR-011 ("single-tenant runner trust model"). If ADR-011 doesn't exist in the repo, the reference is an unresolved dependency. This should be verified before merge — either ADR-011 exists, or the reference should be updated to describe the trust model inline.

Conclusion: The requirements for this fix are well-specified, the operational contract is documented, and the scope is appropriately bounded. The one open item (ADR-011 reference) is shared with the architect's review. Approve pending that check.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** ### Requirements Completeness Review This PR targets a specific, well-defined operational gap. Let me verify it against a requirements lens. **Problem statement**: Clearly stated. The PR description identifies the root cause (deploy workflow never reloads host Caddy), the triggering incident (frontend healthcheck fix unmasked the gap), and the observable symptom (port 443 refusing connections, misleading smoke test error). **Scope separation**: The PR correctly distinguishes two separate concerns: 1. *Future state*: The workflow change (this PR) — fixes all future nightly/release runs 2. *Current state*: The immediate operational fix (manual `systemctl reload caddy` on the VPS) — out of scope, documented separately in the PR body This separation prevents scope creep and keeps the PR focused. ✓ **Acceptance criteria** (from the test plan): - [ ] Next nightly run passes the "Reload Caddy" step without error - [ ] Smoke test verifies `/login`, HSTS header, `/actuator/health → 404` These are functional and testable, though manual. For an infrastructure fix, this is appropriate. ✓ **Non-functional requirements**: The ADR documents: - Performance: `reload` not `restart` to avoid TLS connection drops ✓ - Reliability: fail-fast behavior when Caddy is stopped ✓ - Maintainability: Renovate manual review gate for digest bumps ✓ - Auditability: ADR-012 captures the decision context ✓ **Operational contract (the symlink)**: This is the most important implicit requirement. The workflow's correctness depends on `/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile` existing on the VPS. This is now documented as an explicit CI dependency in `docs/DEPLOYMENT.md` and `ci-gitea.md`. The consequence of a missing symlink (silent stale config, smoke test as downstream catch) is documented. ✓ **Open question**: ADR-012 references ADR-011 ("single-tenant runner trust model"). If ADR-011 doesn't exist in the repo, the reference is an unresolved dependency. This should be verified before merge — either ADR-011 exists, or the reference should be updated to describe the trust model inline. **Conclusion**: The requirements for this fix are well-specified, the operational contract is documented, and the scope is appropriately bounded. The one open item (ADR-011 reference) is shared with the architect's review. Approve pending that check.
marcel added 10 commits 2026-05-12 07:42:31 +02:00
Adds a `sudo systemctl reload caddy` step between the docker compose
deploy and the smoke test. This ensures any committed Caddyfile changes
are applied before the public surface is verified.

Previously the workflow had no mechanism to push Caddyfile changes to
the running host daemon. A Caddyfile edit would land in the repo but
Caddy would keep serving the previous config, causing the smoke test to
catch a stale header or still-proxied /actuator route rather than the
intended current config.

This step also surfaces the root cause of today's port-443 failure
explicitly: if Caddy is not running, the step fails with a clear service
error rather than a misleading "Failed to connect to port 443" from curl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`sudo systemctl reload caddy` does not work from inside a DooD job
container: `systemctl` is absent from Ubuntu container images and
container processes cannot reach the host systemd without entering its
namespaces. Replace with `docker run --privileged --pid=host ubuntu:22.04
nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy`, which uses
the already-mounted Docker socket to spin up a privileged sibling
container that enters the host PID namespace via nsenter. Tested live on
the Hetzner VPS. No sudoers entry required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same gap as nightly.yml: production deploys also need Caddy to reload
the updated Caddyfile before the smoke test validates the public surface.
Uses the same nsenter pattern introduced in the previous commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the stale generic runner provisioning docs with an accurate
description of the actual two-container setup on the Hetzner VPS.
Document the nsenter pattern for running host-level commands (systemctl)
from containerised CI steps, and the Caddyfile symlink contract that the
reload step depends on.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Switch ubuntu:22.04 (floating, ~70 MB) to alpine:3.21 pinned by sha256
  digest (~5 MB); util-linux installed at run time via apk add
- Add explicit comment explaining why `reload` not `restart`: SIGHUP
  re-reads config in-process without dropping TLS connections

Addresses Tobias + Nora blocker from PR review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Captures the architectural decision, alternatives considered (sudo
systemctl, Caddy admin API, SSH), and consequences (symlink contract,
Renovate review requirement, step duplication tracked in #539).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers the three failure modes Sara flagged: Caddy stopped (explicit
systemctl error), symlink missing/mis-pointed (silent reload, stale
smoke test), and Docker socket / nsenter unavailable (container error).
Each failure mode includes symptoms and recovery steps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chore(renovate): require manual review for privileged CI image digest bumps
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1m46s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 4m8s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Failing after 11s
CI / OCR Service Tests (push) Successful in 16s
CI / Unit & Component Tests (push) Failing after 1m52s
CI / Backend Unit Tests (push) Successful in 4m11s
CI / fail2ban Regex (push) Successful in 39s
CI / Compose Bucket Idempotency (push) Failing after 10s
e9caa3a1f7
Adds a packageRule matching .gitea/workflows/** digest updates with
automerge: false. Digest bumps for images running --privileged --pid=host
have root-equivalent host access and must not be auto-merged.

Addresses Nora's review concern on #537.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel force-pushed fix/nightly-caddy-reload from 08c7dbcaa2 to e9caa3a1f7 2026-05-12 07:42:31 +02:00 Compare
marcel merged commit e9caa3a1f7 into main 2026-05-12 07:51:13 +02:00
marcel deleted branch fix/nightly-caddy-reload 2026-05-12 07:51:13 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#537