ci(nightly): reload Caddy before smoke test #537
Reference in New Issue
Block a user
Delete Branch "fix/nightly-caddy-reload"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
sudo systemctl reload caddybetween the docker compose deploy and the smoke testRoot cause of tonight's failure
Before commit
5f352943(frontend healthchecklocalhost→127.0.0.1), the frontend container went unhealthy anddocker compose up --waitdied at the "Deploy staging" step — the smoke test was never reached. That fix unmasked a pre-existing gap: the workflow deploys the app stack but never tells the host Caddy daemon to reload. Port 443 was refusing connections because Caddy either wasn't running or wasn't configured yet on the server.Operational fix still required
This PR fixes the workflow so future runs self-heal. The immediate action is still needed on the server:
Test plan
/login, HSTS header, and/actuator/health → 404🏗️ Markus Keller — Senior Application Architect
Verdict: ✅ Approved
What I checked
This PR touches only
.gitea/workflows/nightly.yml— no schema changes, no new backend packages, no new routes, no new infrastructure components. Running the doc-update checklist:docs/architecture/c4/l2-containers.puml+DEPLOYMENT.mddocs/architecture/c4/l1-context.pumlADR note (suggestion, not blocker): The decision to run
systemctl reload caddyfrom CI — rather than, say, having the Compose stack self-contain the proxy — is an architectural constraint worth capturing. The symlink pattern (/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile) and the choice to manage Caddy as a host service rather than a container are load-bearing decisions that future maintainers will trip over. A two-paragraph entry indocs/infrastructure/ci-gitea.mdnoting the symlink contract and the required sudoers setup would suffice. I'd call this a suggestion, not a blocker.Architecture assessment
The placement of the reload step is correct: after the Docker Compose deploy (which pulls the latest Caddyfile from the cloned repo), before the smoke test (which verifies the public surface). The reload happens on the host, not inside a container — which is the right mental model since Caddy runs as a host service managing TLS via Let's Encrypt. Reloading inside the container would have no effect.
systemctl reload(SIGHUP) oversystemctl restartis the correct choice: Caddy performs a graceful config reload without dropping open connections. This is boring infrastructure behavior done right.No concerns at the architecture layer.
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
TDD evidence
CI workflow steps are not testable via unit tests — the test plan ("merge and let the next nightly run confirm it") is the only viable approach here. Accepted.
Code quality
The change is a single
run: sudo systemctl reload caddywith a well-structured comment block. Reviewing against clean code standards for CI YAML:One minor observation
The comment block is 10 lines for a 1-line command. In application code I'd push back on this ratio, but CI workflows are documentation as much as they are code — operators reading the workflow at 2am during an incident need that context. No change needed.
Nothing else to flag from a developer perspective.
🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
What's correct
reloadoverrestart: Correct. Caddy's graceful reload via SIGHUP re-reads/etc/caddy/Caddyfilewithout dropping TLS connections or active requests.restartwould introduce a service gap and risk a Let's Encrypt rate-limit cycle.docker compose up --wait(so the repo's Caddyfile is already at the latest commit on the server) and before the smoke test (which validates the public surface with the updated config).systemctl reloadexits non-zero immediately, aborting the workflow before the smoke test issues a confusing "port 443 refused" curl error. This is the right failure mode.🚫 Blocker: sudoers entry on the runner needs to be verified
The step runs
sudo systemctl reload caddy. Gitea Actions runners typically execute as an unprivileged user (e.g.gitea-runneroract_runner). Without a passwordless sudoers entry for this specific command, the step will fail on the first nightly run with:or
The required sudoers entry on the production server is:
(Adjust the username to match your actual runner user. Verify the path with
which systemctl.)Before this PR merges, please confirm that this entry exists (or is being added as part of the operational fix described in the PR description). If it's already in place and tonight's nightly failure was not a sudo-permission error, then this is a non-issue — but given the PR description says the operational fix is "still required," I can't assume it's set up.
If the entry is confirmed, change this to ✅ Approved.
Suggestion (non-blocking): document the symlink contract
The PR description describes the symlink:
This is an implicit contract between the server setup and the CI workflow. If the symlink doesn't exist (new server, fresh VPS, disaster recovery), this step silently reloads Caddy with the wrong config and the smoke test may pass or fail for the wrong reason. Consider adding a step or a note in
docs/infrastructure/ci-gitea.mddocumenting that this symlink must exist as a server pre-condition.📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Requirement traceability
The stated problem: the nightly CI workflow deploys the app stack (including a potentially updated Caddyfile) but never instructs the host Caddy daemon to reload, leaving it serving the previous config. The smoke test then validates a stale state.
The fix directly addresses the root cause: insert a reload step between deploy and verify. This is a minimal, targeted fix — no scope creep, no gold-plating.
PR description quality
The PR description is unusually complete for a CI fix:
5f352943unmasking commit to the pre-existing gapsystemctl status caddy / start / reloadblock in the PR body serves as the immediate mitigation guideOne observation
The test plan has two checkboxes:
/login, HSTS header, and/actuator/health → 404These are sequential, not independent — (2) can only be verified after (1) passes. This is correctly sequenced. The acceptance criteria are observable and testable. No ambiguity.
Gap identified (suggestion)
The PR description notes the operational fix but doesn't link it to a Gitea issue or checklist item. If the symlink or Caddy service state isn't fixed on the server before the next nightly run, the new step will fail and the workflow will be no better off than before. Consider tracking the server-side fix in a linked issue to ensure it doesn't slip through.
🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
Security analysis
This PR introduces a privileged host operation (
sudo systemctl reload caddy) into the CI pipeline. I'll assess what attack surface this creates.⚠️ Concern: sudo scope on the runner
If
sudoon the runner is not tightly scoped to exactlysystemctl reload caddy, this creates a privilege escalation path for anyone who can manipulate the workflow or its environment.Risk: If the runner's sudoers entry is broad (e.g.,
NOPASSWD: ALLorNOPASSWD: /bin/systemctl), a malicious workflow step could runsudo systemctl stop caddy,sudo systemctl start malicious.service, or worse.Required sudoers entry — minimal scope:
Not:
Verification: Check with
sudo -l -U <runner-user>on the server. This is a blocker if the scope is broader than the specific command.✅ What's safe
systemctl reload caddycan only reload Caddy's config. It cannot write files, execute arbitrary code, or escalate further (assuming correct sudoers scope).env:block and accesses no secrets. LGTM.reloadexits non-zero and the workflow aborts. The smoke test cannot report false-positive success in that state.Suggestion: consider a Caddyfile syntax pre-check
Caddy supports config validation before reload:
This surfaces syntax errors before the reload attempt, giving a more actionable error message than a silent reload failure. Not a blocker — the current step will fail-fast on a bad config too — but provides better diagnostics. Requires a second sudoers entry for
caddy validate.🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ✅ Approved
Test coverage assessment
This is a CI workflow change — there is no viable automated test at the unit or integration layer for a
systemctl reloadstep. The acceptance criteria ("next nightly run passes this step") is the correct test strategy for this class of change. I accept this.What I checked
Failure mode coverage: The PR correctly identifies and addresses all failure modes:
reloadfails fast with a clear service error (better than "port 443 refused" from curl)Regression risk: Inserting a step between deploy and smoke test could theoretically cause the smoke test to fail if the reload itself introduces a timing issue (Caddy takes a moment to parse and apply the new config). Caddy's SIGHUP handling is synchronous — the process does not return from reload until the new config is validated and applied (or it fails). No race condition here.
Position in workflow: The step is placed correctly — the deploy step uses
--waitto confirm containers are healthy before proceeding, so the application stack is stable when Caddy reloads. Correct sequencing.One observation
The smoke test (the step that follows) presumably tests:
/loginloads/actuator/healthreturns 404 (blocked by Caddy)If the Caddy reload fails but the step doesn't fail-fast (e.g. due to a misconfigured
continue-on-error), the smoke test would catch the wrong config. Looking at the diff, there is nocontinue-on-error: trueon the reload step — which means it will correctly abort the workflow on failure. ✅No quality gate concerns. The change is minimal, correctly positioned, and the failure modes are handled.
🎨 Leonie Voss — UI/UX Design Lead
Verdict: ✅ Approved
This PR modifies only
.gitea/workflows/nightly.yml— a CI workflow file with no user-facing UI changes whatsoever. No Svelte components, no routes, no CSS, no i18n strings, no accessibility implications.I checked: no frontend files changed, no design tokens modified, no HTML structure affected.
LGTM from the UX/accessibility layer. Nothing to flag.
Follow-up: three commits addressing reviewer concerns
🚫 Root cause of the blocker (Tobias + Nora)
The sudoers discussion turned out to be a red herring. I SSHed into the Hetzner VPS and found:
act_runnerruns inside a Docker container (gitea-runner)systemctlis not present in Ubuntu container images/etc/sudoers.d/contains only the defaultREADME— no entries exist, and none are neededsudo systemctl reload caddywould have failed on the first nightly run withbash: sudo: command not found.✅ Fix: nsenter via privileged sibling container
The correct approach for DooD runners: use the Docker socket (already mounted into job containers) to spin up a privileged container in the host PID namespace, then use
nsenterto enter the host's namespaces:Tested live on the Hetzner VPS —
SUCCESS. No sudoers entry required.Commits
99de6f1dfix(ci/nightly): replacesudo systemctl reload caddywith the nsenter pattern; update comment to explain DooD architecturef87504fbfix(ci/release): add the same Reload Caddy step torelease.yml— it had the identical deploy→smoke gap52a96f65docs(ci): replace stale generic runner provisioning docs inci-gitea.mdwith accurate description of the two-container VPS setup, the nsenter pattern, and the Caddyfile symlink contractConfirmed server state (as of this push)
Active: active (running) since Mon 2026-05-11 14:55:11 CEST/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfilensentercall returned exit 0🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
This is the right fix for the right reason. Step ordering (deploy → reload → smoke) is correct, the failure mode when Caddy is stopped is intentional and appropriate, and the docs update is solid. Two things I'd flag:
Blockers
1.
ubuntu:22.04is not a pinned image tagBoth
nightly.ymlandrelease.ymlpullubuntu:22.04, which is a floating tag — it moves whenever Canonical pushes updates. This violates the pinned-image rule: builds are no longer reproducible, a bad upstream layer could silently change the behaviour ofnsenter, and there's no Renovate bump PR to audit the change.Fix: pin to a digest:
Or switch to the much smaller
ubuntu:22.04-compatible image that only carriesutil-linux(which providesnsenter):Alpine has a pinnable digest, is ~5 MB vs ~70 MB for Ubuntu, and ships
nsenterviautil-linux.Suggestions
2. Duplication between
nightly.ymlandrelease.ymlThe Reload Caddy step is copy-pasted verbatim between the two workflows. The comment even says "see nightly.yml — same rationale and mechanism." That's the smell that this should be a reusable composite action.
This isn't a blocker for this PR — duplicated infrastructure steps aren't dangerous in the same way duplicated app code is, and you've cross-referenced the comment — but worth a follow-up issue.
3.
reloadvsrestart: explicit rationale would helpThe comment says
reloadsends SIGHUP. That's correct and is the right call (no dropped connections). Worth adding one line to the comment explicitly stating whyrestartwas rejected: "We do not usesystemctl restart caddybecause it drops all in-flight connections during the handshake window." Future maintainers have no other way to know.What's done well
ci-gitea.md. That symlink was previously tribal knowledge.🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
The threat model is documented honestly and the approach is pragmatic given the DooD architecture. One real concern and one alternative to consider.
Blockers
1. Unpinned
ubuntu:22.04image in a privileged contextThis is a security issue, not just a reproducibility issue. The workflow runs
ubuntu:22.04with--privileged --pid=host— the widest possible host access. If an upstream Ubuntu layer is ever compromised or silently changes whatnsenterdoes, CI picks up the change on the next run with no review gate.For a container that is explicitly granted root-equivalent host access, the image must be pinned to a digest:
The risk delta between a pinned and floating tag is small in practice but the blast radius of a compromised privileged container is the entire VPS — not just the container.
Suggestions (not blockers)
2. Consider the Caddy Admin API as a lower-privilege alternative
The current approach uses the Docker socket to spawn a privileged container that enters all host namespaces. The comment is honest: "the Docker socket already grants root-equivalent host access." That's accurate — but it means CI has a standing capability to execute arbitrary host binaries.
Caddy ships a localhost admin API at
http://localhost:2019by default:If the admin API is reachable from job containers (or if the runner container is on the host network), this eliminates the need for
--privileged --pid=hostentirely. The attack surface drops from "full host root" to "Caddy admin socket, localhost only."Worth evaluating: if the Caddy admin API is accessible, replace the nsenter step with a
curlcall. If not, the current approach is acceptable.3. The
--privilegedcomment is correctly honest — don't weaken itThe YAML comment already says:
This is the right level of transparency. Please keep this phrasing. Future reviewers and auditors need to see this acknowledged, not buried or softened.
What's done well
systemctl reload(SIGHUP) is correct: it re-reads config without dropping TLS connections.restartwould have been a security regression (brief port-443 downtime = potential for connection hijack during the gap).systemctl reload caddy, no credentials involved.👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Verdict: ✅ Approved
No application code changed in this PR — this is purely CI workflow YAML and documentation. From a code quality perspective, the changes are clean.
Findings
No blockers.
Observations:
YAML comments explain the WHY, not the WHAT — The
# Apply any committed Caddyfile changes...and# The runner executes job steps inside Docker containers (DooD)...blocks are exactly the right kind of comment. They describe the constraint (DooD means systemctl is unavailable), the consequence (need nsenter), and the failure mode (step errors if Caddy is stopped). This is how infrastructure comments should be written.The Reload Caddy step in
release.ymlcross-referencesnightly.yml— The comment says "see nightly.yml — same rationale and mechanism." This is honest about the duplication but means two places to update if the approach changes. Not a blocker, but worth a follow-up issue to extract to a composite action.Documentation in
ci-gitea.mdis accurate — The DooD architecture section, the nsenter pattern example, and the Caddyfile symlink contract are all clearly explained. A new team member can understand the deployment without tribal knowledge.What's done well
reloadnotrestart— preserves in-flight connections.🏛️ Markus Keller (@mkeller) — Senior Application Architect
Verdict: ✅ Approved
This PR introduces a runtime pattern (host service management from a DooD CI environment) with lasting operational consequences. I'm checking whether the documentation footprint matches that weight.
Documentation audit
Per the architecture doc-update rule, I check whether this PR triggered the required doc updates:
docs/infrastructure/ci-gitea.mddocs/adr/docs/architecture/c4/l2-containers.pumlADR consideration: The nsenter-via-privileged-container pattern is a non-obvious architectural choice that any future CI maintainer will encounter. It has two lasting consequences: (a) the runner's Docker socket access is now a capability relied upon for service management, not just for running compose commands; (b) the Caddyfile symlink on the VPS is now a required contract for CI to succeed.
I'd normally call a missing ADR a blocker. In this case, the
ci-gitea.mdupdate captures the what, why, and how of the pattern at a level of detail that a focused ADR would also contain. The symlink contract is explicitly documented. I'll accept the doc-in-ci-gitea-md approach here as equivalent, with one addition request (see below).Suggestion (not a blocker)
Add a "Decision" and "Alternatives considered" subsection to
ci-gitea.mdThe current doc explains the what and how of nsenter well. Add one short paragraph on why the alternatives were rejected — so the next maintainer doesn't try to "fix" this:
This closes the loop on the decision rationale in a way that stays durable as the codebase evolves.
What's done well
/etc/caddy/Caddyfileand the repo'sinfra/caddy/Caddyfileis now a documented, explicit dependency. Previously it was tribal knowledge that would cause a 2am incident.ci-gitea.mdis a useful addition — two runners, their purposes, and their configs in one place.🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist
Verdict: ✅ Approved
CI workflow changes sit outside the standard test pyramid — there's no unit test for a GitHub Actions step. I'm evaluating the verification strategy and the completeness of the test plan.
Findings
No blockers.
1. Test plan relies on manual observation of the next nightly run
The PR test plan is:
/login, HSTS header, and/actuator/health → 404This is appropriate for CI workflow changes — there's no practical alternative to running the workflow on the real infrastructure. The smoke test that follows (
/login, HSTS,/actuator/health → 404) is a meaningful functional verification, not just a liveness check.One gap: there's no documented test for the failure path — what happens if Caddy is stopped when the step runs. The PR description states it "fails fast with a clear service error," but that hasn't been verified on the real VPS. If the server bootstrap issue exists (Caddy stopped, symlink missing), the step may produce an error message that isn't as clear as expected.
Recommendation: before merging, manually run the Reload Caddy step on the VPS to confirm the error message is actionable:
Check what the output looks like if Caddy is stopped (
systemctl stop caddyfirst, then run the above, then start it again).2. Smoke test coverage looks adequate for this fix
The smoke test (already in place) checks:
/login(frontend is up)/actuator/health → 404(Caddy is blocking Spring actuator from public access)The third check is directly relevant to this fix — if Caddy were serving a stale config that exposed
/actuator, the smoke test would catch it.What's done well
nightly.ymlandrelease.yml— the smoke test that validates the Caddy config comes after the reload, not before.📋 Elicit — Requirements Engineer & Business Analyst
Verdict: ✅ Approved
Reviewing from a requirements and specification standpoint. This is a corrective infrastructure fix with a clear root cause — not a feature — so I'm checking whether the problem is correctly stated, the solution is traceable to it, and the operational requirements are documented.
Findings
No blockers.
Root cause analysis: clearly stated and correct
The PR description identifies two distinct failure events:
5f352943—localhost→127.0.0.1in the frontend healthcheck — fixed the immediatedocker compose up --waitfailure.The PR fixes only #2, correctly scoped. This is good requirements practice — the fix addresses the root cause of the unmasked problem, not the symptom that revealed it.
Requirements coverage
systemctl reloadexits non-zero if Caddy is stoppedci-gitea.mdOne gap: the server remediation is in the PR description, not in
docs/DEPLOYMENT.mdThe operational fix (verify symlink, start/reload Caddy) is documented in the PR description, which becomes git history after merge but is not easily discoverable. The
ci-gitea.mdupdate documents the pattern but not the one-time server bootstrap steps.Recommendation: add a short section to
docs/DEPLOYMENT.md(under the relevant server setup section) confirming the symlink requirement:Without this, the next person bootstrapping a new server (or recovering from a disk failure) will hit the same gap that caused tonight's failure, with no discoverable documentation.
What's done well
🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
This PR makes no changes to the user interface, frontend components, visual design, accessibility markup, or any user-facing routes. All changes are in CI workflow YAML and infrastructure documentation.
Nothing to flag from a UX, accessibility, or brand compliance perspective.
The documentation update in
ci-gitea.mdis readable and well-structured — the architecture table, code blocks, and descriptive headings follow reasonable technical writing conventions. Not within my design review scope, but no concerns.Implementation complete
All open reviewer concerns have been addressed. Here's a summary of what changed:
✅ Resolved: ubuntu:22.04 → pinned Alpine digest (Tobias, Nora)
Both
nightly.ymlandrelease.ymlnow usealpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07dinstead of the floatingubuntu:22.04tag. The digest pins the exact image layer so any upstream change requires an explicit Renovate bump PR — critical given the step runs privileged with--pid=host.util-linux(which shipsnsenter) is not in the Alpine base image and is installed at run time (apk add --no-cache util-linux -q, ~1 s on the warm VPS cache). Verified thatnsenterfromutil-linux 2.40.4is present and functional inalpine:3.21.Commit:
f608838f✅ Resolved: Documentation —
ci-gitea.md(Elicit)docs/infrastructure/ci-gitea.mdupdated:util-linuxinstallsudo systemctlin the job container?" subsection (no systemd PID 1 inside containers):2019on a host port adds attack surface)Commit:
1ce0638a✅ Resolved: DEPLOYMENT.md symlink note (Elicit)
docs/DEPLOYMENT.md§3.1 now includes a# CI DEPENDENCYcomment above theln -sfline explaining that the nightly and release workflows depend on this symlink to pick up committed Caddyfile changes. The symlink command itself already existed.Commit:
60cb9d5aDeferred: step duplication → composite action
The duplicated
Reload Caddystep acrossnightly.ymlandrelease.ymlis tracked as a follow-up in issue #539. Kept out of this PR to keep scope focused on the immediate fix.🛠️ Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
This fix is correct and well-reasoned. The nsenter pattern is the right tool for DooD constraints, the Alpine digest is pinned, and the
reload-vs-restartrationale is properly commented. The DEPLOYMENT.md symlink note is exactly the kind of operational clarity that prevents late-night debugging.Blockers
None.
Suggestions
1. DRY violation: step copy-pasted between nightly.yml and release.yml
The full
docker run … nsenter …command with its digest appears verbatim in two files. One digest bump in the future means two PRs, or a forgotten file. Gitea Actions has limited composite action support compared to GitHub Actions, but if the runner supports it, a.gitea/actions/reload-caddy/action.ymlcomposite action would eliminate the duplication. If composite actions aren't supported on this runner, at minimum add a comment inrelease.yml:2. Symlink not verified before reload
If the symlink is absent or mis-pointed,
systemctl reload caddysucceeds silently — Caddy re-reads whatever/etc/caddy/Caddyfilecurrently points to. A pre-check inside the nsenter block would make the failure loud:Low priority given this is a bootstrap-once concern, but it would turn a silent mismatch into an explicit CI failure.
What's done well: digest pinning is best practice,
--no-cacheon apk, the long comment block explaining the whole architecture is exactly what future-you will need at 2am.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
This PR is entirely CI infrastructure and documentation — no application code, no frontend, no backend Java, no Python. My usual checklist (TDD, naming, component size, guard clauses, Svelte 5 rules) doesn't apply here.
One observation from a code-quality lens: the "Reload Caddy" step in
nightly.ymlhas a ~20-line comment block, which I'd normally flag as over-documenting. For CI infrastructure with non-obvious architectural constraints (DooD isolation, nsenter, whyreloadvsrestart), verbose comments are appropriate — the WHY is genuinely non-obvious and would take a future engineer significant time to reconstruct. This is the right exception.Nothing to block on. Clean from my domain.
🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
The chosen approach —
docker run --privileged --pid=host+ nsenter — is correct given the DooD constraint. The PR documentation is admirably transparent: the note that "the Docker socket already grants root-equivalent host access" is honest and accurate. This pattern makes existing implicit privileges explicit rather than introducing new ones. That's the right framing.Concerns (not blockers)
1. Renovate bump PRs for this digest should not be auto-merged
The digest-pinned Alpine image (
alpine:3.21@sha256:48b0309…) is pulled at runtime in a--privileged --pid=hostcontainer. Pinning the digest protects against registry substitution ✅. However, when Renovate eventually creates a digest bump PR for this line, that PR deserves manual review — a malicious or compromised digest substitution would give an attacker root access to the production host. This is not a code change to make, but a process note: add a RenovatepackageRuleto require human approval for digest bumps in workflow files that use--privileged.2.
apk add util-linuxis not pinnedThe base image digest is pinned, but
util-linuxis resolved from the Alpine 3.21 package index at runtime. Alpine package integrity is guaranteed by APK's signing chain, so this is an accepted residual risk — but it means thensenterbinary version can vary between runs. This is standard Alpine practice and acceptable here; just worth acknowledging.Positive findings
systemctl reload caddy) — no user-controlled input, zero injection risk ✅--pid=hostis the minimum scope needed for nsenter ✅reloadsemantics are safer thanrestartfrom an availability standpoint ✅🧪 Sara Holt — Senior QA Engineer
Verdict: ⚠️ Approved with concerns
The test plan ("merge and let next nightly run pass the Reload Caddy step without error") is the only practical approach for CI infrastructure — you can't meaningfully unit-test a workflow step in isolation, and I accept that.
Concerns
1. No runbook for when Reload Caddy fails
The PR correctly documents that "if Caddy is not running this step fails fast." Fast failure is good. But what does the failure look like in the CI log, and what should an engineer do about it? The PR description has the shell commands but they live in the PR, not in the repository. A short "Troubleshooting" subsection in
docs/infrastructure/ci-gitea.mdunder the nsenter section — covering the three failure modes (Caddy stopped, symlink missing, systemd unreachable) and the recovery steps for each — would make this actionable for the next person who sees a red CI run.2. Stale-config scenario is not caught by smoke test
If the symlink points to an old file,
systemctl reload caddyexits 0 and the smoke test runs against stale config with no signal. This scenario is mentioned in the docs but has no automated guard. The test plan doesn't cover it. Low priority (bootstrap-once concern), but worth a note in the open test plan.3. Test plan is unverified at merge time
The PR test plan checkbox (
[ ] Merge and let next nightly run pass) cannot be checked before merge. This is expected for CI changes — just noting it explicitly so there's a shared understanding that merge ≠ verified.What's done well
The smoke test immediately after the reload step is the right verification layer — if the reload fails, the smoke test provides an independent signal. The sequencing (deploy → reload → smoke test) is correct.
🎨 Leonie Voss — UI/UX Design Lead
Verdict: ✅ Approved
This PR contains no UI, frontend, or user-facing changes. The diff is confined to CI workflow YAML and infrastructure documentation. No accessibility, brand, responsive design, or interaction concerns apply. LGTM from a design perspective.
🏛️ Markus Keller — Senior Application Architect
Verdict: ⚠️ Approved with concerns
The nsenter approach is the right architectural choice given the DooD constraint — it's explicit, well-documented, and doesn't add permanent infrastructure components. The documentation updates in
docs/DEPLOYMENT.mdanddocs/infrastructure/ci-gitea.mdare thorough.Documentation matrix check
docs/infrastructure/ci-gitea.mddocs/DEPLOYMENT.mddocs/architecture/c4/l2-containers.pumldocker run, not a servicedocs/adr/Concern: Missing ADR
The nsenter-via-privileged-Docker pattern for running host commands from CI is an architectural decision with lasting consequences. Any future CI step that needs to manage a host service will reference this pattern. Documenting the alternatives that were considered and rejected directly in the PR description is a good start, but that content belongs in a committed ADR where future engineers can find it:
Suggested ADR:
docs/adr/ADR-XXX-nsenter-for-host-service-management-in-ci.mdContent is already written in the PR description and the ci-gitea.md "Why not?" sections — it just needs to be formatted as an ADR and committed. The ADR would capture:
sudo systemctlin job container (no systemd PID 1), Caddy admin API (no host network access), SSH to host (requires key management)--privilegedis the minimum required escalationStep duplication (echoing Tobias)
The identical step in
nightly.ymlandrelease.ymlis an architectural concern, not just a style one: two files to keep in sync for digest bumps, and two places where the pattern can silently diverge. If Gitea composite actions are supported, this should be extracted.📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
From a requirements perspective this PR is well-specified. The description contains all the elements of a good change request:
/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfile); recovery commands provided/login, HSTS header, and/actuator/health → 404"One process observation:
The test plan checkboxes are marked unchecked (
[ ]) at merge time, which is honest — these are post-merge verification items. For future CI infrastructure PRs, consider distinguishing between pre-merge and post-merge verification with two sections:No requirements gaps found. The change is narrow, well-justified, and the residual risks (symlink contract, step duplication) are documented.
Round 2 implementation complete
Three concerns from the latest review round addressed. All others remain deferred.
✅ Resolved: Missing ADR (Markus — comment #8989)
docs/adr/012-nsenter-for-host-service-management-in-ci.mdcommitted. Captures:sudo systemctl(no systemd in containers), Caddy admin API (no host network access), SSH (key management overhead)Commit:
e75ddc31✅ Resolved: No runbook for Reload Caddy failures (Sara — comment #8987)
Troubleshooting subsection added to
docs/infrastructure/ci-gitea.mdunder the Caddyfile symlink contract section. Covers all three failure modes Sara identified:systemctlerror, recovery:systemctl start caddyln -sf …runner-config.yamlCommit:
a3f1260c✅ Resolved: Renovate digest bumps must not auto-merge (Nora — comment #8986)
renovate.jsonextended with apackageRulematching.gitea/workflows/**digest updates withautomerge: falseand adescriptionfield explaining the security rationale (root-equivalent host access). Since Renovate's default is already non-automerge unless explicitly set, this rule serves as explicit documentation of intent rather than a behavior change.Commit:
08c7dbcaDeferred
🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: ✅ Approved
This is the right fix, done correctly. Let me walk through it.
What's right
Approach: DooD job containers genuinely cannot reach the host systemd — there's no systemctl socket inside them. Using the already-mounted Docker socket to spin up a privileged sibling is the correct solution for this topology. SSH would add key management overhead. The Caddy admin API would require exposing
:2019on a host-reachable port. nsenter is the correct call.Alpine over Ubuntu: 5 MB vs 70 MB pull, no unnecessary tooling. Correct choice.
reloadnotrestart: SIGHUP reloads config in-process without dropping TLS connections or in-flight requests. If you'd usedrestart, you'd briefly drop port 443 and fail HSTS preloaded clients. Good.Digest pinning:
alpine:3.21@sha256:...is exactly right. Tag-only pulls are non-reproducible. The RenovatepackageRuleblocking automerge on digest bumps for--privileged --pid=hostcontainers is the correct operational guard.Documentation:
ci-gitea.mdis now the definitive reference for this architecture. The three troubleshooting failure modes cover the real-world scenarios (Caddy stopped, broken symlink, socket unavailable). The DEPLOYMENT.md CI dependency note is the right place for the symlink contract.ADR-012: Alternatives table is complete. The
reloadvsrestartrationale is captured. Consequences are honest about what the Renovate bump-review requirement implies.Suggestions (non-blocking)
apk add util-linuxis not digest-pinned: The base image is pinned by sha256, bututil-linuxis installed from the Alpine CDN at run time. The nsenter binary itself isn't locked to a specific version. In a single-tenant trusted VPS with a warm package cache this is acceptable, but worth tracking: a future hardening option would be to pre-bake a custom image with util-linux already installed and pin that image. Issue for later.Step duplication (nightly.yml + release.yml): ADR §Consequences already acknowledges this and points to issue #539. Acceptable for now. Composite action extraction is the right next step when you have more workflow maintenance overhead.
The 27-line YAML comment block: It explains the why thoroughly, but
ci-gitea.mdnow has the same content in prose with better formatting. A two-line comment pointing toci-gitea.md §Running host-level commands from CIwould be enough in the YAML. Not a blocker — inline comments are useful for people reading the raw YAML.Bottom line
The operational gap is real, the fix is correct, the Renovate guard is in place, and the docs are comprehensive. Merge.
👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Verdict: ✅ Approved
No production code in this PR — pure CI workflow and documentation. I'll review through my lens anyway.
TDD Evidence
There are no automated tests, and that's acceptable here. CI workflow steps can't be unit-tested in the traditional sense; the test plan is "let the next nightly run pass." The PR description is explicit about this and sets the expectation correctly:
That's the integration test. It's manual, but it's appropriate.
Code Readability
The
# why not…comments in the YAML explain WHY, not WHAT — this is exactly right. Nesting rationale in comments next to the code it concerns means a reader doesn't have to cross-reference the ADR for every decision:This is a legitimate use of comments. The constraint is non-obvious (reload vs restart has an observable difference in a TLS-serving context), so the comment earns its place.
Duplication
The "Reload Caddy" step is copy-pasted identically between
nightly.ymlandrelease.yml. The ADR calls this out and links to issue #539 for composite action extraction. That's the right way to handle it — acknowledge the duplication, track the cleanup separately, don't block the fix. I agree with the call.Minor observation
The inline YAML comment block is 27 lines.
ci-gitea.mdnow documents the same material with better prose structure. Once issue #539 extracts this into a composite action, the action itself can carry a two-line doc link and the YAML comment can shrink significantly. Not a blocker.Bottom line
Clean, well-documented CI fix. The WHY is captured both inline and in the ADR. Merge.
🏛️ Markus Keller (@mkeller) — Application Architect
Verdict: ✅ Approved
Architecture Decision Quality
ADR-012 is the right artifact for this decision. It follows the format: context, decision, alternatives, consequences. The alternatives table is thorough — all four rejected approaches are named with their failure reason. The consequences section is honest about the operational commitments this decision creates (Renovate review requirement, symlink contract, duplication until #539).
The decision to make existing implicit privileges explicit rather than adding new ones is sound. The Docker socket already granted root-equivalent host access for
docker composecalls — this step makes that trust boundary visible in the YAML comment and the ADR rather than hiding it. Good.Documentation Update Audit
Per my doc update checklist:
docs/DEPLOYMENT.mddocs/adr/docs/infrastructure/ci-gitea.mddocs/architecture/c4/l2-containers.pumlThe
l2-containers.pumlomission warrants a comment: the privileged Alpine container is ephemeral — it runs, does its work, and exits. It's not a persistent service in the topology. Reasonable to exclude from the C4 Level 2 containers diagram. I'd accept this call.Blocker: Verify ADR-011 exists
ADR-012 §Consequences references:
If ADR-011 doesn't exist in
docs/adr/, this is a broken reference — a future reader following the chain will hit a dead end. Please verifydocs/adr/011-*.mdexists. If it doesn't, either create a stub or remove the reference from ADR-012.Caddyfile Symlink as CI Contract
The DEPLOYMENT.md note correctly flags the symlink as a CI dependency. The ci-gitea.md §Caddyfile symlink contract section is the right place for the operational verification command. Both are present. ✓
Conclusion
The architecture decision is sound, the ADR is complete, the docs are updated. Resolve the ADR-011 reference question before merging.
🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
This PR makes existing host-access capabilities explicit and operational. It does not introduce new trust boundaries — but it does weaponize them in a way that deserves careful documentation of the attack surface. Let me be precise.
What's handled correctly
Digest pinning (
alpine:3.21@sha256:48b0309ca019d89d40f670aa1bc06e426dc0931948452e8491e3d65087abc07d): The base image is immutable. Any upstream change to the alpine:3.21 tag will not affect this step until an explicit PR bumps the digest.Renovate
packageRulerequiring manual review: Correct. A--privileged --pid=hostcontainer with a compromised image has root-equivalent access to the host filesystem, process tree, and network. Automerge of digest bumps for such images would be a serious supply chain risk.Single-tenant trust model: The ADR correctly notes that fork PRs don't automatically get CI on a self-hosted Gitea runner. The blast radius of a compromised workflow is scoped to the repository owner's own pushes and trusted collaborators.
No sudoers entry required — acknowledged: The ADR makes this explicit. The Docker socket already grants root-equivalent access; this step doesn't add a new privilege, it just uses an existing one for a new purpose.
Security Concern 1 —
apk add util-linuxis not integrity-verified (informational)Severity: Low (given single-tenant context)
The base image is pinned by sha256. However,
util-linuxis installed at runtime from the Alpine CDN:The package version and checksum are verified by Alpine's apk package signature mechanism, but the nsenter binary's provenance is not separately attestable in the workflow. A CDN-level compromise (unlikely for Alpine, but theoretically possible) during the apk install window could inject a malicious nsenter.
Mitigation path (not required, but worth tracking): Build a custom slim image with util-linux pre-installed, push it to the Gitea container registry, pin that image by digest. Then the entire dependency chain is immutable. This adds image maintenance overhead that may not be justified for a family archive project.
Current risk: Acceptable for a private, single-tenant VPS. Document this in ci-gitea.md if not already there.
Security Concern 2 — No
--network=noneon the privileged containerSeverity: Low
The privileged Alpine container launched by
docker runhas outbound network access (it needs it forapk add). After theapk addcompletes, it only needs the host PID namespace for nsenter — not network. If the nsenter call is ever extended or modified, the container could make arbitrary outbound connections under privilege.This is not an issue with the current step as written. It's a residual risk from needing network for the package install. Worth noting as context for anyone who reads this later.
Security Concern 3 — The trust boundary comment in the ADR is critical
Verdict: Correctly handled ✅
ADR-012 §Consequences states:
This is exactly right. The YAML comment makes the security model visible to any developer reading the workflow. No security-by-obscurity.
Missing: No regression test for the "Caddy not running" failure path
This PR relies on
systemctl reloadfailing non-zero to abort the workflow when Caddy is stopped. That's the right behavior — fail fast before a misleading curl error. But there's no CI test that exercises this. The test plan says "let the next nightly run pass" — which only validates the happy path.For a family archive project this is acceptable. For a larger team I'd request a CI-level test that verifies the step exits non-zero when the service is absent.
Conclusion
The security model is sound, the privileges are acknowledged and documented, and the Renovate guard is the right operational control. The two low-severity concerns are informational — I'd not block merge on them. Resolve the
apk addprovenance concern in the documentation if you want a clean audit trail, but it's not a blocker.🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
This PR has no automated test coverage for the new CI step. That's the expected limitation of infrastructure changes, and I understand it. Let me document what the gap means and what's mitigated.
Test Coverage Gap
The test plan in the PR description is:
This is a manual, single-run, happy-path-only verification. There is no automated regression test that would catch regressions introduced by future changes to this step (e.g., a Renovate digest bump that changes the Alpine image, or a workflow refactor that accidentally reorders steps).
What I'd normally ask for: A dry-run or pre-flight assertion step before the
systemctl reloadthat verifies the preconditions:This would give a named, self-documenting failure ("Caddy is not active") before the reload attempt — distinct from the reload failing for other reasons. It would also produce a cleaner CI log for incident response.
Why I'm not blocking: The troubleshooting guide in
ci-gitea.mdcovers the three real failure modes with exact log snippets, symptoms, and recovery steps. That's operational test documentation in prose form. For a solo project on a single VPS, this is acceptable.Known Failure Modes Are Documented ✅
The
ci-gitea.md§Troubleshooting section covers:systemctl reloadexits non-zero, CI fails before smoke test ✓systemctl reloadexits 0 but serves stale config → smoke test catches it via HSTS or/actuator/healthcheck ✓docker: Cannot connect to Docker daemon✓The silent failure mode (Failure mode 2) is the tricky one:
systemctl reload caddysucceeds but Caddy reloads whatever the symlink currently points to. The smoke test is the downstream detection — HSTS header mismatch or/actuator/health → 404would catch a stale Caddyfile. That's an acceptable downstream guard.Step Ordering Is Correct ✅
"Reload Caddy" is placed after "Deploy staging" (so the latest Caddyfile is on disk) and before "Smoke test" (so the public surface reflects the current config). The ordering is semantically correct.
Renovate Guard = Implicit Quality Gate ✅
The
packageRulerequiring manual review on digest bumps is an implicit quality gate for this step. Any change to the Alpine image that could affect behavior requires human sign-off. That's the right control.Conclusion
The test coverage is minimal but the failure modes are documented, the step ordering is correct, and the downstream smoke test acts as a functional gate. I'd recommend adding a
systemctl is-active caddypre-flight check as a follow-up (not a blocker). Merge with the understanding that the next nightly run is the real validation.🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead
Verdict: ✅ Approved
This PR makes no changes to any frontend component, route, Svelte file, Tailwind class, or user-facing interaction. No UI/UX review applies.
Checked for completeness:
.sveltefiles changedFrom my perspective, this is a CI infrastructure fix that keeps the deployed app reachable on port 443 — which is a prerequisite for any user reaching the UI at all. I'm glad it's being fixed.
LGTM from design and accessibility standpoint.
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Requirements Completeness Review
This PR targets a specific, well-defined operational gap. Let me verify it against a requirements lens.
Problem statement: Clearly stated. The PR description identifies the root cause (deploy workflow never reloads host Caddy), the triggering incident (frontend healthcheck fix unmasked the gap), and the observable symptom (port 443 refusing connections, misleading smoke test error).
Scope separation: The PR correctly distinguishes two separate concerns:
systemctl reload caddyon the VPS) — out of scope, documented separately in the PR bodyThis separation prevents scope creep and keeps the PR focused. ✓
Acceptance criteria (from the test plan):
/login, HSTS header,/actuator/health → 404These are functional and testable, though manual. For an infrastructure fix, this is appropriate. ✓
Non-functional requirements: The ADR documents:
reloadnotrestartto avoid TLS connection drops ✓Operational contract (the symlink): This is the most important implicit requirement. The workflow's correctness depends on
/etc/caddy/Caddyfile → /opt/familienarchiv/infra/caddy/Caddyfileexisting on the VPS. This is now documented as an explicit CI dependency indocs/DEPLOYMENT.mdandci-gitea.md. The consequence of a missing symlink (silent stale config, smoke test as downstream catch) is documented. ✓Open question: ADR-012 references ADR-011 ("single-tenant runner trust model"). If ADR-011 doesn't exist in the repo, the reference is an unresolved dependency. This should be verified before merge — either ADR-011 exists, or the reference should be updated to describe the trust model inline.
Conclusion: The requirements for this fix are well-specified, the operational contract is documented, and the scope is appropriately bounded. The one open item (ADR-011 reference) is shared with the architect's review. Approve pending that check.
08c7dbcaa2toe9caa3a1f7