devops: extract composite actions for obs stack deploy steps (#603) #715
Reference in New Issue
Block a user
Delete Branch "feat/issue-603-composite-deploy-actions"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #603.
Extracts the deploy logic shared verbatim by
nightly.ymlandrelease.ymlinto three single-responsibility Gitea composite actions, replacing the unenforced# Keep in sync with nightly.ymlhonour-system invariant with one enforced definition.What changed
New composite actions under
.gitea/actions/:deploy-obsgrafana_admin_password,grafana_db_password,glitchtip_secret_key,postgres_password,postgres_host)run:blocks (deploy configs → validate → start → assert health). Includesgrafana_db_passwordfor the #651 read-only reader role.reload-caddysmoke-testhostBoth workflows now call each via a single
uses: ./.gitea/actions/<name>;nightly.ymlkeeps its#526 /importguard inline.Security-critical details (per the issue's review notes)
obs-secrets.envheredoc switched to unquoted<<EOFwith secrets mapped viaenv:+$VAR— a quoted delimiter would write literal var names andconfig --quietwould pass anyway.grep -Eq "^KEY=.+") runs right after the write;chmod 600is the final operation (no world-readable window).required: truewith nodefault:— a missing secret fails loudly.run:step declaresshell: bash.actions/checkout@v4pinned as step 1 in both workflows (a local action only exists on disk after checkout).Docs / housekeeping
028-pdfjs-wasm-decoders-and-csp-constraint.mdlanded onmainafter the issue was filed; per the sequential-ADR convention this is 029.docs/infrastructure/ci-gitea.mdgains a Composite actions section (checkout-first rule, secrets-via-inputs + unquoted-heredoc constraint, how to add an input)..gitea/actions/**so the relocated alpine digest stays tracked.Verification
Pure behaviour-preserving CI-YAML refactor; no local test harness exists for
.gitea/YAML. All five files parse; AC invariants verified mechanically (4 named blocks, unquoted heredoc, 5-key guard, chmod-600-last, required-no-default inputs, checkout-first, zero "Keep in sync" comments remaining).Pending: a
workflow_dispatchrun ofnightly.ymlending with all 5 obs services healthy (per the issue's acceptance criterion).release.ymlis trusted via its identicaluses:calls. @marcel to trigger the nightly run.🤖 Generated with Claude Code
🏛️ Markus Keller — Architect
✅ Approved with concerns (concerns are suggestions, not blockers).
I reviewed this through the lens I care about: module boundaries, coupling, single-responsibility, the composite-action-vs-reusable-workflow decision, ADR quality, the checkout-first coupling, the three-way split, and duplication elimination. This is a genuinely well-architected refactor. The honour-system
# Keep in syncinvariant is exactly the kind of unenforced cross-module duplication that rots, and replacing it with one enforced definition + a typed input contract is the right call. ADR-029 is one of the better ADRs in this repo — it states the drivers, considers three real alternatives, and is honest about the negative consequences (checkout coupling, secrets-via-inputs, the relocated digest). That is what an ADR is for.What I checked and liked:
deploy → reload → smokesequence relies on in-job step ordering and a shared, already-checked-out workspace.workflow_callruns as a separate job with its own runner context and would break exactly that. The ADR names this precisely. Good.deploy-obs,reload-caddy,smoke-testeach have one reason to change and one responsibility. I considered whetherreload-caddy+smoke-testshould be one "publish-and-verify" action — no. They have independent rationales (config application vs public-surface assertion) anddeploy-obscould plausibly run without a smoke test in some future path. Keeping them separate preserves recombination. Correct SRP boundary.STAGING_*/PROD_*secret names,POSTGRES_HOST, and the smoke host are the only deltas, and all three are nowwith:inputs. This is the duplication elimination done right — one definition, explicit seams.actions/checkout@v4was already step 1 in both workflows before this PR. So theuses: ./.gitea/actions/…dependency adds a constraint to preserve, not a behavioral change — and the PR pins it with a comment in both workflows, in the ADR consequences, and inci-gitea.md. Three places is appropriate for a load-bearing-but-invisible ordering rule. No objection..gitea/actions/**. The pinned alpine digest moved into the action, and the privileged-digest review rule moved with it. A digest that silently goes stale on a--privileged --pid=hostimage is exactly the kind of thing that should never slip — good that this was caught in the same PR rather than discovered later.Concerns (suggestions — none block merge):
The
deploy-obs/validate/start/assertfour-step block is internally coupled toobs-secrets.envas a shared on-disk artifact. That is inherent to the obs-stack deploy and fine inside one action — I only flag it so future maintainers don't try to split these four steps into separate actions. The data dependency (deploywrites the env file,validate/startread it) means they belong together. Consider a one-line comment at the top ofdeploy-obs/action.ymlstating "these four steps shareobs-secrets.envand must stay in one action" so the SRP boundary is explicit and isn't re-litigated.Input naming asymmetry.
postgres_hostis a non-secret, while the other four are secrets, yet they sit in the same flatinputs:block distinguished only by a description. The action correctly maps all five throughenv:, so behaviour is fine. Minor: a comment grouping "secrets" vs "topology" in theinputs:block would make the contract self-documenting at a glance. Not worth a round-trip on its own.POSTGRES_HOSTderivation remains a latent coupling to the Compose project name (archiv-staging/archiv-production). The PR documents this in both callers' comments and the input description — good. This is pre-existing and out of scope here; I note it only so it's on record that a Compose project rename is now a two-file change (thewith:value in each workflow), which is strictly better than the previous in-heredoc literal.Doc-currency check (my standard gate): I walked the table. New ADR — present (029, correctly numbered after 028 landed on main; the PR's note explaining the off-by-one from the issue is the right call).
ci-gitea.mdComposite-actions section — present and thorough. No new Flyway migration, no new backend package/route, no new Docker service, no new ErrorCode/Permission, no new external system. The C4l2-containers.pumldoes not model the CI workflow internals, so no diagram update is triggered. No doc omissions — nothing to block on.Verdict: ship it. The architecture is sound, the decision is documented and defensible, and the duplication is genuinely eliminated rather than relocated. The three concerns above are polish I'd happily take in a follow-up.
— Markus (@mkeller)
Leonie Voss (@leonievoss) — UI/UX & Accessibility
✅ Approved — not applicable to my domain.
I scanned all 8 changed files. This is a pure CI/infrastructure refactor with zero user-facing surface:
.svelte,.css,.ts, no routes, components, or Paraglide message keys..gitea/actions/*/action.yml), the two deploy workflows, ADR-029,docs/infrastructure/ci-gitea.md, andrenovate.json.Nothing here reaches the senior-on-a-small-phone-in-bright-sunlight audience I advocate for, so there's nothing for me to gate on.
One light, optional note on the only surface that is human-readable — the operator-facing CI logs:
::error::lines are clear and actionable (e.g.obs-secrets.env missing or empty: ${key},$svc is not healthy (status: $status)) — good operator UX. The plainecho "ERROR: ..."insmoke-testand theecho "expected 404..."line use bareechorather than the::error::annotation. Optional: switching those two to::error::would surface them as highlighted annotations in the Gitea run summary, consistent with the rest of the action. Purely cosmetic for log readability — not a blocker.LGTM from the UX side.
Requirements Engineering review — "Elicit"
Verdict: ✅ Approved
A clean, behaviour-preserving refactor with full requirements→implementation traceability. I traced every acceptance criterion in #603 against the diff. All 11 closeable ACs are met; the 12th (live nightly run) is correctly deferred to you. No blockers. One naming deviation is handled acceptably, plus two minor documentation-hygiene suggestions below.
Traceability matrix — AC → implementation
deploy-obsruns 4 namedrun:blocks + writes 5 keys- name:steps (Deploy configs / Validate / Start / Assert health), heredoc writes all 5 keys<<EOF+ secrets viaenv:+$VARcat > … <<EOF(not<<'EOF');env:maps all 5 inputs; heredoc references$VAR::error::+ non-zero exitfor key in …; do grep -Eq "^${key}=.+" … || { echo "::error::…"; exit 1; }over all 5 keyschmod 600is the final operationchmod 600is the last line of the first run block, after the guard loopreload-caddy(no inputs) +smoke-test(host) existreload-caddyhas noinputs:,smoke-testhashostonlynightly.ymlrewired (4 obs steps + reload + smoke →uses:)uses: ./.gitea/actions/…calls replace the inline blocksrelease.ymlrewired identicallyuses:calls# Keep in synccomments removedrelease.yml) are deleted; zero remainactions/checkout@v4precedes first localuses:ci-gitea.md"Composite actions" section (checkout-first + how-to-add-input)docs/adr/029-…mdadded, Accepted, body matches the #603 draftworkflow_dispatchnightly run, 5 obs services healthyNotes beyond the checklist
The ADR-028 → 029 deviation is handled acceptably. #603 specified
028-…md, but028-pdfjs-wasm-decoders-…landed onmainafter the issue was filed. Renumbering to 029 is the right call: ADR numbers are an append-only, collision-free sequence, and reusing 028 would create two ADR-028s. The PR body and the workflow header comments both reference ADR-029 consistently, and the ADR body is otherwise verbatim from the issue draft. This is a deviation from the literal AC text but satisfies its intent (an Accepted ADR documenting the decision). No change required — though I'd record it as a closed deviation so the audit trail explains why the issue says 028 and the repo says 029. The PR body already does this; that is sufficient.The enforced-invariant goal is genuinely achieved. The core requirement behind #603 was replacing an honour-system "Keep in sync" invariant with an enforced single definition. The diff delivers exactly that: per-environment variation is now a typed
inputs:contract (5 + 0 + 1 inputs), and the only remaining per-environment forks are thewith:values — which is the intended, minimal surface. The#526 /importguard correctly stayed inline innightly.ymlper the issue's explicit instruction; good discipline not to over-extract.required: true/ nodefault:fail-loud requirement is met for all secret inputs — a missing secret aborts rather than silently writing an empty key, and the non-empty guard is a second line of defence. This is the kind of unwanted-behaviour ("If a secret is missing, the system shall fail loudly") that the issue called out as the single most likely bug; both the heredoc-delimiter fix and the guard address it.Suggestions (non-blocking)
workflow_dispatchonnightly.yml, (b) confirm all 5 obs services report healthy, and (c) paste the run link into this PR before merge, so the traceability record is closed by evidence rather than by trust.release.ymlriding on the identicaluses:calls is a sound argument — no separate prod run needed.ci-gitea.mdnow encodes the five-key-guard invariant in prose. That is itself a new "keep in sync" point (the guard loop and the heredoc and the docs must all agree). Minor, and arguably unavoidable, but worth a glance during the next obs-secret change.Nothing here blocks merge. The requirements are fully traceable to the implementation; the one open item is a deferred verification, not a defect.
— Elicit (Requirements Engineer)
Nora "NullX" Steiner — Application Security Engineer · Ethical Hacker
⚠️ Approved with concerns
Solid refactor with security-aware comments — the
chmod 600-last ordering, the non-empty guard,required: truewith no default, and the Renovate watch extension are all genuinely good controls and I want to call them out as correct. But the headline change — switching the secrets heredoc from a quoted delimiter (<<'EOF') to an unquoted one (<<EOF) — changes the trust semantics of how secret bytes are interpreted, and the current guard does not cover the new failure mode. One blocker, a couple of concerns, the rest is praise.🚫 BLOCKER 1 — Unquoted heredoc mangles secrets containing backslashes (
.gitea/actions/deploy-obs/action.yml, lines ~57–63)This is the load-bearing change and it's the one place the threat model shifted. Let me be precise about why, because the PR description's reasoning is half-right and the dangerous half is the part it gets wrong.
Old code (quoted
<<'EOF'): the${{ secrets.X }}tokens were substituted by Gitea's template engine before the shell ever ran. The shell saw an already-final literal line and — because the delimiter was quoted — performed zero expansion. Secret bytes were written verbatim. Safe for any byte.New code (unquoted
<<EOF+$VAR): the shell now performs parameter expansion on the heredoc body. Two things are true here, and only one of them is the comment's concern:✅ Command substitution / backticks are NOT a risk. The result of
$VARexpansion is not re-scanned — a secret value of$(rm -rf /)or`id`is written literally, never executed. The comment's worry about "command substitution" is unfounded; POSIX word expansion does not recurse. Good news.🚫 Backslash and
$-sequences in the secret ARE mangled. In an unquoted heredoc the shell still processes\as an escape character for$,`,\, and newline, and still expands$othersequences inside the secret value itself. Concrete payloads — all of which BCrypt-grade / DjangoSECRET_KEY/openssl randoutput can plausibly contain:s3cret\(trailing backslash) → the\+ newline is consumed as a line-continuation; theGLITCHTIP_SECRET_KEY=line swallows the next line (POSTGRES_PASSWORD=…). File now has the wrong number of keys and a corrupted value.a\$b→ written asa$b(the\$collapses).pre$HOME/postorkey$POSTGRES_PASSWORD→$HOME/$POSTGRES_PASSWORDexpands inside the secret, silently substituting another value into this one.tab\there→\t… (POSIX heredoc doesn't interpret\t, but the\is still consumed before thetis not special — net: a dropped backslash).Impact: a deploy writes a silently wrong secret. Grafana/GlitchTip/Postgres come up with corrupted credentials, or — in the trailing-backslash case —
obs-secrets.envis structurally broken (a key swallowed). This is exactly the class of bug the five-key guard was meant to catch, and it doesn't (see Blocker 2). CWE-150 (improper neutralization of escape sequences) / config-injection.The PR description claims "a quoted delimiter would write the literal string and
config --quietwould still pass." That's the reasoning for why you moved off quoting — but it conflates two different quotings. The old quoted heredoc was paired with Gitea-side${{ }}substitution, so it did NOT write literal$VAR— it wrote the real secret. The premise that you had to go unquoted is only true becausesecrets.*is unavailable in composite actions and you're now forced throughenv:+$VAR. Fair. But the cost is that you've re-introduced shell escaping into the secret path, and that needs neutralizing.Fix — don't hand secret bytes to the heredoc parser at all. Use a tool that does no expansion. Two clean options:
Option A —
printfwith positional args (no expansion of the values, only of the format string):%swith a quoted argument writes the value byte-for-byte: no backslash processing, no$re-expansion, no line-continuation. This is the standard safe idiom for emitting untrusted values into a.envfile.Option B — keep the heredoc but feed values that can't be re-interpreted is not possible cleanly; prefer A.
If you keep the heredoc for readability, you must at minimum document that every secret is assumed free of
\,$, and trailing backslash — but that's an unenforceable honour-system constraint on secret content, which is exactly the kind of invariant this PR set out to abolish. Useprintf.🚫 BLOCKER 2 — The five-key guard validates the wrong file and can't detect mangling (
deploy-obs/action.yml, lines ~65–70)grep -Eq "^${key}=.+"asserts each key is present and non-empty. It does not assert the value is correct. The mangling failures in Blocker 1 produce non-empty (but wrong) values —GLITCHTIP_SECRET_KEY=a$b,POSTGRES_PASSWORD=…corrupted— all of which sail past.+. And the trailing-backslash case that swallows the next line would actually be caught here (POSTGRES_PASSWORD key goes missing) — so the guard catches the structural break but not the silent-corruption cases, which are worse because the stack comes up with bad auth instead of failing loudly.This isn't a reason to drop the guard — it's a defense-in-depth control and I like it. But it cannot be relied on as the mitigation for Blocker 1. Fix Blocker 1 at the write layer (printf); keep this guard as a second line. Once the write is byte-safe, the guard's remaining job (catch a genuinely empty/missing secret) is exactly right.
One concrete hardening while you're here: the guard only proves the key exists in the file — it can't prove the byte content survived. After switching to
printf, consider asserting key count too ([ "$(grep -cE '^[A-Z_]+=.+' obs-secrets.env)" -eq 5 ]) so a swallowed-line regression trips immediately.⚠️ CONCERN 3 — Gitea log-masking through input → env →
$VARis plausible but unverifiedThe comment correctly warns "do not stage these into intermediate variables, or Gitea log masking can be lost." Good instinct. But the masking guarantee for the input →
env:→$VARpath specifically — where the secret crosses thesecrets.* → with: → inputs.* → env:boundary into a composite action — is exactly the kind of thing that's framework-version-dependent and that I never assume without verification (per my own rule: "assuming framework defaults are secure without verification").Two real exposure points to confirm on the pending
workflow_dispatchrun:config --quietandup -d:docker composereads--env-fileand on a parse/substitution error can echo offending lines.--quietsuppresses the rendered config on success, but an error path may not be quiet. Verify a deliberately-broken secret does not dump the file to the log.set -e+ a failinggrep/printf: confirm the runner's masking registers values that entered viainputs.*(not just top-levelsecrets.*). If the nightly run shows any of the five values unmasked anywhere, that's a P1 — Gitea Actions log masking keys off registered secret values, and the registration path for composite-action inputs is the open question.Not a blocker on code, but it must be on the acceptance checklist for the pending nightly run, not just "all 5 services healthy."
✅ What's correct (credit where due)
chmod 600as the final operation — there is genuinely no world-readable window;cat/printfcreate the file with the umask-default perms then immediately tighten. Correct ordering, and the comment explains why it's load-bearing. (Belt-and-suspenders nit: aumask 077before the write would make even the creation instant-private, defending against a crash between write and chmod. Optional.)required: truewith nodefault:— fails closed on a missing secret. Textbook. Don't let anyone "simplify" a default in later.--privileged --pid=hostCaddy reload + pinned digest + Renovate watch on.gitea/actions/**— moving the digest into the composite action without extending the RenovatematchPathswould have created a stale-digest blind spot on a root-equivalent-host-access image. You caught it and extended the watch in the same PR. That's the single most important supply-chain control in this diff and it's done right. The--privileged --pid=hostcontainer is root-on-the-host by construction, but that's inherent to the DooD-reload design (pre-existing, well-documented in ADR/ci-gitea.md), not introduced here — and the pinned-by-digest + manual-review-only Renovate rule is the correct compensating control.checkoutpinned as step 1 with an explanatory comment — correct, and the ordering constraint is real./actuator → 404. These are real security regressions it will now catch. Nice.Summary
\,$, trailing backslash → silent-wrong / swallowed secret. Fix withprintf '%s\n'.$VARlog masking +configerror-path quietness on the pending nightly run.Fix Blocker 1 (one
printfblock, ~6 lines) and the trust semantics return to the byte-faithful behaviour the old quoted heredoc had. Happy to re-review the moment that lands. Bugs are systemic, not personal — this one is a genuinely subtle shell-escaping trap and the comments show you were already thinking about it.— Nora "NullX" Steiner
👨💻 Felix Brandt — Senior Fullstack Developer
⚠️ Approved with concerns. This is a genuinely clean DRY refactor — the duplication is gone, the per-environment variance is a typed input contract, and the action boundaries are named in one or two words (
deploy-obs,reload-caddy,smoke-test), exactly the single-responsibility split I'd ask for. My one real concern is that the most fragile, security-critical invariant in the whole PR is still guarded only by a comment and a runtime check, when this repo already has the pattern to enforce it structurally in CI. That's a suggestion I feel strongly about, not a blocker.What I like (no changes needed)
release.ymlshrank by 253 lines andnightly.ymllost its honour-system# Keep in sync with nightly.ymlcomments. The "five keys vs four" drift that ADR-029 cites is now structurally impossible — one definition, both callers. This is the right mechanism.deploy-obs/action.ymlkeeps its fourrun:blocks (deploy → validate → start → assert health) as separately-named steps rather than one mega-script — they stay individually readable and produce their own CI log section. Good.uses: ./…, and you've over-documented it rather than under-documented it. Correct trade-off.required: truewith nodefault:on the secret inputs is exactly the fail-loud posture I want. A missing secret must not fall back to empty.028-pdfjs-wasm-decoders-and-csp-constraint.mdis onmain, so this genuinely is 029. The PR body flags the issue-said-028 discrepancy explicitly rather than quietly renumbering. That's the right way to handle it.Comment quality — mostly load-bearing, one trim
The comments here are overwhelmingly why, not what, which is the standard I hold. The heredoc-delimiter comment (
deploy-obs/action.yml, the<<EOFblock) and the--resolve-as-bash-array comment (smoke-test/action.yml) are load-bearing — they encode non-obvious failure modes that a future editor would otherwise reintroduce. Keep them.One observation, not a request:
reload-caddy/action.ymlhas a ~25-line comment in front of a 3-linerun:. Every sentence in it is genuinely load-bearing (DooD, why-not-systemctl, why-alpine, reload-not-restart), and it's now the single home for that rationale instead of being duplicated across two workflows — so I'd actually leave it. The duplication this PR removed is what makes a big comment block acceptable: it pays for itself once, not twice.Concern (not a blocker) — the unquoted-heredoc invariant deserves a CI guard
This is the part I want to push on, and it's directly the TDD-applicability question for a pure-YAML refactor.
The PR is right that there's no local harness for
.gitea/YAML, and I agree TDD-by-unit-test doesn't apply here — there's nothing to red/green against on a dev machine. But "no local harness" isn't the same as "unguardable." This repo already runs grep-based structural assertions inci.yml— and at least one of them is self-testing:ci.yml:48"Assert no banned vi.mock patterns"ci.yml:68"Assert no raw document date via {@html}" — and look atci.yml:79-85: it feeds the regex a known-bad input, asserts it matches, feeds a known-good input, asserts it doesn't. That is red/green for a CI invariant.ci.yml:91"Assert no upload/download-artifact past v3"The single most dangerous failure mode in this PR is precisely the one the code comment warns about: if someone "fixes" the heredoc back to
<<'EOF', the file gets written with literal$GRAFANA_ADMIN_PASSWORDstrings anddocker compose config --quietstill passes — present-but-wrong. The five-keygrep -Eq "^KEY=.+"runtime guard does not catch this either, becauseGRAFANA_ADMIN_PASSWORD=$GRAFANA_ADMIN_PASSWORDis a non-empty line. So today the only thing standing between us and a silently-broken-auth deploy is a comment.Given the existing precedent, this is cheap to enforce:
That converts the load-bearing comment into an enforced invariant, in the same idiom the repo already uses three times over. I'd also accept a one-line
actionlintstep inci.ymlas a broader structural net for all of.gitea/actions/**(it catches theshell:-missing and malformed-uses:classes for free), but the targeted grep is the one that protects the actual secret-exposure path. Suggestion, not a gate — the comment is honest about the risk; I'd just rather the CI enforce it than trust the next editor to read the comment.Minor nits (non-blocking)
smoke-test/action.yml— theawk/proc/net/routegateway one-liner is dense, but it's commented field-by-field and identical to the pre-refactor code, so I won't ask you to extract it. Behaviour-preserving, fine.deploy-obsstep 1 does three jobs (copy configs, write+guard secrets, chmod) in onerun:. In Java I'd split this; in a compositerun:block the cohesion is acceptable because the chmod-600-last ordering is itself the security property and splitting steps would let a future reorder break it. Leaving the chmod adjacent to the write is the right call — just flagging I considered it.Verdict
DRY goal achieved cleanly, boundaries are right, comments are load-bearing, the ADR/docs trail is thorough and honest. Approving with one concern I'd like addressed before this becomes the template every future composite action copies: add the CI grep guard for the unquoted heredoc so the invariant is enforced, not trusted. The repo already proves that's the house style for un-unit-testable invariants. Everything else is ship-ready.
QA Review — Sara Holt (@saraholt), Senior QA Engineer
⚠️ Approved with concerns.
The refactor is clean and behaviour-preserving on inspection, and the security ordering (guard-then-
chmod 600-last) is correct. My concern is not the code — it's the verification story. This PR ships untestable-by-design YAML whose only proof is a single liveworkflow_dispatchon the VPS, and several of the failure modes the PR claims to defend against are exactly the ones a one-shot manual run will not exercise. Below: the gap, the failure-mode coverage, what the human must watch in the nightly run, and what we can assert statically so the next refactor isn't this hand-verified.1. The verification gap (the headline)
There is no test harness for
.gitea/at any layer — not static, not unit, not integration. The PR's "Verification" section is honest about this ("AC invariants verified mechanically"), but "verified mechanically" here means a human grepped the diff once. That is not a regression gate; it evaporates the moment this file is touched again.Concretely, the test pyramid for this change is empty at every level:
action.yml, missingshell:, bad input refs<<'EOF', checkout-not-firstconfig --quietparityworkflow_dispatchonly, staging onlyThe single live run validates staging via
nightly.yml.release.yml(production) is asserted "trusted via its identicaluses:calls" — which is true only as long as the inputs are wired identically, and that wiring is hand-authored per workflow (different secret names, differentpostgres_host, differenthost). Production'sdeploy-obs/smoke-testinvocation is never executed before it runs for real against prod. That's the riskiest unverified path in the PR and it's getting zero coverage.2. Failure-mode coverage
Five-key guard (
grep -Eq "^${key}=.+") — partially effective..+requires ≥1 char, so it correctly fails on an empty/missing secret (KEY=or absent line). Good — that's the documented intent and it holds.But the guard does not catch the very regression the unquoted heredoc exists to prevent. If someone re-quotes the delimiter back to
<<'EOF'(an easy "tidy-up" mistake, since<<'EOF'is the more common idiom and looks safer), the file gets written as:That line is non-empty, so the five-key guard passes.
docker compose config --quietpasses (the var is present). The stack starts. Grafana/GlitchTip come up with literal-string passwords as their actual credentials — and the smoke test in this PR checks the public app surface (login/HSTS/Permissions-Policy/actuator), none of which touch obs auth. So a re-quoted-delimiter regression sails through all four named blocks and the smoke test, green. The guard defends against empty, not wrong. The PR's own comment even admits "config --quietwould still pass (the var is present, just wrong)" — but the added guard doesn't close that hole; it closes a different (empty) one. This is the single most important thing to understand about this change's safety envelope.Unquoted-heredoc → injection footgun (suggestion, not blocker). Unquoted
<<EOFalso means a$or backtick inside a secret value is now shell-interpreted. These are generated high-entropy secrets so practically low-risk, but it's a new property the quoted form didn't have. Worth a one-line note in the action comment.Behaviour parity old→new — I checked the diff line-by-line; it's faithful, with two cosmetic drifts, both benign:
Start observability stackcomment said "survive workspace wipes between nightly runs"; the action generalizes to "between runs". Correct (it's shared now) — matches release's original wording. No functional change.# Keep in sync with nightly.yml (#603)comments are dropped fromrelease.yml. That's the point of the PR (the action is now the single definition), so this is intended, not a silent loss.#526 /importguard and the app-compose check correctly stay inline (now labelled "nightly-only — stays inline"). Good — no obs logic leaked out, no nightly-specific logic leaked in.I found no silent divergence between the two callers and the extracted actions. The inputs map 1:1 to the former inline literals (
STAGING_*/archiv-staging-db-1/staging.raddatz.cloudvsPROD_*/archiv-production-db-1/archiv.raddatz.cloud).One pre-existing observation (not introduced here): the
Deploy observability configsstep has noset -e. The first three commands (rm/mkdir/cp -r) can fail without aborting — only the heredoc keys are guarded, not the config-tree copy. Ifcp -r infra/observability/.partially fails, you'd start the stack against a stale/incomplete config tree and the guard wouldn't notice. This matched the old inline step too, so it's parity, not a regression — but since you're already touching this block and extracting it as the canonical definition, addingset -euo pipefailto the top of thatrun:is the cheap moment to harden it. (smoke-testandassert healthalready doset -e; this step is the odd one out.)3. What the human must watch on the nightly
workflow_dispatchBecause this is the only gate, the run needs to be read actively, not just "did it go green":
healthyin theAssert observability stack healthblock (obs-loki obs-prometheus obs-grafana obs-tempo obs-glitchtip). Amissingstatus here means a container never started — distinct fromunhealthy.deploy-obsstep's working directory. Compositerun:steps execute in$GITHUB_WORKSPACE(the checkout), socp -r infra/observability/.resolves relative to the repo root. Verify the copy log shows the real tree, not an empty dir — a wrong CWD would silently copy nothing and the guard wouldn't fire (it only checksobs-secrets.env, written by absolute path).actions/checkout@v4is ever not step 1, the firstuses: ./.gitea/actions/…fails with "action not found" — a clear error, but watch for it on the first run.release.ymlrun is the first execution of the proddeploy-obs/smoke-testwiring. Treat that release as a verification event: have someone on hand and check prod Grafana auth + the five obs services immediately after.4. What we can assert statically (the real fix for the gap) — suggestions
This change is the perfect argument for a tiny CI lint job. None of these need the VPS; they run on any runner in seconds and would have caught the regressions above mechanically instead of relying on a human grep:
actionlintover.gitea/workflows/**and.gitea/actions/**. Catches missingshell:, badinputs.*/secrets.*references, malformed composite syntax — the entire class of "only fails at runtime on the VPS" errors. ~2s, zero infra.yamllintfor plain YAML well-formedness as a cheap base.<<EOF(unquoted) is present. This is the test that should ship with this PR — it's the executable form of the invariant the whole change rests on.chmod 600-is-last assertion and a checkout-first assertion (grep thatactions/checkoutprecedes the firstuses: ./.gitea/actions/in each workflow). Both are currently convention-only.These would convert "verified mechanically by a human, once" into "verified by CI, every time" — which is the difference between a refactor that stays correct and one that quietly rots the next time someone touches it.
Verdict
⚠️ Approved with concerns — no blockers; the code is correct and faithful to the originals.
set -euo pipefailto theDeploy observability configsstep (parity hardening at the canonical site).actionlint+ grep-guard CI job. The unquoted-heredoc invariant is load-bearing and currently has zero automated protection — a static grep guard is ~5 lines and closes the empty-vs-wrong gap the runtime guard can't.release.ymlas the production verification event, since prod's wiring is executed for the first time there.— Sara
DevOps & Platform review — Tobias Wendt (@tobiwendt)
⚠️ Approved with concerns — the refactor is structurally correct and behaviour-preserving on every axis I could verify statically. The single real concern is a narrow shell-expansion edge in the unquoted heredoc that did not exist in the old form. It does not block merge, but it must be on the record and the acceptance
workflow_dispatchrun is still mandatory before this is "done."What's correct (verified, not assumed)
using: composite, everyrun:step carriesshell: bash, inputs map toenv:and are referenced as$VAR, callers useuses: ./.gitea/actions/<name>relative paths.actions/checkout@v4is step 1 in bothnightly.ymlandrelease.yml— correct, because a localuses: ./…action only exists on disk post-checkout. The checkout-first invariant is also documented in both workflow headers andci-gitea.md. Good.with:blocks: the only deltas are the three documented ones —STAGING_POSTGRES_PASSWORD→PROD_POSTGRES_PASSWORD,postgres_host(archiv-staging-db-1→archiv-production-db-1), and smoke-testhost(staging.raddatz.cloud→archiv.raddatz.cloud). Nothing else forks. This is precisely the duplication ADR-029 set out to kill, and it's killed cleanly — zero# Keep in synccomments remain (git grepconfirms).#526 /importguard correctly stays inline innightly.yml— it's an app-compose assertion, nightly-only, not shared deploy logic. Right call not to drag it into a composite.grep -Eq "^${key}=.+"runs immediately after the heredoc write and rejects an emptyKEY=line (the.+is load-bearing — a bare presence check would passKEY=).chmod 600is the final operation, so there's no world-readable window. Required inputs carry nodefault:, so a missing secret fails loudly rather than silently writing an empty value.set -eparity holds — and this is the subtle one. The old "Deploy / Validate / Start" inline steps had no explicitset -e, and neither do the composite versions. That's fine: per the Actions runner spec, both a defaultrun:block AND an explicitshell: bashresolve tobash --noprofile --norc -e -o pipefail {0}. So-e -o pipefailis active in the composite step exactly as it was inline —cp/mkdir/catfailures are not swallowed. The explicitset -ein the assert-health and smoke-test steps is belt-and-suspenders, consistent with the originals.|block scalar strips common indentation to column 0 before the shell sees it, soKEY=valueis written flush-left. The plain<<EOF(not<<-EOF) is correct here precisely because YAML already normalised the indentation. Matches the original.matchPathsnow covers.gitea/actions/**alongside.gitea/workflows/**, so the relocated pinnedalpine:3.21@sha256:48b0…digest inreload-caddystays under the manual-review (no-automerge) digest rule. Without this the privileged-image digest would have gone dark. Good catch — exactly the kind of thing that silently rots.028-pdfjs-wasm-decoders-and-csp-constraint.mdlanded onmainafter #603 was filed, so per the sequential-ADR convention the new one is correctly029, not028. No collision.Concern (non-blocking, but record it)
<<'EOF'(quoted) with${{ secrets.X }}— Gitea expanded the secret into the YAML before the shell ran, and the quoted delimiter meant the shell did zero further interpretation of the value. The new form must use<<EOF(unquoted) so$VARexpands, becausesecrets.*isn't available inside a composite action. That's the correct and only mechanism — but it means a secret value containing a backtick,$(…), or a trailing backslash would now be subject to command substitution / line continuation at heredoc time, where before it was inert. For the five values here (two passwords, a Django key, a generated PG password, a static hostname) this is low-probability, but it's a genuine behaviour delta, not a no-op. Worth one sentence in thedeploy-obscomment block noting "secret values must not contain unescaped backticks/$(...)— heredoc is unquoted by necessity," so the next person rotating a password doesn't generate one with a backtick and get a baffling deploy failure (or worse, silent corruption that the five-key guard wouldn't catch, sinceKEY=<garbage>still matches.+).Suggestions (cosmetic, ignore freely)
POSTGRES_HOSTis not actually set ininfra/observability/obs.env— it's only mentioned in obs.env's comments. So there's nothing to override; obs-secrets.env is the sole source. This inaccuracy predates the PR (copied straight from the inline steps), so it's not a regression and I won't hold the PR for it — but since you're already touching these lines, it'd be a clean moment to drop the "overrides" clause.postgres_hostis plain input, correctly not a secret — and the comment says so. Fine. Just noting I checked it isn't accidentally being masked/required-as-secret.The honest bottom line
This can only be truly verified by a real
workflow_dispatchrun ofnightly.ymlending with all five healthchecked obs services (obs-loki,obs-prometheus,obs-grafana,obs-tempo,obs-glitchtip) green — exactly as the PR body acknowledges. Static review can confirm the YAML parses, the wiring matches, and parity holds (all done above), but composite-action input plumbing and the unquoted-heredoc expansion only prove themselves on the runner.release.ymlriding on identicaluses:calls is a reasonable trust argument, but the nightly run is non-negotiable before merge. @marcel — please trigger it and confirm the assert-health step passes; that flips this to a clean ✅ from me.Every added tool is a new failure mode — but a composite action that removes a copy-paste invariant and an honour-system comment is the rare case where the abstraction pays for itself. Nice, minimal refactor.
The unquoted <<EOF delimiter is load-bearing — under a composite action secrets come from $VAR (env), not Gitea ${{ secrets }} substitution, so a re-quote to <<'EOF' would write literal $VAR strings and the five-key non-empty guard would not catch it. Adds a self-testing grep guard (matching the ci.yml 'Assert no X' convention) so a future re-quote fails CI instead of shipping broken obs auth. Raised in review (Felix, Sara, Tobias). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>Review resolution
Thanks all. Addressed below; two new commits pushed.
🔒 Nora's blocker (unquoted-heredoc secret mangling) — verified as a non-issue
The claim that
<<EOF+$VARmangles secrets containing$,`,$(…), or backslashes was tested empirically against the exactKEY=$VARpattern the action uses:obs-secrets.envpa$HOME wordpa$HOME word✓a${POSTGRES_PASSWORD}ba${POSTGRES_PASSWORD}b✓a$(whoami)ba$(whoami)b✓a`whoami`ba`whoami`b✓secretval\(trailing\)secretval\— next key stays on its own line ✓When
$VARis expanded inside an unquoted heredoc, the value is inserted literally and not re-scanned — no recursive expansion, no command substitution, no backslash processing on the value. Backslash escaping applies only to the literal heredoc text (\$,\`,\\), which here is justKEY=$VARwith no backslashes. Output is byte-identical to the old<<'EOF'+ Gitea-substitution form for all realistic and adversarial single-line secrets. Noprintfrewrite needed.Fixes applied
a4756493—set -euo pipefailon the deploy-configs step. A failedcp/mkdirwas previously swallowed;-ualso catches a typo'd env var. (Sara, Tobias)0e06626e— CI guard asserting the heredoc stays unquoted. Felix/Sara/Tobias correctly noted the<<EOFinvariant was guarded only by a comment, and the five-key non-empty guard would not catch a re-quote (KEY=$VARis non-empty). Added a self-testing grep guard toci.ymlmatching the existing "Assert no X" convention — a future<<'EOF'now fails CI instead of shipping broken obs auth green. Verified it passes on the current tree and trips on a re-quoted file.Deferred / acknowledged
echo "ERROR…"→::error::in smoke-test (Leonie) — cosmetic; pre-existing comment carried over verbatim. Leaving for a follow-up to keep this PR's diff scoped to the extraction.workflow_dispatchnightly run ending with all 5 obs services healthy + a manual Grafana/GlitchTip auth check. Will post the run link before merge.Requirements traceability: all 11 closeable ACs confirmed met (Elicit ✅).
👨💻 Felix Brandt — Senior Fullstack Developer (re-review)
✅ Approved — my concern is fully resolved. The unquoted-heredoc invariant now has a structural CI guard built in the repo's own "Assert no X" convention. Upgrading from ⚠️ to ✅.
My round-1 concern is closed
I asked for exactly one thing: the load-bearing
<<EOFinvariant was guarded only by a comment + a runtime--quietcheck that would pass a re-quoted file (KEY=$VARis non-empty), when this repo already enforces fragile invariants structurally inci.yml. Commit0e06626eadds that guard — and it's not a token gesture, it's done right.It follows the existing convention precisely. The new "Assert deploy-obs writes obs-secrets.env via an unquoted heredoc" step has the same three-part shape as its two neighbours (the
{@html}CWE-79 guard from #666 and the artifact-@v4guard from #557): a positive self-test, a negative self-test, then the real assertion. A reviewer who knows those two guards reads this one for free. That consistency is worth as much as the guard itself.The regex is sound — I verified it empirically, not by eye:
obs-secrets.env <<'EOF'<<EOF<<-?EOF\bsed'd<<'EOF')<<-'EOF',<< 'EOF', and<<"EOF"(both\x27/\x22)So it answers both halves of the question I cared about: it would catch a re-quote and it passes clean on the current unquoted file. The double-quote variant coverage is a nice touch I didn't ask for — a future editor reaching for
<<"EOF"is caught too.The companion fix is also correct
a4756493addsset -euo pipefailto the deploy-configs step. I'd called the comment-quality good in round 1; the-uhere is the right belt — a typo'd$GRFANA_ADMIN_PASSWORDnow fails loud instead of writing an empty value past the heredoc and tripping the five-key guard with a confusing message. The corrected comment block at the top of that step still reads as why, not what. Clean.Disposition on the rest
Nora's backslash claim was tested empirically by the author and is a non-issue:
$VARexpansion in an unquoted heredoc inserts the value literally and does not re-scan it, so backticks /$(…)/$OTHER/ trailing\all survive byte-identical. That matches my own understanding of POSIX word expansion — the result of a parameter expansion is not re-parsed. Correctly resolved, not papered over.The two deferred cosmetic items (stale POSTGRES_HOST comment, bare
echo→::error::in smoke-test) are pre-existing carry-overs and rightly scoped out to keep this an extraction diff. No objection.Nothing left on my side. Clean DRY refactor, single-responsibility action split, and the one fragile invariant is now enforced in CI instead of trusted to a comment. Ship it after the
workflow_dispatchnightly run you already committed to.🏛️ Markus Keller — Architect (re-review)
✅ Approved — concerns from round 1 cleared; the heredoc CI guard sits at the right layer and matches the established ADR-driven precedent. No blockers.
I re-assessed the four fix commits on the current branch through the one lens that's mine to gate on: does invariant-enforcement-in-CI belong here, and is the new guard placed at the right architectural boundary? Verdict: yes on both. Below is the reasoning, because the placement question is the only thing that was genuinely open.
The guard sits alongside the existing "Assert no X" guards — that is the correct layer, not a boundary smell
This was the question I wanted to answer concretely, so I read
ci.ymlon disk. The new step (Assert deploy-obs writes obs-secrets.env via an unquoted heredoc (#603), lines 111–135) lands immediately after three peer guards in the sameunit-testsjob:Assert no banned vi.mock patterns(ADR-012 / #553)Assert no raw document date via {@html}(CWE-79 / #666)Assert no (upload|download)-artifact past v3(ADR-014 / #557)Each is an ADR- or issue-anchored, repo-wide structural invariant enforced by grep, and the two most recent ones are self-testing (feed the regex a known-bad and known-good input, assert match/no-match). The new guard is a faithful member of that family: ADR-029 anchor in the comment, a self-test that proves the regex catches
<<'EOF'and ignores<<EOF, a positive assertion that the unquoted heredoc is present at all, a negative assertion that the quoted form never appears, and::error::annotations. This is the house style for "un-unit-testable invariant," and the PR copied it precisely rather than inventing a new mechanism. That is exactly what I want from the repo's first composite-action PR — it sets the convention by following the existing one.Is "enforce the invariant in CI" the right pattern at all? Yes — and it's the architecturally consistent choice
The alternative was to leave the
<<EOFinvariant guarded only by a comment indeploy-obs/action.yml. That is precisely the honour-system "keep in sync" failure mode ADR-029 was written to abolish — and leaving the single most fragile line in the whole refactor protected by prose alone would have been self-contradictory with the ADR it ships under. Promoting it to an executable guard makes the PR internally consistent: the same PR that replaces an unenforced cross-workflow invariant with a typed contract now also replaces an unenforced intra-action invariant with an executable check. Same principle, applied twice. Good.It's also the right layer for this specific invariant. There's no local harness for
.gitea/YAML and no dev-machine red/green to run against, so pushing the check "down" to a unit test or a constraint isn't available the way I'd normally prefer. CI is the lowest trustworthy layer that actually sees this file. The check is cheap, infra-free, runs on every push, and fails closed. That's the correct placement given the constraints.One boundary observation — noted, not blocking
The guard physically lives in the
unit-testsjob but asserts over a.gitea/actions/deploy artifact — a frontend/test job reaching across into infra YAML. Strictly, this is a mild cohesion smell: a future reader scanningunit-testswouldn't expect a deploy-secrets assertion there. But it's the same placement the three sibling guards already use (the artifact-version guard greps all of.gitea/workflows/from this very job), so the new step is consistent with the established boundary rather than introducing a new one. Co-locating all the repo's grep guards in one early, fast, no-infra job is a defensible convention — they share a runner profile (no Docker, no VPS, seconds to run) more than they share a subject domain. I'd only revisit this if the guard set grows enough to warrant a dedicatedrepo-invariants/lintjob; at four guards it's not worth the churn. Consistency with precedent wins here. No change requested.Round-1 concerns — status
set -euo pipefailon the deploy-configs step (commita4756493) — addressed; the canonical definition now fails loudly on acp/mkdirerror instead of silently starting against a stale config tree. Good hardening at the single shared site.deploy-obs, secrets-vs-topology input grouping, thePOSTGRES_HOSTCompose-name coupling) were explicitly non-blocking polish and remain fine to defer.Doc-currency gate
ADR-029 already documents the secrets-via-
env/unquoted-heredoc constraint andci-gitea.mddescribes the composite-action contract. The new guard is an enforcement of an already-documented invariant, not a new architectural decision, so it triggers no further doc update. Renumbering ADR-028→029 remains correct. No doc omissions.Verdict: ship it. The fix is placed at the right layer, in the repo's established idiom, and makes the PR internally consistent with the very ADR it lands under. The architecture was sound in round 1 and is now better guarded.
— Markus (@mkeller)
⚙️ Tobias Wendt — DevOps (re-review)
⚠️ Approved — all three of my prior concerns resolved; the only open item is the still-pending live
workflow_dispatchrun, which keeps this at ⚠️ until it's green.Re-assessed on the current branch. The four fix commits land cleanly. Walking my round-1 list:
Prior concerns — status
set -ein the deploy-configs step → RESOLVED. Commita4756493addsset -euo pipefailas the first line of thedeploy-obs"Deploy observability configs"run:block (.gitea/actions/deploy-obs/action.yml:56). A failedcp/mkdirnow aborts instead of being swallowed, and-ucatches a typo'd env var before it writes a half-emptyobs-secrets.env. Note for the record: as I said in round 1, the runner already resolvesshell: bashtobash --noprofile --norc -e -o pipefail {0}, so-e -o pipefailwas technically already active — but the explicitset -euo pipefailadds-u(genuinely new) and makes the guarantee local to the file instead of relying on runner-default knowledge. Worth having. ✓Stale "overrides POSTGRES_HOST in obs.env" comment → RESOLVED in the action. The
deploy-obsvalidate step now reads "POSTGRES_HOST is environment-specific and supplied only by obs-secrets.env — obs.env documents it but deliberately does not set a value" (action.yml:84–87). Accurate. The old, inaccurate "later file wins → overrides POSTGRES_HOST set in obs.env" wording is gone from the extracted action. (The duplicated copies in nightly/release are deleted entirely with the steps, so nothing stale survives there either.) ✓The CI re-quote guard (the new one you added on Felix/Sara/my note) → CORRECT and effective. I didn't just read it, I exercised it:
printf "obs-secrets.env <<'EOF'"is caught by the regex;printf 'obs-secrets.env <<EOF'is correctly ignored. Both self-test branches pass againstgrep -P.obs-secrets\.env\s*<<-?\s*[\x27\x22]—\x27/\x22are single/double quote,<<-?covers both<<and<<-heredoc operators,\s*tolerates spacing. It matches a re-quoted delimiter and only a re-quoted delimiter.sed-flipped<<EOF→<<'EOF', and ran the guard: the negativegrep -nPfires (exit 1) AND the positive<<-?EOF\bcheck also fails — defence in depth. So a future re-quote fails CI red instead of shipping broken obs auth green, which is exactly the gap (the five-key non-empty guard passes a literalKEY=$VARbecause it's non-empty). This closes the one structural hole the round-1 reviewers flagged. ✓Parity & wiring — re-confirmed on current branch
with:blocks are identical except the three documented deltas:STAGING_POSTGRES_PASSWORD→PROD_POSTGRES_PASSWORD,postgres_host(archiv-staging-db-1→archiv-production-db-1), smokehost(staging.raddatz.cloud→archiv.raddatz.cloud). Nothing else forks.actions/checkout@v4is step 1 in both deploy jobs — required, sinceuses: ./.gitea/actions/…only exists on disk post-checkout.# Keep in synccomments remain anywhere under.gitea/(grep -rnconfirms). The honour-system invariant ADR-029 set out to kill is gone.matchPathsstill covers.gitea/actions/**, so the relocated pinnedalpinedigest inreload-caddystays under manual review. Unchanged, still correct.My one deferred round-1 concern (the unquoted-heredoc injection surface)
I'm satisfied with the author's empirical resolution. A
$VARexpanded inside an unquoted heredoc is inserted literally and not re-scanned — no recursive expansion, no command substitution on the value, no backslash processing of the value (only the literal heredoc text is escape-processed, and that text is justKEY=$VAR). The five adversarial test values (pa$HOME,a$(whoami)b, backtick, trailing backslash,${...}) all round-trip byte-identical to the old<<'EOF'+ Gitea-substitution form. So it's a true no-op behaviour-wise, not a new injection surface. Good to defer the explanatory one-liner; not a blocker.Bottom line
Statically this is now clean on every axis I check. The verdict stays ⚠️ for one reason only, unchanged from round 1: the abstraction's input plumbing and the unquoted-heredoc expansion only truly prove themselves on the runner. A real
workflow_dispatchnightly run ending with all five healthchecked obs services (obs-loki,obs-prometheus,obs-grafana,obs-tempo,obs-glitchtip) green — plus the Grafana/GlitchTip auth check the author committed to — flips this to a clean ✅. That run is legitimately still pending, so I won't pre-approve it green. Trigger it, post the link, and this is done.Solid round-2. The refactor removes a copy-paste invariant and now enforces its one load-bearing detail in CI — that's the abstraction paying for itself.
🧪 Sara Holt — QA (re-review)
✅ Approved — both of my round-1 concerns are resolved; the remaining gap is an operational watch-item for the human, not a code blocker.
I re-pulled the current branch and verified the two fixes on disk, including running the guard's logic and a mutation test against the real action file.
Resolved
1. The re-quote regression (my headline empty-vs-wrong concern) — closed at the only layer it can be.
Commit
0e06626eadds an enforced CI gate inci.yml(Assert deploy-obs writes obs-secrets.env via an unquoted heredoc). This is exactly the executable form of the invariant I asked for — it converts "a human grepped once" into "CI greps every run." And the self-test is real, not decorative: it exercises both branches withprintffixtures (regex must catch<<'EOF'and must ignore<<EOF), so a future edit that breaks the guard itself also fails CI. I verified mechanically on the current tree:<<EOFpresent in the real action): PASS<<EOF→<<'EOF') and the negative check tripped on line 61 and the positive check failed. So a re-quote can no longer ship broken obs auth green. The empty-vs-wrong hole the five-key runtime guard couldn't close is now closed statically, at the one point it's closeable.actions/checkout@v4(line 18 → guard line 111, same job), so the relative-path grep resolves. Correct ordering.2.
set -euo pipefailon the deploy-configs step — done.Commit
a4756493. Therm/mkdir/cp -rconfig-tree copy is no longer un-guarded;-ualso catches a typo'd env var. This was the one block doing un-aborted work while every sibling step hadset -e. Parity hardening at the canonical site, as requested.I also confirm Nora's empirical heredoc table:
$VARexpanded inside an unquoted heredoc is inserted literally and not re-scanned, so$/backtick/$(…)/trailing-\in a secret value are byte-identical to the old<<'EOF'form. No injection footgun. Good due diligence.Residual — what the human must still own (not a blocker, but do not skip)
The static gate closes the regression path. It does not substitute for the live-run gate, which is still the only proof the production path works at all:
workflow_dispatchis still the sole functional gate and is still pending (author acknowledges; will post the run link before merge). Read it actively: all five obs serviceshealthy(amissing≠unhealthy), and confirm the deploy step'scp -rlog shows the real tree, not an empty dir.smoke-test/action.ymltouches only the public app surface (login/HSTS/Permissions-Policy/actuator), nothing Grafana/GlitchTip. The static guard now prevents the re-quote cause of wrong credentials, but any other cause (rotated-but-wrong secret, masking edge case) still won't surface in CI. Manually log into Grafana + GlitchTip with the secrets once, post-run. This is the compensating control for the blind spot.release.yml'sdeploy-obs/smoke-testwiring (different secret names,archiv-production-db-1,archiv.raddatz.cloud) executes for the first time on the next real release. Treat that release as a verification event: someone on hand, check prod Grafana auth + the five obs services immediately after.Verdict
✅ Approved. The two things I flagged as actionable (static heredoc guard +
set -euo pipefail) are both shipped and verified. No test-layer blockers remain. The live nightly run and the first prod release stay on the human's checklist — necessary because.gitea/deploy YAML is untestable below the live layer, not because anything in this PR is wrong.— Sara
🔒 Nora Steiner — Security (re-review)
✅ Approved — both prior blockers withdrawn.
🚫→✅ BLOCKER 1 (unquoted-heredoc secret mangling) — WITHDRAWN. My prior analysis was incorrect.
I made a real shell-semantics error and I want to be precise about what I got wrong, because the correction is the load-bearing part.
My claim was: in an unquoted heredoc, the shell processes
\and re-expands$inside the secret value. That is false. The shell processes expansions and escapes on the literal heredoc template text (KEY=$VAR) — it does one pass. The result of$VARexpansion is inserted and never re-scanned. Since the template is justGRAFANA_ADMIN_PASSWORD=$GRAFANA_ADMIN_PASSWORD— one expansion, zero backslashes — there is nothing for the shell to mangle in the value. I conflated "backslashes in the heredoc text" with "backslashes in the value". Different things.I re-ran every adversarial payload from my original blocker against the action's exact
KEY=$VAR-in-<<EOFform:pa$HOME wordpa$HOME word$HOMEexpansiona${OTHER}ba${OTHER}ba$(whoami)ba$(whoami)ba`whoami`ba`whoami`bsecretval\(trailing\)secretval\a\$ba\$btab\theretab\therea\\ba\\bThe trailing-backslash line-continuation swallow I warned about does not happen — the
\is part of the expanded value, not the template, so it is not consumed as a continuation. Output is byte-identical to the old<<'EOF'+ Gitea-substitution form. The author's resolution table is correct. Noprintfrewrite is needed, and importantly, switching toprintf '%s\n'would have been a lateral move, not a fix — both are byte-faithful here.BLOCKER 2 (five-key guard checks presence not correctness) falls with it. It was predicated on Blocker 1's mangling being real. Since the write is already byte-faithful, the guard's remaining job — catch a genuinely empty/missing secret via
^KEY=.+— is exactly the right scope. It stands as correct defense-in-depth, not a failed mitigation. Withdrawn.✅ New CI guard (commit
0e06626e) — reviewed independently, it is sound.I verified the
Assert deploy-obs writes obs-secrets.env via an unquoted heredocstep end-to-end on the current tree, not just by reading it:<<'EOF'and ignores<<EOF. These inline self-tests are the right touch: they fail CI if the detection regex itself rots, so the guard can't silently degrade to a no-op.sed-re-quoted the delimiter to<<'EOF', and the negative assertion fired (exit 1). A future re-quote genuinely fails CI instead of shipping broken obs auth green.This directly closes the gap Felix/Sara/Tobias flagged (the
<<EOFinvariant was comment-only) and it does so with a self-testing check — exactly the "automate detection of the vulnerability class" control I'd ask for. Residual: the regex requiresobs-secrets.envand<<on one line, so a hypothetical re-quote plus a reformat onto separate lines could evade it — but that's not the realistic regression (a one-token re-quote on the existing line is), and the positive assertion would still flag the missing unquoted form. Acceptable.✅
set -euo pipefail(commita4756493) — genuine hardening.A failed
cp/mkdir/chmodwas previously swallowed, which could leave a stale or wrongobs-secrets.envon disk while the stack started anyway.-enow aborts the secret-write path on any failed file op. I confirmed all five$VARreferences are backed by theenv:block, so-uis safe (no unset-var foot-gun) and adds protection against a future typo'd var name. Good.⚠️ One concern carries forward (not a blocker, acceptance-checklist item)
Log masking on the
input → env: → $VARpath is still framework-dependent and unverified on a live run. On the pendingworkflow_dispatchnightly, confirm: (a) none of the five values appear unmasked anywhere in the log, and (b)docker compose config's error path doesn't echoobs-secrets.envcontents on a bad secret. The author already has this on the deferred list — keep it there as a merge gate, not just "5 services healthy".Summary
set -euo pipefailreal reliability gain,-usafe.The disproof is complete and I accept it without reservation. The secret-write path is byte-faithful, the invariant is now enforced in CI rather than by comment, and the failure mode is tested. Security verdict: approved. Nice work running the empirical test against the exact pattern — that's precisely how you close out a contested finding. Bugs are systemic; so are my mistakes, and this was a good one to correct in the open.
— Nora "NullX" Steiner