devops: extract composite actions for obs stack deploy steps (#603) #715

Merged
marcel merged 12 commits from feat/issue-603-composite-deploy-actions into main 2026-06-02 19:57:20 +02:00
Owner

Closes #603.

Extracts the deploy logic shared verbatim by nightly.yml and release.yml into three single-responsibility Gitea composite actions, replacing the unenforced # Keep in sync with nightly.yml honour-system invariant with one enforced definition.

What changed

New composite actions under .gitea/actions/:

Action Inputs Notes
deploy-obs 5 (grafana_admin_password, grafana_db_password, glitchtip_secret_key, postgres_password, postgres_host) Four named run: blocks (deploy configs → validate → start → assert health). Includes grafana_db_password for the #651 read-only reader role.
reload-caddy Caddy reload step verbatim, pinned alpine digest.
smoke-test host Public-surface checks parameterised by vhost.

Both workflows now call each via a single uses: ./.gitea/actions/<name>; nightly.yml keeps its #526 /import guard inline.

Security-critical details (per the issue's review notes)

  • obs-secrets.env heredoc switched to unquoted <<EOF with secrets mapped via env: + $VAR — a quoted delimiter would write literal var names and config --quiet would pass anyway.
  • Five-key non-empty guard (grep -Eq "^KEY=.+") runs right after the write; chmod 600 is the final operation (no world-readable window).
  • Secret inputs are required: true with no default: — a missing secret fails loudly.
  • Every composite run: step declares shell: bash.
  • actions/checkout@v4 pinned as step 1 in both workflows (a local action only exists on disk after checkout).

Docs / housekeeping

  • ADR-029 added (Accepted). ⚠️ The issue specified ADR-028, but 028-pdfjs-wasm-decoders-and-csp-constraint.md landed on main after the issue was filed; per the sequential-ADR convention this is 029.
  • docs/infrastructure/ci-gitea.md gains a Composite actions section (checkout-first rule, secrets-via-inputs + unquoted-heredoc constraint, how to add an input).
  • Renovate's privileged-digest watch extended to .gitea/actions/** so the relocated alpine digest stays tracked.

Verification

Pure behaviour-preserving CI-YAML refactor; no local test harness exists for .gitea/ YAML. All five files parse; AC invariants verified mechanically (4 named blocks, unquoted heredoc, 5-key guard, chmod-600-last, required-no-default inputs, checkout-first, zero "Keep in sync" comments remaining).

Pending: a workflow_dispatch run of nightly.yml ending with all 5 obs services healthy (per the issue's acceptance criterion). release.yml is trusted via its identical uses: calls. @marcel to trigger the nightly run.

🤖 Generated with Claude Code

Closes #603. Extracts the deploy logic shared verbatim by `nightly.yml` and `release.yml` into three single-responsibility Gitea composite actions, replacing the unenforced `# Keep in sync with nightly.yml` honour-system invariant with one enforced definition. ## What changed **New composite actions** under `.gitea/actions/`: | Action | Inputs | Notes | |---|---|---| | `deploy-obs` | 5 (`grafana_admin_password`, `grafana_db_password`, `glitchtip_secret_key`, `postgres_password`, `postgres_host`) | Four named `run:` blocks (deploy configs → validate → start → assert health). Includes `grafana_db_password` for the #651 read-only reader role. | | `reload-caddy` | — | Caddy reload step verbatim, pinned alpine digest. | | `smoke-test` | `host` | Public-surface checks parameterised by vhost. | **Both workflows** now call each via a single `uses: ./.gitea/actions/<name>`; `nightly.yml` keeps its `#526 /import` guard inline. ## Security-critical details (per the issue's review notes) - `obs-secrets.env` heredoc switched to **unquoted `<<EOF`** with secrets mapped via `env:` + `$VAR` — a quoted delimiter would write literal var names and `config --quiet` would pass anyway. - **Five-key non-empty guard** (`grep -Eq "^KEY=.+"`) runs right after the write; `chmod 600` is the **final** operation (no world-readable window). - Secret inputs are `required: true` with **no `default:`** — a missing secret fails loudly. - Every composite `run:` step declares `shell: bash`. - `actions/checkout@v4` pinned as step 1 in both workflows (a local action only exists on disk after checkout). ## Docs / housekeeping - **ADR-029** added (Accepted). ⚠️ The issue specified ADR-**028**, but `028-pdfjs-wasm-decoders-and-csp-constraint.md` landed on `main` after the issue was filed; per the sequential-ADR convention this is **029**. - `docs/infrastructure/ci-gitea.md` gains a **Composite actions** section (checkout-first rule, secrets-via-inputs + unquoted-heredoc constraint, how to add an input). - Renovate's privileged-digest watch extended to `.gitea/actions/**` so the relocated alpine digest stays tracked. ## Verification Pure behaviour-preserving CI-YAML refactor; no local test harness exists for `.gitea/` YAML. All five files parse; AC invariants verified mechanically (4 named blocks, unquoted heredoc, 5-key guard, chmod-600-last, required-no-default inputs, checkout-first, zero "Keep in sync" comments remaining). **Pending:** a `workflow_dispatch` run of `nightly.yml` ending with all 5 obs services healthy (per the issue's acceptance criterion). `release.yml` is trusted via its identical `uses:` calls. @marcel to trigger the nightly run. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 8 commits 2026-06-02 19:26:22 +02:00
First composite action in the repo (establishes the convention). Lifts the
Caddy reload step verbatim from nightly.yml/release.yml — DooD privileged
sibling + nsenter to systemctl reload caddy, pinned alpine digest, reload
not restart. No inputs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Parameterises the public-surface smoke test by host (one required input,
mapped via env: HOST). Keeps the three checks verbatim — login reachable,
HSTS value pinned, Permissions-Policy present, /actuator -> 404 — plus the
/proc/net/route gateway-detection and RESOLVE-array rationale.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Five required, no-default inputs (incl. grafana_db_password for the #651
read-only reader role). Four named run: blocks keep the four CI log sections:
deploy configs, validate, start, assert health.

Secrets map to env: and are written via an unquoted <<EOF heredoc ('$VAR'
expands at the shell layer; a quoted delimiter would write the literal var
name and config --quiet would pass anyway). A five-key non-empty guard runs
right after the write, and chmod 600 is the final operation so the file is
never world-readable. ADR-016 absolute paths and the two-file --env-file
ordering are preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaces the four inline obs steps with one uses: ./.gitea/actions/deploy-obs,
and the Caddy reload + smoke test with one uses: each (host
staging.raddatz.cloud, postgres_host archiv-staging-db-1, STAGING_* secrets).
checkout@v4 stays the first step; the #526 /import mount guard stays inline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaces the four inline obs steps with one uses: ./.gitea/actions/deploy-obs,
and the Caddy reload + smoke test with one uses: each (host
archiv.raddatz.cloud, postgres_host archiv-production-db-1, PROD_* secrets).
Removes all three '# Keep in sync with nightly.yml' comments — the shared
definition now enforces the invariant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The reload-caddy pinned alpine digest moved out of the workflow files into a
composite action. Add .gitea/actions/** to the manual-review digest rule so the
digest stays watched and never silently goes stale (#603).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Records the decision to extract the shared obs-deploy/reload-caddy/smoke-test
logic into three composite actions instead of a reusable workflow or shared
shell script. Numbered 029 (028 was taken by the pdf.js wasm ADR on main since
the issue was filed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs(ci): document composite actions in ci-gitea.md
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m20s
CI / OCR Service Tests (pull_request) Successful in 24s
CI / Backend Unit Tests (pull_request) Successful in 3m39s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m5s
02fb16a0bd
Adds a Composite actions section covering the checkout-first ordering rule, the
secrets-via-inputs + unquoted-heredoc constraint (with the five-key guard and
shell: bash requirement), and a step-by-step for adding an input. Notes that the
inline Reload Caddy example now lives in the reload-caddy action.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
Owner

🏛️ Markus Keller — Architect

Approved with concerns (concerns are suggestions, not blockers).

I reviewed this through the lens I care about: module boundaries, coupling, single-responsibility, the composite-action-vs-reusable-workflow decision, ADR quality, the checkout-first coupling, the three-way split, and duplication elimination. This is a genuinely well-architected refactor. The honour-system # Keep in sync invariant is exactly the kind of unenforced cross-module duplication that rots, and replacing it with one enforced definition + a typed input contract is the right call. ADR-029 is one of the better ADRs in this repo — it states the drivers, considers three real alternatives, and is honest about the negative consequences (checkout coupling, secrets-via-inputs, the relocated digest). That is what an ADR is for.

What I checked and liked:

  • Composite action over reusable workflow (Alternative A) is the correct decision. The deploy → reload → smoke sequence relies on in-job step ordering and a shared, already-checked-out workspace. workflow_call runs as a separate job with its own runner context and would break exactly that. The ADR names this precisely. Good.
  • Three-action split is the right granularity. deploy-obs, reload-caddy, smoke-test each have one reason to change and one responsibility. I considered whether reload-caddy + smoke-test should be one "publish-and-verify" action — no. They have independent rationales (config application vs public-surface assertion) and deploy-obs could plausibly run without a smoke test in some future path. Keeping them separate preserves recombination. Correct SRP boundary.
  • Per-environment variation is a typed input contract, not a forked block. STAGING_*/PROD_* secret names, POSTGRES_HOST, and the smoke host are the only deltas, and all three are now with: inputs. This is the duplication elimination done right — one definition, explicit seams.
  • The checkout-first coupling is documented, not introduced. I verified actions/checkout@v4 was already step 1 in both workflows before this PR. So the uses: ./.gitea/actions/… dependency adds a constraint to preserve, not a behavioral change — and the PR pins it with a comment in both workflows, in the ADR consequences, and in ci-gitea.md. Three places is appropriate for a load-bearing-but-invisible ordering rule. No objection.
  • Renovate watch extended to .gitea/actions/**. The pinned alpine digest moved into the action, and the privileged-digest review rule moved with it. A digest that silently goes stale on a --privileged --pid=host image is exactly the kind of thing that should never slip — good that this was caught in the same PR rather than discovered later.

Concerns (suggestions — none block merge):

  1. The deploy-obs/validate/start/assert four-step block is internally coupled to obs-secrets.env as a shared on-disk artifact. That is inherent to the obs-stack deploy and fine inside one action — I only flag it so future maintainers don't try to split these four steps into separate actions. The data dependency (deploy writes the env file, validate/start read it) means they belong together. Consider a one-line comment at the top of deploy-obs/action.yml stating "these four steps share obs-secrets.env and must stay in one action" so the SRP boundary is explicit and isn't re-litigated.

  2. Input naming asymmetry. postgres_host is a non-secret, while the other four are secrets, yet they sit in the same flat inputs: block distinguished only by a description. The action correctly maps all five through env:, so behaviour is fine. Minor: a comment grouping "secrets" vs "topology" in the inputs: block would make the contract self-documenting at a glance. Not worth a round-trip on its own.

  3. POSTGRES_HOST derivation remains a latent coupling to the Compose project name (archiv-staging / archiv-production). The PR documents this in both callers' comments and the input description — good. This is pre-existing and out of scope here; I note it only so it's on record that a Compose project rename is now a two-file change (the with: value in each workflow), which is strictly better than the previous in-heredoc literal.

Doc-currency check (my standard gate): I walked the table. New ADR — present (029, correctly numbered after 028 landed on main; the PR's note explaining the off-by-one from the issue is the right call). ci-gitea.md Composite-actions section — present and thorough. No new Flyway migration, no new backend package/route, no new Docker service, no new ErrorCode/Permission, no new external system. The C4 l2-containers.puml does not model the CI workflow internals, so no diagram update is triggered. No doc omissions — nothing to block on.

Verdict: ship it. The architecture is sound, the decision is documented and defensible, and the duplication is genuinely eliminated rather than relocated. The three concerns above are polish I'd happily take in a follow-up.

— Markus (@mkeller)

## 🏛️ Markus Keller — Architect **✅ Approved with concerns** (concerns are suggestions, not blockers). I reviewed this through the lens I care about: module boundaries, coupling, single-responsibility, the composite-action-vs-reusable-workflow decision, ADR quality, the checkout-first coupling, the three-way split, and duplication elimination. This is a genuinely well-architected refactor. The honour-system `# Keep in sync` invariant is exactly the kind of unenforced cross-module duplication that rots, and replacing it with one enforced definition + a typed input contract is the right call. ADR-029 is one of the better ADRs in this repo — it states the drivers, considers three real alternatives, and is honest about the negative consequences (checkout coupling, secrets-via-inputs, the relocated digest). That is what an ADR is for. What I checked and liked: - **Composite action over reusable workflow (Alternative A) is the correct decision.** The `deploy → reload → smoke` sequence relies on in-job step ordering and a shared, already-checked-out workspace. `workflow_call` runs as a separate job with its own runner context and would break exactly that. The ADR names this precisely. Good. - **Three-action split is the right granularity.** `deploy-obs`, `reload-caddy`, `smoke-test` each have one reason to change and one responsibility. I considered whether `reload-caddy` + `smoke-test` should be one "publish-and-verify" action — no. They have independent rationales (config application vs public-surface assertion) and `deploy-obs` could plausibly run without a smoke test in some future path. Keeping them separate preserves recombination. Correct SRP boundary. - **Per-environment variation is a typed input contract**, not a forked block. `STAGING_*`/`PROD_*` secret names, `POSTGRES_HOST`, and the smoke host are the only deltas, and all three are now `with:` inputs. This is the duplication elimination done right — one definition, explicit seams. - **The checkout-first coupling is documented, not introduced.** I verified `actions/checkout@v4` was already step 1 in both workflows before this PR. So the `uses: ./.gitea/actions/…` dependency adds a *constraint to preserve*, not a behavioral change — and the PR pins it with a comment in both workflows, in the ADR consequences, and in `ci-gitea.md`. Three places is appropriate for a load-bearing-but-invisible ordering rule. No objection. - **Renovate watch extended to `.gitea/actions/**`.** The pinned alpine digest moved into the action, and the privileged-digest review rule moved with it. A digest that silently goes stale on a `--privileged --pid=host` image is exactly the kind of thing that should never slip — good that this was caught in the same PR rather than discovered later. Concerns (suggestions — none block merge): 1. **The `deploy-obs`/`validate`/`start`/`assert` four-step block is internally coupled to `obs-secrets.env` as a shared on-disk artifact.** That is inherent to the obs-stack deploy and fine inside one action — I only flag it so future maintainers don't try to split these four steps into separate actions. The data dependency (`deploy` writes the env file, `validate`/`start` read it) means they belong together. Consider a one-line comment at the top of `deploy-obs/action.yml` stating "these four steps share `obs-secrets.env` and must stay in one action" so the SRP boundary is explicit and isn't re-litigated. 2. **Input naming asymmetry.** `postgres_host` is a non-secret, while the other four are secrets, yet they sit in the same flat `inputs:` block distinguished only by a description. The action correctly maps all five through `env:`, so behaviour is fine. Minor: a comment grouping "secrets" vs "topology" in the `inputs:` block would make the contract self-documenting at a glance. Not worth a round-trip on its own. 3. **`POSTGRES_HOST` derivation remains a latent coupling to the Compose project name** (`archiv-staging` / `archiv-production`). The PR documents this in both callers' comments and the input description — good. This is pre-existing and out of scope here; I note it only so it's on record that a Compose project rename is now a two-file change (the `with:` value in each workflow), which is strictly better than the previous in-heredoc literal. Doc-currency check (my standard gate): I walked the table. New ADR — present (029, correctly numbered after 028 landed on main; the PR's note explaining the off-by-one from the issue is the right call). `ci-gitea.md` Composite-actions section — present and thorough. No new Flyway migration, no new backend package/route, no new Docker service, no new ErrorCode/Permission, no new external system. The C4 `l2-containers.puml` does not model the CI workflow internals, so no diagram update is triggered. **No doc omissions — nothing to block on.** Verdict: ship it. The architecture is sound, the decision is documented and defensible, and the duplication is genuinely eliminated rather than relocated. The three concerns above are polish I'd happily take in a follow-up. — Markus (@mkeller)
Author
Owner

Leonie Voss (@leonievoss) — UI/UX & Accessibility

Approved — not applicable to my domain.

I scanned all 8 changed files. This is a pure CI/infrastructure refactor with zero user-facing surface:

  • No frontend files touched — no .svelte, .css, .ts, no routes, components, or Paraglide message keys.
  • No visual, layout, or accessibility impact — nothing renders to the end user; no brand tokens, touch targets, contrast, focus, or responsive behaviour to assess.
  • Changes are limited to Gitea composite actions (.gitea/actions/*/action.yml), the two deploy workflows, ADR-029, docs/infrastructure/ci-gitea.md, and renovate.json.

Nothing here reaches the senior-on-a-small-phone-in-bright-sunlight audience I advocate for, so there's nothing for me to gate on.

One light, optional note on the only surface that is human-readable — the operator-facing CI logs:

  • The ::error:: lines are clear and actionable (e.g. obs-secrets.env missing or empty: ${key}, $svc is not healthy (status: $status)) — good operator UX. The plain echo "ERROR: ..." in smoke-test and the echo "expected 404..." line use bare echo rather than the ::error:: annotation. Optional: switching those two to ::error:: would surface them as highlighted annotations in the Gitea run summary, consistent with the rest of the action. Purely cosmetic for log readability — not a blocker.

LGTM from the UX side.

## Leonie Voss (@leonievoss) — UI/UX & Accessibility **✅ Approved — not applicable to my domain.** I scanned all 8 changed files. This is a pure CI/infrastructure refactor with **zero user-facing surface**: - **No frontend files touched** — no `.svelte`, `.css`, `.ts`, no routes, components, or Paraglide message keys. - **No visual, layout, or accessibility impact** — nothing renders to the end user; no brand tokens, touch targets, contrast, focus, or responsive behaviour to assess. - Changes are limited to Gitea composite actions (`.gitea/actions/*/action.yml`), the two deploy workflows, **ADR-029**, `docs/infrastructure/ci-gitea.md`, and `renovate.json`. Nothing here reaches the senior-on-a-small-phone-in-bright-sunlight audience I advocate for, so there's nothing for me to gate on. One light, optional note on the only surface that *is* human-readable — the operator-facing CI logs: - The `::error::` lines are clear and actionable (e.g. `obs-secrets.env missing or empty: ${key}`, `$svc is not healthy (status: $status)`) — good operator UX. The plain `echo "ERROR: ..."` in `smoke-test` and the `echo "expected 404..."` line use bare `echo` rather than the `::error::` annotation. **Optional**: switching those two to `::error::` would surface them as highlighted annotations in the Gitea run summary, consistent with the rest of the action. Purely cosmetic for log readability — not a blocker. LGTM from the UX side.
Author
Owner

Requirements Engineering review — "Elicit"

Verdict: Approved

A clean, behaviour-preserving refactor with full requirements→implementation traceability. I traced every acceptance criterion in #603 against the diff. All 11 closeable ACs are met; the 12th (live nightly run) is correctly deferred to you. No blockers. One naming deviation is handled acceptably, plus two minor documentation-hygiene suggestions below.

Traceability matrix — AC → implementation

# Acceptance criterion (#603) Evidence in diff Status
1 deploy-obs runs 4 named run: blocks + writes 5 keys Four - name: steps (Deploy configs / Validate / Start / Assert health), heredoc writes all 5 keys Met
2 Unquoted <<EOF + secrets via env: + $VAR cat > … <<EOF (not <<'EOF'); env: maps all 5 inputs; heredoc references $VAR Met
3 Five-key non-empty guard, ::error:: + non-zero exit for key in …; do grep -Eq "^${key}=.+" … || { echo "::error::…"; exit 1; } over all 5 keys Met
4 chmod 600 is the final operation chmod 600 is the last line of the first run block, after the guard loop Met
5 reload-caddy (no inputs) + smoke-test (host) exist Both action files present; reload-caddy has no inputs:, smoke-test has host only Met
6 nightly.yml rewired (4 obs steps + reload + smoke → uses:) Three uses: ./.gitea/actions/… calls replace the inline blocks Met
7 release.yml rewired identically Same three uses: calls Met
8 All 3 # Keep in sync comments removed The three (validate / start / assert-health in release.yml) are deleted; zero remain Met
9 actions/checkout@v4 precedes first local uses: Step 1 in both workflows, pinned with an invariant comment Met
10 ci-gitea.md "Composite actions" section (checkout-first + how-to-add-input) New section covers both, plus secrets/heredoc constraint and a 4-step "Adding an input" recipe Met
11 ADR added (Accepted) docs/adr/029-…md added, Accepted, body matches the #603 draft Met (see deviation note)
12 Verified by live workflow_dispatch nightly run, 5 obs services healthy Explicitly deferred to @marcel in the PR body Open — correctly deferred

Notes beyond the checklist

The ADR-028 → 029 deviation is handled acceptably. #603 specified 028-…md, but 028-pdfjs-wasm-decoders-… landed on main after the issue was filed. Renumbering to 029 is the right call: ADR numbers are an append-only, collision-free sequence, and reusing 028 would create two ADR-028s. The PR body and the workflow header comments both reference ADR-029 consistently, and the ADR body is otherwise verbatim from the issue draft. This is a deviation from the literal AC text but satisfies its intent (an Accepted ADR documenting the decision). No change required — though I'd record it as a closed deviation so the audit trail explains why the issue says 028 and the repo says 029. The PR body already does this; that is sufficient.

The enforced-invariant goal is genuinely achieved. The core requirement behind #603 was replacing an honour-system "Keep in sync" invariant with an enforced single definition. The diff delivers exactly that: per-environment variation is now a typed inputs: contract (5 + 0 + 1 inputs), and the only remaining per-environment forks are the with: values — which is the intended, minimal surface. The #526 /import guard correctly stayed inline in nightly.yml per the issue's explicit instruction; good discipline not to over-extract.

required: true / no default: fail-loud requirement is met for all secret inputs — a missing secret aborts rather than silently writing an empty key, and the non-empty guard is a second line of defence. This is the kind of unwanted-behaviour ("If a secret is missing, the system shall fail loudly") that the issue called out as the single most likely bug; both the heredoc-delimiter fix and the guard address it.

Suggestions (non-blocking)

  • S1 — Close the loop on the deferred AC. AC #12 is the only unverified requirement and it is genuinely unverifiable without infrastructure. Recommend you (a) run workflow_dispatch on nightly.yml, (b) confirm all 5 obs services report healthy, and (c) paste the run link into this PR before merge, so the traceability record is closed by evidence rather than by trust. release.yml riding on the identical uses: calls is a sound argument — no separate prod run needed.
  • S2 — Documentation drift watch. The "Adding an input" recipe in ci-gitea.md now encodes the five-key-guard invariant in prose. That is itself a new "keep in sync" point (the guard loop and the heredoc and the docs must all agree). Minor, and arguably unavoidable, but worth a glance during the next obs-secret change.

Nothing here blocks merge. The requirements are fully traceable to the implementation; the one open item is a deferred verification, not a defect.

Elicit (Requirements Engineer)

## Requirements Engineering review — "Elicit" **Verdict: ✅ Approved** A clean, behaviour-preserving refactor with full requirements→implementation traceability. I traced every acceptance criterion in #603 against the diff. All 11 closeable ACs are met; the 12th (live nightly run) is correctly deferred to you. No blockers. One naming deviation is handled acceptably, plus two minor documentation-hygiene suggestions below. ### Traceability matrix — AC → implementation | # | Acceptance criterion (#603) | Evidence in diff | Status | |---|---|---|---| | 1 | `deploy-obs` runs 4 named `run:` blocks + writes 5 keys | Four `- name:` steps (Deploy configs / Validate / Start / Assert health), heredoc writes all 5 keys | ✅ Met | | 2 | Unquoted `<<EOF` + secrets via `env:` + `$VAR` | `cat > … <<EOF` (not `<<'EOF'`); `env:` maps all 5 inputs; heredoc references `$VAR` | ✅ Met | | 3 | Five-key **non-empty** guard, `::error::` + non-zero exit | `for key in …; do grep -Eq "^${key}=.+" … \|\| { echo "::error::…"; exit 1; }` over all 5 keys | ✅ Met | | 4 | `chmod 600` is the **final** operation | `chmod 600` is the last line of the first run block, after the guard loop | ✅ Met | | 5 | `reload-caddy` (no inputs) + `smoke-test` (`host`) exist | Both action files present; `reload-caddy` has no `inputs:`, `smoke-test` has `host` only | ✅ Met | | 6 | `nightly.yml` rewired (4 obs steps + reload + smoke → `uses:`) | Three `uses: ./.gitea/actions/…` calls replace the inline blocks | ✅ Met | | 7 | `release.yml` rewired identically | Same three `uses:` calls | ✅ Met | | 8 | All **3** `# Keep in sync` comments removed | The three (validate / start / assert-health in `release.yml`) are deleted; zero remain | ✅ Met | | 9 | `actions/checkout@v4` precedes first local `uses:` | Step 1 in both workflows, pinned with an invariant comment | ✅ Met | | 10 | `ci-gitea.md` "Composite actions" section (checkout-first + how-to-add-input) | New section covers both, plus secrets/heredoc constraint and a 4-step "Adding an input" recipe | ✅ Met | | 11 | ADR added (Accepted) | `docs/adr/029-…md` added, Accepted, body matches the #603 draft | ✅ Met (see deviation note) | | 12 | Verified by live `workflow_dispatch` nightly run, 5 obs services healthy | Explicitly **deferred to @marcel** in the PR body | ⏳ Open — correctly deferred | ### Notes beyond the checklist **The ADR-028 → 029 deviation is handled acceptably.** #603 specified `028-…md`, but `028-pdfjs-wasm-decoders-…` landed on `main` after the issue was filed. Renumbering to **029** is the right call: ADR numbers are an append-only, collision-free sequence, and reusing 028 would create two ADR-028s. The PR body and the workflow header comments both reference ADR-029 consistently, and the ADR body is otherwise verbatim from the issue draft. This is a deviation from the literal AC text but satisfies its **intent** (an Accepted ADR documenting the decision). No change required — though I'd record it as a closed deviation so the audit trail explains why the issue says 028 and the repo says 029. The PR body already does this; that is sufficient. **The enforced-invariant goal is genuinely achieved.** The core requirement behind #603 was replacing an honour-system "Keep in sync" invariant with an enforced single definition. The diff delivers exactly that: per-environment variation is now a typed `inputs:` contract (5 + 0 + 1 inputs), and the only remaining per-environment forks are the `with:` values — which is the intended, minimal surface. The `#526 /import` guard correctly **stayed inline** in `nightly.yml` per the issue's explicit instruction; good discipline not to over-extract. **`required: true` / no `default:` fail-loud requirement is met** for all secret inputs — a missing secret aborts rather than silently writing an empty key, and the non-empty guard is a second line of defence. This is the kind of unwanted-behaviour ("If a secret is missing, the system shall fail loudly") that the issue called out as the single most likely bug; both the heredoc-delimiter fix and the guard address it. ### Suggestions (non-blocking) - **S1 — Close the loop on the deferred AC.** AC #12 is the only unverified requirement and it is genuinely unverifiable without infrastructure. Recommend you (a) run `workflow_dispatch` on `nightly.yml`, (b) confirm all 5 obs services report healthy, and (c) paste the run link into this PR before merge, so the traceability record is closed by evidence rather than by trust. `release.yml` riding on the identical `uses:` calls is a sound argument — no separate prod run needed. - **S2 — Documentation drift watch.** The "Adding an input" recipe in `ci-gitea.md` now encodes the five-key-guard invariant in prose. That is itself a new "keep in sync" point (the guard loop and the heredoc and the docs must all agree). Minor, and arguably unavoidable, but worth a glance during the next obs-secret change. Nothing here blocks merge. The requirements are fully traceable to the implementation; the one open item is a deferred verification, not a defect. — *Elicit (Requirements Engineer)*
Author
Owner

Nora "NullX" Steiner — Application Security Engineer · Ethical Hacker

Adversarial mindset, defender's heart. I'll name the exact line and the exact payload, and hand you the fix in the same shell you wrote.

⚠️ Approved with concerns

Solid refactor with security-aware comments — the chmod 600-last ordering, the non-empty guard, required: true with no default, and the Renovate watch extension are all genuinely good controls and I want to call them out as correct. But the headline change — switching the secrets heredoc from a quoted delimiter (<<'EOF') to an unquoted one (<<EOF) — changes the trust semantics of how secret bytes are interpreted, and the current guard does not cover the new failure mode. One blocker, a couple of concerns, the rest is praise.


🚫 BLOCKER 1 — Unquoted heredoc mangles secrets containing backslashes (.gitea/actions/deploy-obs/action.yml, lines ~57–63)

This is the load-bearing change and it's the one place the threat model shifted. Let me be precise about why, because the PR description's reasoning is half-right and the dangerous half is the part it gets wrong.

Old code (quoted <<'EOF'): the ${{ secrets.X }} tokens were substituted by Gitea's template engine before the shell ever ran. The shell saw an already-final literal line and — because the delimiter was quoted — performed zero expansion. Secret bytes were written verbatim. Safe for any byte.

New code (unquoted <<EOF + $VAR): the shell now performs parameter expansion on the heredoc body. Two things are true here, and only one of them is the comment's concern:

  1. Command substitution / backticks are NOT a risk. The result of $VAR expansion is not re-scanned — a secret value of $(rm -rf /) or `id` is written literally, never executed. The comment's worry about "command substitution" is unfounded; POSIX word expansion does not recurse. Good news.

  2. 🚫 Backslash and $-sequences in the secret ARE mangled. In an unquoted heredoc the shell still processes \ as an escape character for $, `, \, and newline, and still expands $other sequences inside the secret value itself. Concrete payloads — all of which BCrypt-grade / Django SECRET_KEY / openssl rand output can plausibly contain:

    • Secret = s3cret\ (trailing backslash) → the \ + newline is consumed as a line-continuation; the GLITCHTIP_SECRET_KEY= line swallows the next line (POSTGRES_PASSWORD=…). File now has the wrong number of keys and a corrupted value.
    • Secret = a\$b → written as a$b (the \$ collapses).
    • Secret = pre$HOME/post or key$POSTGRES_PASSWORD$HOME / $POSTGRES_PASSWORD expands inside the secret, silently substituting another value into this one.
    • Secret = tab\there\t… (POSIX heredoc doesn't interpret \t, but the \ is still consumed before the t is not special — net: a dropped backslash).

    Impact: a deploy writes a silently wrong secret. Grafana/GlitchTip/Postgres come up with corrupted credentials, or — in the trailing-backslash case — obs-secrets.env is structurally broken (a key swallowed). This is exactly the class of bug the five-key guard was meant to catch, and it doesn't (see Blocker 2). CWE-150 (improper neutralization of escape sequences) / config-injection.

The PR description claims "a quoted delimiter would write the literal string and config --quiet would still pass." That's the reasoning for why you moved off quoting — but it conflates two different quotings. The old quoted heredoc was paired with Gitea-side ${{ }} substitution, so it did NOT write literal $VAR — it wrote the real secret. The premise that you had to go unquoted is only true because secrets.* is unavailable in composite actions and you're now forced through env: + $VAR. Fair. But the cost is that you've re-introduced shell escaping into the secret path, and that needs neutralizing.

Fix — don't hand secret bytes to the heredoc parser at all. Use a tool that does no expansion. Two clean options:

Option A — printf with positional args (no expansion of the values, only of the format string):

# env: block stays exactly as-is (GRAFANA_ADMIN_PASSWORD, etc.)
{
  printf 'GRAFANA_ADMIN_PASSWORD=%s\n' "$GRAFANA_ADMIN_PASSWORD"
  printf 'GRAFANA_DB_PASSWORD=%s\n'   "$GRAFANA_DB_PASSWORD"
  printf 'GLITCHTIP_SECRET_KEY=%s\n'  "$GLITCHTIP_SECRET_KEY"
  printf 'POSTGRES_PASSWORD=%s\n'     "$POSTGRES_PASSWORD"
  printf 'POSTGRES_HOST=%s\n'         "$POSTGRES_HOST"
} > /opt/familienarchiv/obs-secrets.env

%s with a quoted argument writes the value byte-for-byte: no backslash processing, no $ re-expansion, no line-continuation. This is the standard safe idiom for emitting untrusted values into a .env file.

Option B — keep the heredoc but feed values that can't be re-interpreted is not possible cleanly; prefer A.

If you keep the heredoc for readability, you must at minimum document that every secret is assumed free of \, $, and trailing backslash — but that's an unenforceable honour-system constraint on secret content, which is exactly the kind of invariant this PR set out to abolish. Use printf.


🚫 BLOCKER 2 — The five-key guard validates the wrong file and can't detect mangling (deploy-obs/action.yml, lines ~65–70)

grep -Eq "^${key}=.+" asserts each key is present and non-empty. It does not assert the value is correct. The mangling failures in Blocker 1 produce non-empty (but wrong) values — GLITCHTIP_SECRET_KEY=a$b, POSTGRES_PASSWORD=…corrupted — all of which sail past .+. And the trailing-backslash case that swallows the next line would actually be caught here (POSTGRES_PASSWORD key goes missing) — so the guard catches the structural break but not the silent-corruption cases, which are worse because the stack comes up with bad auth instead of failing loudly.

This isn't a reason to drop the guard — it's a defense-in-depth control and I like it. But it cannot be relied on as the mitigation for Blocker 1. Fix Blocker 1 at the write layer (printf); keep this guard as a second line. Once the write is byte-safe, the guard's remaining job (catch a genuinely empty/missing secret) is exactly right.

One concrete hardening while you're here: the guard only proves the key exists in the file — it can't prove the byte content survived. After switching to printf, consider asserting key count too ([ "$(grep -cE '^[A-Z_]+=.+' obs-secrets.env)" -eq 5 ]) so a swallowed-line regression trips immediately.


⚠️ CONCERN 3 — Gitea log-masking through input → env → $VAR is plausible but unverified

The comment correctly warns "do not stage these into intermediate variables, or Gitea log masking can be lost." Good instinct. But the masking guarantee for the input → env:$VAR path specifically — where the secret crosses the secrets.* → with: → inputs.* → env: boundary into a composite action — is exactly the kind of thing that's framework-version-dependent and that I never assume without verification (per my own rule: "assuming framework defaults are secure without verification").

Two real exposure points to confirm on the pending workflow_dispatch run:

  • config --quiet and up -d: docker compose reads --env-file and on a parse/substitution error can echo offending lines. --quiet suppresses the rendered config on success, but an error path may not be quiet. Verify a deliberately-broken secret does not dump the file to the log.
  • set -e + a failing grep/printf: confirm the runner's masking registers values that entered via inputs.* (not just top-level secrets.*). If the nightly run shows any of the five values unmasked anywhere, that's a P1 — Gitea Actions log masking keys off registered secret values, and the registration path for composite-action inputs is the open question.

Not a blocker on code, but it must be on the acceptance checklist for the pending nightly run, not just "all 5 services healthy."


What's correct (credit where due)

  • chmod 600 as the final operation — there is genuinely no world-readable window; cat/printf create the file with the umask-default perms then immediately tighten. Correct ordering, and the comment explains why it's load-bearing. (Belt-and-suspenders nit: a umask 077 before the write would make even the creation instant-private, defending against a crash between write and chmod. Optional.)
  • required: true with no default: — fails closed on a missing secret. Textbook. Don't let anyone "simplify" a default in later.
  • --privileged --pid=host Caddy reload + pinned digest + Renovate watch on .gitea/actions/** — moving the digest into the composite action without extending the Renovate matchPaths would have created a stale-digest blind spot on a root-equivalent-host-access image. You caught it and extended the watch in the same PR. That's the single most important supply-chain control in this diff and it's done right. The --privileged --pid=host container is root-on-the-host by construction, but that's inherent to the DooD-reload design (pre-existing, well-documented in ADR/ci-gitea.md), not introduced here — and the pinned-by-digest + manual-review-only Renovate rule is the correct compensating control.
  • checkout pinned as step 1 with an explanatory comment — correct, and the ordering constraint is real.
  • Smoke test pins the full HSTS preload value and the exact Permissions-Policy, and asserts /actuator → 404. These are real security regressions it will now catch. Nice.

Summary

# Severity Item
1 🚫 Blocker Unquoted heredoc mangles secrets with \, $, trailing backslash → silent-wrong / swallowed secret. Fix with printf '%s\n'.
2 🚫 Blocker Five-key guard checks presence not correctness — can't detect the mangling; keep as defense-in-depth, don't rely on it.
3 ⚠️ Concern Verify input→env→$VAR log masking + config error-path quietness on the pending nightly run.
chmod-600-last, required-no-default, Renovate digest watch, checkout-first, smoke-test header pins — all correct.

Fix Blocker 1 (one printf block, ~6 lines) and the trust semantics return to the byte-faithful behaviour the old quoted heredoc had. Happy to re-review the moment that lands. Bugs are systemic, not personal — this one is a genuinely subtle shell-escaping trap and the comments show you were already thinking about it.

— Nora "NullX" Steiner

## Nora "NullX" Steiner — Application Security Engineer · Ethical Hacker > Adversarial mindset, defender's heart. I'll name the exact line and the exact payload, and hand you the fix in the same shell you wrote. ### ⚠️ Approved with concerns Solid refactor with security-aware comments — the `chmod 600`-last ordering, the non-empty guard, `required: true` with no default, and the Renovate watch extension are all genuinely good controls and I want to call them out as *correct*. But the headline change — switching the secrets heredoc from a **quoted** delimiter (`<<'EOF'`) to an **unquoted** one (`<<EOF`) — changes the trust semantics of how secret *bytes* are interpreted, and the current guard does not cover the new failure mode. One blocker, a couple of concerns, the rest is praise. --- ### 🚫 BLOCKER 1 — Unquoted heredoc mangles secrets containing backslashes (`.gitea/actions/deploy-obs/action.yml`, lines ~57–63) This is the load-bearing change and it's the one place the threat model shifted. Let me be precise about *why*, because the PR description's reasoning is half-right and the dangerous half is the part it gets wrong. **Old code** (quoted `<<'EOF'`): the `${{ secrets.X }}` tokens were substituted by **Gitea's template engine** *before the shell ever ran*. The shell saw an already-final literal line and — because the delimiter was quoted — performed **zero** expansion. Secret bytes were written verbatim. Safe for *any* byte. **New code** (unquoted `<<EOF` + `$VAR`): the shell now performs parameter expansion on the heredoc body. Two things are true here, and only one of them is the comment's concern: 1. ✅ **Command substitution / backticks are NOT a risk.** The result of `$VAR` expansion is *not re-scanned* — a secret value of `` $(rm -rf /) `` or `` `id` `` is written **literally**, never executed. The comment's worry about "command substitution" is unfounded; POSIX word expansion does not recurse. Good news. 2. 🚫 **Backslash and `$`-sequences in the secret ARE mangled.** In an *unquoted* heredoc the shell still processes `\` as an escape character for `$`, `` ` ``, `\`, and newline, and still expands `$other` sequences *inside the secret value itself*. Concrete payloads — all of which BCrypt-grade / Django `SECRET_KEY` / `openssl rand` output can plausibly contain: - Secret = `s3cret\` (trailing backslash) → the `\` + newline is consumed as a line-continuation; the `GLITCHTIP_SECRET_KEY=` line **swallows the next line** (`POSTGRES_PASSWORD=…`). File now has the wrong number of keys and a corrupted value. - Secret = `a\$b` → written as `a$b` (the `\$` collapses). - Secret = `pre$HOME/post` or `key$POSTGRES_PASSWORD` → `$HOME` / `$POSTGRES_PASSWORD` expands inside the secret, silently substituting *another* value into this one. - Secret = `tab\there` → `\t`… (POSIX heredoc doesn't interpret `\t`, but the `\` is still consumed before the `t` is not special — net: a dropped backslash). **Impact:** a deploy writes a *silently wrong* secret. Grafana/GlitchTip/Postgres come up with corrupted credentials, or — in the trailing-backslash case — `obs-secrets.env` is structurally broken (a key swallowed). This is exactly the class of bug the five-key guard was *meant* to catch, and it doesn't (see Blocker 2). CWE-150 (improper neutralization of escape sequences) / config-injection. **The PR description claims "a quoted delimiter would write the literal string and `config --quiet` would still pass."** That's the reasoning for *why you moved off quoting* — but it conflates two different quotings. The old quoted heredoc was paired with **Gitea-side** `${{ }}` substitution, so it did NOT write literal `$VAR` — it wrote the real secret. The premise that you *had* to go unquoted is only true because `secrets.*` is unavailable in composite actions and you're now forced through `env:` + `$VAR`. Fair. But the cost is that you've re-introduced shell escaping into the secret path, and that needs neutralizing. **Fix — don't hand secret bytes to the heredoc parser at all. Use a tool that does no expansion.** Two clean options: *Option A — `printf` with positional args (no expansion of the values, only of the format string):* ```bash # env: block stays exactly as-is (GRAFANA_ADMIN_PASSWORD, etc.) { printf 'GRAFANA_ADMIN_PASSWORD=%s\n' "$GRAFANA_ADMIN_PASSWORD" printf 'GRAFANA_DB_PASSWORD=%s\n' "$GRAFANA_DB_PASSWORD" printf 'GLITCHTIP_SECRET_KEY=%s\n' "$GLITCHTIP_SECRET_KEY" printf 'POSTGRES_PASSWORD=%s\n' "$POSTGRES_PASSWORD" printf 'POSTGRES_HOST=%s\n' "$POSTGRES_HOST" } > /opt/familienarchiv/obs-secrets.env ``` `%s` with a quoted argument writes the value byte-for-byte: no backslash processing, no `$` re-expansion, no line-continuation. This is the standard safe idiom for emitting untrusted values into a `.env` file. *Option B — keep the heredoc but feed values that can't be re-interpreted is not possible cleanly; prefer A.* If you keep the heredoc for readability, you **must** at minimum document that every secret is assumed free of `\`, `$`, and trailing backslash — but that's an unenforceable honour-system constraint on *secret content*, which is exactly the kind of invariant this PR set out to abolish. Use `printf`. --- ### 🚫 BLOCKER 2 — The five-key guard validates the wrong file and can't detect mangling (`deploy-obs/action.yml`, lines ~65–70) `grep -Eq "^${key}=.+"` asserts each key is **present and non-empty**. It does **not** assert the value is *correct*. The mangling failures in Blocker 1 produce non-empty (but wrong) values — `GLITCHTIP_SECRET_KEY=a$b`, `POSTGRES_PASSWORD=…corrupted` — all of which sail past `.+`. And the trailing-backslash case that *swallows* the next line would actually be caught here (POSTGRES_PASSWORD key goes missing) — so the guard catches the structural break but not the silent-corruption cases, which are worse because the stack comes up with bad auth instead of failing loudly. This isn't a reason to drop the guard — it's a defense-in-depth control and I like it. But it cannot be relied on as *the* mitigation for Blocker 1. Fix Blocker 1 at the write layer (printf); keep this guard as a second line. Once the write is byte-safe, the guard's remaining job (catch a genuinely empty/missing secret) is exactly right. One concrete hardening while you're here: the guard only proves the key exists in the file — it can't prove the *byte content* survived. After switching to `printf`, consider asserting key **count** too (`[ "$(grep -cE '^[A-Z_]+=.+' obs-secrets.env)" -eq 5 ]`) so a swallowed-line regression trips immediately. --- ### ⚠️ CONCERN 3 — Gitea log-masking through input → env → `$VAR` is plausible but unverified The comment correctly warns "do not stage these into intermediate variables, or Gitea log masking can be lost." Good instinct. But the masking guarantee for the **input → `env:` → `$VAR`** path specifically — where the secret crosses the `secrets.* → with: → inputs.* → env:` boundary *into a composite action* — is exactly the kind of thing that's framework-version-dependent and that I never assume without verification (per my own rule: "assuming framework defaults are secure without verification"). Two real exposure points to confirm on the pending `workflow_dispatch` run: - **`config --quiet` and `up -d`**: `docker compose` reads `--env-file` and on a parse/substitution error can echo offending lines. `--quiet` suppresses the rendered config on success, but an *error* path may not be quiet. Verify a deliberately-broken secret does not dump the file to the log. - **`set -e` + a failing `grep`/`printf`**: confirm the runner's masking registers values that entered via `inputs.*` (not just top-level `secrets.*`). If the nightly run shows any of the five values unmasked anywhere, that's a P1 — Gitea Actions log masking keys off registered secret values, and the registration path for composite-action inputs is the open question. Not a blocker on code, but it **must** be on the acceptance checklist for the pending nightly run, not just "all 5 services healthy." --- ### ✅ What's correct (credit where due) - **`chmod 600` as the final operation** — there is genuinely no world-readable window; `cat`/`printf` create the file with the umask-default perms then immediately tighten. Correct ordering, and the comment explains *why* it's load-bearing. (Belt-and-suspenders nit: a `umask 077` before the write would make even the creation instant-private, defending against a crash *between* write and chmod. Optional.) - **`required: true` with no `default:`** — fails closed on a missing secret. Textbook. Don't let anyone "simplify" a default in later. - **`--privileged --pid=host` Caddy reload + pinned digest + Renovate watch on `.gitea/actions/**`** — moving the digest into the composite action without extending the Renovate `matchPaths` would have created a stale-digest blind spot on a *root-equivalent-host-access* image. You caught it and extended the watch in the same PR. That's the single most important supply-chain control in this diff and it's done right. The `--privileged --pid=host` container is root-on-the-host by construction, but that's inherent to the DooD-reload design (pre-existing, well-documented in ADR/ci-gitea.md), not introduced here — and the pinned-by-digest + manual-review-only Renovate rule is the correct compensating control. - **`checkout` pinned as step 1** with an explanatory comment — correct, and the ordering constraint is real. - **Smoke test** pins the *full* HSTS preload value and the exact Permissions-Policy, and asserts `/actuator → 404`. These are real security regressions it will now catch. Nice. --- ### Summary | # | Severity | Item | |---|----------|------| | 1 | 🚫 Blocker | Unquoted heredoc mangles secrets with `\`, `$`, trailing backslash → silent-wrong / swallowed secret. Fix with `printf '%s\n'`. | | 2 | 🚫 Blocker | Five-key guard checks presence not correctness — can't detect the mangling; keep as defense-in-depth, don't rely on it. | | 3 | ⚠️ Concern | Verify input→env→`$VAR` log masking + `config` error-path quietness on the pending nightly run. | | — | ✅ | chmod-600-last, required-no-default, Renovate digest watch, checkout-first, smoke-test header pins — all correct. | Fix Blocker 1 (one `printf` block, ~6 lines) and the trust semantics return to the byte-faithful behaviour the old quoted heredoc had. Happy to re-review the moment that lands. Bugs are systemic, not personal — this one is a genuinely subtle shell-escaping trap and the comments show you were already thinking about it. — Nora "NullX" Steiner
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

⚠️ Approved with concerns. This is a genuinely clean DRY refactor — the duplication is gone, the per-environment variance is a typed input contract, and the action boundaries are named in one or two words (deploy-obs, reload-caddy, smoke-test), exactly the single-responsibility split I'd ask for. My one real concern is that the most fragile, security-critical invariant in the whole PR is still guarded only by a comment and a runtime check, when this repo already has the pattern to enforce it structurally in CI. That's a suggestion I feel strongly about, not a blocker.


What I like (no changes needed)

  • DRY done right. release.yml shrank by 253 lines and nightly.yml lost its honour-system # Keep in sync with nightly.yml comments. The "five keys vs four" drift that ADR-029 cites is now structurally impossible — one definition, both callers. This is the right mechanism.
  • Single responsibility per action. Each action is one nameable job. deploy-obs/action.yml keeps its four run: blocks (deploy → validate → start → assert health) as separately-named steps rather than one mega-script — they stay individually readable and produce their own CI log section. Good.
  • The checkout-first constraint is handled honestly — pinned as step 1 in both workflows, called out in three places (workflow header comment, inline comment, ADR-029 consequences). A future reorder is the one thing that silently breaks uses: ./…, and you've over-documented it rather than under-documented it. Correct trade-off.
  • required: true with no default: on the secret inputs is exactly the fail-loud posture I want. A missing secret must not fall back to empty.
  • ADR numbering is correct and the deviation is disclosed. I verified: 028-pdfjs-wasm-decoders-and-csp-constraint.md is on main, so this genuinely is 029. The PR body flags the issue-said-028 discrepancy explicitly rather than quietly renumbering. That's the right way to handle it.

Comment quality — mostly load-bearing, one trim

The comments here are overwhelmingly why, not what, which is the standard I hold. The heredoc-delimiter comment (deploy-obs/action.yml, the <<EOF block) and the --resolve-as-bash-array comment (smoke-test/action.yml) are load-bearing — they encode non-obvious failure modes that a future editor would otherwise reintroduce. Keep them.

One observation, not a request: reload-caddy/action.yml has a ~25-line comment in front of a 3-line run:. Every sentence in it is genuinely load-bearing (DooD, why-not-systemctl, why-alpine, reload-not-restart), and it's now the single home for that rationale instead of being duplicated across two workflows — so I'd actually leave it. The duplication this PR removed is what makes a big comment block acceptable: it pays for itself once, not twice.

Concern (not a blocker) — the unquoted-heredoc invariant deserves a CI guard

This is the part I want to push on, and it's directly the TDD-applicability question for a pure-YAML refactor.

The PR is right that there's no local harness for .gitea/ YAML, and I agree TDD-by-unit-test doesn't apply here — there's nothing to red/green against on a dev machine. But "no local harness" isn't the same as "unguardable." This repo already runs grep-based structural assertions in ci.yml — and at least one of them is self-testing:

  • ci.yml:48 "Assert no banned vi.mock patterns"
  • ci.yml:68 "Assert no raw document date via {@html}" — and look at ci.yml:79-85: it feeds the regex a known-bad input, asserts it matches, feeds a known-good input, asserts it doesn't. That is red/green for a CI invariant.
  • ci.yml:91 "Assert no upload/download-artifact past v3"

The single most dangerous failure mode in this PR is precisely the one the code comment warns about: if someone "fixes" the heredoc back to <<'EOF', the file gets written with literal $GRAFANA_ADMIN_PASSWORD strings and docker compose config --quiet still passes — present-but-wrong. The five-key grep -Eq "^KEY=.+" runtime guard does not catch this either, because GRAFANA_ADMIN_PASSWORD=$GRAFANA_ADMIN_PASSWORD is a non-empty line. So today the only thing standing between us and a silently-broken-auth deploy is a comment.

Given the existing precedent, this is cheap to enforce:

- name: Assert deploy-obs heredoc delimiter stays unquoted (#603)
  shell: bash
  run: |
    # A quoted delimiter writes literal $VAR names; `config --quiet` still passes.
    if grep -nE "cat .*obs-secrets\.env.*<<'EOF'" .gitea/actions/deploy-obs/action.yml; then
      echo "FAIL: quoted heredoc delimiter — secrets would be written as literal \$VAR. Use <<EOF."
      exit 1
    fi
    grep -qE 'cat .*obs-secrets\.env.*<<EOF([^'\'']|$)' .gitea/actions/deploy-obs/action.yml \
      || { echo "FAIL: expected unquoted <<EOF heredoc not found in deploy-obs"; exit 1; }

That converts the load-bearing comment into an enforced invariant, in the same idiom the repo already uses three times over. I'd also accept a one-line actionlint step in ci.yml as a broader structural net for all of .gitea/actions/** (it catches the shell:-missing and malformed-uses: classes for free), but the targeted grep is the one that protects the actual secret-exposure path. Suggestion, not a gate — the comment is honest about the risk; I'd just rather the CI enforce it than trust the next editor to read the comment.

Minor nits (non-blocking)

  • smoke-test/action.yml — the awk /proc/net/route gateway one-liner is dense, but it's commented field-by-field and identical to the pre-refactor code, so I won't ask you to extract it. Behaviour-preserving, fine.
  • deploy-obs step 1 does three jobs (copy configs, write+guard secrets, chmod) in one run:. In Java I'd split this; in a composite run: block the cohesion is acceptable because the chmod-600-last ordering is itself the security property and splitting steps would let a future reorder break it. Leaving the chmod adjacent to the write is the right call — just flagging I considered it.

Verdict

DRY goal achieved cleanly, boundaries are right, comments are load-bearing, the ADR/docs trail is thorough and honest. Approving with one concern I'd like addressed before this becomes the template every future composite action copies: add the CI grep guard for the unquoted heredoc so the invariant is enforced, not trusted. The repo already proves that's the house style for un-unit-testable invariants. Everything else is ship-ready.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ⚠️ **Approved with concerns.** This is a genuinely clean DRY refactor — the duplication is gone, the per-environment variance is a typed input contract, and the action boundaries are named in one or two words (`deploy-obs`, `reload-caddy`, `smoke-test`), exactly the single-responsibility split I'd ask for. My one real concern is that the most fragile, security-critical invariant in the whole PR is still guarded only by a comment and a runtime check, when this repo already has the pattern to enforce it structurally in CI. That's a suggestion I feel strongly about, not a blocker. --- ### What I like (no changes needed) - **DRY done right.** `release.yml` shrank by 253 lines and `nightly.yml` lost its honour-system `# Keep in sync with nightly.yml` comments. The "five keys vs four" drift that ADR-029 cites is now structurally impossible — one definition, both callers. This is the right mechanism. - **Single responsibility per action.** Each action is one nameable job. `deploy-obs/action.yml` keeps its four `run:` blocks (deploy → validate → start → assert health) as separately-named steps rather than one mega-script — they stay individually readable and produce their own CI log section. Good. - **The checkout-first constraint is handled honestly** — pinned as step 1 in both workflows, called out in three places (workflow header comment, inline comment, ADR-029 consequences). A future reorder is the one thing that silently breaks `uses: ./…`, and you've over-documented it rather than under-documented it. Correct trade-off. - **`required: true` with no `default:` on the secret inputs** is exactly the fail-loud posture I want. A missing secret must not fall back to empty. - **ADR numbering is correct and the deviation is disclosed.** I verified: `028-pdfjs-wasm-decoders-and-csp-constraint.md` is on `main`, so this genuinely is 029. The PR body flags the issue-said-028 discrepancy explicitly rather than quietly renumbering. That's the right way to handle it. ### Comment quality — mostly load-bearing, one trim The comments here are overwhelmingly *why*, not *what*, which is the standard I hold. The heredoc-delimiter comment (`deploy-obs/action.yml`, the `<<EOF` block) and the `--resolve`-as-bash-array comment (`smoke-test/action.yml`) are load-bearing — they encode non-obvious failure modes that a future editor would otherwise reintroduce. Keep them. One observation, not a request: `reload-caddy/action.yml` has a ~25-line comment in front of a 3-line `run:`. Every sentence in it is genuinely load-bearing (DooD, why-not-systemctl, why-alpine, reload-not-restart), and it's now the *single* home for that rationale instead of being duplicated across two workflows — so I'd actually leave it. The duplication this PR removed is what makes a big comment block acceptable: it pays for itself once, not twice. ### Concern (not a blocker) — the unquoted-heredoc invariant deserves a CI guard This is the part I want to push on, and it's directly the TDD-applicability question for a pure-YAML refactor. The PR is right that there's no *local* harness for `.gitea/` YAML, and I agree TDD-by-unit-test doesn't apply here — there's nothing to red/green against on a dev machine. But "no local harness" isn't the same as "unguardable." This repo already runs **grep-based structural assertions in `ci.yml`** — and at least one of them is self-testing: - `ci.yml:48` "Assert no banned vi.mock patterns" - `ci.yml:68` "Assert no raw document date via {@html}" — and look at `ci.yml:79-85`: it feeds the regex a known-bad input, asserts it matches, feeds a known-good input, asserts it doesn't. That *is* red/green for a CI invariant. - `ci.yml:91` "Assert no upload/download-artifact past v3" The single most dangerous failure mode in this PR is precisely the one the code comment warns about: if someone "fixes" the heredoc back to `<<'EOF'`, the file gets written with literal `$GRAFANA_ADMIN_PASSWORD` strings and `docker compose config --quiet` **still passes** — present-but-wrong. The five-key `grep -Eq "^KEY=.+"` runtime guard does *not* catch this either, because `GRAFANA_ADMIN_PASSWORD=$GRAFANA_ADMIN_PASSWORD` is a non-empty line. So today the only thing standing between us and a silently-broken-auth deploy is a comment. Given the existing precedent, this is cheap to enforce: ```yaml - name: Assert deploy-obs heredoc delimiter stays unquoted (#603) shell: bash run: | # A quoted delimiter writes literal $VAR names; `config --quiet` still passes. if grep -nE "cat .*obs-secrets\.env.*<<'EOF'" .gitea/actions/deploy-obs/action.yml; then echo "FAIL: quoted heredoc delimiter — secrets would be written as literal \$VAR. Use <<EOF." exit 1 fi grep -qE 'cat .*obs-secrets\.env.*<<EOF([^'\'']|$)' .gitea/actions/deploy-obs/action.yml \ || { echo "FAIL: expected unquoted <<EOF heredoc not found in deploy-obs"; exit 1; } ``` That converts the load-bearing comment into an enforced invariant, in the same idiom the repo already uses three times over. I'd also accept a one-line `actionlint` step in `ci.yml` as a broader structural net for all of `.gitea/actions/**` (it catches the `shell:`-missing and malformed-`uses:` classes for free), but the targeted grep is the one that protects the actual secret-exposure path. **Suggestion, not a gate** — the comment is honest about the risk; I'd just rather the CI enforce it than trust the next editor to read the comment. ### Minor nits (non-blocking) - **`smoke-test/action.yml`** — the `awk` `/proc/net/route` gateway one-liner is dense, but it's commented field-by-field and identical to the pre-refactor code, so I won't ask you to extract it. Behaviour-preserving, fine. - **`deploy-obs` step 1 does three jobs** (copy configs, write+guard secrets, chmod) in one `run:`. In Java I'd split this; in a composite `run:` block the cohesion is acceptable because the chmod-600-last ordering is itself the security property and splitting steps would let a future reorder break it. Leaving the chmod adjacent to the write is the right call — just flagging I considered it. ### Verdict DRY goal achieved cleanly, boundaries are right, comments are load-bearing, the ADR/docs trail is thorough and honest. Approving with one concern I'd like addressed before this becomes the template every future composite action copies: **add the CI grep guard for the unquoted heredoc** so the invariant is enforced, not trusted. The repo already proves that's the house style for un-unit-testable invariants. Everything else is ship-ready.
Author
Owner

QA Review — Sara Holt (@saraholt), Senior QA Engineer

⚠️ Approved with concerns.

The refactor is clean and behaviour-preserving on inspection, and the security ordering (guard-then-chmod 600-last) is correct. My concern is not the code — it's the verification story. This PR ships untestable-by-design YAML whose only proof is a single live workflow_dispatch on the VPS, and several of the failure modes the PR claims to defend against are exactly the ones a one-shot manual run will not exercise. Below: the gap, the failure-mode coverage, what the human must watch in the nightly run, and what we can assert statically so the next refactor isn't this hand-verified.


1. The verification gap (the headline)

There is no test harness for .gitea/ at any layer — not static, not unit, not integration. The PR's "Verification" section is honest about this ("AC invariants verified mechanically"), but "verified mechanically" here means a human grepped the diff once. That is not a regression gate; it evaporates the moment this file is touched again.

Concretely, the test pyramid for this change is empty at every level:

Layer What it would catch Present?
Static (actionlint/yamllint) malformed action.yml, missing shell:, bad input refs none
Static (grep guard) re-quoted heredoc <<'EOF', checkout-not-first none
Integration (dry-run) composite resolves + config --quiet parity none — only the live VPS run
E2E 5 obs services healthy + smoke surface ⚠️ manual workflow_dispatch only, staging only

The single live run validates staging via nightly.yml. release.yml (production) is asserted "trusted via its identical uses: calls" — which is true only as long as the inputs are wired identically, and that wiring is hand-authored per workflow (different secret names, different postgres_host, different host). Production's deploy-obs/smoke-test invocation is never executed before it runs for real against prod. That's the riskiest unverified path in the PR and it's getting zero coverage.


2. Failure-mode coverage

Five-key guard (grep -Eq "^${key}=.+") — partially effective.
.+ requires ≥1 char, so it correctly fails on an empty/missing secret (KEY= or absent line). Good — that's the documented intent and it holds.

But the guard does not catch the very regression the unquoted heredoc exists to prevent. If someone re-quotes the delimiter back to <<'EOF' (an easy "tidy-up" mistake, since <<'EOF' is the more common idiom and looks safer), the file gets written as:

GRAFANA_ADMIN_PASSWORD=$GRAFANA_ADMIN_PASSWORD   # literal, unexpanded

That line is non-empty, so the five-key guard passes. docker compose config --quiet passes (the var is present). The stack starts. Grafana/GlitchTip come up with literal-string passwords as their actual credentials — and the smoke test in this PR checks the public app surface (login/HSTS/Permissions-Policy/actuator), none of which touch obs auth. So a re-quoted-delimiter regression sails through all four named blocks and the smoke test, green. The guard defends against empty, not wrong. The PR's own comment even admits "config --quiet would still pass (the var is present, just wrong)" — but the added guard doesn't close that hole; it closes a different (empty) one. This is the single most important thing to understand about this change's safety envelope.

Unquoted-heredoc → injection footgun (suggestion, not blocker). Unquoted <<EOF also means a $ or backtick inside a secret value is now shell-interpreted. These are generated high-entropy secrets so practically low-risk, but it's a new property the quoted form didn't have. Worth a one-line note in the action comment.

Behaviour parity old→new — I checked the diff line-by-line; it's faithful, with two cosmetic drifts, both benign:

  • nightly's Start observability stack comment said "survive workspace wipes between nightly runs"; the action generalizes to "between runs". Correct (it's shared now) — matches release's original wording. No functional change.
  • The per-step # Keep in sync with nightly.yml (#603) comments are dropped from release.yml. That's the point of the PR (the action is now the single definition), so this is intended, not a silent loss.
  • nightly's #526 /import guard and the app-compose check correctly stay inline (now labelled "nightly-only — stays inline"). Good — no obs logic leaked out, no nightly-specific logic leaked in.

I found no silent divergence between the two callers and the extracted actions. The inputs map 1:1 to the former inline literals (STAGING_*/archiv-staging-db-1/staging.raddatz.cloud vs PROD_*/archiv-production-db-1/archiv.raddatz.cloud).

One pre-existing observation (not introduced here): the Deploy observability configs step has no set -e. The first three commands (rm/mkdir/cp -r) can fail without aborting — only the heredoc keys are guarded, not the config-tree copy. If cp -r infra/observability/. partially fails, you'd start the stack against a stale/incomplete config tree and the guard wouldn't notice. This matched the old inline step too, so it's parity, not a regression — but since you're already touching this block and extracting it as the canonical definition, adding set -euo pipefail to the top of that run: is the cheap moment to harden it. (smoke-test and assert health already do set -e; this step is the odd one out.)


3. What the human must watch on the nightly workflow_dispatch

Because this is the only gate, the run needs to be read actively, not just "did it go green":

  1. Confirm all five obs services report healthy in the Assert observability stack health block (obs-loki obs-prometheus obs-grafana obs-tempo obs-glitchtip). A missing status here means a container never started — distinct from unhealthy.
  2. Actually log into Grafana with the secret after the run. The smoke test does NOT cover obs auth, so a green pipeline does not prove the Grafana/GlitchTip passwords were expanded correctly. This is the manual step that compensates for the guard's empty-vs-wrong blind spot above. Do it once now, post-merge.
  3. Confirm the deploy-obs step's working directory. Composite run: steps execute in $GITHUB_WORKSPACE (the checkout), so cp -r infra/observability/. resolves relative to the repo root. Verify the copy log shows the real tree, not an empty dir — a wrong CWD would silently copy nothing and the guard wouldn't fire (it only checks obs-secrets.env, written by absolute path).
  4. Watch for the checkout-first ordering. If actions/checkout@v4 is ever not step 1, the first uses: ./.gitea/actions/… fails with "action not found" — a clear error, but watch for it on the first run.
  5. Production is still unproven after this. The next real release.yml run is the first execution of the prod deploy-obs/smoke-test wiring. Treat that release as a verification event: have someone on hand and check prod Grafana auth + the five obs services immediately after.

4. What we can assert statically (the real fix for the gap) — suggestions

This change is the perfect argument for a tiny CI lint job. None of these need the VPS; they run on any runner in seconds and would have caught the regressions above mechanically instead of relying on a human grep:

  1. actionlint over .gitea/workflows/** and .gitea/actions/**. Catches missing shell:, bad inputs.*/secrets.* references, malformed composite syntax — the entire class of "only fails at runtime on the VPS" errors. ~2s, zero infra.
  2. yamllint for plain YAML well-formedness as a cheap base.
  3. A grep guard for the heredoc invariant — this is the one that closes the empty-vs-wrong hole at the only point it can be closed (statically):
    # fail if the obs-secrets heredoc delimiter is ever quoted again
    grep -nE "cat .*obs-secrets\.env.*<<'EOF'" .gitea/actions/deploy-obs/action.yml && {
      echo "::error::obs-secrets.env heredoc delimiter must be unquoted (<<EOF) so \$VAR expands"; exit 1; }
    
    Plus a positive assertion that <<EOF (unquoted) is present. This is the test that should ship with this PR — it's the executable form of the invariant the whole change rests on.
  4. A chmod 600-is-last assertion and a checkout-first assertion (grep that actions/checkout precedes the first uses: ./.gitea/actions/ in each workflow). Both are currently convention-only.

These would convert "verified mechanically by a human, once" into "verified by CI, every time" — which is the difference between a refactor that stays correct and one that quietly rots the next time someone touches it.


Verdict

⚠️ Approved with concerns — no blockers; the code is correct and faithful to the originals.

  • Not blocking, but do before/at merge: add set -euo pipefail to the Deploy observability configs step (parity hardening at the canonical site).
  • Strongly recommended as a follow-up issue (or fold into this PR): an actionlint + grep-guard CI job. The unquoted-heredoc invariant is load-bearing and currently has zero automated protection — a static grep guard is ~5 lines and closes the empty-vs-wrong gap the runtime guard can't.
  • For the human: the live nightly run is necessary but not sufficient — manually verify Grafana/GlitchTip auth post-run, and treat the first real release.yml as the production verification event, since prod's wiring is executed for the first time there.

— Sara

## QA Review — Sara Holt (@saraholt), Senior QA Engineer **⚠️ Approved with concerns.** The refactor is clean and behaviour-preserving on inspection, and the security ordering (guard-then-`chmod 600`-last) is correct. My concern is not the code — it's the **verification story**. This PR ships untestable-by-design YAML whose only proof is a single live `workflow_dispatch` on the VPS, and several of the failure modes the PR claims to defend against are exactly the ones a one-shot manual run will *not* exercise. Below: the gap, the failure-mode coverage, what the human must watch in the nightly run, and what we can assert statically so the next refactor isn't this hand-verified. --- ### 1. The verification gap (the headline) There is **no test harness for `.gitea/` at any layer** — not static, not unit, not integration. The PR's "Verification" section is honest about this ("AC invariants verified mechanically"), but "verified mechanically" here means *a human grepped the diff once*. That is not a regression gate; it evaporates the moment this file is touched again. Concretely, the test pyramid for this change is empty at every level: | Layer | What it would catch | Present? | |---|---|---| | Static (actionlint/yamllint) | malformed `action.yml`, missing `shell:`, bad input refs | ❌ none | | Static (grep guard) | re-quoted heredoc `<<'EOF'`, checkout-not-first | ❌ none | | Integration (dry-run) | composite resolves + `config --quiet` parity | ❌ none — only the live VPS run | | E2E | 5 obs services healthy + smoke surface | ⚠️ manual `workflow_dispatch` only, staging only | The single live run validates **staging via `nightly.yml`**. `release.yml` (production) is asserted "trusted via its identical `uses:` calls" — which is true *only as long as the inputs are wired identically*, and that wiring is hand-authored per workflow (different secret names, different `postgres_host`, different `host`). Production's `deploy-obs`/`smoke-test` invocation is **never executed before it runs for real against prod**. That's the riskiest unverified path in the PR and it's getting zero coverage. --- ### 2. Failure-mode coverage **Five-key guard (`grep -Eq "^${key}=.+"`) — partially effective.** `.+` requires ≥1 char, so it correctly fails on an empty/missing secret (`KEY=` or absent line). Good — that's the documented intent and it holds. But the guard **does not catch the very regression the unquoted heredoc exists to prevent.** If someone re-quotes the delimiter back to `<<'EOF'` (an easy "tidy-up" mistake, since `<<'EOF'` is the *more common* idiom and looks safer), the file gets written as: ``` GRAFANA_ADMIN_PASSWORD=$GRAFANA_ADMIN_PASSWORD # literal, unexpanded ``` That line is **non-empty**, so the five-key guard passes. `docker compose config --quiet` passes (the var is present). The stack starts. Grafana/GlitchTip come up with literal-string passwords as their actual credentials — and the smoke test in this PR checks the *public app surface* (login/HSTS/Permissions-Policy/actuator), none of which touch obs auth. So a re-quoted-delimiter regression sails through **all four named blocks and the smoke test, green.** The guard defends against *empty*, not *wrong*. The PR's own comment even admits "`config --quiet` would still pass (the var is present, just wrong)" — but the added guard doesn't close that hole; it closes a different (empty) one. This is the single most important thing to understand about this change's safety envelope. **Unquoted-heredoc → injection footgun (suggestion, not blocker).** Unquoted `<<EOF` also means a `$` or backtick *inside* a secret value is now shell-interpreted. These are generated high-entropy secrets so practically low-risk, but it's a new property the quoted form didn't have. Worth a one-line note in the action comment. **Behaviour parity old→new — I checked the diff line-by-line; it's faithful, with two cosmetic drifts, both benign:** - nightly's `Start observability stack` comment said *"survive workspace wipes between **nightly** runs"*; the action generalizes to *"between runs"*. Correct (it's shared now) — matches release's original wording. No functional change. - The per-step `# Keep in sync with nightly.yml (#603)` comments are dropped from `release.yml`. That's the *point* of the PR (the action is now the single definition), so this is intended, not a silent loss. - nightly's `#526 /import` guard and the app-compose check correctly stay inline (now labelled "nightly-only — stays inline"). Good — no obs logic leaked out, no nightly-specific logic leaked in. I found **no silent divergence** between the two callers and the extracted actions. The inputs map 1:1 to the former inline literals (`STAGING_*`/`archiv-staging-db-1`/`staging.raddatz.cloud` vs `PROD_*`/`archiv-production-db-1`/`archiv.raddatz.cloud`). **One pre-existing observation (not introduced here):** the `Deploy observability configs` step has no `set -e`. The first three commands (`rm`/`mkdir`/`cp -r`) can fail without aborting — only the heredoc keys are guarded, not the config-tree copy. If `cp -r infra/observability/.` partially fails, you'd start the stack against a stale/incomplete config tree and the guard wouldn't notice. This matched the old inline step too, so it's **parity, not a regression** — but since you're already touching this block and extracting it as the canonical definition, adding `set -euo pipefail` to the top of that `run:` is the cheap moment to harden it. (`smoke-test` and `assert health` already do `set -e`; this step is the odd one out.) --- ### 3. What the human must watch on the nightly `workflow_dispatch` Because this is the *only* gate, the run needs to be read actively, not just "did it go green": 1. **Confirm all five obs services report `healthy`** in the `Assert observability stack health` block (`obs-loki obs-prometheus obs-grafana obs-tempo obs-glitchtip`). A `missing` status here means a container never started — distinct from `unhealthy`. 2. **Actually log into Grafana with the secret** after the run. The smoke test does NOT cover obs auth, so a green pipeline does not prove the Grafana/GlitchTip passwords were expanded correctly. This is the manual step that compensates for the guard's empty-vs-wrong blind spot above. Do it once now, post-merge. 3. **Confirm the `deploy-obs` step's working directory.** Composite `run:` steps execute in `$GITHUB_WORKSPACE` (the checkout), so `cp -r infra/observability/.` resolves relative to the repo root. Verify the copy log shows the real tree, not an empty dir — a wrong CWD would silently copy nothing and the guard wouldn't fire (it only checks `obs-secrets.env`, written by absolute path). 4. **Watch for the checkout-first ordering.** If `actions/checkout@v4` is ever not step 1, the first `uses: ./.gitea/actions/…` fails with "action not found" — a clear error, but watch for it on the first run. 5. **Production is still unproven** after this. The next *real* `release.yml` run is the first execution of the prod `deploy-obs`/`smoke-test` wiring. Treat that release as a verification event: have someone on hand and check prod Grafana auth + the five obs services immediately after. --- ### 4. What we *can* assert statically (the real fix for the gap) — suggestions This change is the perfect argument for a tiny CI lint job. None of these need the VPS; they run on any runner in seconds and would have caught the regressions above mechanically instead of relying on a human grep: 1. **`actionlint`** over `.gitea/workflows/**` and `.gitea/actions/**`. Catches missing `shell:`, bad `inputs.*`/`secrets.*` references, malformed composite syntax — the entire class of "only fails at runtime on the VPS" errors. ~2s, zero infra. 2. **`yamllint`** for plain YAML well-formedness as a cheap base. 3. **A grep guard for the heredoc invariant** — this is the one that closes the empty-vs-wrong hole at the only point it *can* be closed (statically): ```bash # fail if the obs-secrets heredoc delimiter is ever quoted again grep -nE "cat .*obs-secrets\.env.*<<'EOF'" .gitea/actions/deploy-obs/action.yml && { echo "::error::obs-secrets.env heredoc delimiter must be unquoted (<<EOF) so \$VAR expands"; exit 1; } ``` Plus a positive assertion that `<<EOF` (unquoted) *is* present. This is the test that should ship **with** this PR — it's the executable form of the invariant the whole change rests on. 4. **A `chmod 600`-is-last assertion** and a **checkout-first assertion** (grep that `actions/checkout` precedes the first `uses: ./.gitea/actions/` in each workflow). Both are currently convention-only. These would convert "verified mechanically by a human, once" into "verified by CI, every time" — which is the difference between a refactor that stays correct and one that quietly rots the next time someone touches it. --- ### Verdict **⚠️ Approved with concerns** — no blockers; the code is correct and faithful to the originals. - **Not blocking, but do before/at merge:** add `set -euo pipefail` to the `Deploy observability configs` step (parity hardening at the canonical site). - **Strongly recommended as a follow-up issue (or fold into this PR):** an `actionlint` + grep-guard CI job. The unquoted-heredoc invariant is load-bearing and currently has *zero* automated protection — a static grep guard is ~5 lines and closes the empty-vs-wrong gap the runtime guard can't. - **For the human:** the live nightly run is necessary but not sufficient — manually verify Grafana/GlitchTip auth post-run, and treat the first real `release.yml` as the production verification event, since prod's wiring is executed for the first time there. — Sara
Author
Owner

DevOps & Platform review — Tobias Wendt (@tobiwendt)

⚠️ Approved with concerns — the refactor is structurally correct and behaviour-preserving on every axis I could verify statically. The single real concern is a narrow shell-expansion edge in the unquoted heredoc that did not exist in the old form. It does not block merge, but it must be on the record and the acceptance workflow_dispatch run is still mandatory before this is "done."


What's correct (verified, not assumed)

  • Composite-action mechanics are textbook. using: composite, every run: step carries shell: bash, inputs map to env: and are referenced as $VAR, callers use uses: ./.gitea/actions/<name> relative paths. actions/checkout@v4 is step 1 in both nightly.yml and release.yml — correct, because a local uses: ./… action only exists on disk post-checkout. The checkout-first invariant is also documented in both workflow headers and ci-gitea.md. Good.
  • Parity nightly ↔ release is exact. I diffed the with: blocks: the only deltas are the three documented ones — STAGING_POSTGRES_PASSWORDPROD_POSTGRES_PASSWORD, postgres_host (archiv-staging-db-1archiv-production-db-1), and smoke-test host (staging.raddatz.cloudarchiv.raddatz.cloud). Nothing else forks. This is precisely the duplication ADR-029 set out to kill, and it's killed cleanly — zero # Keep in sync comments remain (git grep confirms).
  • #526 /import guard correctly stays inline in nightly.yml — it's an app-compose assertion, nightly-only, not shared deploy logic. Right call not to drag it into a composite.
  • Five-key guard is correct. grep -Eq "^${key}=.+" runs immediately after the heredoc write and rejects an empty KEY= line (the .+ is load-bearing — a bare presence check would pass KEY=). chmod 600 is the final operation, so there's no world-readable window. Required inputs carry no default:, so a missing secret fails loudly rather than silently writing an empty value.
  • set -e parity holds — and this is the subtle one. The old "Deploy / Validate / Start" inline steps had no explicit set -e, and neither do the composite versions. That's fine: per the Actions runner spec, both a default run: block AND an explicit shell: bash resolve to bash --noprofile --norc -e -o pipefail {0}. So -e -o pipefail is active in the composite step exactly as it was inline — cp/mkdir/cat failures are not swallowed. The explicit set -e in the assert-health and smoke-test steps is belt-and-suspenders, consistent with the originals.
  • Heredoc indentation is safe. Body lines sit at 8-space YAML indent, but the | block scalar strips common indentation to column 0 before the shell sees it, so KEY=value is written flush-left. The plain <<EOF (not <<-EOF) is correct here precisely because YAML already normalised the indentation. Matches the original.
  • Renovate extension is correct. matchPaths now covers .gitea/actions/** alongside .gitea/workflows/**, so the relocated pinned alpine:3.21@sha256:48b0… digest in reload-caddy stays under the manual-review (no-automerge) digest rule. Without this the privileged-image digest would have gone dark. Good catch — exactly the kind of thing that silently rots.
  • ADR-028→029 renumber is legitimate. I confirmed on the PR branch: 028-pdfjs-wasm-decoders-and-csp-constraint.md landed on main after #603 was filed, so per the sequential-ADR convention the new one is correctly 029, not 028. No collision.

Concern (non-blocking, but record it)

  1. Unquoted heredoc introduces a shell-interpretation surface the old form didn't have. The old inline write was <<'EOF' (quoted) with ${{ secrets.X }} — Gitea expanded the secret into the YAML before the shell ran, and the quoted delimiter meant the shell did zero further interpretation of the value. The new form must use <<EOF (unquoted) so $VAR expands, because secrets.* isn't available inside a composite action. That's the correct and only mechanism — but it means a secret value containing a backtick, $(…), or a trailing backslash would now be subject to command substitution / line continuation at heredoc time, where before it was inert. For the five values here (two passwords, a Django key, a generated PG password, a static hostname) this is low-probability, but it's a genuine behaviour delta, not a no-op. Worth one sentence in the deploy-obs comment block noting "secret values must not contain unescaped backticks/$(...) — heredoc is unquoted by necessity," so the next person rotating a password doesn't generate one with a backtick and get a baffling deploy failure (or worse, silent corruption that the five-key guard wouldn't catch, since KEY=<garbage> still matches .+).

Suggestions (cosmetic, ignore freely)

  1. Stale comment carried over verbatim. Both the validate and start steps comment "obs-secrets.env overrides POSTGRES_HOST set in obs.env (later file wins)." POSTGRES_HOST is not actually set in infra/observability/obs.env — it's only mentioned in obs.env's comments. So there's nothing to override; obs-secrets.env is the sole source. This inaccuracy predates the PR (copied straight from the inline steps), so it's not a regression and I won't hold the PR for it — but since you're already touching these lines, it'd be a clean moment to drop the "overrides" clause.
  2. postgres_host is plain input, correctly not a secret — and the comment says so. Fine. Just noting I checked it isn't accidentally being masked/required-as-secret.

The honest bottom line

This can only be truly verified by a real workflow_dispatch run of nightly.yml ending with all five healthchecked obs services (obs-loki, obs-prometheus, obs-grafana, obs-tempo, obs-glitchtip) green — exactly as the PR body acknowledges. Static review can confirm the YAML parses, the wiring matches, and parity holds (all done above), but composite-action input plumbing and the unquoted-heredoc expansion only prove themselves on the runner. release.yml riding on identical uses: calls is a reasonable trust argument, but the nightly run is non-negotiable before merge. @marcel — please trigger it and confirm the assert-health step passes; that flips this to a clean from me.

Every added tool is a new failure mode — but a composite action that removes a copy-paste invariant and an honour-system comment is the rare case where the abstraction pays for itself. Nice, minimal refactor.

## DevOps & Platform review — Tobias Wendt (@tobiwendt) ⚠️ **Approved with concerns** — the refactor is structurally correct and behaviour-preserving on every axis I could verify statically. The single real concern is a narrow shell-expansion edge in the unquoted heredoc that did not exist in the old form. It does not block merge, but it must be on the record and the acceptance `workflow_dispatch` run is still mandatory before this is "done." --- ### What's correct (verified, not assumed) - **Composite-action mechanics are textbook.** `using: composite`, every `run:` step carries `shell: bash`, inputs map to `env:` and are referenced as `$VAR`, callers use `uses: ./.gitea/actions/<name>` relative paths. `actions/checkout@v4` is step 1 in both `nightly.yml` and `release.yml` — correct, because a local `uses: ./…` action only exists on disk post-checkout. The checkout-first invariant is also documented in both workflow headers and `ci-gitea.md`. Good. - **Parity nightly ↔ release is exact.** I diffed the `with:` blocks: the only deltas are the three documented ones — `STAGING_POSTGRES_PASSWORD`→`PROD_POSTGRES_PASSWORD`, `postgres_host` (`archiv-staging-db-1`→`archiv-production-db-1`), and smoke-test `host` (`staging.raddatz.cloud`→`archiv.raddatz.cloud`). Nothing else forks. This is precisely the duplication ADR-029 set out to kill, and it's killed cleanly — zero `# Keep in sync` comments remain (`git grep` confirms). - **`#526 /import` guard correctly stays inline** in `nightly.yml` — it's an app-compose assertion, nightly-only, not shared deploy logic. Right call not to drag it into a composite. - **Five-key guard is correct.** `grep -Eq "^${key}=.+"` runs immediately after the heredoc write and rejects an empty `KEY=` line (the `.+` is load-bearing — a bare presence check would pass `KEY=`). `chmod 600` is the final operation, so there's no world-readable window. Required inputs carry no `default:`, so a missing secret fails loudly rather than silently writing an empty value. - **`set -e` parity holds — and this is the subtle one.** The old "Deploy / Validate / Start" inline steps had no explicit `set -e`, and neither do the composite versions. That's fine: per the Actions runner spec, both a default `run:` block AND an explicit `shell: bash` resolve to `bash --noprofile --norc -e -o pipefail {0}`. So `-e -o pipefail` is active in the composite step exactly as it was inline — `cp`/`mkdir`/`cat` failures are **not** swallowed. The explicit `set -e` in the assert-health and smoke-test steps is belt-and-suspenders, consistent with the originals. - **Heredoc indentation is safe.** Body lines sit at 8-space YAML indent, but the `|` block scalar strips common indentation to column 0 before the shell sees it, so `KEY=value` is written flush-left. The plain `<<EOF` (not `<<-EOF`) is correct here precisely because YAML already normalised the indentation. Matches the original. - **Renovate extension is correct.** `matchPaths` now covers `.gitea/actions/**` alongside `.gitea/workflows/**`, so the relocated pinned `alpine:3.21@sha256:48b0…` digest in `reload-caddy` stays under the manual-review (no-automerge) digest rule. Without this the privileged-image digest would have gone dark. Good catch — exactly the kind of thing that silently rots. - **ADR-028→029 renumber is legitimate.** I confirmed on the PR branch: `028-pdfjs-wasm-decoders-and-csp-constraint.md` landed on `main` after #603 was filed, so per the sequential-ADR convention the new one is correctly `029`, not `028`. No collision. ### Concern (non-blocking, but record it) 1. **Unquoted heredoc introduces a shell-interpretation surface the old form didn't have.** The old inline write was `<<'EOF'` (quoted) with `${{ secrets.X }}` — Gitea expanded the secret into the YAML *before* the shell ran, and the quoted delimiter meant the shell did zero further interpretation of the value. The new form must use `<<EOF` (unquoted) so `$VAR` expands, because `secrets.*` isn't available inside a composite action. That's the correct and only mechanism — but it means a secret value containing a backtick, `$(…)`, or a trailing backslash would now be subject to command substitution / line continuation at heredoc time, where before it was inert. For the five values here (two passwords, a Django key, a generated PG password, a static hostname) this is low-probability, but it's a genuine behaviour delta, not a no-op. Worth one sentence in the `deploy-obs` comment block noting "secret values must not contain unescaped backticks/`$(...)` — heredoc is unquoted by necessity," so the next person rotating a password doesn't generate one with a backtick and get a baffling deploy failure (or worse, silent corruption that the five-key guard wouldn't catch, since `KEY=<garbage>` still matches `.+`). ### Suggestions (cosmetic, ignore freely) 2. **Stale comment carried over verbatim.** Both the validate and start steps comment "obs-secrets.env overrides POSTGRES_HOST set in obs.env (later file wins)." `POSTGRES_HOST` is **not** actually set in `infra/observability/obs.env` — it's only mentioned in obs.env's comments. So there's nothing to override; obs-secrets.env is the sole source. This inaccuracy predates the PR (copied straight from the inline steps), so it's not a regression and I won't hold the PR for it — but since you're already touching these lines, it'd be a clean moment to drop the "overrides" clause. 3. **`postgres_host` is plain input, correctly not a secret** — and the comment says so. Fine. Just noting I checked it isn't accidentally being masked/required-as-secret. ### The honest bottom line This can only be *truly* verified by a real `workflow_dispatch` run of `nightly.yml` ending with all five healthchecked obs services (`obs-loki`, `obs-prometheus`, `obs-grafana`, `obs-tempo`, `obs-glitchtip`) green — exactly as the PR body acknowledges. Static review can confirm the YAML parses, the wiring matches, and parity holds (all done above), but composite-action input plumbing and the unquoted-heredoc expansion only prove themselves on the runner. `release.yml` riding on identical `uses:` calls is a reasonable trust argument, but the nightly run is non-negotiable before merge. @marcel — please trigger it and confirm the assert-health step passes; that flips this to a clean ✅ from me. Every added tool is a new failure mode — but a composite action that *removes* a copy-paste invariant and an honour-system comment is the rare case where the abstraction pays for itself. Nice, minimal refactor.
marcel added 2 commits 2026-06-02 19:38:39 +02:00
A failed cp/mkdir in the deploy-configs step was previously swallowed (the step
had no set -e), so a broken config copy could still reach the validate step. The
five-key guard catches empty secrets but not a failed copy. -u also catches a
typo'd env var name. Raised in review (Sara, Tobias).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ci(deploy): guard deploy-obs heredoc stays unquoted (#603)
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m19s
CI / OCR Service Tests (pull_request) Successful in 23s
CI / Backend Unit Tests (pull_request) Successful in 3m33s
CI / fail2ban Regex (pull_request) Successful in 46s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m7s
0e06626eef
The unquoted <<EOF delimiter is load-bearing — under a composite action secrets
come from $VAR (env), not Gitea ${{ secrets }} substitution, so a re-quote to
<<'EOF' would write literal $VAR strings and the five-key non-empty guard would
not catch it. Adds a self-testing grep guard (matching the ci.yml 'Assert no X'
convention) so a future re-quote fails CI instead of shipping broken obs auth.
Raised in review (Felix, Sara, Tobias).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
Owner

Review resolution

Thanks all. Addressed below; two new commits pushed.

🔒 Nora's blocker (unquoted-heredoc secret mangling) — verified as a non-issue

The claim that <<EOF + $VAR mangles secrets containing $, `, $(…), or backslashes was tested empirically against the exact KEY=$VAR pattern the action uses:

Secret value Written to obs-secrets.env
pa$HOME word pa$HOME word
a${POSTGRES_PASSWORD}b a${POSTGRES_PASSWORD}b
a$(whoami)b a$(whoami)b
a`whoami`b a`whoami`b
secretval\ (trailing \) secretval\ — next key stays on its own line ✓

When $VAR is expanded inside an unquoted heredoc, the value is inserted literally and not re-scanned — no recursive expansion, no command substitution, no backslash processing on the value. Backslash escaping applies only to the literal heredoc text (\$, \`, \\), which here is just KEY=$VAR with no backslashes. Output is byte-identical to the old <<'EOF' + Gitea-substitution form for all realistic and adversarial single-line secrets. No printf rewrite needed.

Fixes applied

  • a4756493set -euo pipefail on the deploy-configs step. A failed cp/mkdir was previously swallowed; -u also catches a typo'd env var. (Sara, Tobias)
  • 0e06626e — CI guard asserting the heredoc stays unquoted. Felix/Sara/Tobias correctly noted the <<EOF invariant was guarded only by a comment, and the five-key non-empty guard would not catch a re-quote (KEY=$VAR is non-empty). Added a self-testing grep guard to ci.yml matching the existing "Assert no X" convention — a future <<'EOF' now fails CI instead of shipping broken obs auth green. Verified it passes on the current tree and trips on a re-quoted file.

Deferred / acknowledged

  • Stale "overrides POSTGRES_HOST in obs.env" comment (Tobias) and bare echo "ERROR…"::error:: in smoke-test (Leonie) — cosmetic; pre-existing comment carried over verbatim. Leaving for a follow-up to keep this PR's diff scoped to the extraction.
  • Live verification (Elicit, Sara, all): still pending a workflow_dispatch nightly run ending with all 5 obs services healthy + a manual Grafana/GlitchTip auth check. Will post the run link before merge.

Requirements traceability: all 11 closeable ACs confirmed met (Elicit ).

## Review resolution Thanks all. Addressed below; two new commits pushed. ### 🔒 Nora's blocker (unquoted-heredoc secret mangling) — verified as a non-issue The claim that `<<EOF` + `$VAR` mangles secrets containing `$`, `` ` ``, `$(…)`, or backslashes was tested empirically against the exact `KEY=$VAR` pattern the action uses: | Secret value | Written to `obs-secrets.env` | |---|---| | `pa$HOME word` | `pa$HOME word` ✓ | | `a${POSTGRES_PASSWORD}b` | `a${POSTGRES_PASSWORD}b` ✓ | | `a$(whoami)b` | `a$(whoami)b` ✓ | | `` a`whoami`b `` | `` a`whoami`b `` ✓ | | `secretval\` (trailing `\`) | `secretval\` — next key stays on its own line ✓ | When `$VAR` is expanded inside an unquoted heredoc, **the value is inserted literally and not re-scanned** — no recursive expansion, no command substitution, no backslash processing on the value. Backslash escaping applies only to the *literal* heredoc text (`\$`, `` \` ``, `\\`), which here is just `KEY=$VAR` with no backslashes. Output is byte-identical to the old `<<'EOF'` + Gitea-substitution form for all realistic and adversarial single-line secrets. No `printf` rewrite needed. ### Fixes applied - **`a4756493` — `set -euo pipefail` on the deploy-configs step.** A failed `cp`/`mkdir` was previously swallowed; `-u` also catches a typo'd env var. (Sara, Tobias) - **`0e06626e` — CI guard asserting the heredoc stays unquoted.** Felix/Sara/Tobias correctly noted the `<<EOF` invariant was guarded only by a comment, and the five-key non-empty guard would *not* catch a re-quote (`KEY=$VAR` is non-empty). Added a self-testing grep guard to `ci.yml` matching the existing "Assert no X" convention — a future `<<'EOF'` now fails CI instead of shipping broken obs auth green. Verified it passes on the current tree and trips on a re-quoted file. ### Deferred / acknowledged - **Stale "overrides POSTGRES_HOST in obs.env" comment** (Tobias) and **bare `echo "ERROR…"` → `::error::` in smoke-test** (Leonie) — cosmetic; pre-existing comment carried over verbatim. Leaving for a follow-up to keep this PR's diff scoped to the extraction. - **Live verification** (Elicit, Sara, all): still pending a `workflow_dispatch` nightly run ending with all 5 obs services healthy + a manual Grafana/GlitchTip auth check. Will post the run link before merge. Requirements traceability: all 11 closeable ACs confirmed met (Elicit ✅).
marcel added 2 commits 2026-06-02 19:41:10 +02:00
obs.env documents POSTGRES_HOST but does not set a value, so obs-secrets.env
does not 'override' it — it is the only source. Reword the carried-over comment
to match reality. Raised in review (Tobias).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ci(deploy): use ::error:: annotations for smoke-test failures
Some checks failed
CI / Unit & Component Tests (pull_request) Has been cancelled
CI / OCR Service Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Semgrep Security Scan (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
CI / Unit & Component Tests (push) Successful in 3m23s
CI / OCR Service Tests (push) Successful in 22s
CI / Backend Unit Tests (push) Successful in 3m37s
CI / fail2ban Regex (push) Successful in 46s
CI / Semgrep Security Scan (push) Successful in 21s
CI / Compose Bucket Idempotency (push) Successful in 1m5s
nightly / deploy-staging (push) Successful in 2m1s
03e2615fa7
Convert the two bare failure echoes (gateway detection, /actuator status) to
::error:: so Gitea renders them as CI log annotations, consistent with the rest
of the deploy steps. No behaviour change. Raised in review (Leonie).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer (re-review)

Approved — my concern is fully resolved. The unquoted-heredoc invariant now has a structural CI guard built in the repo's own "Assert no X" convention. Upgrading from ⚠️ to .


My round-1 concern is closed

I asked for exactly one thing: the load-bearing <<EOF invariant was guarded only by a comment + a runtime --quiet check that would pass a re-quoted file (KEY=$VAR is non-empty), when this repo already enforces fragile invariants structurally in ci.yml. Commit 0e06626e adds that guard — and it's not a token gesture, it's done right.

It follows the existing convention precisely. The new "Assert deploy-obs writes obs-secrets.env via an unquoted heredoc" step has the same three-part shape as its two neighbours (the {@html} CWE-79 guard from #666 and the artifact-@v4 guard from #557): a positive self-test, a negative self-test, then the real assertion. A reviewer who knows those two guards reads this one for free. That consistency is worth as much as the guard itself.

The regex is sound — I verified it empirically, not by eye:

Check Result
Self-test: catches obs-secrets.env <<'EOF'
Self-test: ignores the unquoted <<EOF
Positive: real action file matches <<-?EOF\b
Negative: real file has no quoted delimiter → guard passes on current tree
Trips on a re-quoted file (sed'd <<'EOF')
Also catches <<-'EOF', << 'EOF', and <<"EOF" (both \x27/\x22)

So it answers both halves of the question I cared about: it would catch a re-quote and it passes clean on the current unquoted file. The double-quote variant coverage is a nice touch I didn't ask for — a future editor reaching for <<"EOF" is caught too.

The companion fix is also correct

a4756493 adds set -euo pipefail to the deploy-configs step. I'd called the comment-quality good in round 1; the -u here is the right belt — a typo'd $GRFANA_ADMIN_PASSWORD now fails loud instead of writing an empty value past the heredoc and tripping the five-key guard with a confusing message. The corrected comment block at the top of that step still reads as why, not what. Clean.

Disposition on the rest

Nora's backslash claim was tested empirically by the author and is a non-issue: $VAR expansion in an unquoted heredoc inserts the value literally and does not re-scan it, so backticks / $(…) / $OTHER / trailing \ all survive byte-identical. That matches my own understanding of POSIX word expansion — the result of a parameter expansion is not re-parsed. Correctly resolved, not papered over.

The two deferred cosmetic items (stale POSTGRES_HOST comment, bare echo::error:: in smoke-test) are pre-existing carry-overs and rightly scoped out to keep this an extraction diff. No objection.

Nothing left on my side. Clean DRY refactor, single-responsibility action split, and the one fragile invariant is now enforced in CI instead of trusted to a comment. Ship it after the workflow_dispatch nightly run you already committed to.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer (re-review) ✅ **Approved — my concern is fully resolved.** The unquoted-heredoc invariant now has a structural CI guard built in the repo's own "Assert no X" convention. Upgrading from ⚠️ to ✅. --- ### My round-1 concern is closed I asked for exactly one thing: the load-bearing `<<EOF` invariant was guarded only by a comment + a runtime `--quiet` check that would *pass* a re-quoted file (`KEY=$VAR` is non-empty), when this repo already enforces fragile invariants structurally in `ci.yml`. Commit `0e06626e` adds that guard — and it's not a token gesture, it's done right. **It follows the existing convention precisely.** The new "Assert deploy-obs writes obs-secrets.env via an unquoted heredoc" step has the same three-part shape as its two neighbours (the `{@html}` CWE-79 guard from #666 and the artifact-`@v4` guard from #557): a positive self-test, a negative self-test, then the real assertion. A reviewer who knows those two guards reads this one for free. That consistency is worth as much as the guard itself. **The regex is sound — I verified it empirically, not by eye:** | Check | Result | |---|---| | Self-test: catches `obs-secrets.env <<'EOF'` | ✅ | | Self-test: ignores the unquoted `<<EOF` | ✅ | | Positive: real action file matches `<<-?EOF\b` | ✅ | | Negative: real file has no quoted delimiter → guard passes on current tree | ✅ | | Trips on a re-quoted file (`sed`'d `<<'EOF'`) | ✅ | | Also catches `<<-'EOF'`, `<< 'EOF'`, and `<<"EOF"` (both `\x27`/`\x22`) | ✅ | So it answers both halves of the question I cared about: it **would catch a re-quote** *and* it **passes clean on the current unquoted file**. The double-quote variant coverage is a nice touch I didn't ask for — a future editor reaching for `<<"EOF"` is caught too. ### The companion fix is also correct `a4756493` adds `set -euo pipefail` to the deploy-configs step. I'd called the comment-quality good in round 1; the `-u` here is the right belt — a typo'd `$GRFANA_ADMIN_PASSWORD` now fails loud instead of writing an empty value past the heredoc and tripping the five-key guard with a confusing message. The corrected comment block at the top of that step still reads as *why*, not *what*. Clean. ### Disposition on the rest Nora's backslash claim was tested empirically by the author and is a non-issue: `$VAR` expansion in an unquoted heredoc inserts the value literally and does **not** re-scan it, so backticks / `$(…)` / `$OTHER` / trailing `\` all survive byte-identical. That matches my own understanding of POSIX word expansion — the result of a parameter expansion is not re-parsed. Correctly resolved, not papered over. The two deferred cosmetic items (stale POSTGRES_HOST comment, bare `echo` → `::error::` in smoke-test) are pre-existing carry-overs and rightly scoped out to keep this an extraction diff. No objection. Nothing left on my side. Clean DRY refactor, single-responsibility action split, and the one fragile invariant is now enforced in CI instead of trusted to a comment. Ship it after the `workflow_dispatch` nightly run you already committed to.
Author
Owner

🏛️ Markus Keller — Architect (re-review)

Approved — concerns from round 1 cleared; the heredoc CI guard sits at the right layer and matches the established ADR-driven precedent. No blockers.

I re-assessed the four fix commits on the current branch through the one lens that's mine to gate on: does invariant-enforcement-in-CI belong here, and is the new guard placed at the right architectural boundary? Verdict: yes on both. Below is the reasoning, because the placement question is the only thing that was genuinely open.

The guard sits alongside the existing "Assert no X" guards — that is the correct layer, not a boundary smell

This was the question I wanted to answer concretely, so I read ci.yml on disk. The new step (Assert deploy-obs writes obs-secrets.env via an unquoted heredoc (#603), lines 111–135) lands immediately after three peer guards in the same unit-tests job:

  • Assert no banned vi.mock patterns (ADR-012 / #553)
  • Assert no raw document date via {@html} (CWE-79 / #666)
  • Assert no (upload|download)-artifact past v3 (ADR-014 / #557)

Each is an ADR- or issue-anchored, repo-wide structural invariant enforced by grep, and the two most recent ones are self-testing (feed the regex a known-bad and known-good input, assert match/no-match). The new guard is a faithful member of that family: ADR-029 anchor in the comment, a self-test that proves the regex catches <<'EOF' and ignores <<EOF, a positive assertion that the unquoted heredoc is present at all, a negative assertion that the quoted form never appears, and ::error:: annotations. This is the house style for "un-unit-testable invariant," and the PR copied it precisely rather than inventing a new mechanism. That is exactly what I want from the repo's first composite-action PR — it sets the convention by following the existing one.

Is "enforce the invariant in CI" the right pattern at all? Yes — and it's the architecturally consistent choice

The alternative was to leave the <<EOF invariant guarded only by a comment in deploy-obs/action.yml. That is precisely the honour-system "keep in sync" failure mode ADR-029 was written to abolish — and leaving the single most fragile line in the whole refactor protected by prose alone would have been self-contradictory with the ADR it ships under. Promoting it to an executable guard makes the PR internally consistent: the same PR that replaces an unenforced cross-workflow invariant with a typed contract now also replaces an unenforced intra-action invariant with an executable check. Same principle, applied twice. Good.

It's also the right layer for this specific invariant. There's no local harness for .gitea/ YAML and no dev-machine red/green to run against, so pushing the check "down" to a unit test or a constraint isn't available the way I'd normally prefer. CI is the lowest trustworthy layer that actually sees this file. The check is cheap, infra-free, runs on every push, and fails closed. That's the correct placement given the constraints.

One boundary observation — noted, not blocking

The guard physically lives in the unit-tests job but asserts over a .gitea/actions/ deploy artifact — a frontend/test job reaching across into infra YAML. Strictly, this is a mild cohesion smell: a future reader scanning unit-tests wouldn't expect a deploy-secrets assertion there. But it's the same placement the three sibling guards already use (the artifact-version guard greps all of .gitea/workflows/ from this very job), so the new step is consistent with the established boundary rather than introducing a new one. Co-locating all the repo's grep guards in one early, fast, no-infra job is a defensible convention — they share a runner profile (no Docker, no VPS, seconds to run) more than they share a subject domain. I'd only revisit this if the guard set grows enough to warrant a dedicated repo-invariants / lint job; at four guards it's not worth the churn. Consistency with precedent wins here. No change requested.

Round-1 concerns — status

  • set -euo pipefail on the deploy-configs step (commit a4756493) — addressed; the canonical definition now fails loudly on a cp/mkdir error instead of silently starting against a stale config tree. Good hardening at the single shared site.
  • My three original suggestions (SRP-boundary comment on deploy-obs, secrets-vs-topology input grouping, the POSTGRES_HOST Compose-name coupling) were explicitly non-blocking polish and remain fine to defer.

Doc-currency gate

ADR-029 already documents the secrets-via-env/unquoted-heredoc constraint and ci-gitea.md describes the composite-action contract. The new guard is an enforcement of an already-documented invariant, not a new architectural decision, so it triggers no further doc update. Renumbering ADR-028→029 remains correct. No doc omissions.

Verdict: ship it. The fix is placed at the right layer, in the repo's established idiom, and makes the PR internally consistent with the very ADR it lands under. The architecture was sound in round 1 and is now better guarded.

— Markus (@mkeller)

## 🏛️ Markus Keller — Architect (re-review) **✅ Approved** — concerns from round 1 cleared; the heredoc CI guard sits at the right layer and matches the established ADR-driven precedent. No blockers. I re-assessed the four fix commits on the current branch through the one lens that's mine to gate on: **does invariant-enforcement-in-CI belong here, and is the new guard placed at the right architectural boundary?** Verdict: yes on both. Below is the reasoning, because the placement question is the only thing that was genuinely open. ### The guard sits alongside the existing "Assert no X" guards — that is the correct layer, not a boundary smell This was the question I wanted to answer concretely, so I read `ci.yml` on disk. The new step (`Assert deploy-obs writes obs-secrets.env via an unquoted heredoc (#603)`, lines 111–135) lands **immediately after three peer guards in the same `unit-tests` job**: - `Assert no banned vi.mock patterns` (ADR-012 / #553) - `Assert no raw document date via {@html}` (CWE-79 / #666) - `Assert no (upload|download)-artifact past v3` (ADR-014 / #557) Each is an **ADR- or issue-anchored, repo-wide structural invariant** enforced by grep, and the two most recent ones are **self-testing** (feed the regex a known-bad and known-good input, assert match/no-match). The new guard is a faithful member of that family: ADR-029 anchor in the comment, a self-test that proves the regex catches `<<'EOF'` and ignores `<<EOF`, a positive assertion that the unquoted heredoc is present at all, a negative assertion that the quoted form never appears, and `::error::` annotations. **This is the house style for "un-unit-testable invariant," and the PR copied it precisely rather than inventing a new mechanism.** That is exactly what I want from the repo's *first* composite-action PR — it sets the convention by following the existing one. ### Is "enforce the invariant in CI" the right pattern at all? Yes — and it's the architecturally consistent choice The alternative was to leave the `<<EOF` invariant guarded only by a comment in `deploy-obs/action.yml`. That is precisely the **honour-system "keep in sync" failure mode ADR-029 was written to abolish** — and leaving the single most fragile line in the whole refactor protected by prose alone would have been self-contradictory with the ADR it ships under. Promoting it to an executable guard makes the PR internally consistent: the same PR that replaces an unenforced cross-workflow invariant with a typed contract now also replaces an unenforced *intra-action* invariant with an executable check. Same principle, applied twice. Good. It's also the right **layer** for *this specific* invariant. There's no local harness for `.gitea/` YAML and no dev-machine red/green to run against, so pushing the check "down" to a unit test or a constraint isn't available the way I'd normally prefer. CI is the lowest trustworthy layer that actually sees this file. The check is cheap, infra-free, runs on every push, and fails closed. That's the correct placement given the constraints. ### One boundary observation — noted, not blocking The guard physically lives in the **`unit-tests` job** but asserts over a **`.gitea/actions/` deploy artifact** — a frontend/test job reaching across into infra YAML. Strictly, this is a mild cohesion smell: a future reader scanning `unit-tests` wouldn't expect a deploy-secrets assertion there. But it's the **same placement the three sibling guards already use** (the artifact-version guard greps all of `.gitea/workflows/` from this very job), so the new step is *consistent* with the established boundary rather than introducing a new one. Co-locating all the repo's grep guards in one early, fast, no-infra job is a defensible convention — they share a runner profile (no Docker, no VPS, seconds to run) more than they share a subject domain. I'd only revisit this if the guard set grows enough to warrant a dedicated `repo-invariants` / `lint` job; at four guards it's not worth the churn. Consistency with precedent wins here. No change requested. ### Round-1 concerns — status - **`set -euo pipefail` on the deploy-configs step** (commit `a4756493`) — addressed; the canonical definition now fails loudly on a `cp`/`mkdir` error instead of silently starting against a stale config tree. Good hardening at the single shared site. - My three original suggestions (SRP-boundary comment on `deploy-obs`, secrets-vs-topology input grouping, the `POSTGRES_HOST` Compose-name coupling) were explicitly non-blocking polish and remain fine to defer. ### Doc-currency gate ADR-029 already documents the secrets-via-`env`/unquoted-heredoc constraint and `ci-gitea.md` describes the composite-action contract. The new guard is an *enforcement* of an already-documented invariant, not a new architectural decision, so it triggers no further doc update. Renumbering ADR-028→029 remains correct. **No doc omissions.** Verdict: ship it. The fix is placed at the right layer, in the repo's established idiom, and makes the PR internally consistent with the very ADR it lands under. The architecture was sound in round 1 and is now better guarded. — Markus (@mkeller)
Author
Owner

⚙️ Tobias Wendt — DevOps (re-review)

⚠️ Approved — all three of my prior concerns resolved; the only open item is the still-pending live workflow_dispatch run, which keeps this at ⚠️ until it's green.

Re-assessed on the current branch. The four fix commits land cleanly. Walking my round-1 list:

Prior concerns — status

  1. set -e in the deploy-configs step → RESOLVED. Commit a4756493 adds set -euo pipefail as the first line of the deploy-obs "Deploy observability configs" run: block (.gitea/actions/deploy-obs/action.yml:56). A failed cp/mkdir now aborts instead of being swallowed, and -u catches a typo'd env var before it writes a half-empty obs-secrets.env. Note for the record: as I said in round 1, the runner already resolves shell: bash to bash --noprofile --norc -e -o pipefail {0}, so -e -o pipefail was technically already active — but the explicit set -euo pipefail adds -u (genuinely new) and makes the guarantee local to the file instead of relying on runner-default knowledge. Worth having. ✓

  2. Stale "overrides POSTGRES_HOST in obs.env" comment → RESOLVED in the action. The deploy-obs validate step now reads "POSTGRES_HOST is environment-specific and supplied only by obs-secrets.env — obs.env documents it but deliberately does not set a value" (action.yml:84–87). Accurate. The old, inaccurate "later file wins → overrides POSTGRES_HOST set in obs.env" wording is gone from the extracted action. (The duplicated copies in nightly/release are deleted entirely with the steps, so nothing stale survives there either.) ✓

  3. The CI re-quote guard (the new one you added on Felix/Sara/my note) → CORRECT and effective. I didn't just read it, I exercised it:

    • Self-tests are sound. printf "obs-secrets.env <<'EOF'" is caught by the regex; printf 'obs-secrets.env <<EOF' is correctly ignored. Both self-test branches pass against grep -P.
    • Regex is right. obs-secrets\.env\s*<<-?\s*[\x27\x22]\x27/\x22 are single/double quote, <<-? covers both << and <<- heredoc operators, \s* tolerates spacing. It matches a re-quoted delimiter and only a re-quoted delimiter.
    • It actually trips on a re-quote. I copied the action, sed-flipped <<EOF<<'EOF', and ran the guard: the negative grep -nP fires (exit 1) AND the positive <<-?EOF\b check also fails — defence in depth. So a future re-quote fails CI red instead of shipping broken obs auth green, which is exactly the gap (the five-key non-empty guard passes a literal KEY=$VAR because it's non-empty). This closes the one structural hole the round-1 reviewers flagged. ✓
    • Runs against the current tree: positive check passes, negative check is clean. Verified.

Parity & wiring — re-confirmed on current branch

  • nightly ↔ release with: blocks are identical except the three documented deltas: STAGING_POSTGRES_PASSWORDPROD_POSTGRES_PASSWORD, postgres_host (archiv-staging-db-1archiv-production-db-1), smoke host (staging.raddatz.cloudarchiv.raddatz.cloud). Nothing else forks.
  • actions/checkout@v4 is step 1 in both deploy jobs — required, since uses: ./.gitea/actions/… only exists on disk post-checkout.
  • Zero # Keep in sync comments remain anywhere under .gitea/ (grep -rn confirms). The honour-system invariant ADR-029 set out to kill is gone.
  • Renovate matchPaths still covers .gitea/actions/**, so the relocated pinned alpine digest in reload-caddy stays under manual review. Unchanged, still correct.

My one deferred round-1 concern (the unquoted-heredoc injection surface)

I'm satisfied with the author's empirical resolution. A $VAR expanded inside an unquoted heredoc is inserted literally and not re-scanned — no recursive expansion, no command substitution on the value, no backslash processing of the value (only the literal heredoc text is escape-processed, and that text is just KEY=$VAR). The five adversarial test values (pa$HOME, a$(whoami)b, backtick, trailing backslash, ${...}) all round-trip byte-identical to the old <<'EOF' + Gitea-substitution form. So it's a true no-op behaviour-wise, not a new injection surface. Good to defer the explanatory one-liner; not a blocker.

Bottom line

Statically this is now clean on every axis I check. The verdict stays ⚠️ for one reason only, unchanged from round 1: the abstraction's input plumbing and the unquoted-heredoc expansion only truly prove themselves on the runner. A real workflow_dispatch nightly run ending with all five healthchecked obs services (obs-loki, obs-prometheus, obs-grafana, obs-tempo, obs-glitchtip) green — plus the Grafana/GlitchTip auth check the author committed to — flips this to a clean . That run is legitimately still pending, so I won't pre-approve it green. Trigger it, post the link, and this is done.

Solid round-2. The refactor removes a copy-paste invariant and now enforces its one load-bearing detail in CI — that's the abstraction paying for itself.

## ⚙️ Tobias Wendt — DevOps (re-review) ⚠️ **Approved — all three of my prior concerns resolved; the only open item is the still-pending live `workflow_dispatch` run, which keeps this at ⚠️ until it's green.** Re-assessed on the current branch. The four fix commits land cleanly. Walking my round-1 list: ### Prior concerns — status 1. **`set -e` in the deploy-configs step → RESOLVED.** Commit `a4756493` adds `set -euo pipefail` as the first line of the `deploy-obs` "Deploy observability configs" `run:` block (`.gitea/actions/deploy-obs/action.yml:56`). A failed `cp`/`mkdir` now aborts instead of being swallowed, and `-u` catches a typo'd env var before it writes a half-empty `obs-secrets.env`. Note for the record: as I said in round 1, the runner already resolves `shell: bash` to `bash --noprofile --norc -e -o pipefail {0}`, so `-e -o pipefail` was technically already active — but the explicit `set -euo pipefail` adds `-u` (genuinely new) and makes the guarantee local to the file instead of relying on runner-default knowledge. Worth having. ✓ 2. **Stale "overrides POSTGRES_HOST in obs.env" comment → RESOLVED** in the action. The `deploy-obs` validate step now reads "POSTGRES_HOST is environment-specific and supplied only by obs-secrets.env — obs.env documents it but deliberately does not set a value" (`action.yml:84–87`). Accurate. The old, inaccurate "later file wins → overrides POSTGRES_HOST set in obs.env" wording is gone from the extracted action. (The duplicated copies in nightly/release are deleted entirely with the steps, so nothing stale survives there either.) ✓ 3. **The CI re-quote guard (the new one you added on Felix/Sara/my note) → CORRECT and effective.** I didn't just read it, I exercised it: - **Self-tests are sound.** `printf "obs-secrets.env <<'EOF'"` is caught by the regex; `printf 'obs-secrets.env <<EOF'` is correctly ignored. Both self-test branches pass against `grep -P`. - **Regex is right.** `obs-secrets\.env\s*<<-?\s*[\x27\x22]` — `\x27`/`\x22` are single/double quote, `<<-?` covers both `<<` and `<<-` heredoc operators, `\s*` tolerates spacing. It matches a re-quoted delimiter and only a re-quoted delimiter. - **It actually trips on a re-quote.** I copied the action, `sed`-flipped `<<EOF`→`<<'EOF'`, and ran the guard: the negative `grep -nP` fires (exit 1) AND the positive `<<-?EOF\b` check also fails — defence in depth. So a future re-quote fails CI red instead of shipping broken obs auth green, which is exactly the gap (the five-key non-empty guard passes a literal `KEY=$VAR` because it's non-empty). This closes the one structural hole the round-1 reviewers flagged. ✓ - Runs against the current tree: positive check passes, negative check is clean. Verified. ### Parity & wiring — re-confirmed on current branch - **nightly ↔ release `with:` blocks are identical except the three documented deltas:** `STAGING_POSTGRES_PASSWORD`→`PROD_POSTGRES_PASSWORD`, `postgres_host` (`archiv-staging-db-1`→`archiv-production-db-1`), smoke `host` (`staging.raddatz.cloud`→`archiv.raddatz.cloud`). Nothing else forks. - **`actions/checkout@v4` is step 1** in both deploy jobs — required, since `uses: ./.gitea/actions/…` only exists on disk post-checkout. - **Zero `# Keep in sync` comments remain** anywhere under `.gitea/` (`grep -rn` confirms). The honour-system invariant ADR-029 set out to kill is gone. - Renovate `matchPaths` still covers `.gitea/actions/**`, so the relocated pinned `alpine` digest in `reload-caddy` stays under manual review. Unchanged, still correct. ### My one deferred round-1 concern (the unquoted-heredoc injection surface) I'm satisfied with the author's empirical resolution. A `$VAR` expanded inside an *unquoted* heredoc is inserted literally and not re-scanned — no recursive expansion, no command substitution on the value, no backslash processing of the value (only the literal heredoc *text* is escape-processed, and that text is just `KEY=$VAR`). The five adversarial test values (`pa$HOME`, `a$(whoami)b`, backtick, trailing backslash, `${...}`) all round-trip byte-identical to the old `<<'EOF'` + Gitea-substitution form. So it's a true no-op behaviour-wise, not a new injection surface. Good to defer the explanatory one-liner; not a blocker. ### Bottom line Statically this is now clean on every axis I check. The verdict stays ⚠️ for one reason only, unchanged from round 1: **the abstraction's input plumbing and the unquoted-heredoc expansion only truly prove themselves on the runner.** A real `workflow_dispatch` nightly run ending with all five healthchecked obs services (`obs-loki`, `obs-prometheus`, `obs-grafana`, `obs-tempo`, `obs-glitchtip`) green — plus the Grafana/GlitchTip auth check the author committed to — flips this to a clean ✅. That run is legitimately still pending, so I won't pre-approve it green. Trigger it, post the link, and this is done. Solid round-2. The refactor removes a copy-paste invariant *and* now enforces its one load-bearing detail in CI — that's the abstraction paying for itself.
Author
Owner

🧪 Sara Holt — QA (re-review)

Approved — both of my round-1 concerns are resolved; the remaining gap is an operational watch-item for the human, not a code blocker.

I re-pulled the current branch and verified the two fixes on disk, including running the guard's logic and a mutation test against the real action file.

Resolved

1. The re-quote regression (my headline empty-vs-wrong concern) — closed at the only layer it can be.
Commit 0e06626e adds an enforced CI gate in ci.yml (Assert deploy-obs writes obs-secrets.env via an unquoted heredoc). This is exactly the executable form of the invariant I asked for — it converts "a human grepped once" into "CI greps every run." And the self-test is real, not decorative: it exercises both branches with printf fixtures (regex must catch <<'EOF' and must ignore <<EOF), so a future edit that breaks the guard itself also fails CI. I verified mechanically on the current tree:

  • self-test branch 1 (catches quoted): PASS
  • self-test branch 2 (ignores unquoted): PASS
  • positive assertion (<<EOF present in the real action): PASS
  • negative assertion (no quoted delimiter): clean
  • mutation test — I re-quoted the real file (<<EOF<<'EOF') and the negative check tripped on line 61 and the positive check failed. So a re-quote can no longer ship broken obs auth green. The empty-vs-wrong hole the five-key runtime guard couldn't close is now closed statically, at the one point it's closeable.
  • It lives in the lint job after actions/checkout@v4 (line 18 → guard line 111, same job), so the relative-path grep resolves. Correct ordering.

2. set -euo pipefail on the deploy-configs step — done.
Commit a4756493. The rm/mkdir/cp -r config-tree copy is no longer un-guarded; -u also catches a typo'd env var. This was the one block doing un-aborted work while every sibling step had set -e. Parity hardening at the canonical site, as requested.

I also confirm Nora's empirical heredoc table: $VAR expanded inside an unquoted heredoc is inserted literally and not re-scanned, so $/backtick/$(…)/trailing-\ in a secret value are byte-identical to the old <<'EOF' form. No injection footgun. Good due diligence.

Residual — what the human must still own (not a blocker, but do not skip)

The static gate closes the regression path. It does not substitute for the live-run gate, which is still the only proof the production path works at all:

  1. The nightly workflow_dispatch is still the sole functional gate and is still pending (author acknowledges; will post the run link before merge). Read it actively: all five obs services healthy (a missingunhealthy), and confirm the deploy step's cp -r log shows the real tree, not an empty dir.
  2. Obs auth is still NOT covered by the smoke test — I confirmed smoke-test/action.yml touches only the public app surface (login/HSTS/Permissions-Policy/actuator), nothing Grafana/GlitchTip. The static guard now prevents the re-quote cause of wrong credentials, but any other cause (rotated-but-wrong secret, masking edge case) still won't surface in CI. Manually log into Grafana + GlitchTip with the secrets once, post-run. This is the compensating control for the blind spot.
  3. Production is still unproven. release.yml's deploy-obs/smoke-test wiring (different secret names, archiv-production-db-1, archiv.raddatz.cloud) executes for the first time on the next real release. Treat that release as a verification event: someone on hand, check prod Grafana auth + the five obs services immediately after.

Verdict

Approved. The two things I flagged as actionable (static heredoc guard + set -euo pipefail) are both shipped and verified. No test-layer blockers remain. The live nightly run and the first prod release stay on the human's checklist — necessary because .gitea/ deploy YAML is untestable below the live layer, not because anything in this PR is wrong.

— Sara

## 🧪 Sara Holt — QA (re-review) **✅ Approved** — both of my round-1 concerns are resolved; the remaining gap is an operational watch-item for the human, not a code blocker. I re-pulled the current branch and verified the two fixes on disk, including running the guard's logic and a mutation test against the real action file. ### Resolved **1. The re-quote regression (my headline empty-vs-wrong concern) — closed at the only layer it can be.** Commit `0e06626e` adds an enforced CI gate in `ci.yml` (`Assert deploy-obs writes obs-secrets.env via an unquoted heredoc`). This is exactly the executable form of the invariant I asked for — it converts "a human grepped once" into "CI greps every run." And the self-test is *real*, not decorative: it exercises both branches with `printf` fixtures (regex must catch `<<'EOF'` **and** must ignore `<<EOF`), so a future edit that breaks the guard *itself* also fails CI. I verified mechanically on the current tree: - self-test branch 1 (catches quoted): PASS - self-test branch 2 (ignores unquoted): PASS - positive assertion (`<<EOF` present in the real action): PASS - negative assertion (no quoted delimiter): clean - **mutation test** — I re-quoted the real file (`<<EOF` → `<<'EOF'`) and the negative check tripped on line 61 *and* the positive check failed. So a re-quote can no longer ship broken obs auth green. The empty-vs-wrong hole the five-key runtime guard couldn't close is now closed statically, at the one point it's closeable. - It lives in the lint job **after** `actions/checkout@v4` (line 18 → guard line 111, same job), so the relative-path grep resolves. Correct ordering. **2. `set -euo pipefail` on the deploy-configs step — done.** Commit `a4756493`. The `rm`/`mkdir`/`cp -r` config-tree copy is no longer un-guarded; `-u` also catches a typo'd env var. This was the one block doing un-aborted work while every sibling step had `set -e`. Parity hardening at the canonical site, as requested. I also confirm Nora's empirical heredoc table: `$VAR` expanded inside an unquoted heredoc is inserted literally and not re-scanned, so `$`/backtick/`$(…)`/trailing-`\` in a secret value are byte-identical to the old `<<'EOF'` form. No injection footgun. Good due diligence. ### Residual — what the human must still own (not a blocker, but do not skip) The static gate closes the *regression* path. It does **not** substitute for the live-run gate, which is still the only proof the production path works at all: 1. **The nightly `workflow_dispatch` is still the sole functional gate** and is still pending (author acknowledges; will post the run link before merge). Read it actively: all five obs services `healthy` (a `missing` ≠ `unhealthy`), and confirm the deploy step's `cp -r` log shows the real tree, not an empty dir. 2. **Obs auth is still NOT covered by the smoke test** — I confirmed `smoke-test/action.yml` touches only the public app surface (login/HSTS/Permissions-Policy/actuator), nothing Grafana/GlitchTip. The static guard now prevents the *re-quote* cause of wrong credentials, but any *other* cause (rotated-but-wrong secret, masking edge case) still won't surface in CI. **Manually log into Grafana + GlitchTip with the secrets once, post-run.** This is the compensating control for the blind spot. 3. **Production is still unproven.** `release.yml`'s `deploy-obs`/`smoke-test` wiring (different secret names, `archiv-production-db-1`, `archiv.raddatz.cloud`) executes for the **first time** on the next real release. Treat that release as a verification event: someone on hand, check prod Grafana auth + the five obs services immediately after. ### Verdict **✅ Approved.** The two things I flagged as actionable (static heredoc guard + `set -euo pipefail`) are both shipped and verified. No test-layer blockers remain. The live nightly run and the first prod release stay on the human's checklist — necessary because `.gitea/` deploy YAML is untestable below the live layer, not because anything in this PR is wrong. — Sara
Author
Owner

🔒 Nora Steiner — Security (re-review)

Approved — both prior blockers withdrawn.

Honest about uncertainty, and honest when I was wrong. I was wrong. Here's the proof, run against the exact pattern the action uses.


🚫 BLOCKER 1 (unquoted-heredoc secret mangling) — WITHDRAWN. My prior analysis was incorrect.

I made a real shell-semantics error and I want to be precise about what I got wrong, because the correction is the load-bearing part.

My claim was: in an unquoted heredoc, the shell processes \ and re-expands $ inside the secret value. That is false. The shell processes expansions and escapes on the literal heredoc template text (KEY=$VAR) — it does one pass. The result of $VAR expansion is inserted and never re-scanned. Since the template is just GRAFANA_ADMIN_PASSWORD=$GRAFANA_ADMIN_PASSWORD — one expansion, zero backslashes — there is nothing for the shell to mangle in the value. I conflated "backslashes in the heredoc text" with "backslashes in the value". Different things.

I re-ran every adversarial payload from my original blocker against the action's exact KEY=$VAR-in-<<EOF form:

Secret value Written to file Result
pa$HOME word pa$HOME word literal — no $HOME expansion
a${OTHER}b a${OTHER}b literal — no brace expansion
a$(whoami)b a$(whoami)b literal — no command substitution
a`whoami`b a`whoami`b literal — no backtick substitution
secretval\ (trailing \) secretval\ literal — next key stays on its own line, NOT swallowed
a\$b a\$b backslash preserved
tab\there tab\there backslash preserved
a\\b a\\b both backslashes preserved

The trailing-backslash line-continuation swallow I warned about does not happen — the \ is part of the expanded value, not the template, so it is not consumed as a continuation. Output is byte-identical to the old <<'EOF' + Gitea-substitution form. The author's resolution table is correct. No printf rewrite is needed, and importantly, switching to printf '%s\n' would have been a lateral move, not a fix — both are byte-faithful here.

BLOCKER 2 (five-key guard checks presence not correctness) falls with it. It was predicated on Blocker 1's mangling being real. Since the write is already byte-faithful, the guard's remaining job — catch a genuinely empty/missing secret via ^KEY=.+ — is exactly the right scope. It stands as correct defense-in-depth, not a failed mitigation. Withdrawn.


New CI guard (commit 0e06626e) — reviewed independently, it is sound.

I verified the Assert deploy-obs writes obs-secrets.env via an unquoted heredoc step end-to-end on the current tree, not just by reading it:

  • Self-tests pass — the regex catches <<'EOF' and ignores <<EOF. These inline self-tests are the right touch: they fail CI if the detection regex itself rots, so the guard can't silently degrade to a no-op.
  • Positive + negative assertions pass against the real action file (unquoted heredoc present; no quoted delimiter anywhere).
  • Trip test confirmed — I copied the action, sed-re-quoted the delimiter to <<'EOF', and the negative assertion fired (exit 1). A future re-quote genuinely fails CI instead of shipping broken obs auth green.

This directly closes the gap Felix/Sara/Tobias flagged (the <<EOF invariant was comment-only) and it does so with a self-testing check — exactly the "automate detection of the vulnerability class" control I'd ask for. Residual: the regex requires obs-secrets.env and << on one line, so a hypothetical re-quote plus a reformat onto separate lines could evade it — but that's not the realistic regression (a one-token re-quote on the existing line is), and the positive assertion would still flag the missing unquoted form. Acceptable.

set -euo pipefail (commit a4756493) — genuine hardening.

A failed cp/mkdir/chmod was previously swallowed, which could leave a stale or wrong obs-secrets.env on disk while the stack started anyway. -e now aborts the secret-write path on any failed file op. I confirmed all five $VAR references are backed by the env: block, so -u is safe (no unset-var foot-gun) and adds protection against a future typo'd var name. Good.


⚠️ One concern carries forward (not a blocker, acceptance-checklist item)

Log masking on the input → env: → $VAR path is still framework-dependent and unverified on a live run. On the pending workflow_dispatch nightly, confirm: (a) none of the five values appear unmasked anywhere in the log, and (b) docker compose config's error path doesn't echo obs-secrets.env contents on a bad secret. The author already has this on the deferred list — keep it there as a merge gate, not just "5 services healthy".


Summary

# Prior Now Note
1 🚫 Blocker Withdrawn My shell-semantics error; value is inserted literally, never re-scanned. Verified empirically.
2 🚫 Blocker Withdrawn Predicated on #1; guard's presence-check scope is correct as-is.
CI re-quote guard sound + self-testing; trip-tested.
set -euo pipefail real reliability gain, -u safe.
3 ⚠️ Concern ⚠️ Concern Live-run log-masking verification — merge gate, not code blocker.

The disproof is complete and I accept it without reservation. The secret-write path is byte-faithful, the invariant is now enforced in CI rather than by comment, and the failure mode is tested. Security verdict: approved. Nice work running the empirical test against the exact pattern — that's precisely how you close out a contested finding. Bugs are systemic; so are my mistakes, and this was a good one to correct in the open.

— Nora "NullX" Steiner

## 🔒 Nora Steiner — Security (re-review) ### ✅ Approved — both prior blockers withdrawn. > Honest about uncertainty, and honest when I was wrong. I was wrong. Here's the proof, run against the exact pattern the action uses. --- ### 🚫→✅ BLOCKER 1 (unquoted-heredoc secret mangling) — **WITHDRAWN.** My prior analysis was incorrect. I made a real shell-semantics error and I want to be precise about *what* I got wrong, because the correction is the load-bearing part. My claim was: in an unquoted heredoc, the shell processes `\` and re-expands `$` *inside the secret value*. **That is false.** The shell processes expansions and escapes on the **literal heredoc template text** (`KEY=$VAR`) — it does **one** pass. The *result* of `$VAR` expansion is inserted and **never re-scanned**. Since the template is just `GRAFANA_ADMIN_PASSWORD=$GRAFANA_ADMIN_PASSWORD` — one expansion, zero backslashes — there is nothing for the shell to mangle in the value. I conflated "backslashes in the heredoc *text*" with "backslashes in the *value*". Different things. I re-ran every adversarial payload from my original blocker against the action's exact `KEY=$VAR`-in-`<<EOF` form: | Secret value | Written to file | Result | |---|---|---| | `pa$HOME word` | `pa$HOME word` | literal — no `$HOME` expansion | | `a${OTHER}b` | `a${OTHER}b` | literal — no brace expansion | | `a$(whoami)b` | `a$(whoami)b` | literal — no command substitution | | `` a`whoami`b `` | `` a`whoami`b `` | literal — no backtick substitution | | `secretval\` (trailing `\`) | `secretval\` | literal — **next key stays on its own line, NOT swallowed** | | `a\$b` | `a\$b` | backslash preserved | | `tab\there` | `tab\there` | backslash preserved | | `a\\b` | `a\\b` | both backslashes preserved | The trailing-backslash line-continuation swallow I warned about **does not happen** — the `\` is part of the expanded *value*, not the template, so it is not consumed as a continuation. Output is byte-identical to the old `<<'EOF'` + Gitea-substitution form. The author's resolution table is correct. **No `printf` rewrite is needed**, and importantly, switching to `printf '%s\n'` would have been a lateral move, not a fix — both are byte-faithful here. **BLOCKER 2 (five-key guard checks presence not correctness) falls with it.** It was predicated on Blocker 1's mangling being real. Since the write is already byte-faithful, the guard's remaining job — catch a genuinely empty/missing secret via `^KEY=.+` — is exactly the right scope. It stands as correct defense-in-depth, not a failed mitigation. Withdrawn. --- ### ✅ New CI guard (commit `0e06626e`) — reviewed independently, it is sound. I verified the `Assert deploy-obs writes obs-secrets.env via an unquoted heredoc` step end-to-end on the current tree, not just by reading it: - **Self-tests pass** — the regex catches `<<'EOF'` and ignores `<<EOF`. These inline self-tests are the right touch: they fail CI if the detection regex itself rots, so the guard can't silently degrade to a no-op. - **Positive + negative assertions pass** against the real action file (unquoted heredoc present; no quoted delimiter anywhere). - **Trip test confirmed** — I copied the action, `sed`-re-quoted the delimiter to `<<'EOF'`, and the negative assertion fired (`exit 1`). A future re-quote genuinely fails CI instead of shipping broken obs auth green. This directly closes the gap Felix/Sara/Tobias flagged (the `<<EOF` invariant was comment-only) and it does so with a self-testing check — exactly the "automate detection of the vulnerability class" control I'd ask for. Residual: the regex requires `obs-secrets.env` and `<<` on one line, so a hypothetical re-quote *plus* a reformat onto separate lines could evade it — but that's not the realistic regression (a one-token re-quote on the existing line is), and the positive assertion would still flag the missing unquoted form. Acceptable. ### ✅ `set -euo pipefail` (commit `a4756493`) — genuine hardening. A failed `cp`/`mkdir`/`chmod` was previously swallowed, which could leave a stale or wrong `obs-secrets.env` on disk while the stack started anyway. `-e` now aborts the secret-write path on any failed file op. I confirmed all five `$VAR` references are backed by the `env:` block, so `-u` is safe (no unset-var foot-gun) and adds protection against a future typo'd var name. Good. --- ### ⚠️ One concern carries forward (not a blocker, acceptance-checklist item) **Log masking on the `input → env: → $VAR` path** is still framework-dependent and unverified on a live run. On the pending `workflow_dispatch` nightly, confirm: (a) none of the five values appear unmasked anywhere in the log, and (b) `docker compose config`'s *error* path doesn't echo `obs-secrets.env` contents on a bad secret. The author already has this on the deferred list — keep it there as a merge gate, not just "5 services healthy". --- ### Summary | # | Prior | Now | Note | |---|---|---|---| | 1 | 🚫 Blocker | ✅ Withdrawn | My shell-semantics error; value is inserted literally, never re-scanned. Verified empirically. | | 2 | 🚫 Blocker | ✅ Withdrawn | Predicated on #1; guard's presence-check scope is correct as-is. | | — | — | ✅ | CI re-quote guard sound + self-testing; trip-tested. | | — | — | ✅ | `set -euo pipefail` real reliability gain, `-u` safe. | | 3 | ⚠️ Concern | ⚠️ Concern | Live-run log-masking verification — merge gate, not code blocker. | The disproof is complete and I accept it without reservation. The secret-write path is byte-faithful, the invariant is now enforced in CI rather than by comment, and the failure mode is tested. Security verdict: **approved.** Nice work running the empirical test against the exact pattern — that's precisely how you close out a contested finding. Bugs are systemic; so are my mistakes, and this was a good one to correct in the open. — Nora "NullX" Steiner
marcel merged commit 03e2615fa7 into main 2026-06-02 19:57:20 +02:00
marcel deleted branch feat/issue-603-composite-deploy-actions 2026-06-02 19:57:21 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#715