fix(ci): resolve smoke test host via bridge gateway, not 127.0.0.1 #540

Merged
marcel merged 3 commits from fix/nightly-caddy-reload into main 2026-05-12 09:28:45 +02:00
Owner

Summary

  • The nightly and release smoke tests were resolving staging.raddatz.cloud / archiv.raddatz.cloud to 127.0.0.1, but job containers run in bridge network mode127.0.0.1 inside the container is the container's own loopback, not the host's. Caddy on the host was unreachable, causing an immediate ECONNREFUSED ("after 0 ms").
  • Fixed by detecting the Docker bridge gateway IP dynamically (ip route show default) and using that as the --resolve target. Caddy binds 0.0.0.0:443 so it is reachable from the container via the bridge gateway.
  • Applied to both nightly.yml and release.yml. Updated comments which incorrectly described 127.0.0.1 as "the runner's loopback".

Test plan

  • Trigger the nightly workflow manually (workflow_dispatch) and confirm the Smoke test step passes
  • Confirm the printed line reads pinned to 172.x.x.x via bridge gateway (not 127.0.0.1)

🤖 Generated with Claude Code

## Summary - The nightly and release smoke tests were resolving `staging.raddatz.cloud` / `archiv.raddatz.cloud` to `127.0.0.1`, but job containers run in **bridge network mode** — `127.0.0.1` inside the container is the container's own loopback, not the host's. Caddy on the host was unreachable, causing an immediate `ECONNREFUSED` ("after 0 ms"). - Fixed by detecting the Docker bridge gateway IP dynamically (`ip route show default`) and using that as the `--resolve` target. Caddy binds `0.0.0.0:443` so it is reachable from the container via the bridge gateway. - Applied to both `nightly.yml` and `release.yml`. Updated comments which incorrectly described `127.0.0.1` as "the runner's loopback". ## Test plan - [ ] Trigger the nightly workflow manually (`workflow_dispatch`) and confirm the Smoke test step passes - [ ] Confirm the printed line reads `pinned to 172.x.x.x via bridge gateway` (not `127.0.0.1`) 🤖 Generated with [Claude Code](https://claude.ai/claude-code)
marcel added 1 commit 2026-05-12 09:10:19 +02:00
fix(ci): resolve smoke test host via bridge gateway, not 127.0.0.1
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1m50s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 4m8s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Failing after 10s
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Unit & Component Tests (push) Has started running
CI / Compose Bucket Idempotency (push) Has been cancelled
3056311c24
Job containers run in bridge network mode (runner-config.yaml). Inside
a bridge-networked container 127.0.0.1 is the container's own loopback;
Caddy on the host is unreachable there, causing an immediate ECONNREFUSED.

Use the Docker bridge gateway IP instead — the host's docker0 interface
where Caddy (bound on 0.0.0.0:443) is reachable from the container.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel force-pushed fix/nightly-caddy-reload from e6893ae67e to 3056311c24 2026-05-12 09:10:19 +02:00 Compare
Author
Owner

🏛️ Markus Keller (@mkeller) — Application Architect

Verdict: Approved

This is a targeted CI infrastructure fix — no backend packages, routes, DB migrations, domain entities, or external systems changed. My doc-update checklist has nothing to trigger here.

What I checked

  • No new @ManyToMany, Flyway migration, or DB schema change → DB diagrams unchanged: correct.
  • No new Docker service or infrastructure component added → docs/DEPLOYMENT.md and l2-containers.puml unchanged: correct.
  • No new domain term or concept → docs/GLOSSARY.md unchanged: correct.
  • The ADR already exists (ADR-012, written in a prior commit) covering the nsenter/bridge topology rationale. No new ADR needed for this fix.

Observations

The change correctly follows the "boring technology wins" principle — detecting the bridge gateway with ip route show default | awk '/default/ {print $3}' is a standard POSIX approach available in every Linux runner image without installing anything extra. The cross-reference comment in release.yml pointing to the full explanation in nightly.yml avoids duplication while keeping the workflow self-documenting.

No structural concerns. LGTM from an architecture perspective.

## 🏛️ Markus Keller (@mkeller) — Application Architect **Verdict: ✅ Approved** This is a targeted CI infrastructure fix — no backend packages, routes, DB migrations, domain entities, or external systems changed. My doc-update checklist has nothing to trigger here. ### What I checked - No new `@ManyToMany`, Flyway migration, or DB schema change → DB diagrams unchanged: correct. - No new Docker service or infrastructure component added → `docs/DEPLOYMENT.md` and `l2-containers.puml` unchanged: correct. - No new domain term or concept → `docs/GLOSSARY.md` unchanged: correct. - The ADR already exists (ADR-012, written in a prior commit) covering the nsenter/bridge topology rationale. No new ADR needed for this fix. ### Observations The change correctly follows the "boring technology wins" principle — detecting the bridge gateway with `ip route show default | awk '/default/ {print $3}'` is a standard POSIX approach available in every Linux runner image without installing anything extra. The cross-reference comment in `release.yml` pointing to the full explanation in `nightly.yml` avoids duplication while keeping the workflow self-documenting. No structural concerns. LGTM from an architecture perspective.
Author
Owner

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Verdict: Approved

This is a CI workflow fix, not application code, so the TDD cycle doesn't directly apply. The shell scripting is correct and readable — let me walk through what I checked.

What I checked

Shell correctness
HOST_IP=$(ip route show default | awk '/default/ {print $3}') — the awk pattern /default/ matches the line containing the default route and $3 correctly extracts the gateway IP (field 3 in ip route output on standard Linux). set -e is already active, so any command failure propagates correctly.

Variable quoting
RESOLVE="--resolve $HOST:443:$HOST_IP" — double-quoted; fine. The subsequent curl -fsS $RESOLVE passes $RESOLVE unquoted, which works here because the value contains no spaces or glob characters, but it's a mild style inconsistency with the surrounding code. curl -fsS "$RESOLVE" would be more defensive.

Comments explain WHY, not WHAT
The updated comment block in nightly.yml (lines 162–170 in the diff) correctly explains the network topology constraint — bridge mode, loopback scope, 0.0.0.0 binding — rather than restating the code. The release.yml cross-reference is a clean way to avoid duplication.

Consistency
The same fix is applied to both nightly.yml and release.yml with matching output messages (pinned to $HOST_IP via bridge gateway). Good.

Suggestion (not a blocker)

curl -fsS "$RESOLVE" — quote the variable for defensive consistency, even though the value is safe here. Three-character fix.

## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer **Verdict: ✅ Approved** This is a CI workflow fix, not application code, so the TDD cycle doesn't directly apply. The shell scripting is correct and readable — let me walk through what I checked. ### What I checked **Shell correctness** `HOST_IP=$(ip route show default | awk '/default/ {print $3}')` — the `awk` pattern `/default/` matches the line containing the default route and `$3` correctly extracts the gateway IP (field 3 in `ip route` output on standard Linux). `set -e` is already active, so any command failure propagates correctly. **Variable quoting** `RESOLVE="--resolve $HOST:443:$HOST_IP"` — double-quoted; fine. The subsequent `curl -fsS $RESOLVE` passes `$RESOLVE` unquoted, which works here because the value contains no spaces or glob characters, but it's a mild style inconsistency with the surrounding code. `curl -fsS "$RESOLVE"` would be more defensive. **Comments explain WHY, not WHAT** The updated comment block in `nightly.yml` (lines 162–170 in the diff) correctly explains the network topology constraint — bridge mode, loopback scope, `0.0.0.0` binding — rather than restating the code. The `release.yml` cross-reference is a clean way to avoid duplication. **Consistency** The same fix is applied to both `nightly.yml` and `release.yml` with matching output messages (`pinned to $HOST_IP via bridge gateway`). Good. ### Suggestion (not a blocker) `curl -fsS "$RESOLVE"` — quote the variable for defensive consistency, even though the value is safe here. Three-character fix.
Author
Owner

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

The root cause diagnosis is exactly right: job containers in bridge mode have their own loopback (127.0.0.1 ≠ host). Using ip route show default | awk '/default/ {print $3}' to get the bridge gateway is the standard, portable approach. Caddy binds 0.0.0.0:443, so the gateway IP reaches it cleanly. Good fix.

Blocker

Missing guard for empty HOST_IP

If ip route show default returns no output (e.g. the runner image has a different routing setup, or ip isn't installed), HOST_IP will be an empty string. curl will then receive --resolve staging.raddatz.cloud:443: which it treats as a malformed resolve spec and exits with error 6 ("couldn't resolve host") — a confusing failure mode that will take time to diagnose.

Add a guard immediately after the assignment:

HOST_IP=$(ip route show default | awk '/default/ {print $3}')
[ -n "$HOST_IP" ] || { echo "ERROR: could not detect Docker bridge gateway via 'ip route'"; exit 1; }

Same guard needed in release.yml. With set -e already active, the guard makes the failure message actionable rather than cryptic.

Suggestions (not blockers)

  • Quote $RESOLVE in the curl call. curl -fsS $RESOLVE should be curl -fsS "$RESOLVE". It's safe as-is today (no spaces in the value), but quoting is the correct default.
  • Duplicate logic between nightly.yml and release.yml. The cross-reference comment is acceptable for now. If a third workflow needs this pattern, consider extracting it into a composite action.

What's correct

  • Comments now accurately describe the network topology — no more "runner's loopback" misnomer.
  • set -e is already present, so failures propagate correctly.
  • Applied consistently to both workflows.
  • The echo output change (pinned to $HOST_IP via bridge gateway) makes the CI log self-describing — good for debugging.
## 🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** The root cause diagnosis is exactly right: job containers in bridge mode have their own loopback (`127.0.0.1` ≠ host). Using `ip route show default | awk '/default/ {print $3}'` to get the bridge gateway is the standard, portable approach. Caddy binds `0.0.0.0:443`, so the gateway IP reaches it cleanly. Good fix. ### Blocker **Missing guard for empty `HOST_IP`** If `ip route show default` returns no output (e.g. the runner image has a different routing setup, or `ip` isn't installed), `HOST_IP` will be an empty string. `curl` will then receive `--resolve staging.raddatz.cloud:443:` which it treats as a malformed resolve spec and exits with error 6 ("couldn't resolve host") — a confusing failure mode that will take time to diagnose. Add a guard immediately after the assignment: ```bash HOST_IP=$(ip route show default | awk '/default/ {print $3}') [ -n "$HOST_IP" ] || { echo "ERROR: could not detect Docker bridge gateway via 'ip route'"; exit 1; } ``` Same guard needed in `release.yml`. With `set -e` already active, the guard makes the failure message actionable rather than cryptic. ### Suggestions (not blockers) - **Quote `$RESOLVE` in the curl call.** `curl -fsS $RESOLVE` should be `curl -fsS "$RESOLVE"`. It's safe as-is today (no spaces in the value), but quoting is the correct default. - **Duplicate logic between `nightly.yml` and `release.yml`.** The cross-reference comment is acceptable for now. If a third workflow needs this pattern, consider extracting it into a composite action. ### What's correct - Comments now accurately describe the network topology — no more "runner's loopback" misnomer. - `set -e` is already present, so failures propagate correctly. - Applied consistently to both workflows. - The `echo` output change (`pinned to $HOST_IP via bridge gateway`) makes the CI log self-describing — good for debugging.
Author
Owner

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved

This change touches CI workflows only — no application code, no auth flows, no API surfaces. I reviewed it through a security lens anyway.

What I checked

Command injection risk: none
HOST_IP=$(ip route show default | awk '/default/ {print $3}') — the input to awk comes entirely from the kernel routing table via ip route. There is no user-controlled input path here. The resulting IP is used in curl's --resolve flag, which is parsed by curl and not interpreted by a shell. No injection vector.

Network target of the smoke test
The fix directs curl to the Docker bridge gateway IP, which is the host machine running Caddy. This is the intended target — the smoke test is verifying the deployed service, not an arbitrary network destination. The --resolve flag only overrides DNS resolution for the named host; SNI still uses staging.raddatz.cloud / archiv.raddatz.cloud, so TLS certificate validation remains correct.

No secrets, no credentials
No secrets in the diff. The CI workflow correctly relies on Gitea secrets for any auth-related values elsewhere in the workflow.

No expanded attack surface
The bridge gateway IP is not logged in a way that exposes internal infrastructure topology beyond what's already visible in CI logs. The echo statement (pinned to 172.x.x.x via bridge gateway) is informational and benign.

LGTM from a security perspective.

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** This change touches CI workflows only — no application code, no auth flows, no API surfaces. I reviewed it through a security lens anyway. ### What I checked **Command injection risk: none** `HOST_IP=$(ip route show default | awk '/default/ {print $3}')` — the input to `awk` comes entirely from the kernel routing table via `ip route`. There is no user-controlled input path here. The resulting IP is used in `curl`'s `--resolve` flag, which is parsed by curl and not interpreted by a shell. No injection vector. **Network target of the smoke test** The fix directs curl to the Docker bridge gateway IP, which is the host machine running Caddy. This is the intended target — the smoke test is verifying the deployed service, not an arbitrary network destination. The `--resolve` flag only overrides DNS resolution for the named host; SNI still uses `staging.raddatz.cloud` / `archiv.raddatz.cloud`, so TLS certificate validation remains correct. **No secrets, no credentials** No secrets in the diff. The CI workflow correctly relies on Gitea secrets for any auth-related values elsewhere in the workflow. **No expanded attack surface** The bridge gateway IP is not logged in a way that exposes internal infrastructure topology beyond what's already visible in CI logs. The echo statement (`pinned to 172.x.x.x via bridge gateway`) is informational and benign. LGTM from a security perspective.
Author
Owner

🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist

Verdict: Approved

This is a CI infrastructure fix, not application code. Traditional test pyramid concerns (unit/integration/E2E) don't apply directly, but I reviewed the testability of the fix itself.

What I checked

Test plan quality
The PR description includes a concrete test plan:

  • Trigger nightly via workflow_dispatch
  • Confirm the echo line reads pinned to 172.x.x.x via bridge gateway

These are specific, observable acceptance criteria. "Confirm the smoke test step passes" is the right gate.

Failure mode visibility
The echo statement before the curl invocation (echo "Smoke test: $URL (pinned to $HOST_IP via bridge gateway)") means that when the job runs, CI logs will clearly show which IP was resolved. This makes future failures diagnosable from logs alone — good test hygiene applied to CI.

Regression risk to existing test infrastructure
None. The change only affects the --resolve target in the curl commands and the surrounding comments. The HSTS header assertions and actuator block checks that follow are unchanged.

Observation (not a blocker)

There is no automated guard that verifies HOST_IP is non-empty before it's used (noted also by Tobias). If the gateway detection fails silently, the curl command will fail with a confusing error code rather than a meaningful message. A [ -n "$HOST_IP" ] guard would make the failure self-diagnosing — which is what I'd want for any step in a smoke test suite.

## 🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist **Verdict: ✅ Approved** This is a CI infrastructure fix, not application code. Traditional test pyramid concerns (unit/integration/E2E) don't apply directly, but I reviewed the testability of the fix itself. ### What I checked **Test plan quality** The PR description includes a concrete test plan: - Trigger `nightly` via `workflow_dispatch` ✅ - Confirm the echo line reads `pinned to 172.x.x.x via bridge gateway` ✅ These are specific, observable acceptance criteria. "Confirm the smoke test step passes" is the right gate. **Failure mode visibility** The echo statement before the curl invocation (`echo "Smoke test: $URL (pinned to $HOST_IP via bridge gateway)"`) means that when the job runs, CI logs will clearly show which IP was resolved. This makes future failures diagnosable from logs alone — good test hygiene applied to CI. **Regression risk to existing test infrastructure** None. The change only affects the `--resolve` target in the curl commands and the surrounding comments. The HSTS header assertions and actuator block checks that follow are unchanged. ### Observation (not a blocker) There is no automated guard that verifies `HOST_IP` is non-empty before it's used (noted also by Tobias). If the gateway detection fails silently, the curl command will fail with a confusing error code rather than a meaningful message. A `[ -n "$HOST_IP" ]` guard would make the failure self-diagnosing — which is what I'd want for any step in a smoke test suite.
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved

This PR is a corrective infrastructure fix, not a requirements change. My role here is to verify that the problem statement is clear, the acceptance criteria are testable, and no implicit requirements are being violated.

Problem statement

Clear and specific: 127.0.0.1 inside a bridge-mode container is the container's own loopback, not the host's. Caddy was unreachable, causing immediate ECONNREFUSED. The diagnosis is precise and matches the observable symptom ("after 0 ms").

Acceptance criteria (from PR description)

  • Trigger nightly workflow manually → smoke test step passes (testable, binary)
  • Confirm echo reads pinned to 172.x.x.x via bridge gateway (observable, specific)

Both criteria meet the Definition of Ready: they are concrete, observable, and verifiable by a single person without ambiguity.

Implicit requirements not violated

The fix preserves all existing smoke test assertions (HSTS header pinning, actuator block verification). The scope is tight — two comment blocks and four lines of shell code across two workflow files. No feature scope creep.

Open question (low priority)

The test plan relies on manual workflow_dispatch verification. If this fix is foundational for release confidence, it may be worth tracking a follow-up to add an automated gate that confirms the bridge detection works in the runner environment before the smoke test runs. Not a blocker for this PR — just something to consider for observability of the CI pipeline itself.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** This PR is a corrective infrastructure fix, not a requirements change. My role here is to verify that the problem statement is clear, the acceptance criteria are testable, and no implicit requirements are being violated. ### Problem statement Clear and specific: `127.0.0.1` inside a bridge-mode container is the container's own loopback, not the host's. Caddy was unreachable, causing immediate `ECONNREFUSED`. The diagnosis is precise and matches the observable symptom ("after 0 ms"). ### Acceptance criteria (from PR description) - Trigger nightly workflow manually → smoke test step passes ✅ (testable, binary) - Confirm echo reads `pinned to 172.x.x.x via bridge gateway` ✅ (observable, specific) Both criteria meet the Definition of Ready: they are concrete, observable, and verifiable by a single person without ambiguity. ### Implicit requirements not violated The fix preserves all existing smoke test assertions (HSTS header pinning, actuator block verification). The scope is tight — two comment blocks and four lines of shell code across two workflow files. No feature scope creep. ### Open question (low priority) The test plan relies on manual `workflow_dispatch` verification. If this fix is foundational for release confidence, it may be worth tracking a follow-up to add an automated gate that confirms the bridge detection works in the runner environment before the smoke test runs. Not a blocker for this PR — just something to consider for observability of the CI pipeline itself.
Author
Owner

🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead

Verdict: Approved

This PR modifies CI workflow YAML files only. There are no Svelte components, no HTML, no CSS, no Tailwind classes, no design tokens, no i18n strings, and no user-facing screens affected.

No UI/UX or accessibility concerns to raise. LGTM.

## 🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead **Verdict: ✅ Approved** This PR modifies CI workflow YAML files only. There are no Svelte components, no HTML, no CSS, no Tailwind classes, no design tokens, no i18n strings, and no user-facing screens affected. No UI/UX or accessibility concerns to raise. LGTM.
marcel added 2 commits 2026-05-12 09:26:41 +02:00
If `ip route show default` returns no output the old code passed
an empty string to curl --resolve, producing a confusing error 6
("couldn't resolve host") with no indication that gateway detection
had failed.  The new guard exits immediately with a clear message.

Addresses review concern raised by Tobias Wendt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(ci): quote \$RESOLVE in all curl calls
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1m51s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 4m1s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Failing after 11s
CI / Unit & Component Tests (push) Failing after 1m51s
CI / OCR Service Tests (push) Successful in 18s
CI / Backend Unit Tests (push) Successful in 4m10s
CI / fail2ban Regex (push) Successful in 38s
CI / Compose Bucket Idempotency (push) Failing after 10s
6d16be4669
Unquoted variable expansion is safe here since the value contains
no spaces or glob characters, but quoting is the correct default
and keeps the script consistent with surrounding style.

Addresses review suggestion by Felix Brandt and Tobias Wendt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

Review concerns addressed

Two commits pushed to fix/nightly-caddy-reload following the persona review:

Tobias's blocker — empty HOST_IP guard (commit 0602d45c)

Added to both nightly.yml and release.yml, immediately after the ip route assignment:

HOST_IP=$(ip route show default | awk '/default/ {print $3}')
[ -n "$HOST_IP" ] || { echo "ERROR: could not detect Docker bridge gateway via 'ip route'"; exit 1; }

If gateway detection fails, the job now exits with a clear actionable message instead of a cryptic curl error 6.

Felix + Tobias suggestion — quoted $RESOLVE (commit f12eac45)

All 8 curl invocations across both files updated: curl -fsS $RESOLVEcurl -fsS "$RESOLVE" and curl -s $RESOLVEcurl -s "$RESOLVE".

Shell syntax validated with bash -n on both extracted script blocks before each commit.

## Review concerns addressed Two commits pushed to `fix/nightly-caddy-reload` following the persona review: ### ✅ Tobias's blocker — empty `HOST_IP` guard (commit `0602d45c`) Added to both `nightly.yml` and `release.yml`, immediately after the `ip route` assignment: ```bash HOST_IP=$(ip route show default | awk '/default/ {print $3}') [ -n "$HOST_IP" ] || { echo "ERROR: could not detect Docker bridge gateway via 'ip route'"; exit 1; } ``` If gateway detection fails, the job now exits with a clear actionable message instead of a cryptic curl error 6. ### ✅ Felix + Tobias suggestion — quoted `$RESOLVE` (commit `f12eac45`) All 8 curl invocations across both files updated: `curl -fsS $RESOLVE` → `curl -fsS "$RESOLVE"` and `curl -s $RESOLVE` → `curl -s "$RESOLVE"`. Shell syntax validated with `bash -n` on both extracted script blocks before each commit.
marcel merged commit 6d16be4669 into main 2026-05-12 09:28:45 +02:00
marcel deleted branch fix/nightly-caddy-reload 2026-05-12 09:28:45 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#540