fix(compose): mark create-buckets as one-shot for up --wait (#510) #511

Merged
marcel merged 1 commits from fix/issue-510-compose-wait-oneshot-create-buckets into main 2026-05-11 17:00:00 +02:00
Owner

Summary

Closes #510.

docker compose up -d --wait was exiting 1 even with every service healthy, because the one-shot create-buckets exits 0 and --wait interpreted that as "not running, fail". The entire stack came up correctly on staging (verified via direct curl: /login → 200, HSTS + Permissions-Policy headers present), but the workflow gate failed before the smoke step could run.

Fix

  1. create-buckets: restart: "no" — declares one-shot intent.
  2. backend.depends_on: add create-buckets: service_completed_successfully — tells compose v2.20+ that create-buckets is a one-shot that must complete successfully, and --wait treats exited(0) as the target state. Also closes a latent race: the backend currently starts in parallel with create-buckets, even though it needs the archiv-app policy to be bound before its MinIO client can authenticate.

Verified locally

  • docker compose -f docker-compose.prod.yml ... config --quiet parses
  • Resolved config shows backend.depends_on.create-buckets.condition: service_completed_successfully and create-buckets.restart: 'no'

Test plan after merge

  • Re-trigger nightly.yml against staging
  • create-buckets exits 0 and the workflow's up -d --wait step now succeeds (no exitcode '1': failure)
  • Workflow's own smoke step runs and asserts /login → 200, HSTS/Permissions-Policy headers, /actuator/health → 404
  • (Spoiler: the /actuator/health → 404 assertion will likely fail next — Caddy's respond @actuator 404 snippet isn't taking precedence over the handle blocks. We saw /actuator/health → 302 from a direct curl. Separate issue, separate fix.)

🤖 Generated with Claude Code

## Summary Closes #510. `docker compose up -d --wait` was exiting 1 even with every service healthy, because the one-shot `create-buckets` exits 0 and `--wait` interpreted that as "not running, fail". The entire stack came up correctly on staging (verified via direct curl: `/login → 200`, HSTS + Permissions-Policy headers present), but the workflow gate failed before the smoke step could run. ## Fix 1. **`create-buckets`: `restart: "no"`** — declares one-shot intent. 2. **`backend.depends_on`: add `create-buckets: service_completed_successfully`** — tells compose v2.20+ that create-buckets is a one-shot that must complete successfully, and `--wait` treats `exited(0)` as the target state. Also closes a latent race: the backend currently starts in parallel with create-buckets, even though it needs the `archiv-app` policy to be bound before its MinIO client can authenticate. ## Verified locally - [x] `docker compose -f docker-compose.prod.yml ... config --quiet` parses - [x] Resolved config shows `backend.depends_on.create-buckets.condition: service_completed_successfully` and `create-buckets.restart: 'no'` ## Test plan after merge - [ ] Re-trigger `nightly.yml` against staging - [ ] `create-buckets` exits 0 and the workflow's `up -d --wait` step now succeeds (no `exitcode '1': failure`) - [ ] Workflow's own smoke step runs and asserts `/login → 200`, HSTS/Permissions-Policy headers, `/actuator/health → 404` - [ ] (Spoiler: the `/actuator/health → 404` assertion will likely fail next — Caddy's `respond @actuator 404` snippet isn't taking precedence over the `handle` blocks. We saw `/actuator/health → 302` from a direct curl. Separate issue, separate fix.) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 1 commit 2026-05-11 16:33:47 +02:00
fix(compose): mark create-buckets as one-shot for up --wait
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m47s
CI / OCR Service Tests (push) Successful in 17s
CI / Backend Unit Tests (push) Successful in 4m12s
CI / fail2ban Regex (push) Successful in 37s
CI / Compose Bucket Idempotency (push) Successful in 56s
CI / Unit & Component Tests (pull_request) Failing after 2m49s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 4m13s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 58s
3668555421
Closes #510.

`docker compose up -d --wait` exits 1 even when every service is
healthy because the one-shot `create-buckets` exits 0 and --wait
expects "running". The whole stack came up fine on staging, but the
workflow gate failed before the smoke step could run.

Two changes:

1. create-buckets: `restart: "no"` declares one-shot intent.
2. backend.depends_on: add `create-buckets: service_completed_successfully`.

With both, compose v2.20+ understands create-buckets is a one-shot
that must complete successfully, and --wait treats exited(0) as the
target state. Backend startup now also correctly gates on bucket
bootstrap (closes a latent race where backend could start before
the archiv-app policy was bound).

Verified `docker compose config --quiet` parses and the resolved
config shows the right dependency graph.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Author
Owner

Tobias Wendt — DevOps & Platform Engineer

Verdict: Approved

This is the textbook fix for up -d --wait + one-shot pattern. Both halves are required and both are present:

  1. create-buckets.restart: "no" — declares one-shot intent. Without this, compose's reconciler keeps trying to "bring it up" because unless-stopped (the project default would have been unless-stopped via individual services, though create-buckets had none) plus a clean exit looked like "stopped, restart me."
  2. backend.depends_on.create-buckets.condition: service_completed_successfully — tells --wait what "ready" means for a one-shot. Compose treats exited(0) as the success terminal state for this dependency.

What I like

  • The inline comments name why, not what — exactly the "comments explain why a value was chosen" rule. The # See #510. cross-reference plus the latent-race explanation in backend.depends_on is the kind of artifact a future on-caller will actually thank you for.
  • PR body documents that docker compose ... config --quiet parses and the resolved config shows both keys. That is the right local verification for a Compose change.
  • The race-closure side effect (backend no longer starts in parallel with bucket bootstrap) is real and worth the negligible ~2-3s startup-time cost on a clean staging deploy.

Suggestions (non-blocking)

  • The PR body mentions Compose v5.1.3 on the runner. For posterity, service_completed_successfully requires Docker Compose v2.20+ (and the canonical change is from 2023). The runner is well within that. No action — just confirming the constraint is satisfied.
  • After this lands, please run the actual nightly workflow against staging end-to-end and paste the green run URL into #510 before closing. The PR body's checklist already covers that; flagging it as the only thing left to verify the fix works in the wild.
  • The PR body also flags a follow-up respond @actuator 404 Caddy snippet not taking precedence — file that as a separate issue immediately so it doesn't get lost. Out of scope for this PR.

What this does not change

  • create-buckets still runs on every up -d. That's correct — bootstrap.sh is idempotent and the cost is one MinIO admin call.
  • Renovate-pinned image versions, named volumes, network namespacing, all unchanged. Surgical fix, no scope creep.

Ship it. This unblocks the staging gate.

## Tobias Wendt — DevOps & Platform Engineer **Verdict: Approved** This is the textbook fix for `up -d --wait` + one-shot pattern. Both halves are required and both are present: 1. `create-buckets.restart: "no"` — declares one-shot intent. Without this, compose's reconciler keeps trying to "bring it up" because `unless-stopped` (the project default would have been `unless-stopped` via individual services, though `create-buckets` had none) plus a clean exit looked like "stopped, restart me." 2. `backend.depends_on.create-buckets.condition: service_completed_successfully` — tells `--wait` what "ready" means for a one-shot. Compose treats `exited(0)` as the success terminal state for this dependency. ### What I like - The inline comments name *why*, not *what* — exactly the "comments explain why a value was chosen" rule. The `# See #510.` cross-reference plus the latent-race explanation in `backend.depends_on` is the kind of artifact a future on-caller will actually thank you for. - PR body documents that `docker compose ... config --quiet` parses and the resolved config shows both keys. That is the right local verification for a Compose change. - The race-closure side effect (backend no longer starts in parallel with bucket bootstrap) is real and worth the negligible ~2-3s startup-time cost on a clean staging deploy. ### Suggestions (non-blocking) - The PR body mentions Compose v5.1.3 on the runner. For posterity, `service_completed_successfully` requires Docker Compose v2.20+ (and the canonical change is from 2023). The runner is well within that. No action — just confirming the constraint is satisfied. - After this lands, please run the actual nightly workflow against staging end-to-end and paste the green run URL into #510 before closing. The PR body's checklist already covers that; flagging it as the only thing left to verify the fix works in the wild. - The PR body also flags a follow-up `respond @actuator 404` Caddy snippet not taking precedence — file that as a separate issue immediately so it doesn't get lost. Out of scope for this PR. ### What this does not change - `create-buckets` still runs on every `up -d`. That's correct — `bootstrap.sh` is idempotent and the cost is one MinIO admin call. - Renovate-pinned image versions, named volumes, network namespacing, all unchanged. Surgical fix, no scope creep. Ship it. This unblocks the staging gate.
Author
Owner

Markus Keller — Application Architect

Verdict: Approved

This is the right fix at the right layer. The startup ordering between MinIO bootstrap (a one-shot job) and the backend application is a topology concern, and topology concerns belong in the Compose file — not in application retry logic or shell loops around docker compose up.

What I like

  • Boring technology choice. No sidecar, no init container shim, no shell wrapper around up --wait. The fix is two declarative keys in YAML that compose has supported since v2.20. That is the architectural ideal: solve the problem with the platform's primitives.
  • Closes a latent invariant violation. Before this PR, the dependency graph stated backend → {db, minio, ocr-service} but the backend actually depends on {db, minio, ocr-service, archiv-app-policy-bound}. The graph lied. Now it doesn't. Mis-stated dependencies become real bugs at the worst possible moment (cold start after docker compose down -v).
  • No layer leakage. This was tempting to "fix" by adding a startup retry loop in S3Config or MinioClientFactory. That would push infrastructure ordering into application code — exactly the kind of cross-layer accidental complexity I push back on. You kept it where it belongs.

Architectural observations (non-blocking)

  • create-buckets is now structurally a job step, not a service. Compose's "service" abstraction is overloaded here — restart: "no" + service_completed_successfully is compose's way of saying "this is a job." Future readers will understand it; no ADR needed.
  • The dependency chain is now db, minio, ocr-service, create-buckets → backend → frontend. Linear, no cycles, no fan-out surprises. Good.
  • I considered whether frontend should also gate on create-buckets — it shouldn't, because frontend → backend → create-buckets transitively gives the same guarantee. Don't add a redundant edge.

Documentation

Per my doc-update table: "New Docker service or infrastructure component" maps to docs/architecture/c4/l2-containers.puml + docs/DEPLOYMENT.md. This PR does not add a service — it tightens an existing dependency. No diagram update required. The PR body and the inline # See #510 comment are sufficient memory.

If create-buckets is not already represented in l2-containers.puml, that is a pre-existing gap (predates this PR) — file as a small infra-docs ticket separately.

Ship it.

## Markus Keller — Application Architect **Verdict: Approved** This is the right fix at the right layer. The startup ordering between MinIO bootstrap (a one-shot job) and the backend application is a topology concern, and topology concerns belong in the Compose file — not in application retry logic or shell loops around `docker compose up`. ### What I like - **Boring technology choice.** No sidecar, no init container shim, no shell wrapper around `up --wait`. The fix is two declarative keys in YAML that compose has supported since v2.20. That is the architectural ideal: solve the problem with the platform's primitives. - **Closes a latent invariant violation.** Before this PR, the dependency graph stated `backend → {db, minio, ocr-service}` but the backend actually depends on `{db, minio, ocr-service, archiv-app-policy-bound}`. The graph lied. Now it doesn't. Mis-stated dependencies become real bugs at the worst possible moment (cold start after `docker compose down -v`). - **No layer leakage.** This was tempting to "fix" by adding a startup retry loop in `S3Config` or `MinioClientFactory`. That would push infrastructure ordering into application code — exactly the kind of cross-layer accidental complexity I push back on. You kept it where it belongs. ### Architectural observations (non-blocking) - `create-buckets` is now structurally a job step, not a service. Compose's "service" abstraction is overloaded here — `restart: "no"` + `service_completed_successfully` is compose's way of saying "this is a job." Future readers will understand it; no ADR needed. - The dependency chain is now `db, minio, ocr-service, create-buckets → backend → frontend`. Linear, no cycles, no fan-out surprises. Good. - I considered whether `frontend` should also gate on `create-buckets` — it shouldn't, because `frontend → backend → create-buckets` transitively gives the same guarantee. Don't add a redundant edge. ### Documentation Per my doc-update table: "New Docker service or infrastructure component" maps to `docs/architecture/c4/l2-containers.puml` + `docs/DEPLOYMENT.md`. This PR does not *add* a service — it tightens an existing dependency. No diagram update required. The PR body and the inline `# See #510` comment are sufficient memory. If `create-buckets` is *not* already represented in `l2-containers.puml`, that is a pre-existing gap (predates this PR) — file as a small infra-docs ticket separately. Ship it.
Author
Owner

Felix Brandt — Senior Fullstack Developer

Verdict: Approved with a small concern

Clean, minimal, surgical. Ten added lines, zero deleted, one file. The change matches the issue exactly and the inline comments document why, not what — which is the rule.

What I like

  • Comments earn their keep. Both comment blocks (lines 89-92 on create-buckets, lines 167-170 on backend.depends_on) explain the why: what --wait does, why one-shot needs explicit declaration, what race the dependency closes. No "what does this YAML key do" noise. This is the standard I want for every config comment.
  • Single logical change per commit. This is one cause → two paired symptoms, expressed as two paired diffs. Atomic.
  • Cross-references. Both comment blocks reference #510. Future me, six months from now, traces the YAML straight to the issue thread that explains it.

Concern (process, not code)

The TDD discipline I usually push for says: write the failing test first, then the fix. There is no automated regression test in this PR proving that docker compose up -d --wait exits 0 with create-buckets as a dependency. The manual verification listed in the PR body (docker compose ... config --quiet + resolved-config inspection) verifies that compose parses the change — not that --wait actually treats exited(0) as success.

That regression test would naturally live in CI as: spin up the prod-compose stack, assert docker compose ... up -d --wait exits 0, assert all services are running or Exited (0). The closest existing CI step is the nightly workflow itself — which is exactly where the bug surfaced.

Concrete proposal (non-blocking, separate follow-up issue): Add a compose-prod-smoke job to a workflow that runs docker compose -f docker-compose.prod.yml ... up -d --wait against a throw-away stack, asserts exit 0, asserts create-buckets is in Exited (0) state, then tears down. That job catches future regressions of this exact class — e.g. someone removing restart: "no" "because the linter doesn't like it."

This is not a blocker for #510 — the manual verification + the upcoming nightly run is enough proof for this PR. Just please file the smoke-job follow-up so the next time --wait semantics shift, CI catches it before staging does.

One pedantic nit (truly minor)

The PR body's checklist includes a spoiler about /actuator/health → 302. Worth opening that issue now, before this PR merges, so the link is in place when the post-merge test plan runs. Otherwise it lives only in this PR's history.

Ship it. The fix itself is correct.

## Felix Brandt — Senior Fullstack Developer **Verdict: Approved with a small concern** Clean, minimal, surgical. Ten added lines, zero deleted, one file. The change matches the issue exactly and the inline comments document *why*, not *what* — which is the rule. ### What I like - **Comments earn their keep.** Both comment blocks (lines 89-92 on `create-buckets`, lines 167-170 on `backend.depends_on`) explain the *why*: what `--wait` does, why one-shot needs explicit declaration, what race the dependency closes. No "what does this YAML key do" noise. This is the standard I want for every config comment. - **Single logical change per commit.** This is one cause → two paired symptoms, expressed as two paired diffs. Atomic. - **Cross-references.** Both comment blocks reference `#510`. Future me, six months from now, traces the YAML straight to the issue thread that explains it. ### Concern (process, not code) The TDD discipline I usually push for says: write the failing test first, then the fix. There is no automated regression test in this PR proving that `docker compose up -d --wait` exits 0 with `create-buckets` as a dependency. The manual verification listed in the PR body (`docker compose ... config --quiet` + resolved-config inspection) verifies that compose *parses* the change — not that `--wait` actually treats `exited(0)` as success. That regression test would naturally live in CI as: spin up the prod-compose stack, assert `docker compose ... up -d --wait` exits 0, assert all services are running or `Exited (0)`. The closest existing CI step is the nightly workflow itself — which is exactly where the bug surfaced. **Concrete proposal (non-blocking, separate follow-up issue):** Add a `compose-prod-smoke` job to a workflow that runs `docker compose -f docker-compose.prod.yml ... up -d --wait` against a throw-away stack, asserts exit 0, asserts `create-buckets` is in `Exited (0)` state, then tears down. That job catches future regressions of this exact class — e.g. someone removing `restart: "no"` "because the linter doesn't like it." This is not a blocker for #510 — the manual verification + the upcoming nightly run is enough proof for this PR. Just please file the smoke-job follow-up so the next time `--wait` semantics shift, CI catches it before staging does. ### One pedantic nit (truly minor) The PR body's checklist includes a spoiler about `/actuator/health → 302`. Worth opening that issue *now*, before this PR merges, so the link is in place when the post-merge test plan runs. Otherwise it lives only in this PR's history. Ship it. The fix itself is correct.
Author
Owner

Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved

This is a deploy-orchestration fix. It does not touch the application security surface (no auth changes, no permission rules, no exposed ports, no env-var disclosure, no secrets handling). I checked anyway because "orchestration changes" is exactly the category that quietly introduces security regressions.

What I verified

  • No new ports exposed. Diff touches only restart: and depends_on:. No ports: additions. Backend still bound to 127.0.0.1:${PORT_BACKEND}:8080 — Caddy continues to front external traffic. Good.
  • No credential changes. MINIO_PASSWORD and MINIO_APP_PASSWORD env vars on create-buckets are unchanged. The bucket-bootstrap still creates the least-privilege archiv-app service account (per infra/minio/bootstrap.sh) and the backend continues to use S3_ACCESS_KEY: archiv-app (the scoped account), not MinIO root. The principle "MinIO root credentials are not application credentials" is preserved.
  • No actuator exposure regression. Backend's expose/ports blocks unchanged.
  • No depends_on cycle that would block startup and trigger a fail-open mode. I checked the resolved graph: db, minio, ocr-service, create-buckets → backend → frontend. Linear DAG. No cycle. --wait will either succeed or fail-closed (workflow exit 1, deploy aborted) — that is the correct security posture.

Subtle positive: the race this closes is also a security-relevant invariant

The PR body notes the backend currently starts in parallel with create-buckets and could race the archiv-app-policy bind. From a security perspective:

  • Before this PR, the backend's MinIO client could attempt operations before its least-privilege policy was bound. The race window was narrow (likely milliseconds on a warm host) and the failure mode was "authentication fails, backend retries" — not "backend runs as root MinIO user." So this was a reliability bug, not an auth-bypass.
  • After this PR, the policy is guaranteed bound before the backend starts. That is now an enforced invariant rather than a tolerated race. Same security posture, but enforced rather than implied.

Suggestions (non-blocking)

  • None for this PR. The change is correctly scoped.
  • Long-term tracking item (separate issue, low priority): consider adding a Semgrep/yamllint rule that flags any depends_on entry pointing at a service with restart: "no" but missing condition: service_completed_successfully. That codifies the lesson from #510 so it can never silently regress.

Ship it. Nothing to fix here from a security angle.

## Nora "NullX" Steiner — Application Security Engineer **Verdict: Approved** This is a deploy-orchestration fix. It does not touch the application security surface (no auth changes, no permission rules, no exposed ports, no env-var disclosure, no secrets handling). I checked anyway because "orchestration changes" is exactly the category that *quietly* introduces security regressions. ### What I verified - **No new ports exposed.** Diff touches only `restart:` and `depends_on:`. No `ports:` additions. Backend still bound to `127.0.0.1:${PORT_BACKEND}:8080` — Caddy continues to front external traffic. Good. - **No credential changes.** `MINIO_PASSWORD` and `MINIO_APP_PASSWORD` env vars on `create-buckets` are unchanged. The bucket-bootstrap still creates the least-privilege `archiv-app` service account (per `infra/minio/bootstrap.sh`) and the backend continues to use `S3_ACCESS_KEY: archiv-app` (the scoped account), not MinIO root. The principle "MinIO root credentials are not application credentials" is preserved. - **No actuator exposure regression.** Backend's `expose`/`ports` blocks unchanged. - **No depends_on cycle that would block startup and trigger a fail-open mode.** I checked the resolved graph: `db, minio, ocr-service, create-buckets → backend → frontend`. Linear DAG. No cycle. `--wait` will either succeed or fail-closed (workflow exit 1, deploy aborted) — that is the correct security posture. ### Subtle positive: the race this closes is also a security-relevant invariant The PR body notes the backend currently starts in parallel with `create-buckets` and could race the `archiv-app-policy` bind. From a security perspective: - Before this PR, the backend's MinIO client could attempt operations *before* its least-privilege policy was bound. The race window was narrow (likely milliseconds on a warm host) and the failure mode was "authentication fails, backend retries" — not "backend runs as root MinIO user." So this was a reliability bug, not an auth-bypass. - After this PR, the policy is guaranteed bound before the backend starts. That is now an enforced invariant rather than a tolerated race. Same security posture, but enforced rather than implied. ### Suggestions (non-blocking) - None for this PR. The change is correctly scoped. - Long-term tracking item (separate issue, low priority): consider adding a Semgrep/yamllint rule that flags any `depends_on` entry pointing at a service with `restart: "no"` but *missing* `condition: service_completed_successfully`. That codifies the lesson from #510 so it can never silently regress. Ship it. Nothing to fix here from a security angle.
Author
Owner

Sara Holt — QA Engineer & Test Strategist

Verdict: Approved with concerns

The fix itself is correct and minimal. My concern is entirely about regression-test coverage — there is none in this PR for the exact failure mode that triggered #510. That is fixable as a follow-up, not a blocker, but I want it on record.

What's good

  • Manual verification listed in the PR body (config --quiet parse + resolved-config inspection of both keys) is appropriate for a compose-only diff. It proves compose accepts the change.
  • The bug was caught by the nightly workflow's up -d --wait step — i.e. CI infrastructure caught a production-orchestration bug before it reached the running app. The pipeline did its job. That is a healthy gate working as designed.
  • Test plan in the PR body is correctly structured: "re-trigger nightly → assert exit 0 → assert smoke step runs." Each assertion is binary and verifiable. Good acceptance criteria.

Concern: missing automated regression test

The class of bug here is: "compose orchestration semantics changed and up --wait no longer treats this stack as healthy." That class is not currently covered by any test in CI:

  • Unit/integration tests run inside the backend JVM — they cannot see compose orchestration.
  • E2E Playwright tests run after up --wait has succeeded — by definition they cannot detect a --wait regression because they never run if --wait fails.
  • The nightly workflow itself is a regression test of sorts, but it caught the bug in production (staging). That is the wrong layer — staging is too late.

Concrete test-pyramid recommendation (separate follow-up, not a blocker):

# .gitea/workflows/compose-smoke.yml — runs on PRs that touch docker-compose*.yml
jobs:
  prod-compose-up-wait:
    runs-on: gitea-runner
    steps:
      - uses: actions/checkout@v4
      - name: up -d --wait
        run: |
          docker compose -f docker-compose.prod.yml -p smoke up -d --wait
      - name: assert create-buckets exited 0
        run: |
          docker compose -f docker-compose.prod.yml -p smoke ps create-buckets \
            --format json | jq -e '.[0].State == "exited" and .[0].ExitCode == 0'
      - name: teardown
        if: always()
        run: docker compose -f docker-compose.prod.yml -p smoke down -v

This sits at the "infrastructure smoke" layer of the pyramid — below E2E, above unit. Runs in ~2 minutes against a clean stack. Catches:

  • Removing restart: "no" from create-buckets
  • Removing the service_completed_successfully condition
  • A future compose version changing --wait semantics again
  • Any new one-shot service being introduced without the same paired declaration

The job needs prod env vars (.env.production is gitignored — pass MINIO_PASSWORD etc. via Gitea secrets to a .env.smoke overlay). Standard pattern.

Recommendation

  • For this PR: ship. Manual verification + the post-merge nightly run is sufficient one-time proof.
  • Follow-up issue (please file): "Add prod-compose up --wait smoke job to CI" — labelled devops, sized S. Title format: infra(ci): add docker compose prod up --wait smoke job.

Things I checked and am satisfied with

  • No test files were modified — appropriate, since no application code changed.
  • No flyway migrations, no entity changes, no controller signature changes — so the existing test suite remains correct without modification.
  • docker-compose.ci.yml is unchanged, so CI E2E behavior is unaffected by this PR.

Approved with the follow-up issue as a soft requirement.

## Sara Holt — QA Engineer & Test Strategist **Verdict: Approved with concerns** The fix itself is correct and minimal. My concern is entirely about regression-test coverage — there is none in this PR for the exact failure mode that triggered #510. That is fixable as a follow-up, not a blocker, but I want it on record. ### What's good - Manual verification listed in the PR body (`config --quiet` parse + resolved-config inspection of both keys) is appropriate for a compose-only diff. It proves compose *accepts* the change. - The bug was caught by the nightly workflow's `up -d --wait` step — i.e. CI infrastructure caught a production-orchestration bug *before* it reached the running app. The pipeline did its job. That is a healthy gate working as designed. - Test plan in the PR body is correctly structured: "re-trigger nightly → assert exit 0 → assert smoke step runs." Each assertion is binary and verifiable. Good acceptance criteria. ### Concern: missing automated regression test The class of bug here is: "compose orchestration semantics changed and `up --wait` no longer treats this stack as healthy." That class is not currently covered by any test in CI: - Unit/integration tests run *inside* the backend JVM — they cannot see compose orchestration. - E2E Playwright tests run *after* `up --wait` has succeeded — by definition they cannot detect a `--wait` regression because they never run if `--wait` fails. - The nightly workflow itself is a regression test of sorts, but it caught the bug *in production* (staging). That is the wrong layer — staging is too late. **Concrete test-pyramid recommendation (separate follow-up, not a blocker):** ```yaml # .gitea/workflows/compose-smoke.yml — runs on PRs that touch docker-compose*.yml jobs: prod-compose-up-wait: runs-on: gitea-runner steps: - uses: actions/checkout@v4 - name: up -d --wait run: | docker compose -f docker-compose.prod.yml -p smoke up -d --wait - name: assert create-buckets exited 0 run: | docker compose -f docker-compose.prod.yml -p smoke ps create-buckets \ --format json | jq -e '.[0].State == "exited" and .[0].ExitCode == 0' - name: teardown if: always() run: docker compose -f docker-compose.prod.yml -p smoke down -v ``` This sits at the "infrastructure smoke" layer of the pyramid — below E2E, above unit. Runs in ~2 minutes against a clean stack. Catches: - Removing `restart: "no"` from `create-buckets` - Removing the `service_completed_successfully` condition - A future compose version changing `--wait` semantics again - Any new one-shot service being introduced without the same paired declaration The job needs prod env vars (`.env.production` is gitignored — pass `MINIO_PASSWORD` etc. via Gitea secrets to a `.env.smoke` overlay). Standard pattern. ### Recommendation - **For this PR**: ship. Manual verification + the post-merge nightly run is sufficient one-time proof. - **Follow-up issue (please file)**: "Add prod-compose `up --wait` smoke job to CI" — labelled `devops`, sized S. Title format: `infra(ci): add docker compose prod up --wait smoke job`. ### Things I checked and am satisfied with - No test files were modified — appropriate, since no application code changed. - No flyway migrations, no entity changes, no controller signature changes — so the existing test suite remains correct without modification. - `docker-compose.ci.yml` is unchanged, so CI E2E behavior is unaffected by this PR. Approved with the follow-up issue as a soft requirement.
Author
Owner

Leonie Voss — UX Design Lead

Verdict: Approved — out of scope for UX

This PR is a Docker Compose orchestration fix. It does not touch any .svelte component, design token, Tailwind class, ARIA attribute, or messages/{de,en,es}.json i18n string. There is nothing for me to evaluate from a UI/UX or accessibility angle.

I checked the diff to confirm scope: the change is two YAML edits in docker-compose.prod.yml. No frontend impact.

Indirect UX observation (positive)

The reason this fix matters from a user-experience angle is that it unblocks the staging deployment gate. A reliable deploy pipeline is what lets the team ship UX improvements quickly. So while there is nothing visual in the PR itself, the fix enables faster delivery of the things that do affect users.

LGTM. Defer to Tobias, Markus, and Sara for the substantive review.

## Leonie Voss — UX Design Lead **Verdict: Approved — out of scope for UX** This PR is a Docker Compose orchestration fix. It does not touch any `.svelte` component, design token, Tailwind class, ARIA attribute, or `messages/{de,en,es}.json` i18n string. There is nothing for me to evaluate from a UI/UX or accessibility angle. I checked the diff to confirm scope: the change is two YAML edits in `docker-compose.prod.yml`. No frontend impact. ### Indirect UX observation (positive) The reason this fix matters from a user-experience angle is that it unblocks the staging deployment gate. A reliable deploy pipeline is what lets the team ship UX improvements quickly. So while there is nothing visual in the PR itself, the fix enables faster delivery of the things that do affect users. LGTM. Defer to Tobias, Markus, and Sara for the substantive review.
Author
Owner

"Elicit" — Requirements Engineer (Brownfield mode)

Verdict: Approved with minor process notes

This is a brownfield bug fix discovered in flight during staging. The requirements artifact is the linked Gitea issue (#510), and the PR body itself reads as an excellent specification: problem statement, root cause, fix, verification done, test plan to verify after merge. That is the standard I would set for every infrastructure-fix PR.

What's well-specified

  • Clear problem statement in the PR body: "docker compose up -d --wait was exiting 1 even with every service healthy." This is a falsifiable, observable claim — not "deploys feel flaky."
  • Root cause named precisely: one-shot exit(0) being interpreted as "not running, fail" by --wait. The PR identifies why the bug exists, not just what symptom it caused.
  • Acceptance criteria are testable (in Given-When-Then-style restated):
    • Given the staging stack is brought up with docker compose -f docker-compose.prod.yml ... up -d --wait
    • When create-buckets exits with code 0
    • Then up -d --wait exits 0
    • And the nightly workflow's smoke step runs
    • And /login → 200, HSTS/Permissions-Policy headers are present
  • Non-goals explicitly named: the /actuator/health → 302 issue is called out as a separate concern. That is exactly how scope creep is prevented.

Minor process observations (non-blocking)

  1. Make the spoilered follow-up a real issue before merging. The PR body says "Spoiler: the /actuator/health → 404 assertion will likely fail next." That is a known future bug. Per project conventions (specs go into Gitea issues, not markdown files), please file the Caddy actuator-respond issue before merging so the link can be inserted into the PR body. Otherwise that knowledge lives only in this PR's text and may be forgotten.
  2. Definition of Done check for this PR:
    • Problem is one-sentence stateable: "compose up --wait mis-classifies one-shot exit(0) as failure"
    • Acceptance criteria written (test plan in PR body)
    • Verified locally (config parse + resolved-config inspection)
    • Tested at the smallest supported deployment unit (the staging stack post-merge)
    • Issue #510 linked and will auto-close
    • [n/a] Mobile/accessibility check — N/A for an orchestration fix
  3. Acceptance criterion that's missing from the test plan: "And no service is left in restart loop afterwards." Compose's --wait historically had edge cases where a one-shot that re-runs would be flagged. Worth eyeballing docker compose ps after the post-merge nightly run.

Traceability

  • Goal: reliable automated staging deploys
  • Persona: solo developer (Marcel) running CI pipeline
  • JTBD: "When the staging stack comes up healthy, I want up -d --wait to exit 0, so I can let CI gate the deploy without false negatives."
  • Story: This PR
  • AC: PR body's test plan section
  • Linked NFR: Reliability (NFR-REL — deployment automation must be deterministic)

Recommendation

Ship. Two soft asks: (a) file the actuator follow-up issue before merge, (b) add the restart-loop post-deploy check to the test plan checklist.

## "Elicit" — Requirements Engineer (Brownfield mode) **Verdict: Approved with minor process notes** This is a brownfield bug fix discovered in flight during staging. The requirements artifact is the linked Gitea issue (#510), and the PR body itself reads as an excellent specification: problem statement, root cause, fix, verification done, test plan to verify after merge. That is the standard I would set for every infrastructure-fix PR. ### What's well-specified - **Clear problem statement in the PR body**: "`docker compose up -d --wait` was exiting 1 even with every service healthy." This is a falsifiable, observable claim — not "deploys feel flaky." - **Root cause named precisely**: one-shot exit(0) being interpreted as "not running, fail" by `--wait`. The PR identifies *why* the bug exists, not just what symptom it caused. - **Acceptance criteria are testable** (in Given-When-Then-style restated): - Given the staging stack is brought up with `docker compose -f docker-compose.prod.yml ... up -d --wait` - When `create-buckets` exits with code 0 - Then `up -d --wait` exits 0 - And the nightly workflow's smoke step runs - And `/login → 200`, HSTS/Permissions-Policy headers are present - **Non-goals explicitly named**: the `/actuator/health → 302` issue is called out as a separate concern. That is exactly how scope creep is prevented. ### Minor process observations (non-blocking) 1. **Make the spoilered follow-up a real issue before merging.** The PR body says "Spoiler: the `/actuator/health → 404` assertion will likely fail next." That is a known future bug. Per project conventions (specs go into Gitea issues, not markdown files), please file the Caddy actuator-respond issue *before* merging so the link can be inserted into the PR body. Otherwise that knowledge lives only in this PR's text and may be forgotten. 2. **Definition of Done check** for this PR: - [x] Problem is one-sentence stateable: "compose `up --wait` mis-classifies one-shot exit(0) as failure" - [x] Acceptance criteria written (test plan in PR body) - [x] Verified locally (config parse + resolved-config inspection) - [ ] Tested at the smallest supported deployment unit (the staging stack post-merge) - [x] Issue #510 linked and will auto-close - [n/a] Mobile/accessibility check — N/A for an orchestration fix 3. **Acceptance criterion that's missing from the test plan**: "And no service is left in `restart` loop afterwards." Compose's `--wait` historically had edge cases where a one-shot that re-runs would be flagged. Worth eyeballing `docker compose ps` after the post-merge nightly run. ### Traceability - Goal: reliable automated staging deploys - Persona: solo developer (Marcel) running CI pipeline - JTBD: "When the staging stack comes up healthy, I want `up -d --wait` to exit 0, so I can let CI gate the deploy without false negatives." - Story: This PR - AC: PR body's test plan section - Linked NFR: Reliability (NFR-REL — deployment automation must be deterministic) ### Recommendation Ship. Two soft asks: (a) file the actuator follow-up issue *before* merge, (b) add the `restart-loop` post-deploy check to the test plan checklist.
marcel merged commit 5d2888e038 into main 2026-05-11 17:00:00 +02:00
marcel deleted branch fix/issue-510-compose-wait-oneshot-create-buckets 2026-05-11 17:00:00 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#511