fix(compose): mark create-buckets as one-shot for up --wait (#510)
#511
Reference in New Issue
Block a user
Delete Branch "fix/issue-510-compose-wait-oneshot-create-buckets"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes #510.
docker compose up -d --waitwas exiting 1 even with every service healthy, because the one-shotcreate-bucketsexits 0 and--waitinterpreted that as "not running, fail". The entire stack came up correctly on staging (verified via direct curl:/login → 200, HSTS + Permissions-Policy headers present), but the workflow gate failed before the smoke step could run.Fix
create-buckets:restart: "no"— declares one-shot intent.backend.depends_on: addcreate-buckets: service_completed_successfully— tells compose v2.20+ that create-buckets is a one-shot that must complete successfully, and--waittreatsexited(0)as the target state. Also closes a latent race: the backend currently starts in parallel with create-buckets, even though it needs thearchiv-apppolicy to be bound before its MinIO client can authenticate.Verified locally
docker compose -f docker-compose.prod.yml ... config --quietparsesbackend.depends_on.create-buckets.condition: service_completed_successfullyandcreate-buckets.restart: 'no'Test plan after merge
nightly.ymlagainst stagingcreate-bucketsexits 0 and the workflow'sup -d --waitstep now succeeds (noexitcode '1': failure)/login → 200, HSTS/Permissions-Policy headers,/actuator/health → 404/actuator/health → 404assertion will likely fail next — Caddy'srespond @actuator 404snippet isn't taking precedence over thehandleblocks. We saw/actuator/health → 302from a direct curl. Separate issue, separate fix.)🤖 Generated with Claude Code
Tobias Wendt — DevOps & Platform Engineer
Verdict: Approved
This is the textbook fix for
up -d --wait+ one-shot pattern. Both halves are required and both are present:create-buckets.restart: "no"— declares one-shot intent. Without this, compose's reconciler keeps trying to "bring it up" becauseunless-stopped(the project default would have beenunless-stoppedvia individual services, thoughcreate-bucketshad none) plus a clean exit looked like "stopped, restart me."backend.depends_on.create-buckets.condition: service_completed_successfully— tells--waitwhat "ready" means for a one-shot. Compose treatsexited(0)as the success terminal state for this dependency.What I like
# See #510.cross-reference plus the latent-race explanation inbackend.depends_onis the kind of artifact a future on-caller will actually thank you for.docker compose ... config --quietparses and the resolved config shows both keys. That is the right local verification for a Compose change.Suggestions (non-blocking)
service_completed_successfullyrequires Docker Compose v2.20+ (and the canonical change is from 2023). The runner is well within that. No action — just confirming the constraint is satisfied.respond @actuator 404Caddy snippet not taking precedence — file that as a separate issue immediately so it doesn't get lost. Out of scope for this PR.What this does not change
create-bucketsstill runs on everyup -d. That's correct —bootstrap.shis idempotent and the cost is one MinIO admin call.Ship it. This unblocks the staging gate.
Markus Keller — Application Architect
Verdict: Approved
This is the right fix at the right layer. The startup ordering between MinIO bootstrap (a one-shot job) and the backend application is a topology concern, and topology concerns belong in the Compose file — not in application retry logic or shell loops around
docker compose up.What I like
up --wait. The fix is two declarative keys in YAML that compose has supported since v2.20. That is the architectural ideal: solve the problem with the platform's primitives.backend → {db, minio, ocr-service}but the backend actually depends on{db, minio, ocr-service, archiv-app-policy-bound}. The graph lied. Now it doesn't. Mis-stated dependencies become real bugs at the worst possible moment (cold start afterdocker compose down -v).S3ConfigorMinioClientFactory. That would push infrastructure ordering into application code — exactly the kind of cross-layer accidental complexity I push back on. You kept it where it belongs.Architectural observations (non-blocking)
create-bucketsis now structurally a job step, not a service. Compose's "service" abstraction is overloaded here —restart: "no"+service_completed_successfullyis compose's way of saying "this is a job." Future readers will understand it; no ADR needed.db, minio, ocr-service, create-buckets → backend → frontend. Linear, no cycles, no fan-out surprises. Good.frontendshould also gate oncreate-buckets— it shouldn't, becausefrontend → backend → create-bucketstransitively gives the same guarantee. Don't add a redundant edge.Documentation
Per my doc-update table: "New Docker service or infrastructure component" maps to
docs/architecture/c4/l2-containers.puml+docs/DEPLOYMENT.md. This PR does not add a service — it tightens an existing dependency. No diagram update required. The PR body and the inline# See #510comment are sufficient memory.If
create-bucketsis not already represented inl2-containers.puml, that is a pre-existing gap (predates this PR) — file as a small infra-docs ticket separately.Ship it.
Felix Brandt — Senior Fullstack Developer
Verdict: Approved with a small concern
Clean, minimal, surgical. Ten added lines, zero deleted, one file. The change matches the issue exactly and the inline comments document why, not what — which is the rule.
What I like
create-buckets, lines 167-170 onbackend.depends_on) explain the why: what--waitdoes, why one-shot needs explicit declaration, what race the dependency closes. No "what does this YAML key do" noise. This is the standard I want for every config comment.#510. Future me, six months from now, traces the YAML straight to the issue thread that explains it.Concern (process, not code)
The TDD discipline I usually push for says: write the failing test first, then the fix. There is no automated regression test in this PR proving that
docker compose up -d --waitexits 0 withcreate-bucketsas a dependency. The manual verification listed in the PR body (docker compose ... config --quiet+ resolved-config inspection) verifies that compose parses the change — not that--waitactually treatsexited(0)as success.That regression test would naturally live in CI as: spin up the prod-compose stack, assert
docker compose ... up -d --waitexits 0, assert all services are running orExited (0). The closest existing CI step is the nightly workflow itself — which is exactly where the bug surfaced.Concrete proposal (non-blocking, separate follow-up issue): Add a
compose-prod-smokejob to a workflow that runsdocker compose -f docker-compose.prod.yml ... up -d --waitagainst a throw-away stack, asserts exit 0, assertscreate-bucketsis inExited (0)state, then tears down. That job catches future regressions of this exact class — e.g. someone removingrestart: "no""because the linter doesn't like it."This is not a blocker for #510 — the manual verification + the upcoming nightly run is enough proof for this PR. Just please file the smoke-job follow-up so the next time
--waitsemantics shift, CI catches it before staging does.One pedantic nit (truly minor)
The PR body's checklist includes a spoiler about
/actuator/health → 302. Worth opening that issue now, before this PR merges, so the link is in place when the post-merge test plan runs. Otherwise it lives only in this PR's history.Ship it. The fix itself is correct.
Nora "NullX" Steiner — Application Security Engineer
Verdict: Approved
This is a deploy-orchestration fix. It does not touch the application security surface (no auth changes, no permission rules, no exposed ports, no env-var disclosure, no secrets handling). I checked anyway because "orchestration changes" is exactly the category that quietly introduces security regressions.
What I verified
restart:anddepends_on:. Noports:additions. Backend still bound to127.0.0.1:${PORT_BACKEND}:8080— Caddy continues to front external traffic. Good.MINIO_PASSWORDandMINIO_APP_PASSWORDenv vars oncreate-bucketsare unchanged. The bucket-bootstrap still creates the least-privilegearchiv-appservice account (perinfra/minio/bootstrap.sh) and the backend continues to useS3_ACCESS_KEY: archiv-app(the scoped account), not MinIO root. The principle "MinIO root credentials are not application credentials" is preserved.expose/portsblocks unchanged.db, minio, ocr-service, create-buckets → backend → frontend. Linear DAG. No cycle.--waitwill either succeed or fail-closed (workflow exit 1, deploy aborted) — that is the correct security posture.Subtle positive: the race this closes is also a security-relevant invariant
The PR body notes the backend currently starts in parallel with
create-bucketsand could race thearchiv-app-policybind. From a security perspective:Suggestions (non-blocking)
depends_onentry pointing at a service withrestart: "no"but missingcondition: service_completed_successfully. That codifies the lesson from #510 so it can never silently regress.Ship it. Nothing to fix here from a security angle.
Sara Holt — QA Engineer & Test Strategist
Verdict: Approved with concerns
The fix itself is correct and minimal. My concern is entirely about regression-test coverage — there is none in this PR for the exact failure mode that triggered #510. That is fixable as a follow-up, not a blocker, but I want it on record.
What's good
config --quietparse + resolved-config inspection of both keys) is appropriate for a compose-only diff. It proves compose accepts the change.up -d --waitstep — i.e. CI infrastructure caught a production-orchestration bug before it reached the running app. The pipeline did its job. That is a healthy gate working as designed.Concern: missing automated regression test
The class of bug here is: "compose orchestration semantics changed and
up --waitno longer treats this stack as healthy." That class is not currently covered by any test in CI:up --waithas succeeded — by definition they cannot detect a--waitregression because they never run if--waitfails.Concrete test-pyramid recommendation (separate follow-up, not a blocker):
This sits at the "infrastructure smoke" layer of the pyramid — below E2E, above unit. Runs in ~2 minutes against a clean stack. Catches:
restart: "no"fromcreate-bucketsservice_completed_successfullycondition--waitsemantics againThe job needs prod env vars (
.env.productionis gitignored — passMINIO_PASSWORDetc. via Gitea secrets to a.env.smokeoverlay). Standard pattern.Recommendation
up --waitsmoke job to CI" — labelleddevops, sized S. Title format:infra(ci): add docker compose prod up --wait smoke job.Things I checked and am satisfied with
docker-compose.ci.ymlis unchanged, so CI E2E behavior is unaffected by this PR.Approved with the follow-up issue as a soft requirement.
Leonie Voss — UX Design Lead
Verdict: Approved — out of scope for UX
This PR is a Docker Compose orchestration fix. It does not touch any
.sveltecomponent, design token, Tailwind class, ARIA attribute, ormessages/{de,en,es}.jsoni18n string. There is nothing for me to evaluate from a UI/UX or accessibility angle.I checked the diff to confirm scope: the change is two YAML edits in
docker-compose.prod.yml. No frontend impact.Indirect UX observation (positive)
The reason this fix matters from a user-experience angle is that it unblocks the staging deployment gate. A reliable deploy pipeline is what lets the team ship UX improvements quickly. So while there is nothing visual in the PR itself, the fix enables faster delivery of the things that do affect users.
LGTM. Defer to Tobias, Markus, and Sara for the substantive review.
"Elicit" — Requirements Engineer (Brownfield mode)
Verdict: Approved with minor process notes
This is a brownfield bug fix discovered in flight during staging. The requirements artifact is the linked Gitea issue (#510), and the PR body itself reads as an excellent specification: problem statement, root cause, fix, verification done, test plan to verify after merge. That is the standard I would set for every infrastructure-fix PR.
What's well-specified
docker compose up -d --waitwas exiting 1 even with every service healthy." This is a falsifiable, observable claim — not "deploys feel flaky."--wait. The PR identifies why the bug exists, not just what symptom it caused.docker compose -f docker-compose.prod.yml ... up -d --waitcreate-bucketsexits with code 0up -d --waitexits 0/login → 200, HSTS/Permissions-Policy headers are present/actuator/health → 302issue is called out as a separate concern. That is exactly how scope creep is prevented.Minor process observations (non-blocking)
/actuator/health → 404assertion will likely fail next." That is a known future bug. Per project conventions (specs go into Gitea issues, not markdown files), please file the Caddy actuator-respond issue before merging so the link can be inserted into the PR body. Otherwise that knowledge lives only in this PR's text and may be forgotten.up --waitmis-classifies one-shot exit(0) as failure"restartloop afterwards." Compose's--waithistorically had edge cases where a one-shot that re-runs would be flagged. Worth eyeballingdocker compose psafter the post-merge nightly run.Traceability
up -d --waitto exit 0, so I can let CI gate the deploy without false negatives."Recommendation
Ship. Two soft asks: (a) file the actuator follow-up issue before merge, (b) add the
restart-looppost-deploy check to the test plan checklist.