fix(ci): re-enable Testcontainers Ryuk to stop the backend fork shutdown hang (#848) #849

Merged
marcel merged 2 commits from devops/issue-848-fork-exit-timeout into main 2026-06-15 20:53:59 +02:00
Owner

Fixes #848.

Symptom

CI Backend Unit Tests goes red despite all tests passing: after the last test, the fork hangs at JVM shutdown and Surefire reports There was a timeout in the forkBUILD FAILURE.

Root cause (corrected after investigation)

My first theory (slow shutdown needs a bigger timeout) was wrong — raising forkedProcessExitTimeoutInSeconds 30→120 only delayed the kill by ~90s (total time 12:35 → 14:04), proving an indefinite hang, not slowness.

The real cause is Testcontainers teardown with Ryuk disabled:

  • The job set TESTCONTAINERS_RYUK_DISABLED: "true" (carry-over from the old NAS runner).
  • With Ryuk off, containers are reaped by the in-JVM JVMHookResourceReaper at shutdown. That reaper crashes (NotFoundException) and leaks containers run-over-run.
  • The run boots ~30 per-context Spring contexts (PostgresContainerConfig is a per-context @Bean), so ~30 Postgres containers are torn down in-JVM at shutdown.
  • As leaks accumulate on the runner, per-run teardown degrades until the fork hangs at shutdown → fork timeout. The server had 21 orphaned postgres:16-alpine/minio containers up to 5 weeks old; manually killing them is what restored CI before (a recurring pattern).

Environment confirmed via ssh root@raddatz.cloud: CI now runs on a root server with Docker 29.4.3 (8 CPU, 62 GB, socket access) — so the original reason to disable Ryuk no longer applies, and Docker is not slow.

Change

  1. Re-enable Ryuk (remove TESTCONTAINERS_RYUK_DISABLED) — Ryuk reaps each run's containers out-of-process after the JVM exits, so they never accumulate. Automates the manual "kill all testcontainers."
  2. Keep forkedProcessExitTimeoutInSeconds=120 as a harmless backstop.
  3. Drop the stale "NAS runner" comment on DOCKER_API_VERSION.

Operational: the 21 leaked containers were already removed from the server (by org.testcontainers=true label; real services untouched), giving immediate relief.

Validation

Validated by this PR's CI run on the real runner (watching it). If Ryuk can't start in the runner's docker-outside-docker setup, the integration tests fail fast and I revert — fallback is a singleton Postgres container.

🤖 Generated with Claude Code

Fixes #848. ## Symptom CI `Backend Unit Tests` goes red despite **all tests passing**: after the last test, the fork hangs at JVM shutdown and Surefire reports `There was a timeout in the fork` → `BUILD FAILURE`. ## Root cause (corrected after investigation) My first theory (slow shutdown needs a bigger timeout) was **wrong** — raising `forkedProcessExitTimeoutInSeconds` 30→120 only delayed the kill by ~90s (total time 12:35 → 14:04), proving an *indefinite* hang, not slowness. The real cause is **Testcontainers teardown with Ryuk disabled**: - The job set `TESTCONTAINERS_RYUK_DISABLED: "true"` (carry-over from the old NAS runner). - With Ryuk off, containers are reaped by the **in-JVM `JVMHookResourceReaper`** at shutdown. That reaper crashes (`NotFoundException`) and **leaks containers run-over-run**. - The run boots ~30 per-context Spring contexts (`PostgresContainerConfig` is a per-context `@Bean`), so ~30 Postgres containers are torn down in-JVM at shutdown. - As leaks accumulate on the runner, per-run teardown degrades until the fork hangs at shutdown → fork timeout. **The server had 21 orphaned `postgres:16-alpine`/`minio` containers up to 5 weeks old**; manually killing them is what restored CI before (a recurring pattern). Environment confirmed via `ssh root@raddatz.cloud`: CI now runs on a root server with **Docker 29.4.3** (8 CPU, 62 GB, socket access) — so the original reason to disable Ryuk no longer applies, and Docker is *not* slow. ## Change 1. **Re-enable Ryuk** (remove `TESTCONTAINERS_RYUK_DISABLED`) — Ryuk reaps each run's containers out-of-process after the JVM exits, so they never accumulate. Automates the manual "kill all testcontainers." 2. Keep `forkedProcessExitTimeoutInSeconds=120` as a harmless backstop. 3. Drop the stale "NAS runner" comment on `DOCKER_API_VERSION`. Operational: the 21 leaked containers were already removed from the server (by `org.testcontainers=true` label; real services untouched), giving immediate relief. ## Validation Validated by this PR's CI run on the real runner (watching it). If Ryuk can't start in the runner's docker-outside-docker setup, the integration tests fail fast and I revert — fallback is a singleton Postgres container. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 1 commit 2026-06-15 16:39:21 +02:00
fix(ci): give the backend test fork 120s to shut down
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 4m56s
CI / OCR Service Tests (pull_request) Successful in 46s
CI / Backend Unit Tests (pull_request) Failing after 16m17s
CI / fail2ban Regex (pull_request) Failing after 2m5s
CI / Semgrep Security Scan (pull_request) Successful in 43s
CI / Compose Bucket Idempotency (pull_request) Successful in 2m7s
SDD Gate / RTM Check (pull_request) Successful in 35s
SDD Gate / Contract Validate (pull_request) Successful in 49s
SDD Gate / Constitution Impact (pull_request) Successful in 32s
2ad9d23ad9
All 2327 tests pass, but the build went red: after the suite finishes,
Surefire calls System.exit(0) and the single reused fork then closes ~32
cached Spring contexts at once — each tearing down a Testcontainers Postgres
+ HikariCP pool — which overruns Surefire's 30s default post-exit grace.
Surefire force-kills the fork and reports a fork timeout (BUILD FAILURE with
0 failures). The session-cleanup InterruptedException and Testcontainers
reaper NotFoundException in the log are symptoms of that contended teardown.

Set the previously-unset forkedProcessExitTimeoutInSeconds to 120s. This is a
different knob from forkedProcessTimeoutInSeconds (total/inactivity), already
600s, which is why the earlier ceiling bumps never addressed this failure.

Phase B of #848; the durable fix (singleton Testcontainers Postgres +
disabling the Spring Session JDBC cleanup scheduler in tests) follows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
marcel added 1 commit 2026-06-15 20:26:27 +02:00
fix(ci): re-enable Testcontainers Ryuk to stop the shutdown hang
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1m22s
CI / OCR Service Tests (pull_request) Successful in 54s
CI / Backend Unit Tests (pull_request) Successful in 10m55s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 24s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m9s
SDD Gate / RTM Check (pull_request) Successful in 16s
SDD Gate / Contract Validate (pull_request) Successful in 23s
SDD Gate / Constitution Impact (pull_request) Successful in 17s
0f9e8c75cc
The backend job set TESTCONTAINERS_RYUK_DISABLED=true, a carry-over from the
old NAS runner. With Ryuk off, Testcontainers tears down containers via the
in-JVM JVMHookResourceReaper at shutdown; that reaper crashes (NotFoundException)
and leaks containers run-over-run. As leaked postgres:16-alpine containers pile
up on the runner, the per-run teardown of ~30 per-context containers degrades
until the fork hangs at JVM shutdown and Surefire reports "There was a timeout
in the fork" — even though all tests pass. (The server had 21 such leaks, up to
5 weeks old; manually killing them was what restored CI before.)

CI now runs on a root server with modern Docker (29.4.3, socket access), so the
original reason to disable Ryuk no longer applies. Re-enabling it reaps each
run's containers out-of-process after the JVM exits, so they never accumulate.

Also drops the stale "NAS runner" comment on DOCKER_API_VERSION.

Fixes #848.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
marcel changed title from fix(ci): give the backend test fork 120s to shut down (#848) to fix(ci): re-enable Testcontainers Ryuk to stop the backend fork shutdown hang (#848) 2026-06-15 20:26:51 +02:00
marcel merged commit 273a97046a into main 2026-06-15 20:53:59 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#849