fix(ci): re-enable Testcontainers Ryuk to stop the backend fork shutdown hang (#848) #849

Merged

marcel merged 2 commits from devops/issue-848-fork-exit-timeout into main

2026-06-15 20:53:59 +02:00

Author	SHA1	Message	Date
Marcel	0f9e8c75cc	fix(ci): re-enable Testcontainers Ryuk to stop the shutdown hang Some checks failed CI / Unit & Component Tests (pull_request) Failing after 1m22s Details CI / OCR Service Tests (pull_request) Successful in 54s Details CI / Backend Unit Tests (pull_request) Successful in 10m55s Details CI / fail2ban Regex (pull_request) Successful in 45s Details CI / Semgrep Security Scan (pull_request) Successful in 24s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m9s Details SDD Gate / RTM Check (pull_request) Successful in 16s Details SDD Gate / Contract Validate (pull_request) Successful in 23s Details SDD Gate / Constitution Impact (pull_request) Successful in 17s Details The backend job set TESTCONTAINERS_RYUK_DISABLED=true, a carry-over from the old NAS runner. With Ryuk off, Testcontainers tears down containers via the in-JVM JVMHookResourceReaper at shutdown; that reaper crashes (NotFoundException) and leaks containers run-over-run. As leaked postgres:16-alpine containers pile up on the runner, the per-run teardown of ~30 per-context containers degrades until the fork hangs at JVM shutdown and Surefire reports "There was a timeout in the fork" — even though all tests pass. (The server had 21 such leaks, up to 5 weeks old; manually killing them was what restored CI before.) CI now runs on a root server with modern Docker (29.4.3, socket access), so the original reason to disable Ryuk no longer applies. Re-enabling it reaps each run's containers out-of-process after the JVM exits, so they never accumulate. Also drops the stale "NAS runner" comment on DOCKER_API_VERSION. Fixes #848. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:25:58 +02:00
Marcel	2ad9d23ad9	fix(ci): give the backend test fork 120s to shut down Some checks failed CI / Unit & Component Tests (pull_request) Failing after 4m56s Details CI / OCR Service Tests (pull_request) Successful in 46s Details CI / Backend Unit Tests (pull_request) Failing after 16m17s Details CI / fail2ban Regex (pull_request) Failing after 2m5s Details CI / Semgrep Security Scan (pull_request) Successful in 43s Details CI / Compose Bucket Idempotency (pull_request) Successful in 2m7s Details SDD Gate / RTM Check (pull_request) Successful in 35s Details SDD Gate / Contract Validate (pull_request) Successful in 49s Details SDD Gate / Constitution Impact (pull_request) Successful in 32s Details All 2327 tests pass, but the build went red: after the suite finishes, Surefire calls System.exit(0) and the single reused fork then closes ~32 cached Spring contexts at once — each tearing down a Testcontainers Postgres + HikariCP pool — which overruns Surefire's 30s default post-exit grace. Surefire force-kills the fork and reports a fork timeout (BUILD FAILURE with 0 failures). The session-cleanup InterruptedException and Testcontainers reaper NotFoundException in the log are symptoms of that contended teardown. Set the previously-unset forkedProcessExitTimeoutInSeconds to 120s. This is a different knob from forkedProcessTimeoutInSeconds (total/inactivity), already 600s, which is why the earlier ceiling bumps never addressed this failure. Phase B of #848; the durable fix (singleton Testcontainers Postgres + disabling the Spring Session JDBC cleanup scheduler in tests) follows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:38:31 +02:00

Author

SHA1

Message

Date

Marcel

0f9e8c75cc

fix(ci): re-enable Testcontainers Ryuk to stop the shutdown hang

CI / Unit & Component Tests (pull_request) Failing after 1m22s

Details

CI / OCR Service Tests (pull_request) Successful in 54s

Details

CI / Backend Unit Tests (pull_request) Successful in 10m55s

Details

CI / fail2ban Regex (pull_request) Successful in 45s

Details

CI / Semgrep Security Scan (pull_request) Successful in 24s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m9s

Details

SDD Gate / RTM Check (pull_request) Successful in 16s

Details

SDD Gate / Contract Validate (pull_request) Successful in 23s

Details

SDD Gate / Constitution Impact (pull_request) Successful in 17s

Details

The backend job set TESTCONTAINERS_RYUK_DISABLED=true, a carry-over from the
old NAS runner. With Ryuk off, Testcontainers tears down containers via the
in-JVM JVMHookResourceReaper at shutdown; that reaper crashes (NotFoundException)
and leaks containers run-over-run. As leaked postgres:16-alpine containers pile
up on the runner, the per-run teardown of ~30 per-context containers degrades
until the fork hangs at JVM shutdown and Surefire reports "There was a timeout
in the fork" — even though all tests pass. (The server had 21 such leaks, up to
5 weeks old; manually killing them was what restored CI before.)

CI now runs on a root server with modern Docker (29.4.3, socket access), so the
original reason to disable Ryuk no longer applies. Re-enabling it reaps each
run's containers out-of-process after the JVM exits, so they never accumulate.

Also drops the stale "NAS runner" comment on DOCKER_API_VERSION.

Fixes #848.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 20:25:58 +02:00

Marcel

2ad9d23ad9

fix(ci): give the backend test fork 120s to shut down

CI / Unit & Component Tests (pull_request) Failing after 4m56s

Details

CI / OCR Service Tests (pull_request) Successful in 46s

Details

CI / Backend Unit Tests (pull_request) Failing after 16m17s

Details

CI / fail2ban Regex (pull_request) Failing after 2m5s

Details

CI / Semgrep Security Scan (pull_request) Successful in 43s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 2m7s

Details

SDD Gate / RTM Check (pull_request) Successful in 35s

Details

SDD Gate / Contract Validate (pull_request) Successful in 49s

Details

SDD Gate / Constitution Impact (pull_request) Successful in 32s

Details

All 2327 tests pass, but the build went red: after the suite finishes,
Surefire calls System.exit(0) and the single reused fork then closes ~32
cached Spring contexts at once — each tearing down a Testcontainers Postgres
+ HikariCP pool — which overruns Surefire's 30s default post-exit grace.
Surefire force-kills the fork and reports a fork timeout (BUILD FAILURE with
0 failures). The session-cleanup InterruptedException and Testcontainers
reaper NotFoundException in the log are symptoms of that contended teardown.

Set the previously-unset forkedProcessExitTimeoutInSeconds to 120s. This is a
different knob from forkedProcessTimeoutInSeconds (total/inactivity), already
600s, which is why the earlier ceiling bumps never addressed this failure.

Phase B of #848; the durable fix (singleton Testcontainers Postgres +
disabling the Spring Session JDBC cleanup scheduler in tests) follows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 16:38:31 +02:00

fix(ci): re-enable Testcontainers Ryuk to stop the backend fork shutdown hang (#848) #849

2 Commits