Compare commits

...

2 Commits

Author SHA1 Message Date
Marcel
0f9e8c75cc fix(ci): re-enable Testcontainers Ryuk to stop the shutdown hang
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1m22s
CI / OCR Service Tests (pull_request) Successful in 54s
CI / Backend Unit Tests (pull_request) Successful in 10m55s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 24s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m9s
SDD Gate / RTM Check (pull_request) Successful in 16s
SDD Gate / Contract Validate (pull_request) Successful in 23s
SDD Gate / Constitution Impact (pull_request) Successful in 17s
The backend job set TESTCONTAINERS_RYUK_DISABLED=true, a carry-over from the
old NAS runner. With Ryuk off, Testcontainers tears down containers via the
in-JVM JVMHookResourceReaper at shutdown; that reaper crashes (NotFoundException)
and leaks containers run-over-run. As leaked postgres:16-alpine containers pile
up on the runner, the per-run teardown of ~30 per-context containers degrades
until the fork hangs at JVM shutdown and Surefire reports "There was a timeout
in the fork" — even though all tests pass. (The server had 21 such leaks, up to
5 weeks old; manually killing them was what restored CI before.)

CI now runs on a root server with modern Docker (29.4.3, socket access), so the
original reason to disable Ryuk no longer applies. Re-enabling it reaps each
run's containers out-of-process after the JVM exits, so they never accumulate.

Also drops the stale "NAS runner" comment on DOCKER_API_VERSION.

Fixes #848.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:25:58 +02:00
Marcel
2ad9d23ad9 fix(ci): give the backend test fork 120s to shut down
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 4m56s
CI / OCR Service Tests (pull_request) Successful in 46s
CI / Backend Unit Tests (pull_request) Failing after 16m17s
CI / fail2ban Regex (pull_request) Failing after 2m5s
CI / Semgrep Security Scan (pull_request) Successful in 43s
CI / Compose Bucket Idempotency (pull_request) Successful in 2m7s
SDD Gate / RTM Check (pull_request) Successful in 35s
SDD Gate / Contract Validate (pull_request) Successful in 49s
SDD Gate / Constitution Impact (pull_request) Successful in 32s
All 2327 tests pass, but the build went red: after the suite finishes,
Surefire calls System.exit(0) and the single reused fork then closes ~32
cached Spring contexts at once — each tearing down a Testcontainers Postgres
+ HikariCP pool — which overruns Surefire's 30s default post-exit grace.
Surefire force-kills the fork and reports a fork timeout (BUILD FAILURE with
0 failures). The session-cleanup InterruptedException and Testcontainers
reaper NotFoundException in the log are symptoms of that contended teardown.

Set the previously-unset forkedProcessExitTimeoutInSeconds to 120s. This is a
different knob from forkedProcessTimeoutInSeconds (total/inactivity), already
600s, which is why the earlier ceiling bumps never addressed this failure.

Phase B of #848; the durable fix (singleton Testcontainers Postgres +
disabling the Spring Session JDBC cleanup scheduler in tests) follows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:38:31 +02:00
2 changed files with 13 additions and 2 deletions

View File

@@ -229,9 +229,14 @@ jobs:
name: Backend Unit Tests
runs-on: ubuntu-latest
env:
DOCKER_API_VERSION: "1.43" # NAS runner runs Docker 24.x (max API 1.43); Testcontainers 2.x defaults to 1.44
# CI runs against the root-server Docker daemon (29.x). This API pin is a harmless
# carry-over from the old NAS runner (Docker 24.x, max API 1.43); safe to drop later.
DOCKER_API_VERSION: "1.43"
DOCKER_HOST: unix:///var/run/docker.sock
TESTCONTAINERS_RYUK_DISABLED: "true"
# Ryuk (Testcontainers' out-of-process reaper) is intentionally LEFT ENABLED so it
# removes each run's containers after the JVM exits. Disabling it forced the in-JVM
# reaper, which hung at JVM shutdown and leaked Postgres containers run-over-run until
# the daemon degraded and the fork timed out at teardown — see #848.
steps:
- uses: actions/checkout@v4

View File

@@ -369,6 +369,12 @@
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<forkedProcessTimeoutInSeconds>600</forkedProcessTimeoutInSeconds>
<!-- Grace period after the test JVM calls System.exit(0). The 30s default is too
short: the single reused fork closes ~32 cached Spring contexts at shutdown,
each tearing down a Testcontainers Postgres + HikariCP pool, which overruns 30s
and makes Surefire kill the fork (BUILD FAILURE despite 0 test failures). This is
a different knob from forkedProcessTimeoutInSeconds above. See issue #848. -->
<forkedProcessExitTimeoutInSeconds>120</forkedProcessExitTimeoutInSeconds>
<systemPropertyVariables>
<junit.jupiter.execution.timeout.default>90 s</junit.jupiter.execution.timeout.default>
</systemPropertyVariables>