devops(ci): backend test build fails at JVM shutdown — forkedProcessExitTimeoutInSeconds (default 30s) exceeded #848
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
CI
Run backend tests(./mvnw clean verify) goes red despite every test passing:This is not a test failure. After the suite finishes, Surefire calls
System.exit(0); the forked JVM does not terminate within Surefire's post-exit grace period and Surefire force-kills the fork, reporting it as a fork timeout.Root cause
The grace period is
maven-surefire-plugin'sforkedProcessExitTimeoutInSeconds, which defaults to 30s and is not set inbackend/pom.xml. The whole suite runs in one reused fork (forkCount=1,reuseForks=true), so at JVM exit Spring's TestContext cache shutdown hook closes every cached context (default cap 32) at once. The expensive part:PostgresContainerConfig, whosePostgreSQLContaineris a non-static, per-context@Bean(backend/src/test/java/org/raddatz/familienarchiv/PostgresContainerConfig.java). With@ServiceConnection, each distinct context configuration starts and must stop its own Docker postgres container. Stopping many containers + closing many HikariCP pools simultaneously at shutdown overruns 30s.cleanUpExpiredSessionsfires mid-shutdown and isInterrupted during connection acquisitionas its pool closes (CannotCreateTransactionException).NotFoundException: No such containerbecause the container is already gone.Why earlier timeout bumps didn't catch this
History shows
forkedProcessTimeoutInSecondswas added (120) then raised to 600 ("suite takes ~4 min"). That is the total/inactivity fork timeout — the suite ran in 12:35, well inside it. This failure is the differentforkedProcessExitTimeoutInSecondsknob (the 30s post-exit grace), which has never been configured.It is timing/threshold-dependent (rides on Docker container-stop latency on the runner), so it presents as intermittent — "all green, build red."
Plan
Phase B — immediate unblock (this PR):
forkedProcessExitTimeoutInSeconds(e.g.120) in the Surefire<configuration>inbackend/pom.xml. This is the exact knob for this failure; it gives legitimate shutdown enough headroom.Phase A — durable root-cause reduction (follow-up PR):
PostgresContainerConfigto a single shared static Postgres container started once for the whole run (one container instead of dozens) — collapses shutdown cost and speeds startup.spring.session.jdbc.cleanup-cron: "-"inapplication-test.yaml— removes the task contending on the closing pool and its noisy traces.Acceptance
Tests run: …, Failures: 0, Errors: 0andBUILD SUCCESS.Surefire is going to kill self fork JVMmessage in the log.marcel referenced this issue2026-06-15 16:39:20 +02:00
Root cause corrected after investigation
The original "slow shutdown → raise the Surefire timeout" theory was wrong. Raising
forkedProcessExitTimeoutInSeconds30→120 only pushed the kill ~90s later (total 12:35 → 14:04) — an indefinite hang, not slowness.Actual cause: Testcontainers teardown with Ryuk disabled.
TESTCONTAINERS_RYUK_DISABLED: "true"(carry-over from the old NAS runner).JVMHookResourceReaperremoves containers at shutdown; it crashes withNotFoundExceptionand leaks containers run-over-run.PostgresContainerConfigis a per-context@Bean) are torn down in-JVM at shutdown. As leaks pile up, teardown degrades until the fork hangs →There was a timeout in the fork(all tests still pass).Evidence
ssh root@raddatz.cloud): Docker 29.4.3, 8 CPU, 62 GB — not slow.postgres:16-alpine/miniotest containers up to 5 weeks old (matched byorg.testcontainers=true). This matches the maintainer's recurring experience: manually killing all testcontainers restores CI for a while, then it degrades again.Fix (PR #849)
forkedProcessExitTimeoutInSeconds=120as a backstop.The earlier "Phase A" singleton-container idea is held as the fallback if Ryuk can't run in the runner's docker-outside-docker setup.
✅ Validated on the real runner
CI run #2321 (commit
0f9e8c75), Backend Unit Tests:BUILD SUCCESS,Tests run: 2325, Failures: 0, Errors: 0timeout in the fork, nokill self fork JVM.Ryuk has been disabledwarning is gone.org.testcontainers=truecontainers — Ryuk reaped everything; nothing leaked.The leak-accumulation cycle is broken. PR #849 is ready to merge.
(Unrelated: the same run's frontend "Unit & Component Tests" job failed independently — not caused or fixed by this change.)