devops(ci): backend test build fails at JVM shutdown — forkedProcessExitTimeoutInSeconds (default 30s) exceeded #848

Closed
opened 2026-06-15 16:35:55 +02:00 by marcel · 2 comments
Owner

Symptom

CI Run backend tests (./mvnw clean verify) goes red despite every test passing:

Tests run: 2327, Failures: 0, Errors: 0, Skipped: 0
[ERROR] Surefire is going to kill self fork JVM. The exit has elapsed 30 seconds after System.exit(0).
[ERROR] ... There was a timeout in the fork -> BUILD FAILURE
Total time:  12:35 min

This is not a test failure. After the suite finishes, Surefire calls System.exit(0); the forked JVM does not terminate within Surefire's post-exit grace period and Surefire force-kills the fork, reporting it as a fork timeout.

Root cause

The grace period is maven-surefire-plugin's forkedProcessExitTimeoutInSeconds, which defaults to 30s and is not set in backend/pom.xml. The whole suite runs in one reused fork (forkCount=1, reuseForks=true), so at JVM exit Spring's TestContext cache shutdown hook closes every cached context (default cap 32) at once. The expensive part:

  • 52 test classes import PostgresContainerConfig, whose PostgreSQLContainer is a non-static, per-context @Bean (backend/src/test/java/org/raddatz/familienarchiv/PostgresContainerConfig.java). With @ServiceConnection, each distinct context configuration starts and must stop its own Docker postgres container. Stopping many containers + closing many HikariCP pools simultaneously at shutdown overruns 30s.
  • The stack traces in the log are symptoms of that contended teardown, not separate bugs:
    • Spring Session JDBC cleanUpExpiredSessions fires mid-shutdown and is Interrupted during connection acquisition as its pool closes (CannotCreateTransactionException).
    • The Testcontainers reaper throws NotFoundException: No such container because the container is already gone.

Why earlier timeout bumps didn't catch this

History shows forkedProcessTimeoutInSeconds was added (120) then raised to 600 ("suite takes ~4 min"). That is the total/inactivity fork timeout — the suite ran in 12:35, well inside it. This failure is the different forkedProcessExitTimeoutInSeconds knob (the 30s post-exit grace), which has never been configured.

It is timing/threshold-dependent (rides on Docker container-stop latency on the runner), so it presents as intermittent — "all green, build red."

Plan

Phase B — immediate unblock (this PR):

  • Set forkedProcessExitTimeoutInSeconds (e.g. 120) in the Surefire <configuration> in backend/pom.xml. This is the exact knob for this failure; it gives legitimate shutdown enough headroom.

Phase A — durable root-cause reduction (follow-up PR):

  • Convert PostgresContainerConfig to a single shared static Postgres container started once for the whole run (one container instead of dozens) — collapses shutdown cost and speeds startup.
  • Disable the Spring Session JDBC cleanup scheduler in tests: spring.session.jdbc.cleanup-cron: "-" in application-test.yaml — removes the task contending on the closing pool and its noisy traces.

Acceptance

  • The CI backend job passes with Tests run: …, Failures: 0, Errors: 0 and BUILD SUCCESS.
  • No Surefire is going to kill self fork JVM message in the log.
## Symptom CI `Run backend tests` (`./mvnw clean verify`) goes **red despite every test passing**: ``` Tests run: 2327, Failures: 0, Errors: 0, Skipped: 0 [ERROR] Surefire is going to kill self fork JVM. The exit has elapsed 30 seconds after System.exit(0). [ERROR] ... There was a timeout in the fork -> BUILD FAILURE Total time: 12:35 min ``` This is **not a test failure**. After the suite finishes, Surefire calls `System.exit(0)`; the forked JVM does not terminate within Surefire's post-exit grace period and Surefire force-kills the fork, reporting it as a fork timeout. ## Root cause The grace period is `maven-surefire-plugin`'s **`forkedProcessExitTimeoutInSeconds`**, which **defaults to 30s and is not set** in `backend/pom.xml`. The whole suite runs in **one reused fork** (`forkCount=1`, `reuseForks=true`), so at JVM exit Spring's TestContext cache shutdown hook closes every cached context (default cap 32) at once. The expensive part: - **52 test classes import `PostgresContainerConfig`**, whose `PostgreSQLContainer` is a **non-static, per-context `@Bean`** (`backend/src/test/java/org/raddatz/familienarchiv/PostgresContainerConfig.java`). With `@ServiceConnection`, each *distinct* context configuration starts and must stop its own Docker postgres container. Stopping many containers + closing many HikariCP pools simultaneously at shutdown overruns 30s. - The stack traces in the log are *symptoms* of that contended teardown, not separate bugs: - Spring Session JDBC `cleanUpExpiredSessions` fires mid-shutdown and is `Interrupted during connection acquisition` as its pool closes (`CannotCreateTransactionException`). - The Testcontainers reaper throws `NotFoundException: No such container` because the container is already gone. ### Why earlier timeout bumps didn't catch this History shows `forkedProcessTimeoutInSeconds` was added (120) then raised to 600 ("suite takes ~4 min"). That is the *total/inactivity* fork timeout — the suite ran in 12:35, well inside it. This failure is the **different** `forkedProcessExitTimeoutInSeconds` knob (the 30s *post-exit* grace), which has never been configured. It is timing/threshold-dependent (rides on Docker container-stop latency on the runner), so it presents as **intermittent** — "all green, build red." ## Plan **Phase B — immediate unblock (this PR):** - Set `forkedProcessExitTimeoutInSeconds` (e.g. `120`) in the Surefire `<configuration>` in `backend/pom.xml`. This is the exact knob for this failure; it gives legitimate shutdown enough headroom. **Phase A — durable root-cause reduction (follow-up PR):** - Convert `PostgresContainerConfig` to a single **shared static** Postgres container started once for the whole run (one container instead of dozens) — collapses shutdown cost and speeds startup. - Disable the Spring Session JDBC cleanup scheduler in tests: `spring.session.jdbc.cleanup-cron: "-"` in `application-test.yaml` — removes the task contending on the closing pool and its noisy traces. ## Acceptance - The CI backend job passes with `Tests run: …, Failures: 0, Errors: 0` **and** `BUILD SUCCESS`. - No `Surefire is going to kill self fork JVM` message in the log.
marcel added the P1-highdevops labels 2026-06-15 16:36:04 +02:00
Author
Owner

Root cause corrected after investigation

The original "slow shutdown → raise the Surefire timeout" theory was wrong. Raising forkedProcessExitTimeoutInSeconds 30→120 only pushed the kill ~90s later (total 12:35 → 14:04) — an indefinite hang, not slowness.

Actual cause: Testcontainers teardown with Ryuk disabled.

  • The backend job set TESTCONTAINERS_RYUK_DISABLED: "true" (carry-over from the old NAS runner).
  • With Ryuk off, the in-JVM JVMHookResourceReaper removes containers at shutdown; it crashes with NotFoundException and leaks containers run-over-run.
  • ~30 per-context Postgres containers (PostgresContainerConfig is a per-context @Bean) are torn down in-JVM at shutdown. As leaks pile up, teardown degrades until the fork hangs → There was a timeout in the fork (all tests still pass).

Evidence

  • CI now runs on a root server (ssh root@raddatz.cloud): Docker 29.4.3, 8 CPU, 62 GB — not slow.
  • The server had 21 orphaned postgres:16-alpine / minio test containers up to 5 weeks old (matched by org.testcontainers=true). This matches the maintainer's recurring experience: manually killing all testcontainers restores CI for a while, then it degrades again.
  • Failing run #2319: last test at 15:25:45, then 38s of silence, then fork timeout — no clean shutdown logs.

Fix (PR #849)

  1. Re-enable Ryuk → reaps each run's containers out-of-process after the JVM exits, so they never accumulate.
  2. Keep forkedProcessExitTimeoutInSeconds=120 as a backstop.
  3. Removed the 21 leaked containers from the server for immediate relief.

The earlier "Phase A" singleton-container idea is held as the fallback if Ryuk can't run in the runner's docker-outside-docker setup.

## Root cause corrected after investigation The original "slow shutdown → raise the Surefire timeout" theory was **wrong**. Raising `forkedProcessExitTimeoutInSeconds` 30→120 only pushed the kill ~90s later (total 12:35 → 14:04) — an *indefinite hang*, not slowness. **Actual cause: Testcontainers teardown with Ryuk disabled.** - The backend job set `TESTCONTAINERS_RYUK_DISABLED: "true"` (carry-over from the old NAS runner). - With Ryuk off, the **in-JVM `JVMHookResourceReaper`** removes containers at shutdown; it crashes with `NotFoundException` and leaks containers run-over-run. - ~30 per-context Postgres containers (`PostgresContainerConfig` is a per-context `@Bean`) are torn down in-JVM at shutdown. As leaks pile up, teardown degrades until the fork hangs → `There was a timeout in the fork` (all tests still pass). **Evidence** - CI now runs on a root server (`ssh root@raddatz.cloud`): Docker **29.4.3**, 8 CPU, 62 GB — *not* slow. - The server had **21 orphaned `postgres:16-alpine` / `minio` test containers up to 5 weeks old** (matched by `org.testcontainers=true`). This matches the maintainer's recurring experience: manually killing all testcontainers restores CI for a while, then it degrades again. - Failing run #2319: last test at 15:25:45, then 38s of silence, then fork timeout — no clean shutdown logs. **Fix (PR #849)** 1. Re-enable Ryuk → reaps each run's containers out-of-process after the JVM exits, so they never accumulate. 2. Keep `forkedProcessExitTimeoutInSeconds=120` as a backstop. 3. Removed the 21 leaked containers from the server for immediate relief. The earlier "Phase A" singleton-container idea is held as the fallback if Ryuk can't run in the runner's docker-outside-docker setup.
Author
Owner

Validated on the real runner

CI run #2321 (commit 0f9e8c75), Backend Unit Tests:

  • BUILD SUCCESS, Tests run: 2325, Failures: 0, Errors: 0
  • Total time 09:45 (down from the failing 14:04); shutdown completed cleanly — no timeout in the fork, no kill self fork JVM.
  • The Ryuk has been disabled warning is gone.
  • Server check after the run: 0 org.testcontainers=true containers — Ryuk reaped everything; nothing leaked.

The leak-accumulation cycle is broken. PR #849 is ready to merge.

(Unrelated: the same run's frontend "Unit & Component Tests" job failed independently — not caused or fixed by this change.)

## ✅ Validated on the real runner CI run #2321 (commit `0f9e8c75`), **Backend Unit Tests**: - `BUILD SUCCESS`, `Tests run: 2325, Failures: 0, Errors: 0` - Total time **09:45** (down from the failing 14:04); shutdown completed cleanly — no `timeout in the fork`, no `kill self fork JVM`. - The `Ryuk has been disabled` warning is gone. - **Server check after the run: 0 `org.testcontainers=true` containers** — Ryuk reaped everything; nothing leaked. The leak-accumulation cycle is broken. PR #849 is ready to merge. (Unrelated: the same run's frontend "Unit & Component Tests" job failed independently — not caused or fixed by this change.)
Sign in to join this conversation.
No Label P1-high devops
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#848