From 273a97046a215321219f742062ba36167cdb3a63 Mon Sep 17 00:00:00 2001 From: marcel Date: Mon, 15 Jun 2026 20:53:58 +0200 Subject: [PATCH] fix(ci): re-enable Testcontainers Ryuk to stop the backend fork shutdown hang (#848) (#849) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fixes #848. ## Symptom CI `Backend Unit Tests` goes red despite **all tests passing**: after the last test, the fork hangs at JVM shutdown and Surefire reports `There was a timeout in the fork` → `BUILD FAILURE`. ## Root cause (corrected after investigation) My first theory (slow shutdown needs a bigger timeout) was **wrong** — raising `forkedProcessExitTimeoutInSeconds` 30→120 only delayed the kill by ~90s (total time 12:35 → 14:04), proving an *indefinite* hang, not slowness. The real cause is **Testcontainers teardown with Ryuk disabled**: - The job set `TESTCONTAINERS_RYUK_DISABLED: "true"` (carry-over from the old NAS runner). - With Ryuk off, containers are reaped by the **in-JVM `JVMHookResourceReaper`** at shutdown. That reaper crashes (`NotFoundException`) and **leaks containers run-over-run**. - The run boots ~30 per-context Spring contexts (`PostgresContainerConfig` is a per-context `@Bean`), so ~30 Postgres containers are torn down in-JVM at shutdown. - As leaks accumulate on the runner, per-run teardown degrades until the fork hangs at shutdown → fork timeout. **The server had 21 orphaned `postgres:16-alpine`/`minio` containers up to 5 weeks old**; manually killing them is what restored CI before (a recurring pattern). Environment confirmed via `ssh root@raddatz.cloud`: CI now runs on a root server with **Docker 29.4.3** (8 CPU, 62 GB, socket access) — so the original reason to disable Ryuk no longer applies, and Docker is *not* slow. ## Change 1. **Re-enable Ryuk** (remove `TESTCONTAINERS_RYUK_DISABLED`) — Ryuk reaps each run's containers out-of-process after the JVM exits, so they never accumulate. Automates the manual "kill all testcontainers." 2. Keep `forkedProcessExitTimeoutInSeconds=120` as a harmless backstop. 3. Drop the stale "NAS runner" comment on `DOCKER_API_VERSION`. Operational: the 21 leaked containers were already removed from the server (by `org.testcontainers=true` label; real services untouched), giving immediate relief. ## Validation Validated by this PR's CI run on the real runner (watching it). If Ryuk can't start in the runner's docker-outside-docker setup, the integration tests fail fast and I revert — fallback is a singleton Postgres container. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Marcel Reviewed-on: https://git.raddatz.cloud/marcel/familienarchiv/pulls/849 --- .gitea/workflows/ci.yml | 9 +++++++-- backend/pom.xml | 6 ++++++ 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/.gitea/workflows/ci.yml b/.gitea/workflows/ci.yml index 46e38aa1..75c1efe1 100644 --- a/.gitea/workflows/ci.yml +++ b/.gitea/workflows/ci.yml @@ -229,9 +229,14 @@ jobs: name: Backend Unit Tests runs-on: ubuntu-latest env: - DOCKER_API_VERSION: "1.43" # NAS runner runs Docker 24.x (max API 1.43); Testcontainers 2.x defaults to 1.44 + # CI runs against the root-server Docker daemon (29.x). This API pin is a harmless + # carry-over from the old NAS runner (Docker 24.x, max API 1.43); safe to drop later. + DOCKER_API_VERSION: "1.43" DOCKER_HOST: unix:///var/run/docker.sock - TESTCONTAINERS_RYUK_DISABLED: "true" + # Ryuk (Testcontainers' out-of-process reaper) is intentionally LEFT ENABLED so it + # removes each run's containers after the JVM exits. Disabling it forced the in-JVM + # reaper, which hung at JVM shutdown and leaked Postgres containers run-over-run until + # the daemon degraded and the fork timed out at teardown — see #848. steps: - uses: actions/checkout@v4 diff --git a/backend/pom.xml b/backend/pom.xml index b01f2362..20e34b91 100644 --- a/backend/pom.xml +++ b/backend/pom.xml @@ -369,6 +369,12 @@ maven-surefire-plugin 600 + + 120 90 s