When the node_modules cache hits, npm ci is skipped and the prepare
lifecycle (svelte-kit sync) never runs. frontend/tsconfig.json extends
.svelte-kit/tsconfig.json which only exists after svelte-kit sync —
so ESLint fails at tsconfig resolution on every cache-warm run.
Adding an unconditional svelte-kit sync step after Paraglide compile
and before Lint ensures .svelte-kit/tsconfig.json is always present
regardless of cache state.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Docker Compose interpolates all variables in the full file even when
only a subset of services is requested. The backend service uses
IMPORT_HOST_DIR with :? (hard-required), causing the idempotency job
to abort before any container starts. A dummy path satisfies the parser;
the backend service is never started in this job so the path need not exist.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous self-test proved the regex catches @v5 (positive case).
This adds a negative case proving @v3 is NOT flagged — guards against
a false-positive that would break every CI run permanently.
Suggested by Sara Holt in review of PR #558.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reverts the re-regression introduced in 410b91e2. Gitea Actions
(act_runner) does not implement the v4 artifact protocol — jobs report
failure even when all tests pass. Pins all three call sites back to @v3
and adds load-bearing inline comments pointing to ADR-014 / #557.
This commit makes the grep guard added in the previous commit GREEN.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a repo-invariant check in the same 'Assert' block as the ADR-012
birpc guard. Anchored to YAML `uses:` lines so the inline self-test
fixture does not false-positive. Fails with an actionable error
referencing ADR-014 / #557.
Guard is intentionally RED at this commit — the three v4 call sites
are downgraded in the next commit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verification mechanism for the 20-run acceptance criterion of issue #553.
Triggered manually via workflow_dispatch, runs the full coverage suite 20×
in parallel against a single SHA, asserts zero `[birpc] rpc is closed`
lines in every cell.
One fire, parallel cost (~one main-job's wall-clock), deterministic signal
for the teardown race. Cheaper than 20 sequential push events and tests
the same property the AC names.
Closes the verification gap raised by Tobias and Elicit in the issue
discussion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pdfjs-dist literal grep added in 9260866f only caught one named
trigger of the birpc teardown race; the underlying mechanism (ADR 012 /
#553) is any async vi.mock factory whose body performs `await import(...)`.
Add a second PCRE-multiline grep matching that shape. Scoped to
**/*.{spec,test}.ts under frontend/src/, excluding __meta__ (which holds
the fixture strings exercising the meta-test). Defence in depth pairs with
the ESLint rule (saves at edit time) and the in-suite meta-test (catches
when tests run).
Verified locally with real GNU grep against a planted synthetic offender.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a static grep step that runs after Lint and before the test suite.
Fails in ~1 s if any file under frontend/src/ contains the banned
vi.mock('pdfjs-dist' pattern, catching the regression before Playwright
spins up. Belt-and-suspenders with the ESLint rule (ADR 012).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a comment above the assertion step so a future developer diagnosing
a birpc-related failure in `npm test` knows where to find the diagnostic.
Addresses Sara Holt + Tobias Wendt round-4 observation on PR #536.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
-F (fixed string) matches the literal pattern [birpc] rpc is closed
without relying on BRE bracket escaping, making the intent explicit
and immune to accidental regex interpretation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The birpc guard step writes to /tmp/coverage-test-<run_id>.log and exits 1
when a race is detected. Without this file in the artifact, the evidence
disappears when the runner tears down — only the exit code remained visible.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add explicit set -eo pipefail so npm test:coverage exit code
propagates through the pipe (not just tee's always-0 exit)
- Scope log file to github.run_id to prevent stale-log false positives
on retried steps sharing the same runner /tmp
- Tighten grep pattern to \[birpc\] rpc is closed to avoid matching
unrelated log lines that happen to contain "rpc is closed"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- removes unreachable `; exit ${PIPESTATUS[0]}` — already covered by pipefail (Tobias)
- adds explicit `shell: bash` to both new steps for clarity (Tobias)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Captures npm run test:coverage output with tee and adds an always-run step
that greps for the teardown-race fingerprint. Any future regression where a
vi.mock factory races with birpc teardown will now surface as an explicit CI
failure rather than a silent exit-1 after all tests report green (#535).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`ip route` (iproute2) is not installed in the Gitea runner container,
causing the smoke test step to exit 127. /proc/net/route is a kernel
virtual file that is always present on Linux; awk decodes the
little-endian hex gateway field to dotted-decimal without any external
binary dependency.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Unquoted variable expansion is safe here since the value contains
no spaces or glob characters, but quoting is the correct default
and keeps the script consistent with surrounding style.
Addresses review suggestion by Felix Brandt and Tobias Wendt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If `ip route show default` returns no output the old code passed
an empty string to curl --resolve, producing a confusing error 6
("couldn't resolve host") with no indication that gateway detection
had failed. The new guard exits immediately with a clear message.
Addresses review concern raised by Tobias Wendt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Job containers run in bridge network mode (runner-config.yaml). Inside
a bridge-networked container 127.0.0.1 is the container's own loopback;
Caddy on the host is unreachable there, causing an immediate ECONNREFUSED.
Use the Docker bridge gateway IP instead — the host's docker0 interface
where Caddy (bound on 0.0.0.0:443) is reachable from the container.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same gap as nightly.yml: production deploys also need Caddy to reload
the updated Caddyfile before the smoke test validates the public surface.
Uses the same nsenter pattern introduced in the previous commit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`sudo systemctl reload caddy` does not work from inside a DooD job
container: `systemctl` is absent from Ubuntu container images and
container processes cannot reach the host systemd without entering its
namespaces. Replace with `docker run --privileged --pid=host ubuntu:22.04
nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy`, which uses
the already-mounted Docker socket to spin up a privileged sibling
container that enters the host PID namespace via nsenter. Tested live on
the Hetzner VPS. No sudoers entry required.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a `sudo systemctl reload caddy` step between the docker compose
deploy and the smoke test. This ensures any committed Caddyfile changes
are applied before the public surface is verified.
Previously the workflow had no mechanism to push Caddyfile changes to
the running host daemon. A Caddyfile edit would land in the repo but
Caddy would keep serving the previous config, causing the smoke test to
catch a stale header or still-proxied /actuator route rather than the
intended current config.
This step also surfaces the root cause of today's port-443 failure
explicitly: if Caddy is not running, the step fails with a clear service
error rather than a misleading "Failed to connect to port 443" from curl.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test:coverage step runs the full suite under Istanbul; running
`npm test` first executes every test twice for no extra signal.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sara flagged that a future "compose cleanup" PR could silently drop the
backend volumes block and CI would happily pass while mass import on
staging silently broke. Adds a pre-build step that renders the staging
compose config and fails the deploy if `target: /import` or
`read_only: true` is missing.
Local verification of the guard:
- Volumes block removed → `grep -q 'target: /import'` exits 1 → step fails
- Volumes block present → both greps match → step passes
Addresses Sara's review on #526.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the staging change. The host directory does not yet exist on
the production server — first production release that consumes this
will create an empty bind source via Docker's auto-create behaviour;
mass import then reports "no spreadsheet found" until an operator
pre-stages a payload there.
Addresses Tobias's review on #526.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The compose file now requires IMPORT_HOST_DIR or refuses to start
(#526). Without this line the next nightly deploy would fail with a
clear interpolation error, but it should not fail — the staging
import payload already lives at this host path (rsync'd in #526).
Addresses Tobias's review on #526.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses Sara's review request on #515.
Without this gate, a future regression that turns prerender.crawl
back on (or adds a new prerender entry whose nav links into
protected routes) would silently bake /, /documents, /persons etc.
to "redirect-to-login" HTML and re-introduce #514.
Verified the script catches the current broken build state:
$ find build/prerendered ... -not -path 'hilfe/*' ...
build/prerendered/{index,documents,persons,geschichten,stammbaum}.html
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes#508.
Our gitea-runner advertises labels ubuntu-latest / ubuntu-24.04 /
ubuntu-22.04. `runs-on: self-hosted` never matches → dispatched
deploy jobs sit in the queue forever. The runner is still
genuinely self-hosted (DooD socket, joined to gitea_gitea net,
single-tenant per ADR-011) — the `self-hosted` token was just an
unconfirmed assumption about the label name.
Unblocks #497 / #499 first deploy.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes#503.
Debian's fail2ban package ships defaults-debian.conf with
`[DEFAULT] backend = systemd`. Without an explicit override, our
familienarchiv-auth jail inherits the systemd backend at runtime,
reads from journald, and never inspects /var/log/caddy/access.log.
A live login brute-force would not be banned.
Add `backend = polling` to the jail and a CI step that links the jail
into /etc/fail2ban/ and asserts `fail2ban-client -d` resolves it to
the polling backend, not the inherited systemd backend.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The `if: always()` conditional on the env-file cleanup step in both
deploy workflows is what makes the ADR-011 single-tenant runner trust
model safe: secrets land on disk before each deploy and are wiped
unconditionally afterwards. A future workflow refactor that drops
`if: always()` would silently leave plaintext secrets on the runner
on any failed deploy.
The ADR documents this; the workflow file did not. Adds a prominent
inline comment so the next reader of the YAML sees the constraint
without having to cross-reference ADR-011. No behaviour change — both
workflows still parse. Addresses @nora's round-2 suggestion on PR
#499 — "linchpin of the ADR-011 trust model".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `Permissions-Policy: camera=(), microphone=(), geolocation=()` to
the shared (security_headers) snippet, so both archiv vhosts and the
git vhost deny browser APIs the app does not use. Reduces blast radius
of an XSS landing in a privileged origin.
The deploy smoke steps in nightly.yml and release.yml gain a matching
assertion against the canonical header value, so a future Caddyfile
edit that drops or loosens the header (e.g. `camera=(self)`) fails the
deploy instead of regressing silently.
`caddy validate` against caddy:2 passes; both workflow YAMLs parse.
Addresses @nora's round-2 suggestion on PR #499 — "lower-impact than
CSP but nearly free".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the presence-only `grep -qi strict-transport-security` smoke
assertion in both nightly.yml and release.yml with a value-pinning
regex that requires `max-age=31536000`, `includeSubDomains`, and
`preload`. A future Caddyfile edit that drops any of those three
parts now fails the deploy smoke step instead of passing silently.
Verified locally that the new pattern matches the preload-eligible
value and rejects three degraded forms (short max-age, missing
includeSubDomains, missing preload). Addresses @sara's round-2 note
on PR #499 — "presence check, not value check".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The smoke step previously curled the public hostname unconditionally,
which routes the runner's request via DNS → router → back into the same
host. Many SOHO routers do not implement hairpin NAT (or do so only after
a firmware update), so the deploy may pass on day one and silently fail
on day 90.
--resolve "<host>:443:127.0.0.1" pins the hostname to the runner's
loopback while keeping SNI on the public name (so the cert validates
correctly and the Caddy vhost block matches). The smoke test now
verifies that the Caddy-on-the-same-host is serving the right
hostname end-to-end, with no router dependency.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Without --pull, the host's Docker layer cache wins: if a CVE drops in
node:20.19.0-alpine3.21 / postgres:16-alpine and the vendor re-publishes
the same tag, the runner keeps serving the cached layer until the cache
is manually cleared — a silent supply-chain blind spot.
Adding --pull to both `compose build` invocations costs a single
re-pull per run and lifts the base-image patch lag from "next host
prune" to "next nightly".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The filter only watched /api/auth/login 401 — leaving the forgot-password
endpoint open to:
- email enumeration (slow brute-force probing which addresses exist)
- password-reset brute-force against accounts whose addresses leak
Widens the failregex to /api/auth/(login|forgot-password) and adds 429 to
the status alternation so a future in-app rate-limiter response is also
caught by the jail (defense in depth).
CI assertions extended to cover both new dimensions plus a negative case
on an unrelated 401 endpoint (/api/documents) — pins that the widening
did not over-match.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The create-buckets service in docker-compose.prod.yml runs on every
`docker compose up` (one-shot, restart=no). A re-deploy that fails
because the user/bucket/policy already exists would block the whole
nightly/release pipeline — and the only way to find out today is to
run a second deploy.
This job runs the bootstrap twice against a throwaway minio stack and
asserts both invocations exit 0. Caught at PR time, not at the third
nightly deploy at 02:00.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Caddy 2.x emits JSON access logs; the failregex in
infra/fail2ban/filter.d/familienarchiv-auth.conf depends on the
"remote_ip" → "uri" → "status" key order being stable. A future Caddy
upgrade that reorders fields would break the jail silently (regex no
longer matches → fail2ban returns 0 hits → host stops banning
brute-force, discovered only at the next incident).
This job pins the contract: a sample /api/auth/login 401 line must
match (1 hit) and a /api/auth/login 200 line must not (0 hits).
Catches a regression at PR time instead of in production.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The two deploy workflows make two non-obvious assumptions that future
maintainers should not have to rediscover by reading the diff:
1. Single-tenant self-hosted runner — the .env.* file lands on disk
during the deploy and is cleaned up unconditionally. Multi-tenant
usage would require switching to stdin-piped env input.
2. Host docker layer cache is authoritative — there is no
actions/cache directive; a host-level `docker system prune` will
cold-start the next build.
Both notes added as block comments at the top of each workflow.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the nightly.yml smoke step against archiv.raddatz.cloud. Catches
the same three failure modes (Caddy not reloaded, DNS missing, HSTS
dropped, /actuator block bypassed) on the prod path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Healthchecks prove containers are healthy on the docker network; they
do not prove the public URL is reachable, HSTS still fires, or
/actuator is still blocked at the edge. Add a post-deploy smoke step
to nightly.yml that:
1. GETs https://staging.raddatz.cloud/login (frontend reachable)
2. asserts the response includes the Strict-Transport-Security header
3. asserts /actuator/health returns 404 (defense-in-depth verified)
Failure aborts the workflow before the env-file cleanup step. The
cleanup step still runs because it is `if: always()`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fires on `v*` tag push. Tags the built images with the git tag so
rollbacks are a one-liner (TAG=<previous> docker compose ... up -d).
`up -d --wait` blocks until every service healthcheck reports
healthy; a bad release fails the workflow rather than crash-looping
silently. The .env.production file containing all Gitea secrets is
removed in `if: always()` after the deploy step.
Refs #497.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs daily at 02:00 (and on workflow_dispatch). Builds the prod
compose stack with BuildKit, writes a transient .env.staging from
Gitea secrets, then `docker compose up -d --wait` so the job fails
loudly if any service's healthcheck never reports healthy.
The --profile staging flag starts the mailpit catcher in place of
a real SMTP relay; no production SMTP credentials touch the staging
environment.
The .env.staging file is cleaned up in `if: always()` to avoid
leaving secrets in the runner workspace between runs.
Refs #497.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs test:coverage (server v8 + client Istanbul) after tests, hard-gates
on both 80% branch thresholds, and uploads coverage/ as an artifact.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DOCKER_HOST makes the socket explicit rather than relying on runner
config propagation; TESTCONTAINERS_RYUK_DISABLED=true avoids Ryuk
watchdog start failures in nested container environments.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
date-buckets.spec.ts midnight tests pass timezone-aware dates (+02:00)
which are 22:00 UTC the prior day; setHours(0,0,0,0) uses local TZ.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The prerender fix only prevents regression if the build is actually run in
CI. Without this gate, a future prerendered route that becomes unreachable
behind auth would fail silently until someone runs the build manually.
Fits after the test step in the existing unit-tests job — no new job needed
since node_modules is already cached for the Playwright container.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add 503/403 auth tests for the /train-sender endpoint, matching the pattern
already used for /train and /segtrain. Also surface test_sender_registry.py
in CI (it needs no ML stack) and add pytest-asyncio to the install step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>