Commit Graph

93 Commits

Author SHA1 Message Date
Marcel
c43f45a472 Merge branch 'fix/issue-601-obs-stack-permanent'
Some checks failed
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
CI / Unit & Component Tests (push) Has been cancelled
2026-05-16 10:19:59 +02:00
Marcel
55ccd5f3c0 ci(obs): replace rsync with rm+cp in deploy step
rsync is not present in the act_runner job container image. rm -rf +
cp -r gives identical semantics (including removal of deleted files)
using only coreutils, which are always available.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 10:18:42 +02:00
0bb0a314ad ci(obs): add obs-glitchtip to health assertion loop (now has /_health/ healthcheck)
Some checks are pending
CI / Unit & Component Tests (pull_request) Waiting to run
CI / OCR Service Tests (pull_request) Waiting to run
CI / Backend Unit Tests (pull_request) Waiting to run
CI / fail2ban Regex (pull_request) Waiting to run
CI / Compose Bucket Idempotency (pull_request) Waiting to run
2026-05-16 09:36:37 +02:00
b194b565f6 ci(obs): reference #603 in keep-in-sync comments; add obs-glitchtip to health assertion
Some checks failed
CI / Unit & Component Tests (pull_request) Has been cancelled
CI / OCR Service Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
2026-05-16 09:35:43 +02:00
Marcel
6720a5aeb2 chore(obs): improve deploy maintainability from review feedback
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 5m45s
CI / OCR Service Tests (pull_request) Successful in 47s
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
- Move POSTGRES_USER to obs.env (non-secret, constant across envs)
- Replace cp -r with rsync -a --delete so removed config files are
  purged from /opt/familienarchiv on next deploy instead of lingering
- Document --env-file ordering contract in validate + start steps:
  obs.env first (defaults), obs-secrets.env second (wins on dupes)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 09:20:08 +02:00
Marcel
25062be657 ci(obs): quote heredoc delimiter in release obs-secrets.env write
Same fix as nightly.yml: prevents shell expansion of '$' in secret
values after Gitea renders them. Keep in sync with nightly.yml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 09:04:12 +02:00
Marcel
9662ff5f8c ci(obs): quote heredoc delimiter in nightly obs-secrets.env write
Prevents shell from expanding '$' in Gitea-rendered secret values.
Without the quote, a password like 'P@$s5w0rd' has '$s5w0rd' silently
expanded to '' — writing a truncated value to obs-secrets.env.
'<<'EOF'' suppresses shell expansion; Gitea's '${{ }}' template
rendering already ran before the shell sees the script.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 09:03:46 +02:00
Marcel
f5c7be932b ci(obs): document POSTGRES_HOST derivation from Compose project name
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 5m38s
CI / OCR Service Tests (pull_request) Successful in 45s
CI / Backend Unit Tests (pull_request) Failing after 10m48s
CI / fail2ban Regex (pull_request) Successful in 2m51s
CI / Compose Bucket Idempotency (pull_request) Successful in 2m16s
The container names archiv-staging-db-1 and archiv-production-db-1 are
derived from the Compose project name + service name. A project rename
silently breaks the obs stack DB connection. Add a comment at the point
of definition so the dependency is obvious when someone changes it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 08:54:17 +02:00
Marcel
dec0001bd1 ci(obs): chmod 600 obs-secrets.env after creation in both workflows
The heredoc creates the file with default umask permissions (644 —
world-readable). Setting 600 immediately after creation prevents other
processes on the host from reading the Grafana, GlitchTip, and Postgres
credentials. Defence-in-depth for the single-tenant VPS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 08:53:49 +02:00
Marcel
f628ab6435 ci(obs): add validate + health assertion steps to release.yml
nightly.yml had two observability gates that release.yml lacked:
- "Validate observability compose config" (docker compose config --quiet)
  catches missing env vars and YAML errors before any containers start
- "Assert observability stack health" checks obs-loki/prometheus/grafana/tempo
  are healthy after up --wait, covering services without healthcheck directives

Mirrors the nightly.yml steps verbatim so the production deploy path is at
least as well-verified as the nightly staging path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 08:53:18 +02:00
Marcel
53cf1837b2 fix(obs): set POSTGRES_HOST per environment — staging/prod use compose auto-names not archive-db
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 2m58s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 2m39s
CI / fail2ban Regex (pull_request) Successful in 40s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:21:53 +02:00
Marcel
1ae4bfe325 ci(obs): GitOps obs env split in release — deploy to /opt/familienarchiv/, secrets fresh from Gitea
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:19:12 +02:00
Marcel
c5139851b8 ci(obs): GitOps obs env split in nightly — obs.env in git, secrets fresh from Gitea
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:18:38 +02:00
Marcel
79735e23e0 ci(obs): assert obs-loki/prometheus/grafana/tempo are healthy after stack up
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 2m58s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 2m36s
CI / fail2ban Regex (pull_request) Successful in 41s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:01:48 +02:00
Marcel
df37113d38 ci(obs): add compose config dry-run before obs stack up to catch .env substitution errors
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:01:17 +02:00
Marcel
7e52494880 fix(ci): deploy obs configs to /opt/familienarchiv/ before starting stack
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m4s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 2m42s
CI / fail2ban Regex (pull_request) Successful in 41s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
The observability stack's bind-mount sources pointed to workspace-relative
paths. When CI wiped the workspace between runs, containers kept running but
their config files disappeared — causing Docker to auto-create directories
at the missing paths and crash the services on next restart.

Fix: mount /opt/familienarchiv/ into CI job containers via runner-config.yaml,
then copy infra/observability/ and docker-compose.observability.yml there before
docker compose up. Compose runs from the permanent path, so bind mounts resolve
to stable host paths that survive workspace wipes.

Docker Compose reads /opt/familienarchiv/.env automatically (no --env-file flag),
which is managed on the server and persists between CI runs.

Closes #601

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 21:59:23 +02:00
Marcel
56c3e51657 fix(ci): replace overlay2 sync with workspace bind mount for DooD
runner-config.yaml: correct path to /srv/gitea-workspace (VPS, not Synology).
docker-compose.observability.yml: revert 5 bind mounts to plain relative paths;
  OBS_CONFIG_DIR variable is no longer needed.
nightly.yml / release.yml: remove OBS_CONFIG_DIR env injection and the
  "Sync observability configs to host" step from both workflows.

With workdir_parent=/srv/gitea-workspace and an identical host<->container
bind mount, $(pwd) inside job containers resolves to a real host path the
daemon can find — no privileged container, no overlay2 inspection, no nsenter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 19:36:55 +02:00
Marcel
1fc47888d5 fix(ci): sync observability configs to host before docker compose up (#598)
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m26s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 2m40s
CI / fail2ban Regex (pull_request) Successful in 41s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
DooD runner only shares /var/run/docker.sock — no workspace directory is
mapped to the host daemon. Relative bind mounts in
docker-compose.observability.yml resolved to paths that didn't exist on
the host; Docker auto-created directories in their place, causing
'not a directory' mount failures for all five config files.

Fix:
- docker-compose.observability.yml: replace hardcoded ./infra/observability/
  prefix with ${OBS_CONFIG_DIR:-./infra/observability} so the path is
  configurable while remaining backwards-compatible for local use.
- nightly.yml / release.yml: add a 'Sync observability configs to host'
  step that finds the job container's overlay2 MergedDir (the container's
  full filesystem as seen from the host mount namespace), then uses the
  existing nsenter/alpine pattern to cp the config tree into a stable host
  path (/srv/familienarchiv-{staging,production}/obs-configs).
  OBS_CONFIG_DIR is injected into the env file so Compose picks it up.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 19:02:53 +02:00
Marcel
fed427dc4a fix(infra): set OTEL_EXPORTER_OTLP_ENDPOINT in docker-compose.prod.yml
Some checks failed
CI / Unit & Component Tests (pull_request) Has been cancelled
CI / OCR Service Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
CI / Unit & Component Tests (push) Has been cancelled
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
The endpoint belongs in the compose file (hardcoded to the in-network
Tempo service) rather than per-environment workflow files. This covers
both staging (nightly.yml) and production (release.yml) with a single
change and removes the duplicate from the nightly env-file block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 17:43:23 +02:00
Marcel
cf78ab2f8e fix(staging): correct backend healthcheck port and OTel endpoint
Some checks failed
CI / OCR Service Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
CI / Unit & Component Tests (pull_request) Has been cancelled
Two bugs introduced when the management port was split from the app port:

1. Backend healthcheck hit localhost:8080/actuator/health (app port) —
   actuator is on management.server.port=8081, so every probe got a 404
   from the main MVC dispatcher, marking the container permanently unhealthy.
   Fix: change the probe to localhost:8081.

2. OTEL_EXPORTER_OTLP_ENDPOINT was not set in .env.staging, so the exporter
   fell back to http://localhost:4317 (the CI-safe default) instead of
   http://tempo:4317 (the in-network Tempo service). Fix: inject the correct
   endpoint in the nightly env-file generation step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 17:37:15 +02:00
Marcel
c8883d0e40 fix(ci): isolate compose-idempotency network from archiv-net collisions
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 5m40s
CI / OCR Service Tests (pull_request) Successful in 34s
CI / Backend Unit Tests (pull_request) Successful in 7m8s
CI / fail2ban Regex (pull_request) Successful in 1m58s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m41s
CI / Unit & Component Tests (push) Successful in 5m37s
CI / OCR Service Tests (push) Successful in 28s
CI / Backend Unit Tests (push) Successful in 6m59s
CI / fail2ban Regex (push) Successful in 1m59s
CI / Compose Bucket Idempotency (push) Successful in 1m44s
The name: archiv-net declaration (needed so docker-compose.observability.yml
can join the network as external: true) caused the compose-idempotency CI job
to collide with any archiv-net left on the runner from staging or a previous
run. mc would resolve 'minio' to the wrong container and fail with a signature
mismatch.

Make the network name interpolable via COMPOSE_NETWORK_NAME (default: archiv-net
so production/staging behaviour is unchanged). Inject COMPOSE_NETWORK_NAME=
test-idem-archiv-net into the stub env file so the idempotency test always
gets a fully isolated network.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:33:07 +02:00
Marcel
ada3a3ccaf devops(ci): add --remove-orphans to observability stack deploy steps
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 5m27s
CI / OCR Service Tests (pull_request) Successful in 34s
CI / Backend Unit Tests (pull_request) Successful in 7m13s
CI / fail2ban Regex (pull_request) Successful in 1m51s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m47s
CI / Unit & Component Tests (push) Successful in 5m45s
CI / OCR Service Tests (push) Successful in 36s
CI / Backend Unit Tests (push) Successful in 7m12s
CI / fail2ban Regex (push) Successful in 1m54s
CI / Compose Bucket Idempotency (push) Successful in 1m41s
Both nightly and release workflows were missing --remove-orphans on the
observability compose up, while the main app deploy step already had it.
Without it, containers removed from docker-compose.observability.yml
linger as unnamed orphans until manually pruned.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 14:55:28 +02:00
Marcel
4a7349543a devops(ci): wire SENTRY_DSN into staging and production env files
Adds SENTRY_DSN as an optional secret (empty by default) so it can be
set after GlitchTip first-run without requiring another code change.
Backend reads it via application.yaml; empty value keeps Sentry disabled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 13:45:07 +02:00
Marcel
f15e004645 devops(ci): add --wait to observability stack startup
Prometheus, Loki, Tempo, and Grafana all define healthchecks in
docker-compose.observability.yml. Without --wait, the step exits 0
as soon as containers are created, masking startup failures silently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 13:44:16 +02:00
Marcel
d7d225af77 devops(observability): wire observability stack into nightly and release deploys
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 4m32s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 4m3s
CI / fail2ban Regex (pull_request) Successful in 1m55s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m42s
- docker-compose.prod.yml: add `name: archiv-net` so the network has a
  stable Docker name regardless of compose project name (-p flag).
  Both staging and production share the same host-level network, which
  is correct since the observability stack is a single shared instance.

- nightly.yml / release.yml: add observability env vars (POSTGRES_USER,
  PORT_GRAFANA=3003, PORT_GLITCHTIP=3002, PORT_PROMETHEUS=9090,
  GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, GLITCHTIP_DOMAIN) to the
  env file, then `docker compose -f docker-compose.observability.yml up -d`
  after the app deploy step. PORT_GRAFANA=3003 avoids collision with
  staging frontend on 3001.

  Requires two new Gitea secrets: GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 11:22:37 +02:00
Marcel
ebdb36b7d0 devops(ci): upload surefire XML reports as CI artifact
Captures all 102 test results independent of log verbosity.
if: always() ensures reports are available on failure — exactly
when they're needed most.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 14:25:37 +02:00
Marcel
5646e739c2 fix(ci): run svelte-kit sync before lint to fix cache-hit tsconfig miss
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m8s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 4m25s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
CI / Unit & Component Tests (push) Successful in 3m7s
CI / OCR Service Tests (push) Successful in 17s
CI / Backend Unit Tests (push) Successful in 4m15s
CI / fail2ban Regex (push) Successful in 39s
CI / Compose Bucket Idempotency (push) Successful in 58s
When the node_modules cache hits, npm ci is skipped and the prepare
lifecycle (svelte-kit sync) never runs. frontend/tsconfig.json extends
.svelte-kit/tsconfig.json which only exists after svelte-kit sync —
so ESLint fails at tsconfig resolution on every cache-warm run.

Adding an unconditional svelte-kit sync step after Paraglide compile
and before Lint ensures .svelte-kit/tsconfig.json is always present
regardless of cache state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 12:07:15 +02:00
Marcel
bbbdf8cd09 ci: restrict push trigger to main — eliminate duplicate runs on feature branches
Some checks failed
CI / Unit & Component Tests (push) Failing after 1m5s
CI / OCR Service Tests (push) Successful in 17s
CI / Backend Unit Tests (push) Successful in 4m27s
CI / fail2ban Regex (push) Successful in 40s
CI / Compose Bucket Idempotency (push) Successful in 58s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 11:12:24 +02:00
Marcel
3de0d2f0fe fix(ci): add IMPORT_HOST_DIR stub to compose-idempotency env file
Some checks failed
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
CI / Unit & Component Tests (push) Has been cancelled
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
Docker Compose interpolates all variables in the full file even when
only a subset of services is requested. The backend service uses
IMPORT_HOST_DIR with :? (hard-required), causing the idempotency job
to abort before any container starts. A dummy path satisfies the parser;
the backend service is never started in this job so the path need not exist.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 10:58:38 +02:00
Marcel
0abbc147e2 ci(unit-tests): add negative self-test case to upload-artifact guard
Some checks failed
CI / Unit & Component Tests (push) Has been cancelled
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
The previous self-test proved the regex catches @v5 (positive case).
This adds a negative case proving @v3 is NOT flagged — guards against
a false-positive that would break every CI run permanently.

Suggested by Sara Holt in review of PR #558.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 10:58:19 +02:00
Marcel
fa46492759 ci(workflows): downgrade upload-artifact v4 → v3 — Gitea act_runner limitation (ADR-014)
Reverts the re-regression introduced in 410b91e2. Gitea Actions
(act_runner) does not implement the v4 artifact protocol — jobs report
failure even when all tests pass. Pins all three call sites back to @v3
and adds load-bearing inline comments pointing to ADR-014 / #557.

This commit makes the grep guard added in the previous commit GREEN.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 10:58:19 +02:00
Marcel
3965541879 ci(unit-tests): add grep guard for (upload|download)-artifact@v4+
Adds a repo-invariant check in the same 'Assert' block as the ADR-012
birpc guard. Anchored to YAML `uses:` lines so the inline self-test
fixture does not false-positive. Fails with an actionable error
referencing ADR-014 / #557.

Guard is intentionally RED at this commit — the three v4 call sites
are downgraded in the next commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 10:58:19 +02:00
Marcel
c820884765 ci(coverage-flake-probe): add workflow_dispatch matrix job (20 parallel runs)
Verification mechanism for the 20-run acceptance criterion of issue #553.
Triggered manually via workflow_dispatch, runs the full coverage suite 20×
in parallel against a single SHA, asserts zero `[birpc] rpc is closed`
lines in every cell.

One fire, parallel cost (~one main-job's wall-clock), deterministic signal
for the teardown race. Cheaper than 20 sequential push events and tests
the same property the AC names.

Closes the verification gap raised by Tobias and Elicit in the issue
discussion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 22:12:04 +02:00
Marcel
67cd56acc7 ci(unit-tests): extend grep guard to async vi.mock with dynamic import
The pdfjs-dist literal grep added in 9260866f only caught one named
trigger of the birpc teardown race; the underlying mechanism (ADR 012 /
#553) is any async vi.mock factory whose body performs `await import(...)`.

Add a second PCRE-multiline grep matching that shape. Scoped to
**/*.{spec,test}.ts under frontend/src/, excluding __meta__ (which holds
the fixture strings exercising the meta-test). Defence in depth pairs with
the ESLint rule (saves at edit time) and the in-suite meta-test (catches
when tests run).

Verified locally with real GNU grep against a planted synthetic offender.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 22:11:09 +02:00
Marcel
9260866f47 ci(unit-tests): add early grep check for banned vi.mock pdfjs-dist pattern
Some checks failed
CI / Unit & Component Tests (push) Failing after 1m47s
CI / OCR Service Tests (push) Successful in 16s
CI / Backend Unit Tests (push) Successful in 4m11s
CI / fail2ban Regex (push) Successful in 38s
CI / Compose Bucket Idempotency (push) Failing after 11s
Adds a static grep step that runs after Lint and before the test suite.
Fails in ~1 s if any file under frontend/src/ contains the banned
vi.mock('pdfjs-dist' pattern, catching the regression before Playwright
spins up. Belt-and-suspenders with the ESLint rule (ADR 012).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 12:32:23 +02:00
Marcel
1ead1f293f ci(coverage): document that birpc guard covers coverage run only
Adds a comment above the assertion step so a future developer diagnosing
a birpc-related failure in `npm test` knows where to find the diagnostic.

Addresses Sara Holt + Tobias Wendt round-4 observation on PR #536.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 09:57:28 +02:00
Marcel
729f5c66d6 ci(coverage): use grep -F for birpc guard to avoid BRE escaping
-F (fixed string) matches the literal pattern [birpc] rpc is closed
without relying on BRE bracket escaping, making the intent explicit
and immune to accidental regex interpretation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 09:57:28 +02:00
Marcel
d40f477397 ci(coverage): include coverage log in artifact upload
The birpc guard step writes to /tmp/coverage-test-<run_id>.log and exits 1
when a race is detected. Without this file in the artifact, the evidence
disappears when the runner tears down — only the exit code remained visible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 09:57:28 +02:00
Marcel
cf78957476 ci(coverage): harden coverage guard step
- Add explicit set -eo pipefail so npm test:coverage exit code
  propagates through the pipe (not just tee's always-0 exit)
- Scope log file to github.run_id to prevent stale-log false positives
  on retried steps sharing the same runner /tmp
- Tighten grep pattern to \[birpc\] rpc is closed to avoid matching
  unrelated log lines that happen to contain "rpc is closed"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 09:57:28 +02:00
Marcel
3594204214 ci(coverage): simplify coverage step and pin shell to bash
- removes unreachable `; exit ${PIPESTATUS[0]}` — already covered by pipefail (Tobias)
- adds explicit `shell: bash` to both new steps for clarity (Tobias)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 09:57:28 +02:00
Marcel
538adb43a9 ci(guard): fail unit-tests job if [birpc] rpc is closed appears in coverage run
Captures npm run test:coverage output with tee and adds an always-run step
that greps for the teardown-race fingerprint. Any future regression where a
vi.mock factory races with birpc teardown will now surface as an explicit CI
failure rather than a silent exit-1 after all tests report green (#535).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 09:57:28 +02:00
Marcel
9c26c00eee fix(ci): replace iproute2 ip with /proc/net/route for gateway detection
Some checks failed
CI / Unit & Component Tests (push) Has been cancelled
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
CI / Unit & Component Tests (pull_request) Has been cancelled
CI / OCR Service Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
`ip route` (iproute2) is not installed in the Gitea runner container,
causing the smoke test step to exit 127. /proc/net/route is a kernel
virtual file that is always present on Linux; awk decodes the
little-endian hex gateway field to dotted-decimal without any external
binary dependency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 09:50:56 +02:00
Marcel
6d16be4669 fix(ci): quote \$RESOLVE in all curl calls
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1m51s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 4m1s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Failing after 11s
CI / Unit & Component Tests (push) Failing after 1m51s
CI / OCR Service Tests (push) Successful in 18s
CI / Backend Unit Tests (push) Successful in 4m10s
CI / fail2ban Regex (push) Successful in 38s
CI / Compose Bucket Idempotency (push) Failing after 10s
Unquoted variable expansion is safe here since the value contains
no spaces or glob characters, but quoting is the correct default
and keeps the script consistent with surrounding style.

Addresses review suggestion by Felix Brandt and Tobias Wendt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 09:26:35 +02:00
Marcel
f1032865f3 fix(ci): guard against empty HOST_IP in smoke test
If `ip route show default` returns no output the old code passed
an empty string to curl --resolve, producing a confusing error 6
("couldn't resolve host") with no indication that gateway detection
had failed.  The new guard exits immediately with a clear message.

Addresses review concern raised by Tobias Wendt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 09:26:35 +02:00
Marcel
3056311c24 fix(ci): resolve smoke test host via bridge gateway, not 127.0.0.1
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1m50s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 4m8s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Failing after 10s
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Unit & Component Tests (push) Has started running
CI / Compose Bucket Idempotency (push) Has been cancelled
Job containers run in bridge network mode (runner-config.yaml). Inside
a bridge-networked container 127.0.0.1 is the container's own loopback;
Caddy on the host is unreachable there, causing an immediate ECONNREFUSED.

Use the Docker bridge gateway IP instead — the host's docker0 interface
where Caddy (bound on 0.0.0.0:443) is reachable from the container.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 09:10:17 +02:00
Marcel
544b96bc9e fix(ci): pin Reload Caddy to alpine:3.21 digest, add reload-vs-restart rationale
- Switch ubuntu:22.04 (floating, ~70 MB) to alpine:3.21 pinned by sha256
  digest (~5 MB); util-linux installed at run time via apk add
- Add explicit comment explaining why `reload` not `restart`: SIGHUP
  re-reads config in-process without dropping TLS connections

Addresses Tobias + Nora blocker from PR review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 07:42:28 +02:00
Marcel
d29169eb39 fix(ci): add Caddy reload step to release workflow
Same gap as nightly.yml: production deploys also need Caddy to reload
the updated Caddyfile before the smoke test validates the public surface.
Uses the same nsenter pattern introduced in the previous commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 07:42:28 +02:00
Marcel
d750d5cee2 fix(ci): reload Caddy via nsenter, not sudo systemctl
`sudo systemctl reload caddy` does not work from inside a DooD job
container: `systemctl` is absent from Ubuntu container images and
container processes cannot reach the host systemd without entering its
namespaces. Replace with `docker run --privileged --pid=host ubuntu:22.04
nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy`, which uses
the already-mounted Docker socket to spin up a privileged sibling
container that enters the host PID namespace via nsenter. Tested live on
the Hetzner VPS. No sudoers entry required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 07:42:28 +02:00
Marcel
90f52eae41 ci(nightly): reload Caddy before smoke test
Adds a `sudo systemctl reload caddy` step between the docker compose
deploy and the smoke test. This ensures any committed Caddyfile changes
are applied before the public surface is verified.

Previously the workflow had no mechanism to push Caddyfile changes to
the running host daemon. A Caddyfile edit would land in the repo but
Caddy would keep serving the previous config, causing the smoke test to
catch a stale header or still-proxied /actuator route rather than the
intended current config.

This step also surfaces the root cause of today's port-443 failure
explicitly: if Caddy is not running, the step fails with a clear service
error rather than a misleading "Failed to connect to port 443" from curl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 07:42:28 +02:00
Marcel
e42c7b04c1 ci: drop redundant npm test step, coverage run covers it
The test:coverage step runs the full suite under Istanbul; running
`npm test` first executes every test twice for no extra signal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 21:50:28 +02:00