Quoting RESOLVE as a string and expanding with "$RESOLVE" passes the
flag and its value as a single token to curl; curl rejects the whole
string as an unknown option (exit 2). Switching to a Bash array and
"${RESOLVE[@]}" ensures the two words are always passed as separate
arguments regardless of quoting context.
Also aligns release.yml gateway detection with nightly.yml: replaces
`ip route` (requires iproute2) with /proc/net/route (always available
in the job container, no extra package needed).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rsync is not present in the act_runner job container image. rm -rf +
cp -r gives identical semantics (including removal of deleted files)
using only coreutils, which are always available.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move POSTGRES_USER to obs.env (non-secret, constant across envs)
- Replace cp -r with rsync -a --delete so removed config files are
purged from /opt/familienarchiv on next deploy instead of lingering
- Document --env-file ordering contract in validate + start steps:
obs.env first (defaults), obs-secrets.env second (wins on dupes)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents shell from expanding '$' in Gitea-rendered secret values.
Without the quote, a password like 'P@$s5w0rd' has '$s5w0rd' silently
expanded to '' — writing a truncated value to obs-secrets.env.
'<<'EOF'' suppresses shell expansion; Gitea's '${{ }}' template
rendering already ran before the shell sees the script.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The container names archiv-staging-db-1 and archiv-production-db-1 are
derived from the Compose project name + service name. A project rename
silently breaks the obs stack DB connection. Add a comment at the point
of definition so the dependency is obvious when someone changes it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The heredoc creates the file with default umask permissions (644 —
world-readable). Setting 600 immediately after creation prevents other
processes on the host from reading the Grafana, GlitchTip, and Postgres
credentials. Defence-in-depth for the single-tenant VPS.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The observability stack's bind-mount sources pointed to workspace-relative
paths. When CI wiped the workspace between runs, containers kept running but
their config files disappeared — causing Docker to auto-create directories
at the missing paths and crash the services on next restart.
Fix: mount /opt/familienarchiv/ into CI job containers via runner-config.yaml,
then copy infra/observability/ and docker-compose.observability.yml there before
docker compose up. Compose runs from the permanent path, so bind mounts resolve
to stable host paths that survive workspace wipes.
Docker Compose reads /opt/familienarchiv/.env automatically (no --env-file flag),
which is managed on the server and persists between CI runs.
Closes#601
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
runner-config.yaml: correct path to /srv/gitea-workspace (VPS, not Synology).
docker-compose.observability.yml: revert 5 bind mounts to plain relative paths;
OBS_CONFIG_DIR variable is no longer needed.
nightly.yml / release.yml: remove OBS_CONFIG_DIR env injection and the
"Sync observability configs to host" step from both workflows.
With workdir_parent=/srv/gitea-workspace and an identical host<->container
bind mount, $(pwd) inside job containers resolves to a real host path the
daemon can find — no privileged container, no overlay2 inspection, no nsenter.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DooD runner only shares /var/run/docker.sock — no workspace directory is
mapped to the host daemon. Relative bind mounts in
docker-compose.observability.yml resolved to paths that didn't exist on
the host; Docker auto-created directories in their place, causing
'not a directory' mount failures for all five config files.
Fix:
- docker-compose.observability.yml: replace hardcoded ./infra/observability/
prefix with ${OBS_CONFIG_DIR:-./infra/observability} so the path is
configurable while remaining backwards-compatible for local use.
- nightly.yml / release.yml: add a 'Sync observability configs to host'
step that finds the job container's overlay2 MergedDir (the container's
full filesystem as seen from the host mount namespace), then uses the
existing nsenter/alpine pattern to cp the config tree into a stable host
path (/srv/familienarchiv-{staging,production}/obs-configs).
OBS_CONFIG_DIR is injected into the env file so Compose picks it up.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The endpoint belongs in the compose file (hardcoded to the in-network
Tempo service) rather than per-environment workflow files. This covers
both staging (nightly.yml) and production (release.yml) with a single
change and removes the duplicate from the nightly env-file block.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs introduced when the management port was split from the app port:
1. Backend healthcheck hit localhost:8080/actuator/health (app port) —
actuator is on management.server.port=8081, so every probe got a 404
from the main MVC dispatcher, marking the container permanently unhealthy.
Fix: change the probe to localhost:8081.
2. OTEL_EXPORTER_OTLP_ENDPOINT was not set in .env.staging, so the exporter
fell back to http://localhost:4317 (the CI-safe default) instead of
http://tempo:4317 (the in-network Tempo service). Fix: inject the correct
endpoint in the nightly env-file generation step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both nightly and release workflows were missing --remove-orphans on the
observability compose up, while the main app deploy step already had it.
Without it, containers removed from docker-compose.observability.yml
linger as unnamed orphans until manually pruned.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds SENTRY_DSN as an optional secret (empty by default) so it can be
set after GlitchTip first-run without requiring another code change.
Backend reads it via application.yaml; empty value keeps Sentry disabled.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prometheus, Loki, Tempo, and Grafana all define healthchecks in
docker-compose.observability.yml. Without --wait, the step exits 0
as soon as containers are created, masking startup failures silently.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docker-compose.prod.yml: add `name: archiv-net` so the network has a
stable Docker name regardless of compose project name (-p flag).
Both staging and production share the same host-level network, which
is correct since the observability stack is a single shared instance.
- nightly.yml / release.yml: add observability env vars (POSTGRES_USER,
PORT_GRAFANA=3003, PORT_GLITCHTIP=3002, PORT_PROMETHEUS=9090,
GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, GLITCHTIP_DOMAIN) to the
env file, then `docker compose -f docker-compose.observability.yml up -d`
after the app deploy step. PORT_GRAFANA=3003 avoids collision with
staging frontend on 3001.
Requires two new Gitea secrets: GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`ip route` (iproute2) is not installed in the Gitea runner container,
causing the smoke test step to exit 127. /proc/net/route is a kernel
virtual file that is always present on Linux; awk decodes the
little-endian hex gateway field to dotted-decimal without any external
binary dependency.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Unquoted variable expansion is safe here since the value contains
no spaces or glob characters, but quoting is the correct default
and keeps the script consistent with surrounding style.
Addresses review suggestion by Felix Brandt and Tobias Wendt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If `ip route show default` returns no output the old code passed
an empty string to curl --resolve, producing a confusing error 6
("couldn't resolve host") with no indication that gateway detection
had failed. The new guard exits immediately with a clear message.
Addresses review concern raised by Tobias Wendt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Job containers run in bridge network mode (runner-config.yaml). Inside
a bridge-networked container 127.0.0.1 is the container's own loopback;
Caddy on the host is unreachable there, causing an immediate ECONNREFUSED.
Use the Docker bridge gateway IP instead — the host's docker0 interface
where Caddy (bound on 0.0.0.0:443) is reachable from the container.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`sudo systemctl reload caddy` does not work from inside a DooD job
container: `systemctl` is absent from Ubuntu container images and
container processes cannot reach the host systemd without entering its
namespaces. Replace with `docker run --privileged --pid=host ubuntu:22.04
nsenter -t 1 -m -u -n -p -i -- /bin/systemctl reload caddy`, which uses
the already-mounted Docker socket to spin up a privileged sibling
container that enters the host PID namespace via nsenter. Tested live on
the Hetzner VPS. No sudoers entry required.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a `sudo systemctl reload caddy` step between the docker compose
deploy and the smoke test. This ensures any committed Caddyfile changes
are applied before the public surface is verified.
Previously the workflow had no mechanism to push Caddyfile changes to
the running host daemon. A Caddyfile edit would land in the repo but
Caddy would keep serving the previous config, causing the smoke test to
catch a stale header or still-proxied /actuator route rather than the
intended current config.
This step also surfaces the root cause of today's port-443 failure
explicitly: if Caddy is not running, the step fails with a clear service
error rather than a misleading "Failed to connect to port 443" from curl.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sara flagged that a future "compose cleanup" PR could silently drop the
backend volumes block and CI would happily pass while mass import on
staging silently broke. Adds a pre-build step that renders the staging
compose config and fails the deploy if `target: /import` or
`read_only: true` is missing.
Local verification of the guard:
- Volumes block removed → `grep -q 'target: /import'` exits 1 → step fails
- Volumes block present → both greps match → step passes
Addresses Sara's review on #526.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The compose file now requires IMPORT_HOST_DIR or refuses to start
(#526). Without this line the next nightly deploy would fail with a
clear interpolation error, but it should not fail — the staging
import payload already lives at this host path (rsync'd in #526).
Addresses Tobias's review on #526.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes#508.
Our gitea-runner advertises labels ubuntu-latest / ubuntu-24.04 /
ubuntu-22.04. `runs-on: self-hosted` never matches → dispatched
deploy jobs sit in the queue forever. The runner is still
genuinely self-hosted (DooD socket, joined to gitea_gitea net,
single-tenant per ADR-011) — the `self-hosted` token was just an
unconfirmed assumption about the label name.
Unblocks #497 / #499 first deploy.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The `if: always()` conditional on the env-file cleanup step in both
deploy workflows is what makes the ADR-011 single-tenant runner trust
model safe: secrets land on disk before each deploy and are wiped
unconditionally afterwards. A future workflow refactor that drops
`if: always()` would silently leave plaintext secrets on the runner
on any failed deploy.
The ADR documents this; the workflow file did not. Adds a prominent
inline comment so the next reader of the YAML sees the constraint
without having to cross-reference ADR-011. No behaviour change — both
workflows still parse. Addresses @nora's round-2 suggestion on PR
#499 — "linchpin of the ADR-011 trust model".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `Permissions-Policy: camera=(), microphone=(), geolocation=()` to
the shared (security_headers) snippet, so both archiv vhosts and the
git vhost deny browser APIs the app does not use. Reduces blast radius
of an XSS landing in a privileged origin.
The deploy smoke steps in nightly.yml and release.yml gain a matching
assertion against the canonical header value, so a future Caddyfile
edit that drops or loosens the header (e.g. `camera=(self)`) fails the
deploy instead of regressing silently.
`caddy validate` against caddy:2 passes; both workflow YAMLs parse.
Addresses @nora's round-2 suggestion on PR #499 — "lower-impact than
CSP but nearly free".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the presence-only `grep -qi strict-transport-security` smoke
assertion in both nightly.yml and release.yml with a value-pinning
regex that requires `max-age=31536000`, `includeSubDomains`, and
`preload`. A future Caddyfile edit that drops any of those three
parts now fails the deploy smoke step instead of passing silently.
Verified locally that the new pattern matches the preload-eligible
value and rejects three degraded forms (short max-age, missing
includeSubDomains, missing preload). Addresses @sara's round-2 note
on PR #499 — "presence check, not value check".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The smoke step previously curled the public hostname unconditionally,
which routes the runner's request via DNS → router → back into the same
host. Many SOHO routers do not implement hairpin NAT (or do so only after
a firmware update), so the deploy may pass on day one and silently fail
on day 90.
--resolve "<host>:443:127.0.0.1" pins the hostname to the runner's
loopback while keeping SNI on the public name (so the cert validates
correctly and the Caddy vhost block matches). The smoke test now
verifies that the Caddy-on-the-same-host is serving the right
hostname end-to-end, with no router dependency.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Without --pull, the host's Docker layer cache wins: if a CVE drops in
node:20.19.0-alpine3.21 / postgres:16-alpine and the vendor re-publishes
the same tag, the runner keeps serving the cached layer until the cache
is manually cleared — a silent supply-chain blind spot.
Adding --pull to both `compose build` invocations costs a single
re-pull per run and lifts the base-image patch lag from "next host
prune" to "next nightly".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The two deploy workflows make two non-obvious assumptions that future
maintainers should not have to rediscover by reading the diff:
1. Single-tenant self-hosted runner — the .env.* file lands on disk
during the deploy and is cleaned up unconditionally. Multi-tenant
usage would require switching to stdin-piped env input.
2. Host docker layer cache is authoritative — there is no
actions/cache directive; a host-level `docker system prune` will
cold-start the next build.
Both notes added as block comments at the top of each workflow.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Healthchecks prove containers are healthy on the docker network; they
do not prove the public URL is reachable, HSTS still fires, or
/actuator is still blocked at the edge. Add a post-deploy smoke step
to nightly.yml that:
1. GETs https://staging.raddatz.cloud/login (frontend reachable)
2. asserts the response includes the Strict-Transport-Security header
3. asserts /actuator/health returns 404 (defense-in-depth verified)
Failure aborts the workflow before the env-file cleanup step. The
cleanup step still runs because it is `if: always()`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Runs daily at 02:00 (and on workflow_dispatch). Builds the prod
compose stack with BuildKit, writes a transient .env.staging from
Gitea secrets, then `docker compose up -d --wait` so the job fails
loudly if any service's healthcheck never reports healthy.
The --profile staging flag starts the mailpit catcher in place of
a real SMTP relay; no production SMTP credentials touch the staging
environment.
The .env.staging file is cleaned up in `if: always()` to avoid
leaving secrets in the runner workspace between runs.
Refs #497.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>