Compare commits

..

7 Commits

Author SHA1 Message Date
Marcel
79735e23e0 ci(obs): assert obs-loki/prometheus/grafana/tempo are healthy after stack up
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 2m58s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 2m36s
CI / fail2ban Regex (pull_request) Successful in 41s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:01:48 +02:00
Marcel
df37113d38 ci(obs): add compose config dry-run before obs stack up to catch .env substitution errors
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:01:17 +02:00
Marcel
c7d2eeb3f0 docs(ci): harden runner-config.yaml security comment for /opt/familienarchiv/ write access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:00:44 +02:00
Marcel
4e94d85d7e docs(adr): add ADR-016 for obs stack co-location and CI-push config sync
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:00:07 +02:00
Marcel
dec6b8139b docs(c4): update l2-containers obs boundary to show /opt/familienarchiv/ permanent path
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 23:59:11 +02:00
Marcel
7b7d0c92a8 docs(obs): update DEPLOYMENT.md with /opt/familienarchiv/ ops section, env keys, runner restart
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 23:58:42 +02:00
Marcel
448c3cdcdb docs(obs): update .env.example for PORT_GRAFANA 3003, POSTGRES_HOST, $$ escaping
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 23:57:31 +02:00
6 changed files with 152 additions and 14 deletions

View File

@@ -29,16 +29,17 @@ OCR_TRAINING_TOKEN=change-me-in-production
# --- Observability ---
# Optional stack — start with: docker compose -f docker-compose.observability.yml up -d
# Requires the main stack to already be running (docker compose up -d creates archiv-net).
# In production the stack is managed from /opt/familienarchiv/ (see docs/DEPLOYMENT.md §4).
# Ports for host access
PORT_GRAFANA=3001
PORT_GRAFANA=3003
PORT_GLITCHTIP=3002
PORT_PROMETHEUS=9090
# Grafana admin password — change this before exposing Grafana beyond localhost
GRAFANA_ADMIN_PASSWORD=changeme
# GlitchTip domain — production: use https://grafana.raddatz.cloud (must match Caddy vhost)
# GlitchTip domain — production: use https://glitchtip.archiv.raddatz.cloud (must match Caddy vhost)
GLITCHTIP_DOMAIN=http://localhost:3002
# GlitchTip secret key — Django SECRET_KEY equivalent, used to sign sessions and tokens.
@@ -47,6 +48,15 @@ GLITCHTIP_DOMAIN=http://localhost:3002
# Generate with: python3 -c "import secrets; print(secrets.token_hex(50))"
GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret
# PostgreSQL hostname for GlitchTip's db-init job and workers.
# Override when only the staging stack is running (container name differs from archive-db).
# Default (archive-db) is correct for production with the full stack up.
POSTGRES_HOST=archive-db
# $$ escaping note: passwords in /opt/familienarchiv/.env that contain a literal '$' must
# use '$$' so Docker Compose does not expand them as variable references.
# Example: a password 'p@$$word' should be written as 'p@$$$$word' in the .env file.
# Error reporting DSNs — leave empty to disable the SDK (safe default).
# SENTRY_DSN: backend (Spring Boot) — used by the GlitchTip/Sentry Java SDK
SENTRY_DSN=

View File

@@ -143,6 +143,16 @@ jobs:
cp -r infra/observability /opt/familienarchiv/infra/
cp docker-compose.observability.yml /opt/familienarchiv/
- name: Validate observability compose config
# Dry-run: resolves all variable substitutions from /opt/familienarchiv/.env
# and reports any missing required keys before containers start. Catches
# truncated passwords (missing $$ escaping), undefined variables, and YAML
# errors in config files updated by the previous step.
run: |
docker compose \
-f /opt/familienarchiv/docker-compose.observability.yml \
config --quiet
- name: Start observability stack
# Runs from /opt/familienarchiv/ so bind mounts resolve to stable
# host paths that survive workspace wipes between nightly runs.
@@ -152,6 +162,25 @@ jobs:
-f /opt/familienarchiv/docker-compose.observability.yml \
up -d --wait --remove-orphans
- name: Assert observability stack health
# docker compose up --wait covers services WITH healthcheck directives only.
# obs-promtail, obs-cadvisor, obs-node-exporter, and obs-glitchtip-worker have
# no healthcheck — they are considered "started" as soon as the process runs.
# This step explicitly asserts the four healthchecked critical services are
# healthy before the smoke test proceeds.
run: |
set -e
unhealthy=""
for svc in obs-loki obs-prometheus obs-grafana obs-tempo; do
status=$(docker inspect "$svc" --format '{{.State.Health.Status}}' 2>/dev/null || echo "missing")
if [ "$status" != "healthy" ]; then
echo "::error::$svc is not healthy (status: $status)"
unhealthy="$unhealthy $svc"
fi
done
[ -z "$unhealthy" ] || exit 1
echo "All critical observability services are healthy"
- name: Reload Caddy
# Apply any committed Caddyfile changes before smoke-testing the
# public surface. Without this step, a Caddyfile edit lands in the

View File

@@ -43,7 +43,7 @@ graph TD
- SSE notifications transit Caddy (browser → Caddy → backend); the backend is never reachable directly from the public internet. The SvelteKit SSR layer is bypassed for SSE, but Caddy is not.
- The Caddyfile responds `404` on `/actuator/*` (defense in depth). Internal monitoring scrapes the backend on the docker network, not through Caddy.
- Production and staging cohabit on the same host via docker compose project names: `archiv-production` (ports 8080/3000) and `archiv-staging` (ports 8081/3001).
- An optional observability stack (Prometheus, Node Exporter, cAdvisor) runs as a separate compose file: `docker compose -f docker-compose.observability.yml up -d`. It joins `archiv-net` and scrapes the backend's management port (`:8081`). Configuration lives under `infra/observability/`.
- An optional observability stack (Prometheus, Node Exporter, cAdvisor, Loki, Tempo, Grafana, GlitchTip) runs as a separate compose file. Configuration lives under `infra/observability/`. In production and CI, the stack is managed from `/opt/familienarchiv/` (CI copies it there on every nightly run) so bind mounts survive workspace wipes — see §4 for the ops procedure.
### OCR memory requirements
@@ -142,7 +142,8 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
| Variable | Purpose | Default | Required? | Sensitive? |
|---|---|---|---|---|
| `PORT_PROMETHEUS` | Host port for the Prometheus UI (bound to `127.0.0.1` only) | `9090` | — | — |
| `PORT_GRAFANA` | Host port for the Grafana UI (bound to `127.0.0.1` only) | `3001` | — | — |
| `PORT_GRAFANA` | Host port for the Grafana UI (bound to `127.0.0.1` only) | `3003` | — | — |
| `POSTGRES_HOST` | PostgreSQL hostname for GlitchTip's db-init job and workers. Override when only the staging stack is running and `archive-db` is not resolvable by that name. | `archive-db` | — | — |
| `GRAFANA_ADMIN_PASSWORD` | Grafana `admin` user password | `changeme` | YES (prod) | YES |
| `PORT_GLITCHTIP` | Host port for the GlitchTip UI (bound to `127.0.0.1` only) | `3002` | — | — |
| `GLITCHTIP_DOMAIN` | Public-facing base URL for GlitchTip (used in email links and CORS) | `http://localhost:3002` | YES (prod) | — |
@@ -202,6 +203,18 @@ mkdir -p /srv/gitea-workspace
# volumes:
# - /srv/gitea-workspace:/srv/gitea-workspace
# See runner-config.yaml (workdir_parent + valid_volumes + options) and ADR-015.
# Observability config permanent directory — the nightly CI job copies
# docker-compose.observability.yml and infra/observability/ here on every run.
# The obs stack is always started from this path, not from the workspace.
# See ADR-016 for why this directory is used instead of a server-pull approach.
mkdir -p /opt/familienarchiv/infra
# ⚠ IMPORTANT: after any change to runner-config.yaml (valid_volumes, options, workdir_parent),
# restart the Gitea Act runner on the host for the new config to take effect:
# systemctl restart gitea-runner
# Until restarted, job containers are spawned with the old config and any new bind mounts
# (e.g. /opt/familienarchiv) will not be available inside job steps.
```
### 3.2 DNS records
@@ -284,13 +297,43 @@ docker compose logs --tail=200 <service>
### Observability stack
An observability stack is available via `docker-compose.observability.yml`. Configuration lives under `infra/observability/`. Start it after the main stack is up (which creates `archiv-net`):
An observability stack is available via `docker-compose.observability.yml`. Configuration lives under `infra/observability/`.
#### Dev — start from the workspace
```bash
docker compose up -d # creates archiv-net
docker compose -f docker-compose.observability.yml up -d
```
#### Production — managed from `/opt/familienarchiv/`
The nightly CI job copies `docker-compose.observability.yml` and `infra/observability/` to `/opt/familienarchiv/` on every run, then starts the stack from there. Bind mounts in the compose file resolve to `/opt/familienarchiv/infra/observability/…` on the host, which survives workspace wipes between CI runs (see ADR-016).
The obs stack reads secrets from `/opt/familienarchiv/.env` (Docker Compose auto-reads this file when launched from that directory). This file is managed by the operator — CI does **not** write or delete it.
**Required keys in `/opt/familienarchiv/.env`:**
| Key | Example / notes |
|---|---|
| `GRAFANA_ADMIN_PASSWORD` | Strong unique password |
| `GLITCHTIP_SECRET_KEY` | `python3 -c "import secrets; print(secrets.token_hex(32))"` |
| `GLITCHTIP_DOMAIN` | `https://glitchtip.archiv.raddatz.cloud` — must match the Caddy vhost |
| `POSTGRES_USER` | Must match the `archiv` user set in `.env.staging` / `.env.production` |
| `POSTGRES_PASSWORD` | Must match the running PostgreSQL container's password |
| `PORT_GRAFANA` | `3003` (staging default; 3001 was used by staging frontend) |
| `PORT_GLITCHTIP` | `3002` |
| `PORT_PROMETHEUS` | `9090` |
| `SENTRY_DSN` | Set after GlitchTip first-run; leave empty to disable |
**`$$` escaping rule:** passwords that contain a literal `$` must use `$$` in this file so Docker Compose does not expand them as variable references. Example: a password `p@$word` must be written as `p@$$word`. Failure to escape produces a silently truncated password — Grafana or GlitchTip will start but reject logins.
To start or restart the obs stack manually on the server:
```bash
docker compose -f /opt/familienarchiv/docker-compose.observability.yml up -d --wait --remove-orphans
```
Current services:
| Service | Image | Purpose |
@@ -311,7 +354,7 @@ Current services:
| Item | Value |
|---|---|
| URL | `http://localhost:3001` (or `http://localhost:$PORT_GRAFANA`) |
| URL | `http://localhost:3003` (or `http://localhost:$PORT_GRAFANA`) |
| Username | `admin` |
| Password | `$GRAFANA_ADMIN_PASSWORD` (default: `changeme`**change before exposing to a network**) |
@@ -341,7 +384,7 @@ docker exec obs-loki wget -qO- \
**Prefer `compose_service` over `container_name` in LogQL queries**`container_name` differs between dev (`archive-backend`) and prod (`archiv-production-backend-1`), while `compose_service` is stable (`backend`, `db`, `minio`, etc.).
Prometheus port `9090` and Grafana port `3001` are bound to `127.0.0.1` on the host. No other observability ports are host-bound.
Prometheus port `9090` and Grafana port `3003` (default; configurable via `PORT_GRAFANA`) are bound to `127.0.0.1` on the host. No other observability ports are host-bound.
#### GlitchTip

View File

@@ -0,0 +1,52 @@
# ADR-016: Observability stack co-location at `/opt/familienarchiv/` with CI-push config sync
## Status
Accepted
## Context
Issue #601 established that the observability stack must survive Gitea CI workspace wipes between nightly runs. When the nightly job completes, act_runner deletes the job workspace. Any Docker container that bind-mounts a config file from a workspace path (`/srv/gitea-workspace/…/infra/observability/prometheus/prometheus.yml`) then references a path that no longer exists on the host. On the next nightly run, Docker Compose either auto-creates an empty directory in its place (causing the container to fail to start because a file mount receives a directory) or finds a stale file from a previous run if the workspace happened to land at the same path.
ADR-015 solved the workspace bind-mount resolution problem: job workspaces are stored at `/srv/gitea-workspace` so `$(pwd)` inside the job container maps to a real host path. But it did not address persistence: the workspace is still wiped after the job, so bind mounts from workspace-relative paths remain fragile across runs.
### Decision drivers
1. Bind-mount sources must point to a host path that persists indefinitely, not to a path that disappears after each CI run.
2. Config files must reflect the committed state of the repo after every nightly run (no manual sync steps).
3. Secrets must not be written to the workspace or to any path managed by CI; they must survive independently of deployments.
4. The solution must not introduce new infrastructure dependencies (no SSH access from CI, no external registry, no additional server-side daemon).
### Alternatives considered
**A: Server-pull model** — a systemd timer or cron job on the server does `git pull` from the repo into `/opt/familienarchiv/` and then runs `docker compose up`. Rejected because: (1) requires git credentials on the server and a registered deploy key, (2) adds a second deployment mechanism that diverges from the CI-push model used for the main app stack, (3) timing coupling — the server pull must complete before CI's health checks run, requiring polling or a webhook.
**B: Separate directory (e.g. `/opt/obs/`)** — keeps obs configs isolated from the app stack. Rejected because: (1) the main app compose files are already in `/opt/familienarchiv/` (managed the same way), and (2) GlitchTip shares the `archive-db` PostgreSQL instance and `archiv-net` Docker network — it is architecturally part of the same deployment unit, not a separate one. Co-location reflects the actual coupling.
**C: Named Docker configs (Swarm)** — Docker Swarm supports first-class config objects that persist in the cluster. Rejected because the project does not use Swarm and introducing it solely for config persistence is a disproportionate dependency.
## Decision
The observability stack is co-located with the main application deployment at `/opt/familienarchiv/`:
- `docker-compose.observability.yml``/opt/familienarchiv/docker-compose.observability.yml`
- `infra/observability/``/opt/familienarchiv/infra/observability/`
The nightly CI job (`nightly.yml`) copies these files from the workspace checkout to `/opt/familienarchiv/` using `cp -r` on every run (CI-push model). Containers always read config from the permanent location; a workspace wipe has no effect on running containers.
Secrets are stored in `/opt/familienarchiv/.env` on the server. This file is managed by the operator — CI does not write or delete it. Docker Compose auto-reads it when started from `/opt/familienarchiv/`. The required key inventory is documented in `docs/DEPLOYMENT.md §4`.
The CI runner mounts `/opt/familienarchiv` as a bind mount into job containers (see `runner-config.yaml`). This requires a one-time `mkdir -p /opt/familienarchiv/infra` on the server and a runner restart after updating `runner-config.yaml` (see ADR-015 and `docs/DEPLOYMENT.md §3.1`).
## Consequences
**Positive:**
- Bind-mount sources survive workspace wipes by definition — they are on a persistent host path.
- Config is always in sync with the repo after each nightly run.
- No new infrastructure dependencies; the CI-push model mirrors how the main app stack is deployed.
- Secrets (`/opt/familienarchiv/.env`) are decoupled from CI — a deployment cannot accidentally overwrite them.
**Negative:**
- `cp -r` does not remove deleted files; a config file removed from the repo persists in `/opt/familienarchiv/infra/observability/` until manually deleted. Acceptable for this project's change frequency. A `rsync -a --delete` would give a clean mirror if this becomes a problem.
- Mounting `/opt/familienarchiv/` into CI job containers expands the blast radius of a compromised workflow step — a malicious step could overwrite app compose files and Caddy config. Acceptable because the runner is single-tenant (trusted code only). See `runner-config.yaml` security comment.
- Runner must be restarted (`systemctl restart gitea-runner`) after any change to `runner-config.yaml` for the new mount to take effect.

View File

@@ -17,7 +17,7 @@ System_Boundary(archiv, "Familienarchiv (Docker Compose)") {
Container(mc, "Bucket / Service-Account Init", "MinIO Client (mc)", "One-shot container on startup. Idempotent: creates the archive bucket, the archiv-app service account, and attaches the readwrite policy.")
}
System_Boundary(observability, "Observability Stack (docker-compose.observability.yml)") {
System_Boundary(observability, "Observability Stack (/opt/familienarchiv/docker-compose.observability.yml)") {
Container(prometheus, "Prometheus", "prom/prometheus:v3.4.0", "Scrapes metrics from backend management port 8081 (/actuator/prometheus), node-exporter, and cAdvisor. Retention: 30 days.")
Container(node_exporter, "Node Exporter", "prom/node-exporter:v1.9.0", "Host-level CPU, memory, disk, and network metrics.")
Container(cadvisor, "cAdvisor", "gcr.io/cadvisor/cadvisor:v0.52.1", "Per-container resource metrics.")

View File

@@ -17,12 +17,16 @@ container:
- "/srv/gitea-workspace"
- "/opt/familienarchiv"
# appended to `docker run` when the runner spawns a job container
# SECURITY: Mounting the Docker socket grants job containers root-equivalent
# access to the host Docker daemon. Acceptable here because only trusted code
# from this private repo runs on this runner. Do NOT use on a runner that
# accepts untrusted PRs from external contributors.
# /opt/familienarchiv is mounted so the nightly job can deploy observability
# configs to the permanent location without needing ssh or nsenter.
# SECURITY WARNING: This mount configuration grants CI job containers:
# 1. Root-equivalent access to the host Docker daemon (via the socket).
# 2. Read/write access to /opt/familienarchiv/ — including the main app's
# compose files, Caddy config, and observability configs. A malicious
# workflow step could overwrite any file in that directory.
# Both are acceptable ONLY because this runner is single-tenant: it executes
# code exclusively from this private repo with a fixed set of trusted authors.
# WARNING: Do NOT add this runner to any repo with external contributors or
# untrusted PRs — the blast radius includes the entire production deployment.
# See ADR-016 for the reasoning behind the /opt/familienarchiv mount.
options: "-v /var/run/docker.sock:/var/run/docker.sock -v /srv/gitea-workspace:/srv/gitea-workspace -v /opt/familienarchiv:/opt/familienarchiv"
# keep network mode default (bridge) — Testcontainers handles its own networking
force_pull: false