Compare commits
7 Commits
7e52494880
...
79735e23e0
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
79735e23e0 | ||
|
|
df37113d38 | ||
|
|
c7d2eeb3f0 | ||
|
|
4e94d85d7e | ||
|
|
dec6b8139b | ||
|
|
7b7d0c92a8 | ||
|
|
448c3cdcdb |
14
.env.example
14
.env.example
@@ -29,16 +29,17 @@ OCR_TRAINING_TOKEN=change-me-in-production
|
||||
# --- Observability ---
|
||||
# Optional stack — start with: docker compose -f docker-compose.observability.yml up -d
|
||||
# Requires the main stack to already be running (docker compose up -d creates archiv-net).
|
||||
# In production the stack is managed from /opt/familienarchiv/ (see docs/DEPLOYMENT.md §4).
|
||||
|
||||
# Ports for host access
|
||||
PORT_GRAFANA=3001
|
||||
PORT_GRAFANA=3003
|
||||
PORT_GLITCHTIP=3002
|
||||
PORT_PROMETHEUS=9090
|
||||
|
||||
# Grafana admin password — change this before exposing Grafana beyond localhost
|
||||
GRAFANA_ADMIN_PASSWORD=changeme
|
||||
|
||||
# GlitchTip domain — production: use https://grafana.raddatz.cloud (must match Caddy vhost)
|
||||
# GlitchTip domain — production: use https://glitchtip.archiv.raddatz.cloud (must match Caddy vhost)
|
||||
GLITCHTIP_DOMAIN=http://localhost:3002
|
||||
|
||||
# GlitchTip secret key — Django SECRET_KEY equivalent, used to sign sessions and tokens.
|
||||
@@ -47,6 +48,15 @@ GLITCHTIP_DOMAIN=http://localhost:3002
|
||||
# Generate with: python3 -c "import secrets; print(secrets.token_hex(50))"
|
||||
GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret
|
||||
|
||||
# PostgreSQL hostname for GlitchTip's db-init job and workers.
|
||||
# Override when only the staging stack is running (container name differs from archive-db).
|
||||
# Default (archive-db) is correct for production with the full stack up.
|
||||
POSTGRES_HOST=archive-db
|
||||
|
||||
# $$ escaping note: passwords in /opt/familienarchiv/.env that contain a literal '$' must
|
||||
# use '$$' so Docker Compose does not expand them as variable references.
|
||||
# Example: a password 'p@$$word' should be written as 'p@$$$$word' in the .env file.
|
||||
|
||||
# Error reporting DSNs — leave empty to disable the SDK (safe default).
|
||||
# SENTRY_DSN: backend (Spring Boot) — used by the GlitchTip/Sentry Java SDK
|
||||
SENTRY_DSN=
|
||||
|
||||
@@ -143,6 +143,16 @@ jobs:
|
||||
cp -r infra/observability /opt/familienarchiv/infra/
|
||||
cp docker-compose.observability.yml /opt/familienarchiv/
|
||||
|
||||
- name: Validate observability compose config
|
||||
# Dry-run: resolves all variable substitutions from /opt/familienarchiv/.env
|
||||
# and reports any missing required keys before containers start. Catches
|
||||
# truncated passwords (missing $$ escaping), undefined variables, and YAML
|
||||
# errors in config files updated by the previous step.
|
||||
run: |
|
||||
docker compose \
|
||||
-f /opt/familienarchiv/docker-compose.observability.yml \
|
||||
config --quiet
|
||||
|
||||
- name: Start observability stack
|
||||
# Runs from /opt/familienarchiv/ so bind mounts resolve to stable
|
||||
# host paths that survive workspace wipes between nightly runs.
|
||||
@@ -152,6 +162,25 @@ jobs:
|
||||
-f /opt/familienarchiv/docker-compose.observability.yml \
|
||||
up -d --wait --remove-orphans
|
||||
|
||||
- name: Assert observability stack health
|
||||
# docker compose up --wait covers services WITH healthcheck directives only.
|
||||
# obs-promtail, obs-cadvisor, obs-node-exporter, and obs-glitchtip-worker have
|
||||
# no healthcheck — they are considered "started" as soon as the process runs.
|
||||
# This step explicitly asserts the four healthchecked critical services are
|
||||
# healthy before the smoke test proceeds.
|
||||
run: |
|
||||
set -e
|
||||
unhealthy=""
|
||||
for svc in obs-loki obs-prometheus obs-grafana obs-tempo; do
|
||||
status=$(docker inspect "$svc" --format '{{.State.Health.Status}}' 2>/dev/null || echo "missing")
|
||||
if [ "$status" != "healthy" ]; then
|
||||
echo "::error::$svc is not healthy (status: $status)"
|
||||
unhealthy="$unhealthy $svc"
|
||||
fi
|
||||
done
|
||||
[ -z "$unhealthy" ] || exit 1
|
||||
echo "All critical observability services are healthy"
|
||||
|
||||
- name: Reload Caddy
|
||||
# Apply any committed Caddyfile changes before smoke-testing the
|
||||
# public surface. Without this step, a Caddyfile edit lands in the
|
||||
|
||||
@@ -43,7 +43,7 @@ graph TD
|
||||
- SSE notifications transit Caddy (browser → Caddy → backend); the backend is never reachable directly from the public internet. The SvelteKit SSR layer is bypassed for SSE, but Caddy is not.
|
||||
- The Caddyfile responds `404` on `/actuator/*` (defense in depth). Internal monitoring scrapes the backend on the docker network, not through Caddy.
|
||||
- Production and staging cohabit on the same host via docker compose project names: `archiv-production` (ports 8080/3000) and `archiv-staging` (ports 8081/3001).
|
||||
- An optional observability stack (Prometheus, Node Exporter, cAdvisor) runs as a separate compose file: `docker compose -f docker-compose.observability.yml up -d`. It joins `archiv-net` and scrapes the backend's management port (`:8081`). Configuration lives under `infra/observability/`.
|
||||
- An optional observability stack (Prometheus, Node Exporter, cAdvisor, Loki, Tempo, Grafana, GlitchTip) runs as a separate compose file. Configuration lives under `infra/observability/`. In production and CI, the stack is managed from `/opt/familienarchiv/` (CI copies it there on every nightly run) so bind mounts survive workspace wipes — see §4 for the ops procedure.
|
||||
|
||||
### OCR memory requirements
|
||||
|
||||
@@ -142,7 +142,8 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
|
||||
| Variable | Purpose | Default | Required? | Sensitive? |
|
||||
|---|---|---|---|---|
|
||||
| `PORT_PROMETHEUS` | Host port for the Prometheus UI (bound to `127.0.0.1` only) | `9090` | — | — |
|
||||
| `PORT_GRAFANA` | Host port for the Grafana UI (bound to `127.0.0.1` only) | `3001` | — | — |
|
||||
| `PORT_GRAFANA` | Host port for the Grafana UI (bound to `127.0.0.1` only) | `3003` | — | — |
|
||||
| `POSTGRES_HOST` | PostgreSQL hostname for GlitchTip's db-init job and workers. Override when only the staging stack is running and `archive-db` is not resolvable by that name. | `archive-db` | — | — |
|
||||
| `GRAFANA_ADMIN_PASSWORD` | Grafana `admin` user password | `changeme` | YES (prod) | YES |
|
||||
| `PORT_GLITCHTIP` | Host port for the GlitchTip UI (bound to `127.0.0.1` only) | `3002` | — | — |
|
||||
| `GLITCHTIP_DOMAIN` | Public-facing base URL for GlitchTip (used in email links and CORS) | `http://localhost:3002` | YES (prod) | — |
|
||||
@@ -202,6 +203,18 @@ mkdir -p /srv/gitea-workspace
|
||||
# volumes:
|
||||
# - /srv/gitea-workspace:/srv/gitea-workspace
|
||||
# See runner-config.yaml (workdir_parent + valid_volumes + options) and ADR-015.
|
||||
|
||||
# Observability config permanent directory — the nightly CI job copies
|
||||
# docker-compose.observability.yml and infra/observability/ here on every run.
|
||||
# The obs stack is always started from this path, not from the workspace.
|
||||
# See ADR-016 for why this directory is used instead of a server-pull approach.
|
||||
mkdir -p /opt/familienarchiv/infra
|
||||
|
||||
# ⚠ IMPORTANT: after any change to runner-config.yaml (valid_volumes, options, workdir_parent),
|
||||
# restart the Gitea Act runner on the host for the new config to take effect:
|
||||
# systemctl restart gitea-runner
|
||||
# Until restarted, job containers are spawned with the old config and any new bind mounts
|
||||
# (e.g. /opt/familienarchiv) will not be available inside job steps.
|
||||
```
|
||||
|
||||
### 3.2 DNS records
|
||||
@@ -284,13 +297,43 @@ docker compose logs --tail=200 <service>
|
||||
|
||||
### Observability stack
|
||||
|
||||
An observability stack is available via `docker-compose.observability.yml`. Configuration lives under `infra/observability/`. Start it after the main stack is up (which creates `archiv-net`):
|
||||
An observability stack is available via `docker-compose.observability.yml`. Configuration lives under `infra/observability/`.
|
||||
|
||||
#### Dev — start from the workspace
|
||||
|
||||
```bash
|
||||
docker compose up -d # creates archiv-net
|
||||
docker compose -f docker-compose.observability.yml up -d
|
||||
```
|
||||
|
||||
#### Production — managed from `/opt/familienarchiv/`
|
||||
|
||||
The nightly CI job copies `docker-compose.observability.yml` and `infra/observability/` to `/opt/familienarchiv/` on every run, then starts the stack from there. Bind mounts in the compose file resolve to `/opt/familienarchiv/infra/observability/…` on the host, which survives workspace wipes between CI runs (see ADR-016).
|
||||
|
||||
The obs stack reads secrets from `/opt/familienarchiv/.env` (Docker Compose auto-reads this file when launched from that directory). This file is managed by the operator — CI does **not** write or delete it.
|
||||
|
||||
**Required keys in `/opt/familienarchiv/.env`:**
|
||||
|
||||
| Key | Example / notes |
|
||||
|---|---|
|
||||
| `GRAFANA_ADMIN_PASSWORD` | Strong unique password |
|
||||
| `GLITCHTIP_SECRET_KEY` | `python3 -c "import secrets; print(secrets.token_hex(32))"` |
|
||||
| `GLITCHTIP_DOMAIN` | `https://glitchtip.archiv.raddatz.cloud` — must match the Caddy vhost |
|
||||
| `POSTGRES_USER` | Must match the `archiv` user set in `.env.staging` / `.env.production` |
|
||||
| `POSTGRES_PASSWORD` | Must match the running PostgreSQL container's password |
|
||||
| `PORT_GRAFANA` | `3003` (staging default; 3001 was used by staging frontend) |
|
||||
| `PORT_GLITCHTIP` | `3002` |
|
||||
| `PORT_PROMETHEUS` | `9090` |
|
||||
| `SENTRY_DSN` | Set after GlitchTip first-run; leave empty to disable |
|
||||
|
||||
**`$$` escaping rule:** passwords that contain a literal `$` must use `$$` in this file so Docker Compose does not expand them as variable references. Example: a password `p@$word` must be written as `p@$$word`. Failure to escape produces a silently truncated password — Grafana or GlitchTip will start but reject logins.
|
||||
|
||||
To start or restart the obs stack manually on the server:
|
||||
|
||||
```bash
|
||||
docker compose -f /opt/familienarchiv/docker-compose.observability.yml up -d --wait --remove-orphans
|
||||
```
|
||||
|
||||
Current services:
|
||||
|
||||
| Service | Image | Purpose |
|
||||
@@ -311,7 +354,7 @@ Current services:
|
||||
|
||||
| Item | Value |
|
||||
|---|---|
|
||||
| URL | `http://localhost:3001` (or `http://localhost:$PORT_GRAFANA`) |
|
||||
| URL | `http://localhost:3003` (or `http://localhost:$PORT_GRAFANA`) |
|
||||
| Username | `admin` |
|
||||
| Password | `$GRAFANA_ADMIN_PASSWORD` (default: `changeme` — **change before exposing to a network**) |
|
||||
|
||||
@@ -341,7 +384,7 @@ docker exec obs-loki wget -qO- \
|
||||
|
||||
**Prefer `compose_service` over `container_name` in LogQL queries** — `container_name` differs between dev (`archive-backend`) and prod (`archiv-production-backend-1`), while `compose_service` is stable (`backend`, `db`, `minio`, etc.).
|
||||
|
||||
Prometheus port `9090` and Grafana port `3001` are bound to `127.0.0.1` on the host. No other observability ports are host-bound.
|
||||
Prometheus port `9090` and Grafana port `3003` (default; configurable via `PORT_GRAFANA`) are bound to `127.0.0.1` on the host. No other observability ports are host-bound.
|
||||
|
||||
#### GlitchTip
|
||||
|
||||
|
||||
52
docs/adr/016-obs-stack-co-location-ci-push.md
Normal file
52
docs/adr/016-obs-stack-co-location-ci-push.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# ADR-016: Observability stack co-location at `/opt/familienarchiv/` with CI-push config sync
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Issue #601 established that the observability stack must survive Gitea CI workspace wipes between nightly runs. When the nightly job completes, act_runner deletes the job workspace. Any Docker container that bind-mounts a config file from a workspace path (`/srv/gitea-workspace/…/infra/observability/prometheus/prometheus.yml`) then references a path that no longer exists on the host. On the next nightly run, Docker Compose either auto-creates an empty directory in its place (causing the container to fail to start because a file mount receives a directory) or finds a stale file from a previous run if the workspace happened to land at the same path.
|
||||
|
||||
ADR-015 solved the workspace bind-mount resolution problem: job workspaces are stored at `/srv/gitea-workspace` so `$(pwd)` inside the job container maps to a real host path. But it did not address persistence: the workspace is still wiped after the job, so bind mounts from workspace-relative paths remain fragile across runs.
|
||||
|
||||
### Decision drivers
|
||||
|
||||
1. Bind-mount sources must point to a host path that persists indefinitely, not to a path that disappears after each CI run.
|
||||
2. Config files must reflect the committed state of the repo after every nightly run (no manual sync steps).
|
||||
3. Secrets must not be written to the workspace or to any path managed by CI; they must survive independently of deployments.
|
||||
4. The solution must not introduce new infrastructure dependencies (no SSH access from CI, no external registry, no additional server-side daemon).
|
||||
|
||||
### Alternatives considered
|
||||
|
||||
**A: Server-pull model** — a systemd timer or cron job on the server does `git pull` from the repo into `/opt/familienarchiv/` and then runs `docker compose up`. Rejected because: (1) requires git credentials on the server and a registered deploy key, (2) adds a second deployment mechanism that diverges from the CI-push model used for the main app stack, (3) timing coupling — the server pull must complete before CI's health checks run, requiring polling or a webhook.
|
||||
|
||||
**B: Separate directory (e.g. `/opt/obs/`)** — keeps obs configs isolated from the app stack. Rejected because: (1) the main app compose files are already in `/opt/familienarchiv/` (managed the same way), and (2) GlitchTip shares the `archive-db` PostgreSQL instance and `archiv-net` Docker network — it is architecturally part of the same deployment unit, not a separate one. Co-location reflects the actual coupling.
|
||||
|
||||
**C: Named Docker configs (Swarm)** — Docker Swarm supports first-class config objects that persist in the cluster. Rejected because the project does not use Swarm and introducing it solely for config persistence is a disproportionate dependency.
|
||||
|
||||
## Decision
|
||||
|
||||
The observability stack is co-located with the main application deployment at `/opt/familienarchiv/`:
|
||||
|
||||
- `docker-compose.observability.yml` → `/opt/familienarchiv/docker-compose.observability.yml`
|
||||
- `infra/observability/` → `/opt/familienarchiv/infra/observability/`
|
||||
|
||||
The nightly CI job (`nightly.yml`) copies these files from the workspace checkout to `/opt/familienarchiv/` using `cp -r` on every run (CI-push model). Containers always read config from the permanent location; a workspace wipe has no effect on running containers.
|
||||
|
||||
Secrets are stored in `/opt/familienarchiv/.env` on the server. This file is managed by the operator — CI does not write or delete it. Docker Compose auto-reads it when started from `/opt/familienarchiv/`. The required key inventory is documented in `docs/DEPLOYMENT.md §4`.
|
||||
|
||||
The CI runner mounts `/opt/familienarchiv` as a bind mount into job containers (see `runner-config.yaml`). This requires a one-time `mkdir -p /opt/familienarchiv/infra` on the server and a runner restart after updating `runner-config.yaml` (see ADR-015 and `docs/DEPLOYMENT.md §3.1`).
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Bind-mount sources survive workspace wipes by definition — they are on a persistent host path.
|
||||
- Config is always in sync with the repo after each nightly run.
|
||||
- No new infrastructure dependencies; the CI-push model mirrors how the main app stack is deployed.
|
||||
- Secrets (`/opt/familienarchiv/.env`) are decoupled from CI — a deployment cannot accidentally overwrite them.
|
||||
|
||||
**Negative:**
|
||||
- `cp -r` does not remove deleted files; a config file removed from the repo persists in `/opt/familienarchiv/infra/observability/` until manually deleted. Acceptable for this project's change frequency. A `rsync -a --delete` would give a clean mirror if this becomes a problem.
|
||||
- Mounting `/opt/familienarchiv/` into CI job containers expands the blast radius of a compromised workflow step — a malicious step could overwrite app compose files and Caddy config. Acceptable because the runner is single-tenant (trusted code only). See `runner-config.yaml` security comment.
|
||||
- Runner must be restarted (`systemctl restart gitea-runner`) after any change to `runner-config.yaml` for the new mount to take effect.
|
||||
@@ -17,7 +17,7 @@ System_Boundary(archiv, "Familienarchiv (Docker Compose)") {
|
||||
Container(mc, "Bucket / Service-Account Init", "MinIO Client (mc)", "One-shot container on startup. Idempotent: creates the archive bucket, the archiv-app service account, and attaches the readwrite policy.")
|
||||
}
|
||||
|
||||
System_Boundary(observability, "Observability Stack (docker-compose.observability.yml)") {
|
||||
System_Boundary(observability, "Observability Stack (/opt/familienarchiv/docker-compose.observability.yml)") {
|
||||
Container(prometheus, "Prometheus", "prom/prometheus:v3.4.0", "Scrapes metrics from backend management port 8081 (/actuator/prometheus), node-exporter, and cAdvisor. Retention: 30 days.")
|
||||
Container(node_exporter, "Node Exporter", "prom/node-exporter:v1.9.0", "Host-level CPU, memory, disk, and network metrics.")
|
||||
Container(cadvisor, "cAdvisor", "gcr.io/cadvisor/cadvisor:v0.52.1", "Per-container resource metrics.")
|
||||
|
||||
@@ -17,12 +17,16 @@ container:
|
||||
- "/srv/gitea-workspace"
|
||||
- "/opt/familienarchiv"
|
||||
# appended to `docker run` when the runner spawns a job container
|
||||
# SECURITY: Mounting the Docker socket grants job containers root-equivalent
|
||||
# access to the host Docker daemon. Acceptable here because only trusted code
|
||||
# from this private repo runs on this runner. Do NOT use on a runner that
|
||||
# accepts untrusted PRs from external contributors.
|
||||
# /opt/familienarchiv is mounted so the nightly job can deploy observability
|
||||
# configs to the permanent location without needing ssh or nsenter.
|
||||
# SECURITY WARNING: This mount configuration grants CI job containers:
|
||||
# 1. Root-equivalent access to the host Docker daemon (via the socket).
|
||||
# 2. Read/write access to /opt/familienarchiv/ — including the main app's
|
||||
# compose files, Caddy config, and observability configs. A malicious
|
||||
# workflow step could overwrite any file in that directory.
|
||||
# Both are acceptable ONLY because this runner is single-tenant: it executes
|
||||
# code exclusively from this private repo with a fixed set of trusted authors.
|
||||
# WARNING: Do NOT add this runner to any repo with external contributors or
|
||||
# untrusted PRs — the blast radius includes the entire production deployment.
|
||||
# See ADR-016 for the reasoning behind the /opt/familienarchiv mount.
|
||||
options: "-v /var/run/docker.sock:/var/run/docker.sock -v /srv/gitea-workspace:/srv/gitea-workspace -v /opt/familienarchiv:/opt/familienarchiv"
|
||||
# keep network mode default (bridge) — Testcontainers handles its own networking
|
||||
force_pull: false
|
||||
|
||||
Reference in New Issue
Block a user