Files

Marcel 40d9713b79 docs(deployment): fix stale GlitchTip image tags and add SENTRY_DSN to env vars table

- GlitchTip image corrected from glitchtip:v4 to glitchtip:6.1.6 in services table
- Grafana default port corrected from 3001 to 3003 in services table description
- SENTRY_DSN added to backend env vars table (wired in docker-compose.yml and application.yaml)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 10:53:31 +02:00

30 KiB

Raw Blame History

Familienarchiv — Deployment Reference

If the app is down right now → jump to §4 Logs.

This doc is the Day-1 checklist and operational reference. It links to the canonical infrastructure docs in docs/infrastructure/ rather than duplicating them.

Audience: operator bringing up a fresh instance, or Successor-X debugging a live incident.

Ownership: project owner. Update this file in any PR that changes the container topology, env vars, or backup procedure.

Deployment topology
Environment variables
Bootstrap from scratch
Logs + observability
Backup + recovery
Common operational tasks
Known limitations

1. Deployment topology

graph TD
    Browser -->|HTTPS| Caddy["Caddy (TLS termination)"]
    Caddy -->|HTTP :3000| Frontend["Web Frontend\nSvelteKit Node adapter"]
    Caddy -->|HTTP :8080| Backend["API Backend\nSpring Boot / Jetty :8080"]
    Backend -->|JDBC :5432| DB[(PostgreSQL 16)]
    Backend -->|S3 API :9000| MinIO[(MinIO)]
    Backend -->|HTTP :8000 internal| OCR["OCR Service\nPython FastAPI"]
    OCR -->|presigned URL| MinIO
    Caddy -->|SSE proxy_pass| Backend

Key facts:

Caddy terminates TLS and reverse-proxies to frontend (:3000) and backend (:8080). The Caddyfile is committed at infra/caddy/Caddyfile and is installed on the host as /etc/caddy/Caddyfile (symlink).
The host binds all docker-published ports to 127.0.0.1 only; Caddy is the sole external entry point.
The OCR service has no published port — reachable only on the internal Docker network from the backend.
SSE notifications transit Caddy (browser → Caddy → backend); the backend is never reachable directly from the public internet. The SvelteKit SSR layer is bypassed for SSE, but Caddy is not.
The Caddyfile responds 404 on /actuator/* (defense in depth). Internal monitoring scrapes the backend on the docker network, not through Caddy.
Production and staging cohabit on the same host via docker compose project names: archiv-production (ports 8080/3000) and archiv-staging (ports 8081/3001).
An optional observability stack (Prometheus, Node Exporter, cAdvisor, Loki, Tempo, Grafana, GlitchTip) runs as a separate compose file. Configuration lives under infra/observability/. In production and CI, the stack is managed from /opt/familienarchiv/ (CI copies it there on every nightly run) so bind mounts survive workspace wipes — see §4 for the ops procedure.

OCR memory requirements

The OCR service requires significant RAM for model loading. The dev compose sets mem_limit: 12g.

Production target	RAM	Recommended OCR limit	Notes
Hetzner CX42	16 GB	12 GB	Recommended for OCR-enabled production
Hetzner CX32	8 GB	6 GB	Accept reduced batch sizes and slower throughput
Hetzner CX22	4 GB	—	Disable the OCR service (`profiles: [ocr]`); run OCR on demand only

A CX32 cannot honour the default mem_limit: 12g — set the OCR_MEM_LIMIT=6g env var (in .env.production / .env.staging, or as a Gitea secret consumed by the workflow) before deploying on a CX32. The prod compose interpolates this var with a 12g default.

Dev vs production differences

Concern	Dev (`docker-compose.yml`)	Prod (`docker-compose.prod.yml`)
MinIO image tag	`minio/minio:latest`	Pinned `minio/minio:RELEASE.…`
Data persistence	Bind mounts `./data/postgres`, `./data/minio`	Named Docker volumes (`postgres-data`, `minio-data`)
MinIO credentials for backend	Root user/password	Service account `archiv-app` with bucket-scoped rights
Bucket creation	`create-buckets` helper	Same helper, plus service-account bootstrap on every up
Spring profile	`dev,e2e` (Swagger + e2e overrides)	unset — base `application.yaml` is production-ready
Mail	Mailpit (local catcher)	Real SMTP (production) / Mailpit via `profiles: [staging]` (staging)
Frontend image	Dev server, `target: development`, port 5173	Node adapter, `target: production`, port 3000
Host port binding	All published	Bound to `127.0.0.1` only; Caddy is the front door
Deploy method	`docker compose up -d` (manual)	Gitea Actions: `nightly.yml` (staging, cron) and `release.yml` (production, on `v*` tag) — both use `up -d --wait`

Full prod compose: docker-compose.prod.yml. Workflow files: .gitea/workflows/nightly.yml, .gitea/workflows/release.yml.

2. Environment variables

All vars are set in .env at the repo root (copy from .env.example). The backend resolves them via application.yaml; the Docker Compose file wires them into each container.

Any var found in docker-compose.yml or application*.yaml that is not in this table is a blocking review comment on any PR that changes those files.

Backend

Variable	Purpose	Default	Required?	Sensitive?
`SPRING_DATASOURCE_URL`	PostgreSQL JDBC URL	—	YES	—
`SPRING_DATASOURCE_USERNAME`	DB username	—	YES	—
`SPRING_DATASOURCE_PASSWORD`	DB password	—	YES	YES
`S3_ENDPOINT`	MinIO / OBS endpoint URL	—	YES	—
`S3_ACCESS_KEY`	MinIO access key (use service account, not root in prod)	—	YES	YES
`S3_SECRET_KEY`	MinIO secret key	—	YES	YES
`S3_BUCKET_NAME`	Target bucket name	—	YES	—
`S3_REGION`	S3 region string	`us-east-1`	YES	—
`APP_ADMIN_USERNAME`	Bootstrap admin username (⚠ not in .env.example)	`admin`	YES	—
`APP_ADMIN_PASSWORD`	Bootstrap admin password (⚠ ships as `admin123`)	`admin123`	YES	YES
`APP_BASE_URL`	Public-facing URL for email links	`http://localhost:3000`	YES (prod)	—
`APP_OCR_BASE_URL`	Internal URL of the OCR service	—	YES	—
`APP_OCR_TRAINING_TOKEN`	Secret token for OCR training endpoints	—	YES (prod)	YES
`IMPORT_HOST_DIR`	Absolute host path holding the ODS spreadsheet + PDFs for the `/admin/system` mass-import card. Mounted read-only at `/import` inside the backend (compose-only — backend reads via `app.import.dir`). Compose refuses to start when unset, so staging and prod cannot accidentally share the source. Convention: `/srv/familienarchiv-staging/import` and `/srv/familienarchiv-production/import`	—	YES (prod compose)	—
`MAIL_HOST`	SMTP host	`mailpit` (dev)	YES (prod)	—
`MAIL_PORT`	SMTP port	`1025` (dev)	YES (prod)	—
`MAIL_USERNAME`	SMTP username	—	YES (prod)	YES
`MAIL_PASSWORD`	SMTP password	—	YES (prod)	YES
`APP_MAIL_FROM`	From address for outbound mail	`noreply@familienarchiv.local`	YES (prod)	—
`MAIL_SMTP_AUTH`	SMTP auth enabled	`false` (dev)	YES (prod)	—
`MAIL_STARTTLS_ENABLE`	STARTTLS enabled	`false` (dev)	YES (prod)	—
`SPRING_PROFILES_ACTIVE`	Spring profile	`dev,e2e` (compose)	YES	—
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP gRPC endpoint for distributed traces (Tempo). Set to `http://tempo:4317` via compose.	`http://localhost:4317`	—	—
`MANAGEMENT_TRACING_SAMPLING_PROBABILITY`	Micrometer tracing sample rate; overridden to `0.0` in test profile.	`0.1` (compose) / `1.0` (dev)	—	—
`SENTRY_DSN`	GlitchTip / Sentry DSN for backend error reporting. Leave empty to disable the SDK. Set after GlitchTip first-run (§4).	—	—	YES

PostgreSQL container

Variable	Purpose	Default	Required?	Sensitive?
`POSTGRES_USER`	DB superuser	`archive_user`	YES	—
`POSTGRES_PASSWORD`	DB password	`change-me`	YES	YES
`POSTGRES_DB`	Database name	`family_archive_db`	YES	—

MinIO container

Variable	Purpose	Default	Required?	Sensitive?
`MINIO_ROOT_USER`	MinIO root username (dev compose only — prod compose hardcodes `archiv`)	`minio_admin`	YES (dev)	—
`MINIO_ROOT_PASSWORD` / `MINIO_PASSWORD`	MinIO root password. Used only by the `mc admin` bootstrap in prod, never by the backend.	`change-me`	YES	YES
`MINIO_APP_PASSWORD`	Password for the `archiv-app` service account that the backend uses. Bucket-scoped via `readwrite` policy on `familienarchiv`. Bootstrapped by `create-buckets`.	—	YES (prod)	YES
`MINIO_DEFAULT_BUCKETS`	Bucket name (dev compose only — prod compose hardcodes `familienarchiv`)	`archive-documents`	YES (dev)	—

OCR service

Variable	Purpose	Default	Required?	Sensitive?
`TRAINING_TOKEN`	Guards `/train` and `/segtrain` endpoints (accepts file uploads)	—	YES (prod)	YES
`ALLOWED_PDF_HOSTS`	SSRF protection — comma-separated list of allowed PDF source hosts. *Do not widen to ``**	`minio,localhost,127.0.0.1`	YES	—
`KRAKEN_MODEL_PATH`	Directory containing Kraken HTR models (populated by `download-kraken-models.sh`)	`/app/models/`	—	—
`BLLA_MODEL_PATH`	Kraken baseline layout analysis model path	`/app/models/blla.mlmodel`	—	—
`OCR_MEM_LIMIT`	Container memory cap for ocr-service in `docker-compose.prod.yml`. Set to `6g` on CX32 hosts; leave unset on CX42+ to use the 12g default	`12g` (prod compose default)	—	—

Observability stack (`docker-compose.observability.yml`)

Variable	Purpose	Default	Required?	Sensitive?
`PORT_PROMETHEUS`	Host port for the Prometheus UI (bound to `127.0.0.1` only)	`9090`	—	—
`PORT_GRAFANA`	Host port for the Grafana UI (bound to `127.0.0.1` only)	`3003`	—	—
`POSTGRES_HOST`	PostgreSQL hostname for GlitchTip's db-init job and workers. Override when only the staging stack is running and `archive-db` is not resolvable by that name.	`archive-db`	—	—
`GRAFANA_ADMIN_PASSWORD`	Grafana `admin` user password	`changeme`	YES (prod)	YES
`PORT_GLITCHTIP`	Host port for the GlitchTip UI (bound to `127.0.0.1` only)	`3002`	—	—
`GLITCHTIP_DOMAIN`	Public-facing base URL for GlitchTip (used in email links and CORS)	`http://localhost:3002`	YES (prod)	—
`GLITCHTIP_SECRET_KEY`	Django secret key for GlitchTip — generate with `python3 -c "import secrets; print(secrets.token_hex(32))"`	—	YES	YES

3. Bootstrap from scratch

Production and staging deploy via Gitea Actions (release.yml on v* tag, nightly.yml on cron). The server itself only needs to host Caddy, Docker, and the runner — the workflows handle the rest.

3.1 Server one-time setup

# Base hardening
ufw default deny incoming && ufw allow 22/tcp && ufw allow 80/tcp && ufw allow 443/tcp && ufw enable
# /etc/ssh/sshd_config: PasswordAuthentication no, PermitRootLogin no

# Install Caddy 2 (https://caddyserver.com/docs/install#debian-ubuntu-raspbian)
apt install caddy

# Use the Caddyfile from the repo (replace path with the runner's clone target)
# CI DEPENDENCY: the nightly and release workflows run `systemctl reload caddy` to
# pick up committed Caddyfile changes. They find the file via this symlink — if it
# is absent or points elsewhere, the reload succeeds but serves stale config.
ln -sf /opt/familienarchiv/infra/caddy/Caddyfile /etc/caddy/Caddyfile
systemctl reload caddy

# fail2ban — protect /api/auth/login from credential stuffing.
# Jail watches the Caddy JSON access log for 401 responses on
# /api/auth/login. The jail (maxretry=10 / findtime=10m / bantime=30m)
# and filter are committed under infra/fail2ban/ — symlink them in:
apt install fail2ban
ln -sf /opt/familienarchiv/infra/fail2ban/jail.d/familienarchiv.conf \
       /etc/fail2ban/jail.d/familienarchiv.conf
ln -sf /opt/familienarchiv/infra/fail2ban/filter.d/familienarchiv-auth.conf \
       /etc/fail2ban/filter.d/familienarchiv-auth.conf
systemctl reload fail2ban
# Verify after first deploy with:
#   fail2ban-client status familienarchiv-auth
#   fail2ban-regex /var/log/caddy/access.log familienarchiv-auth

# Tailscale — used by the backup pipeline to reach heim-nas (follow-up issue)
curl -fsSL https://tailscale.com/install.sh | sh && tailscale up

# Self-hosted Gitea runner — register against the repo with a runner token.
# This runner is assumed single-tenant: the deploy workflows write .env.*
# files to disk during execution (cleaned up unconditionally on completion).
# A multi-tenant runner would need to switch to stdin-piped env files.
# (See https://docs.gitea.com/usage/actions/quickstart for the register step.)

# Runner workspace directory — required for DooD bind-mount resolution (ADR-015).
# act_runner stores job workspaces here so that docker compose bind mounts resolve
# to real host paths. The path must be identical on the host and inside job containers.
mkdir -p /srv/gitea-workspace
# Observability config permanent directory — the nightly CI job copies
# docker-compose.observability.yml and infra/observability/ here on every run.
# The obs stack is always started from this path, not from the workspace.
# See ADR-016 for why this directory is used instead of a server-pull approach.
mkdir -p /opt/familienarchiv/infra
# Both paths must also appear in the runner service volumes in ~/docker/gitea/compose.yaml:
#   volumes:
#     - /srv/gitea-workspace:/srv/gitea-workspace
# /opt/familienarchiv does NOT need to be in the runner container's volumes — job
# containers are spawned by the host daemon directly (DooD), so the host path is
# accessible to them as long as runner-config.yaml lists it in valid_volumes + options.
# See runner-config.yaml (workdir_parent + valid_volumes + options) and ADR-015/016.

# ⚠ IMPORTANT: after any change to runner-config.yaml (valid_volumes, options, workdir_parent),
# restart the Gitea Act runner for the new config to take effect:
#   docker restart gitea-runner
# Until restarted, job containers are spawned with the old config and any new bind mounts
# (e.g. /opt/familienarchiv) will not be available inside job steps.

3.2 DNS records

archiv.raddatz.cloud   A   <server IP>
staging.raddatz.cloud  A   <server IP>
git.raddatz.cloud      A   <server IP>

3.3 Gitea secrets (Repo → Settings → Actions → Secrets)

Secret	Used by	Notes
`PROD_POSTGRES_PASSWORD`	release.yml	strong unique password
`PROD_MINIO_PASSWORD`	release.yml	MinIO root password; used only at bootstrap
`PROD_MINIO_APP_PASSWORD`	release.yml	application service-account password
`PROD_OCR_TRAINING_TOKEN`	release.yml	`python3 -c "import secrets; print(secrets.token_hex(32))"`
`PROD_APP_ADMIN_USERNAME`	release.yml	e.g. `admin@archiv.raddatz.cloud`
`PROD_APP_ADMIN_PASSWORD`	release.yml	⚠ locked permanently on first deploy — see §3.5
`STAGING_POSTGRES_PASSWORD`	nightly.yml	different from prod
`STAGING_MINIO_PASSWORD`	nightly.yml	different from prod
`STAGING_MINIO_APP_PASSWORD`	nightly.yml	different from prod
`STAGING_OCR_TRAINING_TOKEN`	nightly.yml	different from prod
`STAGING_APP_ADMIN_USERNAME`	nightly.yml	e.g. `admin@staging.raddatz.cloud`
`STAGING_APP_ADMIN_PASSWORD`	nightly.yml	locked on first staging deploy
`MAIL_HOST`	release.yml	SMTP relay hostname (prod only)
`MAIL_PORT`	release.yml	typically `587`
`MAIL_USERNAME`	release.yml	SMTP user
`MAIL_PASSWORD`	release.yml	SMTP password
`GRAFANA_ADMIN_PASSWORD`	both	Grafana `admin` login — generate a strong password
`GLITCHTIP_SECRET_KEY`	both	Django secret key — `openssl rand -hex 32`
`SENTRY_DSN`	both	GlitchTip project DSN — set after first-run (§4); leave empty to keep Sentry disabled

3.4 First deploy

# 1. Trigger nightly.yml manually (Repo → Actions → nightly → "Run workflow")
#    Expected: docker compose up -d --wait succeeds for archiv-staging, then
#    the workflow's "Smoke test deployed environment" step asserts:
#      - https://staging.raddatz.cloud/login returns 200
#      - HSTS header is present
#      - /actuator/health returns 404 (defense-in-depth check)
# 2. (Optional) Re-verify manually
curl -I https://staging.raddatz.cloud/
#    Expected: 200 (login page) with HSTS + X-Content-Type-Options headers
# 3. When staging looks healthy, push a v* tag to trigger release.yml
git tag v1.0.0 && git push origin v1.0.0

3.5 ⚠ Admin password is locked on first deploy

UserDataInitializer creates the admin user only if the email does not exist. The first successful deploy persists the admin password to the database. Changing PROD_APP_ADMIN_PASSWORD in Gitea secrets after that point has no effect — the secret is only consulted when the row is missing.

Before the first deploy: rotate PROD_APP_ADMIN_PASSWORD to a strong value. After the first deploy: change the admin password via the in-app account settings, not via the Gitea secret.

4. Logs + observability

First-response commands

# Stream backend logs (most useful first)
docker compose logs --follow --tail=100 backend

# Stream all services
docker compose logs --follow

# Single snapshot
docker compose logs --tail=200 <service>
# services: frontend, backend, db, minio, ocr-service

Log locations

Backend application log: stdout (captured by Docker). Access inside the container at /app/logs/ via docker exec.
Spring Actuator health: http://localhost:8080/actuator/health (internal only in prod — port 8081 for Prometheus scraping)
Prometheus scraping: management port 8081, path /actuator/prometheus. Internal only; Caddy blocks /actuator/* externally.

Observability stack

An observability stack is available via docker-compose.observability.yml. Configuration lives under infra/observability/.

Dev — start from the workspace

docker compose up -d                                    # creates archiv-net
docker compose -f docker-compose.observability.yml up -d

Why the obs stack is managed differently from the main app stack

The main app stack (docker-compose.prod.yml) has no config-file bind mounts — its containers read config from env vars and image defaults. The workspace is wiped after each CI run but that does not affect running containers, because they hold no references to workspace paths.

The obs stack is different: prometheus.yml, tempo.yml, Loki config, Grafana provisioning files, and Promtail config are all bind-mounted from the host filesystem into their containers. If those source paths disappear (workspace wipe), the containers can restart fine until a docker compose up is run again — at that point Docker tries to re-resolve the bind-mount source and fails because the workspace path no longer exists.

The fix is to keep the obs compose file and config tree at a permanent path that CI copies to on every run but which survives between runs: /opt/familienarchiv/ (see ADR-016).

Production — managed from `/opt/familienarchiv/`

Every CI run (nightly + release) copies docker-compose.observability.yml and infra/observability/ to /opt/familienarchiv/ before starting the stack. Bind mounts then resolve to /opt/familienarchiv/infra/observability/… — a stable path that outlasts any workspace wipe.

Environment variables follow the same two-source model as the main stack:

Source	What it contains	Managed by
`infra/observability/obs.env`	All non-secret config (ports, URLs, hostnames)	Git — reviewed in PRs
`/opt/familienarchiv/obs-secrets.env`	Passwords and secret keys only	CI — written fresh from Gitea secrets on every deploy

Both files are passed explicitly via --env-file to the compose command, so there is no implicit auto-read .env and no operator-managed file to keep in sync.

Non-secret config (infra/observability/obs.env):

Key	Value	Notes
`PORT_GRAFANA`	`3003`	Avoids collision with staging frontend on port 3001
`PORT_GLITCHTIP`	`3002`
`PORT_PROMETHEUS`	`9090`
`GF_SERVER_ROOT_URL`	`https://grafana.archiv.raddatz.cloud`	Required for alert email links and OAuth redirects
`GLITCHTIP_DOMAIN`	`https://glitchtip.archiv.raddatz.cloud`	Must match the Caddy vhost
`POSTGRES_HOST`	`archive-db`	Override if only the staging stack is running

Secret keys (set in Gitea secrets, injected by CI into obs-secrets.env):

Gitea secret	Notes
`GRAFANA_ADMIN_PASSWORD`	Strong unique password; shared by nightly and release
`GLITCHTIP_SECRET_KEY`	`openssl rand -hex 32`; shared by nightly and release
`STAGING_POSTGRES_PASSWORD` / `PROD_POSTGRES_PASSWORD`	Must match the running PostgreSQL container

To start or restart the obs stack manually on the server (after CI has run at least once):

docker compose \
  -f /opt/familienarchiv/docker-compose.observability.yml \
  --env-file /opt/familienarchiv/infra/observability/obs.env \
  --env-file /opt/familienarchiv/obs-secrets.env \
  up -d --wait --remove-orphans

Note (manual ops only): CI clears the destination with rm -rf before copying, so deleted files are removed automatically on the next run. If you copy manually with cp -r without first removing the directory, stale files from deleted configs will persist until cleaned up:
rm /opt/familienarchiv/infra/observability/<path-to-removed-file>

Current services:

Service	Image	Purpose
`obs-prometheus`	`prom/prometheus:v3.4.0`	Scrapes metrics from backend management port 8081 (`/actuator/prometheus`), node-exporter, and cAdvisor
`obs-node-exporter`	`prom/node-exporter:v1.9.0`	Host-level CPU / memory / disk / network metrics
`obs-cadvisor`	`gcr.io/cadvisor/cadvisor:v0.52.1`	Per-container resource metrics
`obs-loki`	`grafana/loki:3.4.2`	Log aggregation — receives log streams from Promtail. Port 3100 is `expose`-only (not host-bound).
`obs-promtail`	`grafana/promtail:3.4.2`	Log shipping agent — reads all Docker container logs via the Docker socket and forwards them to Loki with `container_name`, `compose_service`, and `compose_project` labels
`obs-tempo`	`grafana/tempo:2.7.2`	Distributed trace storage — OTLP gRPC receiver on port 4317, OTLP HTTP on port 4318 (both `archiv-net`-internal). Grafana queries traces on port 3200 (`obs-net`-internal). All ports are `expose`-only (not host-bound).
`obs-grafana`	`grafana/grafana-oss:11.6.1`	Unified observability UI — metrics dashboards, log exploration, trace viewer. Bound to `127.0.0.1:${PORT_GRAFANA:-3003}` on the host.
`obs-glitchtip`	`glitchtip/glitchtip:6.1.6`	Sentry-compatible error tracker. Receives frontend + backend error events, groups by fingerprint, provides issue UI with stack traces. Bound to `127.0.0.1:${PORT_GLITCHTIP:-3002}`.
`obs-glitchtip-worker`	`glitchtip/glitchtip:6.1.6`	Celery + beat worker — processes async GlitchTip tasks (event ingestion, notifications, cleanup).
`obs-redis`	`redis:7-alpine`	Celery task broker for GlitchTip. Internal to `obs-net`; no host port exposed.
`obs-glitchtip-db-init`	`postgres:16-alpine`	One-shot init container. Creates the `glitchtip` database on the existing `archive-db` PostgreSQL instance if it does not already exist. Runs at stack startup; exits cleanly once done.

Grafana

Item	Value
URL	`http://localhost:3003` (or `http://localhost:$PORT_GRAFANA`)
Username	`admin`
Password	`$GRAFANA_ADMIN_PASSWORD` (default: `changeme` — change before exposing to a network)

Datasources are auto-provisioned on first start (Prometheus, Loki, Tempo — no manual setup required). Three dashboards are pre-loaded:

Dashboard	Grafana ID	Purpose
Node Exporter Full	1860	Host CPU, memory, disk, network
Spring Boot Observability	17175	JVM metrics, HTTP latency, error rate
Loki Logs	13639	Log exploration and filtering

Tempo traces are accessible via Grafana Explore → Tempo datasource, and linked from Loki logs via the traceId derived field.

Loki quick checks (after ~60 s, run from inside the obs-loki container):

# Loki health
docker exec obs-loki wget -qO- http://localhost:3100/ready

# List labels
docker exec obs-loki wget -qO- 'http://localhost:3100/loki/api/v1/labels'

# Query logs by service (stable across dev and prod environments)
docker exec obs-loki wget -qO- \
  'http://localhost:3100/loki/api/v1/query_range?query=%7Bcompose_service%3D%22backend%22%7D&limit=5'

Prefer compose_service over container_name in LogQL queries — container_name differs between dev (archive-backend) and prod (archiv-production-backend-1), while compose_service is stable (backend, db, minio, etc.).

Prometheus port 9090 and Grafana port 3003 (default; configurable via PORT_GRAFANA) are bound to 127.0.0.1 on the host. No other observability ports are host-bound.

GlitchTip

Item	Value
URL	`http://localhost:3002` (or `http://localhost:$PORT_GLITCHTIP`)

Required env vars — set in .env before first start:

GLITCHTIP_SECRET_KEY=$(python3 -c "import secrets; print(secrets.token_hex(32))")
GLITCHTIP_DOMAIN=http://localhost:3002   # change to your public URL in prod
PORT_GLITCHTIP=3002                      # optional, defaults to 3002

Database: GlitchTip shares the existing archive-db PostgreSQL instance. The obs-glitchtip-db-init one-shot container creates a dedicated glitchtip database on first stack start — no manual step required.

First-run steps (one-time, after docker compose -f docker-compose.observability.yml up -d):

# 1. Create the Django superuser (interactive)
docker exec -it obs-glitchtip ./manage.py createsuperuser

# 2. Open the GlitchTip UI and log in
open http://localhost:3002

# 3. Create an organisation (e.g. "Familienarchiv")
# 4. Create two projects:
#    - "familienarchiv-frontend"  (platform: JavaScript / SvelteKit)
#    - "familienarchiv-backend"   (platform: Java / Spring Boot)
# 5. Copy each project's DSN from Settings → Projects → <project> → Client Keys
# 6. Wire the DSNs into the backend and frontend via env vars (separate issue)

5. Backup + recovery

Current state — no automated backup

No automated backup is configured. Manual procedure for a point-in-time backup:

# PostgreSQL dump
docker exec archive-db pg_dump -U ${POSTGRES_USER} ${POSTGRES_DB} > backup-$(date +%Y%m%d).sql

# MinIO data (bind-mounted in dev)
# Copy ./data/minio/ to external storage

Restoration:

# Restore Postgres
docker exec -i archive-db psql -U ${POSTGRES_USER} ${POSTGRES_DB} < backup-YYYYMMDD.sql

Planned — phase 5 of Production v1 milestone

Automated backup (nightly pg_dump + MinIO mc mirror over Tailscale to heim-nas) is a follow-up issue. Until that ships: manual backups are the only recovery option.

Rollback

Each release tag corresponds to a docker image tag on the host daemon (built via DooD; no registry). Rolling back to a previous tag is one command:

TAG=v1.0.0 docker compose \
  -f docker-compose.prod.yml \
  -p archiv-production \
  --env-file /opt/familienarchiv/.env.production \
  up -d --wait --remove-orphans

If the rollback target image is no longer present on the host (host disk pruned, etc.), re-trigger release.yml for that tag from Gitea Actions UI — it rebuilds and redeploys.

Flyway migrations are not auto-rolled-back. If a release contained a destructive migration (drop column, rename table), a tag rollback brings the schema back to a previous app version but the data shape has already changed. For breaking schema changes, prefer a forward-only fix.

6. Common operational tasks

Reset dev database (truncates data, keeps schema)

bash scripts/reset-db.sh

Truncates all data but does not drop the schema or re-run Flyway. Use for E2E test resets, not full reinstalls. ⚠️ Script hardcodes DB_USER=archive_user and DB_NAME=family_archive_db — if you customised these in .env, edit the script accordingly.

Rebuild frontend container (clears node_modules volume)

bash scripts/rebuild-frontend.sh

Assumes the Docker Compose volume is named familienarchiv_frontend_node_modules. If your project directory is not named familienarchiv, edit line 16 of the script.

Download Kraken OCR models

bash scripts/download-kraken-models.sh

Downloads the Kurrent/Sütterlin HTR models. Run once after a fresh clone or when models are updated.

Trigger a mass import (Excel/ODS)

Dev: drop the ODS spreadsheet + PDFs into ./import/ at the repo root — the dev compose bind-mounts it to /import automatically.

Staging/production:

Pre-stage the payload on the host. Convention: /srv/familienarchiv-staging/import/ or /srv/familienarchiv-production/import/.
```
rsync -avh --progress ./import/ user@host:/srv/familienarchiv-staging/import/
```
Make sure IMPORT_HOST_DIR=<host-path> is set in .env.staging / .env.production (the nightly/release workflows already write this — see §3). Compose refuses to start without it.
Redeploy the stack so the bind mount picks up — or, if the mount is already in place, skip to step 4.
Call POST /api/admin/trigger-import (requires ADMIN permission), or click the "Import starten" button on /admin/system.
The import runs asynchronously — poll GET /api/admin/import-status, watch /admin/system, or tail the backend logs.

7. Known limitations

Limitation	Reason	Reference
Single-node OCR service	The two required OCR engines (Surya + Kraken) exist only in the Python ecosystem; horizontal scaling would require a job queue not currently implemented	ADR-001
No multi-tenancy	Designed as a single-family private archive; all authenticated users share the same document space	Deliberate scope decision (family-only product frame)
No multi-region	Single PostgreSQL + MinIO instance; no replication or failover	Deliberate scope decision
Max upload size	50 MB per file (500 MB per request for multi-file)	Configurable in `application.yaml` (`spring.servlet.multipart`)
No automated backup	Phase 5 of Production v1 milestone is not yet implemented	See §5 above

30 KiB Raw Blame History