familienarchiv/docs/DEPLOYMENT.md

<!-- Last reviewed: 2026-05-05 — reviewed at every milestone close -->

# Familienarchiv — Deployment Reference

> **If the app is down right now → jump to [§4 Logs](#4-logs--observability).**

This doc is the Day-1 checklist and operational reference. It links to the canonical infrastructure docs in `docs/infrastructure/` rather than duplicating them.

**Audience:** operator bringing up a fresh instance, or Successor-X debugging a live incident.

**Ownership:** project owner. Update this file in any PR that changes the container topology, env vars, or backup procedure.

## Table of Contents

1. [Deployment topology](#1-deployment-topology)
2. [Environment variables](#2-environment-variables)
3. [Bootstrap from scratch](#3-bootstrap-from-scratch)
4. [Logs + observability](#4-logs--observability)
5. [Backup + recovery](#5-backup--recovery)
6. [Common operational tasks](#6-common-operational-tasks)
7. [Known limitations](#7-known-limitations)

---

## 1. Deployment topology

```mermaid
graph TD
    Browser -->|HTTPS| Caddy["Caddy (TLS termination)"]
    Caddy -->|HTTP :3000| Frontend["Web Frontend\nSvelteKit Node adapter"]
    Caddy -->|HTTP :8080| Backend["API Backend\nSpring Boot / Jetty :8080"]
    Backend -->|JDBC :5432| DB[(PostgreSQL 16)]
    Backend -->|S3 API :9000| MinIO[(MinIO)]
    Backend -->|HTTP :8000 internal| OCR["OCR Service\nPython FastAPI"]
    OCR -->|presigned URL| MinIO
    Caddy -->|SSE proxy_pass| Backend
```

**Key facts:**
- Caddy terminates TLS and reverse-proxies to frontend (`:3000`) and backend (`:8080`). The Caddyfile is committed at [`infra/caddy/Caddyfile`](../infra/caddy/Caddyfile) and is installed on the host as `/etc/caddy/Caddyfile` (symlink).
- The host binds all docker-published ports to `127.0.0.1` only; Caddy is the sole external entry point.
- The OCR service has **no published port** — reachable only on the internal Docker network from the backend.
- SSE notifications transit Caddy (browser → Caddy → backend); the backend is never reachable directly from the public internet. The SvelteKit SSR layer is bypassed for SSE, but Caddy is not.
- The Caddyfile responds `404` on `/actuator/*` (defense in depth). Internal monitoring scrapes the backend on the docker network, not through Caddy.
- Production and staging cohabit on the same host via docker compose project names: `archiv-production` (ports 8080/3000) and `archiv-staging` (ports 8081/3001).
- An optional observability stack (Prometheus, Node Exporter, cAdvisor, Loki, Tempo, Grafana, GlitchTip) runs as a separate compose file. Configuration lives under `infra/observability/`. In production and CI, the stack is managed from `/opt/familienarchiv/` (CI copies it there on every nightly run) so bind mounts survive workspace wipes — see §4 for the ops procedure.

### OCR memory requirements

The OCR service requires significant RAM for model loading. The dev compose sets `mem_limit: 12g`.

| Production target | RAM | Recommended OCR limit | Notes |
|---|---|---|---|
| Hetzner CX42 | 16 GB | 12 GB | Recommended for OCR-enabled production |
| Hetzner CX32 | 8 GB | 6 GB | Accept reduced batch sizes and slower throughput |
| Hetzner CX22 | 4 GB | — | Disable the OCR service (`profiles: [ocr]`); run OCR on demand only |

A CX32 cannot honour the default `mem_limit: 12g` — set the `OCR_MEM_LIMIT=6g` env var (in `.env.production` / `.env.staging`, or as a Gitea secret consumed by the workflow) before deploying on a CX32. The prod compose interpolates this var with a 12g default.

### Dev vs production differences

| Concern | Dev (`docker-compose.yml`) | Prod (`docker-compose.prod.yml`) |
|---|---|---|
| MinIO image tag | `minio/minio:latest` | Pinned `minio/minio:RELEASE.…` |
| Data persistence | Bind mounts `./data/postgres`, `./data/minio` | Named Docker volumes (`postgres-data`, `minio-data`) |
| MinIO credentials for backend | Root user/password | Service account `archiv-app` with bucket-scoped rights |
| Bucket creation | `create-buckets` helper | Same helper, plus service-account bootstrap on every up |
| Spring profile | `dev,e2e` (Swagger + e2e overrides) | unset — base `application.yaml` is production-ready |
| Mail | Mailpit (local catcher) | Real SMTP (production) / Mailpit via `profiles: [staging]` (staging) |
| Frontend image | Dev server, `target: development`, port 5173 | Node adapter, `target: production`, port 3000 |
| Host port binding | All published | Bound to `127.0.0.1` only; Caddy is the front door |
| Deploy method | `docker compose up -d` (manual) | Gitea Actions: `nightly.yml` (staging, cron) and `release.yml` (production, on `v*` tag) — both use `up -d --wait` |

Full prod compose: [`docker-compose.prod.yml`](../docker-compose.prod.yml). Workflow files: [`.gitea/workflows/nightly.yml`](../.gitea/workflows/nightly.yml), [`.gitea/workflows/release.yml`](../.gitea/workflows/release.yml).

---

## 2. Environment variables

All vars are set in `.env` at the repo root (copy from `.env.example`). The backend resolves them via `application.yaml`; the Docker Compose file wires them into each container.

**Any var found in `docker-compose.yml` or `application*.yaml` that is not in this table is a blocking review comment on any PR that changes those files.**

### Backend

| Variable | Purpose | Default | Required? | Sensitive? |
|---|---|---|---|---|
| `SPRING_DATASOURCE_URL` | PostgreSQL JDBC URL | — | YES | — |
| `SPRING_DATASOURCE_USERNAME` | DB username | — | YES | — |
| `SPRING_DATASOURCE_PASSWORD` | DB password | — | YES | YES |
| `S3_ENDPOINT` | MinIO / OBS endpoint URL | — | YES | — |
| `S3_ACCESS_KEY` | MinIO access key (use service account, not root in prod) | — | YES | YES |
| `S3_SECRET_KEY` | MinIO secret key | — | YES | YES |
| `S3_BUCKET_NAME` | Target bucket name | — | YES | — |
| `S3_REGION` | S3 region string | `us-east-1` | YES | — |
| `APP_ADMIN_USERNAME` | Bootstrap admin username (⚠ not in .env.example) | `admin` | YES | — |
| `APP_ADMIN_PASSWORD` | Bootstrap admin password (⚠ ships as `admin123`) | `admin123` | YES | YES |
| `APP_BASE_URL` | Public-facing URL for email links | `http://localhost:3000` | YES (prod) | — |
| `APP_OCR_BASE_URL` | Internal URL of the OCR service | — | YES | — |
| `APP_OCR_TRAINING_TOKEN` | Secret token for OCR training endpoints | — | YES (prod) | YES |
| `IMPORT_HOST_DIR` | Absolute host path holding the ODS spreadsheet + PDFs for the `/admin/system` mass-import card. Mounted read-only at `/import` inside the backend (compose-only — backend reads via `app.import.dir`). Compose refuses to start when unset, so staging and prod cannot accidentally share the source. Convention: `/srv/familienarchiv-staging/import` and `/srv/familienarchiv-production/import` | — | YES (prod compose) | — |
| `MAIL_HOST` | SMTP host | `mailpit` (dev) | YES (prod) | — |
| `MAIL_PORT` | SMTP port | `1025` (dev) | YES (prod) | — |
| `MAIL_USERNAME` | SMTP username | — | YES (prod) | YES |
| `MAIL_PASSWORD` | SMTP password | — | YES (prod) | YES |
| `APP_MAIL_FROM` | From address for outbound mail | `noreply@familienarchiv.local` | YES (prod) | — |
| `MAIL_SMTP_AUTH` | SMTP auth enabled | `false` (dev) | YES (prod) | — |
| `MAIL_STARTTLS_ENABLE` | STARTTLS enabled | `false` (dev) | YES (prod) | — |
| `SPRING_PROFILES_ACTIVE` | Spring profile | `dev,e2e` (compose) | YES | — |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP gRPC endpoint for distributed traces (Tempo). Set to `http://tempo:4317` via compose. | `http://localhost:4317` | — | — |
| `MANAGEMENT_TRACING_SAMPLING_PROBABILITY` | Micrometer tracing sample rate; overridden to `0.0` in test profile. | `0.1` (compose) / `1.0` (dev) | — | — |

### PostgreSQL container

| Variable | Purpose | Default | Required? | Sensitive? |
|---|---|---|---|---|
| `POSTGRES_USER` | DB superuser | `archive_user` | YES | — |
| `POSTGRES_PASSWORD` | DB password | `change-me` | YES | YES |
| `POSTGRES_DB` | Database name | `family_archive_db` | YES | — |

### MinIO container

| Variable | Purpose | Default | Required? | Sensitive? |
|---|---|---|---|---|
| `MINIO_ROOT_USER` | MinIO root username (dev compose only — prod compose hardcodes `archiv`) | `minio_admin` | YES (dev) | — |
| `MINIO_ROOT_PASSWORD` / `MINIO_PASSWORD` | MinIO root password. **Used only by the `mc admin` bootstrap in prod, never by the backend.** | `change-me` | YES | YES |
| `MINIO_APP_PASSWORD` | Password for the `archiv-app` service account that the backend uses. Bucket-scoped via `readwrite` policy on `familienarchiv`. Bootstrapped by `create-buckets`. | — | YES (prod) | YES |
| `MINIO_DEFAULT_BUCKETS` | Bucket name (dev compose only — prod compose hardcodes `familienarchiv`) | `archive-documents` | YES (dev) | — |

### OCR service

| Variable | Purpose | Default | Required? | Sensitive? |
|---|---|---|---|---|
| `TRAINING_TOKEN` | Guards `/train` and `/segtrain` endpoints (accepts file uploads) | — | YES (prod) | YES |
| `ALLOWED_PDF_HOSTS` | SSRF protection — comma-separated list of allowed PDF source hosts. **Do not widen to `*`** | `minio,localhost,127.0.0.1` | YES | — |
| `KRAKEN_MODEL_PATH` | Directory containing Kraken HTR models (populated by `download-kraken-models.sh`) | `/app/models/` | — | — |
| `BLLA_MODEL_PATH` | Kraken baseline layout analysis model path | `/app/models/blla.mlmodel` | — | — |
| `OCR_MEM_LIMIT` | Container memory cap for ocr-service in `docker-compose.prod.yml`. Set to `6g` on CX32 hosts; leave unset on CX42+ to use the 12g default | `12g` (prod compose default) | — | — |

### Observability stack (`docker-compose.observability.yml`)

| Variable | Purpose | Default | Required? | Sensitive? |
|---|---|---|---|---|
| `PORT_PROMETHEUS` | Host port for the Prometheus UI (bound to `127.0.0.1` only) | `9090` | — | — |
| `PORT_GRAFANA` | Host port for the Grafana UI (bound to `127.0.0.1` only) | `3003` | — | — |
| `POSTGRES_HOST` | PostgreSQL hostname for GlitchTip's db-init job and workers. Override when only the staging stack is running and `archive-db` is not resolvable by that name. | `archive-db` | — | — |
| `GRAFANA_ADMIN_PASSWORD` | Grafana `admin` user password | `changeme` | YES (prod) | YES |
| `PORT_GLITCHTIP` | Host port for the GlitchTip UI (bound to `127.0.0.1` only) | `3002` | — | — |
| `GLITCHTIP_DOMAIN` | Public-facing base URL for GlitchTip (used in email links and CORS) | `http://localhost:3002` | YES (prod) | — |
| `GLITCHTIP_SECRET_KEY` | Django secret key for GlitchTip — generate with `python3 -c "import secrets; print(secrets.token_hex(32))"` | — | YES | YES |

---

## 3. Bootstrap from scratch

Production and staging deploy via Gitea Actions (`release.yml` on `v*` tag, `nightly.yml` on cron). The server itself only needs to host Caddy, Docker, and the runner — the workflows handle the rest.

### 3.1 Server one-time setup

```bash
# Base hardening
ufw default deny incoming && ufw allow 22/tcp && ufw allow 80/tcp && ufw allow 443/tcp && ufw enable
# /etc/ssh/sshd_config: PasswordAuthentication no, PermitRootLogin no

# Install Caddy 2 (https://caddyserver.com/docs/install#debian-ubuntu-raspbian)
apt install caddy

# Use the Caddyfile from the repo (replace path with the runner's clone target)
# CI DEPENDENCY: the nightly and release workflows run `systemctl reload caddy` to
# pick up committed Caddyfile changes. They find the file via this symlink — if it
# is absent or points elsewhere, the reload succeeds but serves stale config.
ln -sf /opt/familienarchiv/infra/caddy/Caddyfile /etc/caddy/Caddyfile
systemctl reload caddy

# fail2ban — protect /api/auth/login from credential stuffing.
# Jail watches the Caddy JSON access log for 401 responses on
# /api/auth/login. The jail (maxretry=10 / findtime=10m / bantime=30m)
# and filter are committed under infra/fail2ban/ — symlink them in:
apt install fail2ban
ln -sf /opt/familienarchiv/infra/fail2ban/jail.d/familienarchiv.conf \
       /etc/fail2ban/jail.d/familienarchiv.conf
ln -sf /opt/familienarchiv/infra/fail2ban/filter.d/familienarchiv-auth.conf \
       /etc/fail2ban/filter.d/familienarchiv-auth.conf
systemctl reload fail2ban
# Verify after first deploy with:
#   fail2ban-client status familienarchiv-auth
#   fail2ban-regex /var/log/caddy/access.log familienarchiv-auth

# Tailscale — used by the backup pipeline to reach heim-nas (follow-up issue)
curl -fsSL https://tailscale.com/install.sh | sh && tailscale up

# Self-hosted Gitea runner — register against the repo with a runner token.
# This runner is assumed single-tenant: the deploy workflows write .env.*
# files to disk during execution (cleaned up unconditionally on completion).
# A multi-tenant runner would need to switch to stdin-piped env files.
# (See https://docs.gitea.com/usage/actions/quickstart for the register step.)

# Runner workspace directory — required for DooD bind-mount resolution (ADR-015).
# act_runner stores job workspaces here so that docker compose bind mounts resolve
# to real host paths. The path must be identical on the host and inside job containers.
mkdir -p /srv/gitea-workspace
# Also add this volume line to the runner service in ~/docker/gitea/compose.yaml:
#   volumes:
#     - /srv/gitea-workspace:/srv/gitea-workspace
# See runner-config.yaml (workdir_parent + valid_volumes + options) and ADR-015.

# Observability config permanent directory — the nightly CI job copies
# docker-compose.observability.yml and infra/observability/ here on every run.
# The obs stack is always started from this path, not from the workspace.
# See ADR-016 for why this directory is used instead of a server-pull approach.
mkdir -p /opt/familienarchiv/infra

# ⚠ IMPORTANT: after any change to runner-config.yaml (valid_volumes, options, workdir_parent),
# restart the Gitea Act runner on the host for the new config to take effect:
#   systemctl restart gitea-runner
# Until restarted, job containers are spawned with the old config and any new bind mounts
# (e.g. /opt/familienarchiv) will not be available inside job steps.
```

### 3.2 DNS records

```
archiv.raddatz.cloud   A   <server IP>
staging.raddatz.cloud  A   <server IP>
git.raddatz.cloud      A   <server IP>
```

### 3.3 Gitea secrets (Repo → Settings → Actions → Secrets)

| Secret | Used by | Notes |
|---|---|---|
| `PROD_POSTGRES_PASSWORD` | release.yml | strong unique password |
| `PROD_MINIO_PASSWORD` | release.yml | MinIO root password; used only at bootstrap |
| `PROD_MINIO_APP_PASSWORD` | release.yml | application service-account password |
| `PROD_OCR_TRAINING_TOKEN` | release.yml | `python3 -c "import secrets; print(secrets.token_hex(32))"` |
| `PROD_APP_ADMIN_USERNAME` | release.yml | e.g. `admin@archiv.raddatz.cloud` |
| `PROD_APP_ADMIN_PASSWORD` | release.yml | **⚠ locked permanently on first deploy** — see §3.5 |
| `STAGING_POSTGRES_PASSWORD` | nightly.yml | different from prod |
| `STAGING_MINIO_PASSWORD` | nightly.yml | different from prod |
| `STAGING_MINIO_APP_PASSWORD` | nightly.yml | different from prod |
| `STAGING_OCR_TRAINING_TOKEN` | nightly.yml | different from prod |
| `STAGING_APP_ADMIN_USERNAME` | nightly.yml | e.g. `admin@staging.raddatz.cloud` |
| `STAGING_APP_ADMIN_PASSWORD` | nightly.yml | locked on first staging deploy |
| `MAIL_HOST` | release.yml | SMTP relay hostname (prod only) |
| `MAIL_PORT` | release.yml | typically `587` |
| `MAIL_USERNAME` | release.yml | SMTP user |
| `MAIL_PASSWORD` | release.yml | SMTP password |
| `GRAFANA_ADMIN_PASSWORD` | both | Grafana `admin` login — generate a strong password |
| `GLITCHTIP_SECRET_KEY` | both | Django secret key — `openssl rand -hex 32` |
| `SENTRY_DSN` | both | GlitchTip project DSN — set after first-run (§4); leave empty to keep Sentry disabled |

### 3.4 First deploy

```bash
# 1. Trigger nightly.yml manually (Repo → Actions → nightly → "Run workflow")
#    Expected: docker compose up -d --wait succeeds for archiv-staging, then
#    the workflow's "Smoke test deployed environment" step asserts:
#      - https://staging.raddatz.cloud/login returns 200
#      - HSTS header is present
#      - /actuator/health returns 404 (defense-in-depth check)
# 2. (Optional) Re-verify manually
curl -I https://staging.raddatz.cloud/
#    Expected: 200 (login page) with HSTS + X-Content-Type-Options headers
# 3. When staging looks healthy, push a v* tag to trigger release.yml
git tag v1.0.0 && git push origin v1.0.0
```

### 3.5 ⚠ Admin password is locked on first deploy

`UserDataInitializer` creates the admin user **only if the email does not exist**. The first successful deploy persists the admin password to the database. Changing `PROD_APP_ADMIN_PASSWORD` in Gitea secrets after that point has **no effect** — the secret is only consulted when the row is missing.

Before the first deploy: rotate `PROD_APP_ADMIN_PASSWORD` to a strong value. After the first deploy: change the admin password via the in-app account settings, not via the Gitea secret.

---

## 4. Logs + observability

### First-response commands

```bash
# Stream backend logs (most useful first)
docker compose logs --follow --tail=100 backend

# Stream all services
docker compose logs --follow

# Single snapshot
docker compose logs --tail=200 <service>
# services: frontend, backend, db, minio, ocr-service
```

### Log locations

- **Backend application log**: stdout (captured by Docker). Access inside the container at `/app/logs/` via `docker exec`.
- **Spring Actuator health**: `http://localhost:8080/actuator/health` (internal only in prod — port 8081 for Prometheus scraping)
- **Prometheus scraping**: management port 8081, path `/actuator/prometheus`. Internal only; Caddy blocks `/actuator/*` externally.

### Observability stack

An observability stack is available via `docker-compose.observability.yml`. Configuration lives under `infra/observability/`.

#### Dev — start from the workspace

```bash
docker compose up -d                                    # creates archiv-net
docker compose -f docker-compose.observability.yml up -d
```

#### Why the obs stack is managed differently from the main app stack

The main app stack (`docker-compose.prod.yml`) has no config-file bind mounts — its containers read config from env vars and image defaults. The workspace is wiped after each CI run but that does not affect running containers, because they hold no references to workspace paths.

The obs stack is different: `prometheus.yml`, `tempo.yml`, Loki config, Grafana provisioning files, and Promtail config are all bind-mounted from the host filesystem into their containers. If those source paths disappear (workspace wipe), the containers can restart fine until a `docker compose up` is run again — at that point Docker tries to re-resolve the bind-mount source and fails because the workspace path no longer exists.

The fix is to keep the obs compose file and config tree at a **permanent path** that CI copies to on every run but which survives between runs: `/opt/familienarchiv/` (see ADR-016).

#### Production — managed from `/opt/familienarchiv/`

Every CI run (nightly + release) copies `docker-compose.observability.yml` and `infra/observability/` to `/opt/familienarchiv/` before starting the stack. Bind mounts then resolve to `/opt/familienarchiv/infra/observability/…` — a stable path that outlasts any workspace wipe.

**Environment variables** follow the same two-source model as the main stack:

| Source | What it contains | Managed by |
|---|---|---|
| `infra/observability/obs.env` | All non-secret config (ports, URLs, hostnames) | Git — reviewed in PRs |
| `/opt/familienarchiv/obs-secrets.env` | Passwords and secret keys only | CI — written fresh from Gitea secrets on every deploy |

Both files are passed explicitly via `--env-file` to the compose command, so there is no implicit auto-read `.env` and no operator-managed file to keep in sync.

**Non-secret config** (`infra/observability/obs.env`):

| Key | Value | Notes |
|---|---|---|
| `PORT_GRAFANA` | `3003` | Avoids collision with staging frontend on port 3001 |
| `PORT_GLITCHTIP` | `3002` | |
| `PORT_PROMETHEUS` | `9090` | |
| `GF_SERVER_ROOT_URL` | `https://grafana.archiv.raddatz.cloud` | Required for alert email links and OAuth redirects |
| `GLITCHTIP_DOMAIN` | `https://glitchtip.archiv.raddatz.cloud` | Must match the Caddy vhost |
| `POSTGRES_HOST` | `archive-db` | Override if only the staging stack is running |

**Secret keys** (set in Gitea secrets, injected by CI into `obs-secrets.env`):

| Gitea secret | Notes |
|---|---|
| `GRAFANA_ADMIN_PASSWORD` | Strong unique password; shared by nightly and release |
| `GLITCHTIP_SECRET_KEY` | `openssl rand -hex 32`; shared by nightly and release |
| `STAGING_POSTGRES_PASSWORD` / `PROD_POSTGRES_PASSWORD` | Must match the running PostgreSQL container |

To start or restart the obs stack manually on the server (after CI has run at least once):

```bash
docker compose \
  -f /opt/familienarchiv/docker-compose.observability.yml \
  --env-file /opt/familienarchiv/infra/observability/obs.env \
  --env-file /opt/familienarchiv/obs-secrets.env \
  up -d --wait --remove-orphans
```

Current services:

| Service | Image | Purpose |
|---|---|---|
| `obs-prometheus` | `prom/prometheus:v3.4.0` | Scrapes metrics from backend management port 8081 (`/actuator/prometheus`), node-exporter, and cAdvisor |
| `obs-node-exporter` | `prom/node-exporter:v1.9.0` | Host-level CPU / memory / disk / network metrics |
| `obs-cadvisor` | `gcr.io/cadvisor/cadvisor:v0.52.1` | Per-container resource metrics |
| `obs-loki` | `grafana/loki:3.4.2` | Log aggregation — receives log streams from Promtail. Port 3100 is `expose`-only (not host-bound). |
| `obs-promtail` | `grafana/promtail:3.4.2` | Log shipping agent — reads all Docker container logs via the Docker socket and forwards them to Loki with `container_name`, `compose_service`, and `compose_project` labels |
| `obs-tempo` | `grafana/tempo:2.7.2` | Distributed trace storage — OTLP gRPC receiver on port 4317, OTLP HTTP on port 4318 (both `archiv-net`-internal). Grafana queries traces on port 3200 (`obs-net`-internal). All ports are `expose`-only (not host-bound). |
| `obs-grafana` | `grafana/grafana-oss:11.6.1` | Unified observability UI — metrics dashboards, log exploration, trace viewer. Bound to `127.0.0.1:${PORT_GRAFANA:-3001}` on the host. |
| `obs-glitchtip` | `glitchtip/glitchtip:v4` | Sentry-compatible error tracker. Receives frontend + backend error events, groups by fingerprint, provides issue UI with stack traces. Bound to `127.0.0.1:${PORT_GLITCHTIP:-3002}`. |
| `obs-glitchtip-worker` | `glitchtip/glitchtip:v4` | Celery + beat worker — processes async GlitchTip tasks (event ingestion, notifications, cleanup). |
| `obs-redis` | `redis:7-alpine` | Celery task broker for GlitchTip. Internal to `obs-net`; no host port exposed. |
| `obs-glitchtip-db-init` | `postgres:16-alpine` | One-shot init container. Creates the `glitchtip` database on the existing `archive-db` PostgreSQL instance if it does not already exist. Runs at stack startup; exits cleanly once done. |

#### Grafana

| Item | Value |
|---|---|
| URL | `http://localhost:3003` (or `http://localhost:$PORT_GRAFANA`) |
| Username | `admin` |
| Password | `$GRAFANA_ADMIN_PASSWORD` (default: `changeme` — **change before exposing to a network**) |

Datasources are auto-provisioned on first start (Prometheus, Loki, Tempo — no manual setup required). Three dashboards are pre-loaded:

| Dashboard | Grafana ID | Purpose |
|---|---|---|
| Node Exporter Full | 1860 | Host CPU, memory, disk, network |
| Spring Boot Observability | 17175 | JVM metrics, HTTP latency, error rate |
| Loki Logs | 13639 | Log exploration and filtering |

Tempo traces are accessible via Grafana Explore → Tempo datasource, and linked from Loki logs via the `traceId` derived field.

**Loki quick checks** (after ~60 s, run from inside the `obs-loki` container):

```bash
# Loki health
docker exec obs-loki wget -qO- http://localhost:3100/ready

# List labels
docker exec obs-loki wget -qO- 'http://localhost:3100/loki/api/v1/labels'

# Query logs by service (stable across dev and prod environments)
docker exec obs-loki wget -qO- \
  'http://localhost:3100/loki/api/v1/query_range?query=%7Bcompose_service%3D%22backend%22%7D&limit=5'
```

**Prefer `compose_service` over `container_name` in LogQL queries** — `container_name` differs between dev (`archive-backend`) and prod (`archiv-production-backend-1`), while `compose_service` is stable (`backend`, `db`, `minio`, etc.).

Prometheus port `9090` and Grafana port `3003` (default; configurable via `PORT_GRAFANA`) are bound to `127.0.0.1` on the host. No other observability ports are host-bound.

#### GlitchTip

| Item | Value |
|---|---|
| URL | `http://localhost:3002` (or `http://localhost:$PORT_GLITCHTIP`) |

**Required env vars** — set in `.env` before first start:

```bash
GLITCHTIP_SECRET_KEY=$(python3 -c "import secrets; print(secrets.token_hex(32))")
GLITCHTIP_DOMAIN=http://localhost:3002   # change to your public URL in prod
PORT_GLITCHTIP=3002                      # optional, defaults to 3002
```

**Database:** GlitchTip shares the existing `archive-db` PostgreSQL instance. The `obs-glitchtip-db-init` one-shot container creates a dedicated `glitchtip` database on first stack start — no manual step required.

**First-run steps** (one-time, after `docker compose -f docker-compose.observability.yml up -d`):

```bash
# 1. Create the Django superuser (interactive)
docker exec -it obs-glitchtip ./manage.py createsuperuser

# 2. Open the GlitchTip UI and log in
open http://localhost:3002

# 3. Create an organisation (e.g. "Familienarchiv")
# 4. Create two projects:
#    - "familienarchiv-frontend"  (platform: JavaScript / SvelteKit)
#    - "familienarchiv-backend"   (platform: Java / Spring Boot)
# 5. Copy each project's DSN from Settings → Projects → <project> → Client Keys
# 6. Wire the DSNs into the backend and frontend via env vars (separate issue)
```

---

## 5. Backup + recovery

### Current state — no automated backup

No automated backup is configured. Manual procedure for a point-in-time backup:

```bash
# PostgreSQL dump
docker exec archive-db pg_dump -U ${POSTGRES_USER} ${POSTGRES_DB} > backup-$(date +%Y%m%d).sql

# MinIO data (bind-mounted in dev)
# Copy ./data/minio/ to external storage
```

Restoration:
```bash
# Restore Postgres
docker exec -i archive-db psql -U ${POSTGRES_USER} ${POSTGRES_DB} < backup-YYYYMMDD.sql
```

### Planned — phase 5 of Production v1 milestone

Automated backup (nightly `pg_dump` + MinIO `mc mirror` over Tailscale to `heim-nas`) is a follow-up issue. Until that ships: **manual backups are the only recovery option.**

### Rollback

Each release tag corresponds to a docker image tag on the host daemon (built via DooD; no registry). Rolling back to a previous tag is one command:

```bash
TAG=v1.0.0 docker compose \
  -f docker-compose.prod.yml \
  -p archiv-production \
  --env-file /opt/familienarchiv/.env.production \
  up -d --wait --remove-orphans
```

If the rollback target image is no longer present on the host (host disk pruned, etc.), re-trigger `release.yml` for that tag from Gitea Actions UI — it rebuilds and redeploys.

**Flyway migrations are not auto-rolled-back.** If a release contained a destructive migration (drop column, rename table), a tag rollback brings the schema back to a previous app version but the data shape has already changed. For breaking schema changes, prefer a forward-only fix.

---

## 6. Common operational tasks

### Reset dev database (truncates data, keeps schema)

```bash
bash scripts/reset-db.sh
```

> Truncates all data but does **not** drop the schema or re-run Flyway. Use for E2E test resets, not full reinstalls.
> ⚠️ Script hardcodes `DB_USER=archive_user` and `DB_NAME=family_archive_db` — if you customised these in `.env`, edit the script accordingly.

### Rebuild frontend container (clears node_modules volume)

```bash
bash scripts/rebuild-frontend.sh
```

> Assumes the Docker Compose volume is named `familienarchiv_frontend_node_modules`. If your project directory is not named `familienarchiv`, edit line 16 of the script.

### Download Kraken OCR models

```bash
bash scripts/download-kraken-models.sh
```

> Downloads the Kurrent/Sütterlin HTR models. Run once after a fresh clone or when models are updated.

### Trigger a mass import (Excel/ODS)

**Dev:** drop the ODS spreadsheet + PDFs into `./import/` at the repo root — the dev compose bind-mounts it to `/import` automatically.

**Staging/production:**

1. Pre-stage the payload on the host. Convention: `/srv/familienarchiv-staging/import/` or `/srv/familienarchiv-production/import/`.
   ```bash
   rsync -avh --progress ./import/ user@host:/srv/familienarchiv-staging/import/
   ```
2. Make sure `IMPORT_HOST_DIR=<host-path>` is set in `.env.staging` / `.env.production` (the nightly/release workflows already write this — see §3). Compose refuses to start without it.
3. Redeploy the stack so the bind mount picks up — or, if the mount is already in place, skip to step 4.
4. Call `POST /api/admin/trigger-import` (requires `ADMIN` permission), or click the "Import starten" button on `/admin/system`.
5. The import runs asynchronously — poll `GET /api/admin/import-status`, watch `/admin/system`, or tail the backend logs.

---

## 7. Known limitations

| Limitation | Reason | Reference |
|---|---|---|
| **Single-node OCR service** | The two required OCR engines (Surya + Kraken) exist only in the Python ecosystem; horizontal scaling would require a job queue not currently implemented | [ADR-001](adr/001-ocr-python-microservice.md) |
| **No multi-tenancy** | Designed as a single-family private archive; all authenticated users share the same document space | Deliberate scope decision (family-only product frame) |
| **No multi-region** | Single PostgreSQL + MinIO instance; no replication or failover | Deliberate scope decision |
| **Max upload size** | 50 MB per file (500 MB per request for multi-file) | Configurable in `application.yaml` (`spring.servlet.multipart`) |
| **No automated backup** | Phase 5 of Production v1 milestone is not yet implemented | See §5 above |