Files

Marcel 8536b2ebbd docs(deploy): note Caddyfile symlink is a CI dependency

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-12 07:42:28 +02:00

18 KiB

Raw Blame History

Familienarchiv — Deployment Reference

If the app is down right now → jump to §4 Logs.

This doc is the Day-1 checklist and operational reference. It links to the canonical infrastructure docs in docs/infrastructure/ rather than duplicating them.

Audience: operator bringing up a fresh instance, or Successor-X debugging a live incident.

Ownership: project owner. Update this file in any PR that changes the container topology, env vars, or backup procedure.

Deployment topology
Environment variables
Bootstrap from scratch
Logs + observability
Backup + recovery
Common operational tasks
Known limitations

1. Deployment topology

graph TD
    Browser -->|HTTPS| Caddy["Caddy (TLS termination)"]
    Caddy -->|HTTP :3000| Frontend["Web Frontend\nSvelteKit Node adapter"]
    Caddy -->|HTTP :8080| Backend["API Backend\nSpring Boot / Jetty :8080"]
    Backend -->|JDBC :5432| DB[(PostgreSQL 16)]
    Backend -->|S3 API :9000| MinIO[(MinIO)]
    Backend -->|HTTP :8000 internal| OCR["OCR Service\nPython FastAPI"]
    OCR -->|presigned URL| MinIO
    Caddy -->|SSE proxy_pass| Backend

Key facts:

Caddy terminates TLS and reverse-proxies to frontend (:3000) and backend (:8080). The Caddyfile is committed at infra/caddy/Caddyfile and is installed on the host as /etc/caddy/Caddyfile (symlink).
The host binds all docker-published ports to 127.0.0.1 only; Caddy is the sole external entry point.
The OCR service has no published port — reachable only on the internal Docker network from the backend.
SSE notifications transit Caddy (browser → Caddy → backend); the backend is never reachable directly from the public internet. The SvelteKit SSR layer is bypassed for SSE, but Caddy is not.
The Caddyfile responds 404 on /actuator/* (defense in depth). Internal monitoring scrapes the backend on the docker network, not through Caddy.
Production and staging cohabit on the same host via docker compose project names: archiv-production (ports 8080/3000) and archiv-staging (ports 8081/3001).

OCR memory requirements

The OCR service requires significant RAM for model loading. The dev compose sets mem_limit: 12g.

Production target	RAM	Recommended OCR limit	Notes
Hetzner CX42	16 GB	12 GB	Recommended for OCR-enabled production
Hetzner CX32	8 GB	6 GB	Accept reduced batch sizes and slower throughput
Hetzner CX22	4 GB	—	Disable the OCR service (`profiles: [ocr]`); run OCR on demand only

A CX32 cannot honour the default mem_limit: 12g — set the OCR_MEM_LIMIT=6g env var (in .env.production / .env.staging, or as a Gitea secret consumed by the workflow) before deploying on a CX32. The prod compose interpolates this var with a 12g default.

Dev vs production differences

Concern	Dev (`docker-compose.yml`)	Prod (`docker-compose.prod.yml`)
MinIO image tag	`minio/minio:latest`	Pinned `minio/minio:RELEASE.…`
Data persistence	Bind mounts `./data/postgres`, `./data/minio`	Named Docker volumes (`postgres-data`, `minio-data`)
MinIO credentials for backend	Root user/password	Service account `archiv-app` with bucket-scoped rights
Bucket creation	`create-buckets` helper	Same helper, plus service-account bootstrap on every up
Spring profile	`dev,e2e` (Swagger + e2e overrides)	unset — base `application.yaml` is production-ready
Mail	Mailpit (local catcher)	Real SMTP (production) / Mailpit via `profiles: [staging]` (staging)
Frontend image	Dev server, `target: development`, port 5173	Node adapter, `target: production`, port 3000
Host port binding	All published	Bound to `127.0.0.1` only; Caddy is the front door
Deploy method	`docker compose up -d` (manual)	Gitea Actions: `nightly.yml` (staging, cron) and `release.yml` (production, on `v*` tag) — both use `up -d --wait`

Full prod compose: docker-compose.prod.yml. Workflow files: .gitea/workflows/nightly.yml, .gitea/workflows/release.yml.

2. Environment variables

All vars are set in .env at the repo root (copy from .env.example). The backend resolves them via application.yaml; the Docker Compose file wires them into each container.

Any var found in docker-compose.yml or application*.yaml that is not in this table is a blocking review comment on any PR that changes those files.

Backend

Variable	Purpose	Default	Required?	Sensitive?
`SPRING_DATASOURCE_URL`	PostgreSQL JDBC URL	—	YES	—
`SPRING_DATASOURCE_USERNAME`	DB username	—	YES	—
`SPRING_DATASOURCE_PASSWORD`	DB password	—	YES	YES
`S3_ENDPOINT`	MinIO / OBS endpoint URL	—	YES	—
`S3_ACCESS_KEY`	MinIO access key (use service account, not root in prod)	—	YES	YES
`S3_SECRET_KEY`	MinIO secret key	—	YES	YES
`S3_BUCKET_NAME`	Target bucket name	—	YES	—
`S3_REGION`	S3 region string	`us-east-1`	YES	—
`APP_ADMIN_USERNAME`	Bootstrap admin username (⚠ not in .env.example)	`admin`	YES	—
`APP_ADMIN_PASSWORD`	Bootstrap admin password (⚠ ships as `admin123`)	`admin123`	YES	YES
`APP_BASE_URL`	Public-facing URL for email links	`http://localhost:3000`	YES (prod)	—
`APP_OCR_BASE_URL`	Internal URL of the OCR service	—	YES	—
`APP_OCR_TRAINING_TOKEN`	Secret token for OCR training endpoints	—	YES (prod)	YES
`IMPORT_HOST_DIR`	Absolute host path holding the ODS spreadsheet + PDFs for the `/admin/system` mass-import card. Mounted read-only at `/import` inside the backend (compose-only — backend reads via `app.import.dir`). Compose refuses to start when unset, so staging and prod cannot accidentally share the source. Convention: `/srv/familienarchiv-staging/import` and `/srv/familienarchiv-production/import`	—	YES (prod compose)	—
`MAIL_HOST`	SMTP host	`mailpit` (dev)	YES (prod)	—
`MAIL_PORT`	SMTP port	`1025` (dev)	YES (prod)	—
`MAIL_USERNAME`	SMTP username	—	YES (prod)	YES
`MAIL_PASSWORD`	SMTP password	—	YES (prod)	YES
`APP_MAIL_FROM`	From address for outbound mail	`noreply@familienarchiv.local`	YES (prod)	—
`MAIL_SMTP_AUTH`	SMTP auth enabled	`false` (dev)	YES (prod)	—
`MAIL_STARTTLS_ENABLE`	STARTTLS enabled	`false` (dev)	YES (prod)	—
`SPRING_PROFILES_ACTIVE`	Spring profile	`dev,e2e` (compose)	YES	—

PostgreSQL container

Variable	Purpose	Default	Required?	Sensitive?
`POSTGRES_USER`	DB superuser	`archive_user`	YES	—
`POSTGRES_PASSWORD`	DB password	`change-me`	YES	YES
`POSTGRES_DB`	Database name	`family_archive_db`	YES	—

MinIO container

Variable	Purpose	Default	Required?	Sensitive?
`MINIO_ROOT_USER`	MinIO root username (dev compose only — prod compose hardcodes `archiv`)	`minio_admin`	YES (dev)	—
`MINIO_ROOT_PASSWORD` / `MINIO_PASSWORD`	MinIO root password. Used only by the `mc admin` bootstrap in prod, never by the backend.	`change-me`	YES	YES
`MINIO_APP_PASSWORD`	Password for the `archiv-app` service account that the backend uses. Bucket-scoped via `readwrite` policy on `familienarchiv`. Bootstrapped by `create-buckets`.	—	YES (prod)	YES
`MINIO_DEFAULT_BUCKETS`	Bucket name (dev compose only — prod compose hardcodes `familienarchiv`)	`archive-documents`	YES (dev)	—

OCR service

Variable	Purpose	Default	Required?	Sensitive?
`TRAINING_TOKEN`	Guards `/train` and `/segtrain` endpoints (accepts file uploads)	—	YES (prod)	YES
`ALLOWED_PDF_HOSTS`	SSRF protection — comma-separated list of allowed PDF source hosts. *Do not widen to ``**	`minio,localhost,127.0.0.1`	YES	—
`KRAKEN_MODEL_PATH`	Directory containing Kraken HTR models (populated by `download-kraken-models.sh`)	`/app/models/`	—	—
`BLLA_MODEL_PATH`	Kraken baseline layout analysis model path	`/app/models/blla.mlmodel`	—	—
`OCR_MEM_LIMIT`	Container memory cap for ocr-service in `docker-compose.prod.yml`. Set to `6g` on CX32 hosts; leave unset on CX42+ to use the 12g default	`12g` (prod compose default)	—	—

3. Bootstrap from scratch

Production and staging deploy via Gitea Actions (release.yml on v* tag, nightly.yml on cron). The server itself only needs to host Caddy, Docker, and the runner — the workflows handle the rest.

3.1 Server one-time setup

# Base hardening
ufw default deny incoming && ufw allow 22/tcp && ufw allow 80/tcp && ufw allow 443/tcp && ufw enable
# /etc/ssh/sshd_config: PasswordAuthentication no, PermitRootLogin no

# Install Caddy 2 (https://caddyserver.com/docs/install#debian-ubuntu-raspbian)
apt install caddy

# Use the Caddyfile from the repo (replace path with the runner's clone target)
# CI DEPENDENCY: the nightly and release workflows run `systemctl reload caddy` to
# pick up committed Caddyfile changes. They find the file via this symlink — if it
# is absent or points elsewhere, the reload succeeds but serves stale config.
ln -sf /opt/familienarchiv/infra/caddy/Caddyfile /etc/caddy/Caddyfile
systemctl reload caddy

# fail2ban — protect /api/auth/login from credential stuffing.
# Jail watches the Caddy JSON access log for 401 responses on
# /api/auth/login. The jail (maxretry=10 / findtime=10m / bantime=30m)
# and filter are committed under infra/fail2ban/ — symlink them in:
apt install fail2ban
ln -sf /opt/familienarchiv/infra/fail2ban/jail.d/familienarchiv.conf \
       /etc/fail2ban/jail.d/familienarchiv.conf
ln -sf /opt/familienarchiv/infra/fail2ban/filter.d/familienarchiv-auth.conf \
       /etc/fail2ban/filter.d/familienarchiv-auth.conf
systemctl reload fail2ban
# Verify after first deploy with:
#   fail2ban-client status familienarchiv-auth
#   fail2ban-regex /var/log/caddy/access.log familienarchiv-auth

# Tailscale — used by the backup pipeline to reach heim-nas (follow-up issue)
curl -fsSL https://tailscale.com/install.sh | sh && tailscale up

# Self-hosted Gitea runner — register against the repo with a runner token.
# This runner is assumed single-tenant: the deploy workflows write .env.*
# files to disk during execution (cleaned up unconditionally on completion).
# A multi-tenant runner would need to switch to stdin-piped env files.
# (See https://docs.gitea.com/usage/actions/quickstart for the register step.)

3.2 DNS records

archiv.raddatz.cloud   A   <server IP>
staging.raddatz.cloud  A   <server IP>
git.raddatz.cloud      A   <server IP>

3.3 Gitea secrets (Repo → Settings → Actions → Secrets)

Secret	Used by	Notes
`PROD_POSTGRES_PASSWORD`	release.yml	strong unique password
`PROD_MINIO_PASSWORD`	release.yml	MinIO root password; used only at bootstrap
`PROD_MINIO_APP_PASSWORD`	release.yml	application service-account password
`PROD_OCR_TRAINING_TOKEN`	release.yml	`python3 -c "import secrets; print(secrets.token_hex(32))"`
`PROD_APP_ADMIN_USERNAME`	release.yml	e.g. `admin@archiv.raddatz.cloud`
`PROD_APP_ADMIN_PASSWORD`	release.yml	⚠ locked permanently on first deploy — see §3.5
`STAGING_POSTGRES_PASSWORD`	nightly.yml	different from prod
`STAGING_MINIO_PASSWORD`	nightly.yml	different from prod
`STAGING_MINIO_APP_PASSWORD`	nightly.yml	different from prod
`STAGING_OCR_TRAINING_TOKEN`	nightly.yml	different from prod
`STAGING_APP_ADMIN_USERNAME`	nightly.yml	e.g. `admin@staging.raddatz.cloud`
`STAGING_APP_ADMIN_PASSWORD`	nightly.yml	locked on first staging deploy
`MAIL_HOST`	release.yml	SMTP relay hostname (prod only)
`MAIL_PORT`	release.yml	typically `587`
`MAIL_USERNAME`	release.yml	SMTP user
`MAIL_PASSWORD`	release.yml	SMTP password

3.4 First deploy

# 1. Trigger nightly.yml manually (Repo → Actions → nightly → "Run workflow")
#    Expected: docker compose up -d --wait succeeds for archiv-staging, then
#    the workflow's "Smoke test deployed environment" step asserts:
#      - https://staging.raddatz.cloud/login returns 200
#      - HSTS header is present
#      - /actuator/health returns 404 (defense-in-depth check)
# 2. (Optional) Re-verify manually
curl -I https://staging.raddatz.cloud/
#    Expected: 200 (login page) with HSTS + X-Content-Type-Options headers
# 3. When staging looks healthy, push a v* tag to trigger release.yml
git tag v1.0.0 && git push origin v1.0.0

3.5 ⚠ Admin password is locked on first deploy

UserDataInitializer creates the admin user only if the email does not exist. The first successful deploy persists the admin password to the database. Changing PROD_APP_ADMIN_PASSWORD in Gitea secrets after that point has no effect — the secret is only consulted when the row is missing.

Before the first deploy: rotate PROD_APP_ADMIN_PASSWORD to a strong value. After the first deploy: change the admin password via the in-app account settings, not via the Gitea secret.

4. Logs + observability

First-response commands

# Stream backend logs (most useful first)
docker compose logs --follow --tail=100 backend

# Stream all services
docker compose logs --follow

# Single snapshot
docker compose logs --tail=200 <service>
# services: frontend, backend, db, minio, ocr-service

Log locations

Backend application log: stdout (captured by Docker). Access inside the container at /app/logs/ via docker exec.
Spring Actuator health: http://localhost:8080/actuator/health (internal only in prod — port 8081 for Prometheus scraping)
Prometheus scraping: management port 8081, path /actuator/prometheus. Internal only; Caddy blocks /actuator/* externally.

Future observability

Phase 7 of the Production v1 milestone adds Prometheus + Loki + Grafana. No monitoring infrastructure is in place yet.

5. Backup + recovery

Current state — no automated backup

No automated backup is configured. Manual procedure for a point-in-time backup:

# PostgreSQL dump
docker exec archive-db pg_dump -U ${POSTGRES_USER} ${POSTGRES_DB} > backup-$(date +%Y%m%d).sql

# MinIO data (bind-mounted in dev)
# Copy ./data/minio/ to external storage

Restoration:

# Restore Postgres
docker exec -i archive-db psql -U ${POSTGRES_USER} ${POSTGRES_DB} < backup-YYYYMMDD.sql

Planned — phase 5 of Production v1 milestone

Automated backup (nightly pg_dump + MinIO mc mirror over Tailscale to heim-nas) is a follow-up issue. Until that ships: manual backups are the only recovery option.

Rollback

Each release tag corresponds to a docker image tag on the host daemon (built via DooD; no registry). Rolling back to a previous tag is one command:

TAG=v1.0.0 docker compose \
  -f docker-compose.prod.yml \
  -p archiv-production \
  --env-file /opt/familienarchiv/.env.production \
  up -d --wait --remove-orphans

If the rollback target image is no longer present on the host (host disk pruned, etc.), re-trigger release.yml for that tag from Gitea Actions UI — it rebuilds and redeploys.

Flyway migrations are not auto-rolled-back. If a release contained a destructive migration (drop column, rename table), a tag rollback brings the schema back to a previous app version but the data shape has already changed. For breaking schema changes, prefer a forward-only fix.

6. Common operational tasks

Reset dev database (truncates data, keeps schema)

bash scripts/reset-db.sh

Truncates all data but does not drop the schema or re-run Flyway. Use for E2E test resets, not full reinstalls. ⚠️ Script hardcodes DB_USER=archive_user and DB_NAME=family_archive_db — if you customised these in .env, edit the script accordingly.

Rebuild frontend container (clears node_modules volume)

bash scripts/rebuild-frontend.sh

Assumes the Docker Compose volume is named familienarchiv_frontend_node_modules. If your project directory is not named familienarchiv, edit line 16 of the script.

Download Kraken OCR models

bash scripts/download-kraken-models.sh

Downloads the Kurrent/Sütterlin HTR models. Run once after a fresh clone or when models are updated.

Trigger a mass import (Excel/ODS)

Dev: drop the ODS spreadsheet + PDFs into ./import/ at the repo root — the dev compose bind-mounts it to /import automatically.

Staging/production:

Pre-stage the payload on the host. Convention: /srv/familienarchiv-staging/import/ or /srv/familienarchiv-production/import/.
```
rsync -avh --progress ./import/ user@host:/srv/familienarchiv-staging/import/
```
Make sure IMPORT_HOST_DIR=<host-path> is set in .env.staging / .env.production (the nightly/release workflows already write this — see §3). Compose refuses to start without it.
Redeploy the stack so the bind mount picks up — or, if the mount is already in place, skip to step 4.
Call POST /api/admin/trigger-import (requires ADMIN permission), or click the "Import starten" button on /admin/system.
The import runs asynchronously — poll GET /api/admin/import-status, watch /admin/system, or tail the backend logs.

7. Known limitations

Limitation	Reason	Reference
Single-node OCR service	The two required OCR engines (Surya + Kraken) exist only in the Python ecosystem; horizontal scaling would require a job queue not currently implemented	ADR-001
No multi-tenancy	Designed as a single-family private archive; all authenticated users share the same document space	Deliberate scope decision (family-only product frame)
No multi-region	Single PostgreSQL + MinIO instance; no replication or failover	Deliberate scope decision
Max upload size	50 MB per file (500 MB per request for multi-file)	Configurable in `application.yaml` (`spring.servlet.multipart`)
No automated backup	Phase 5 of Production v1 milestone is not yet implemented	See §5 above

18 KiB Raw Blame History