docs(deployment): document fail2ban symlink, OCR_MEM_LIMIT, smoke test

Updates DEPLOYMENT.md to match the infra changes in this PR: §1 OCR memory — point operators at the new OCR_MEM_LIMIT env var instead of telling them to edit "the prod overlay". §2 OCR env vars — add OCR_MEM_LIMIT to the table. §3.1 server setup — replace fail2ban prose with concrete `ln -sf` commands referencing the committed jail/filter. Document the single-tenant runner assumption near the runner-registration step. §3.4 first deploy — describe the new automated smoke test step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 12:07:59 +02:00
parent 83565c6bb5
commit ba5bd9cb11
1 changed files with 25 additions and 9 deletions
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -54,7 +54,7 @@ The OCR service requires significant RAM for model loading. The dev compose sets
 | Hetzner CX32 | 8 GB | 6 GB | Accept reduced batch sizes and slower throughput |
 | Hetzner CX22 | 4 GB | — | Disable the OCR service (`profiles: [ocr]`); run OCR on demand only |

-A CX32 cannot honour a `mem_limit: 12g` — set it to `6g` in the prod overlay or use CX42.
+A CX32 cannot honour the default `mem_limit: 12g` — set the `OCR_MEM_LIMIT=6g` env var (in `.env.production` / `.env.staging`, or as a Gitea secret consumed by the workflow) before deploying on a CX32. The prod compose interpolates this var with a 12g default.

 ### Dev vs production differences

@@ -131,6 +131,7 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
 | `ALLOWED_PDF_HOSTS` | SSRF protection — comma-separated list of allowed PDF source hosts. **Do not widen to `*`** | `minio,localhost,127.0.0.1` | YES | — |
 | `KRAKEN_MODEL_PATH` | Directory containing Kraken HTR models (populated by `download-kraken-models.sh`) | `/app/models/` | — | — |
 | `BLLA_MODEL_PATH` | Kraken baseline layout analysis model path | `/app/models/blla.mlmodel` | — | — |
+| `OCR_MEM_LIMIT` | Container memory cap for ocr-service in `docker-compose.prod.yml`. Set to `6g` on CX32 hosts; leave unset on CX42+ to use the 12g default | `12g` (prod compose default) | — | — |

 ---

@@ -152,17 +153,28 @@ apt install caddy
 ln -sf /opt/familienarchiv/infra/caddy/Caddyfile /etc/caddy/Caddyfile
 systemctl reload caddy

-# fail2ban — protect /api/auth/login from credential stuffing
-# Jail watches Caddy access log for 401 responses on /api/auth/login.
-#   maxretry=10  findtime=10m  bantime=30m
+# fail2ban — protect /api/auth/login from credential stuffing.
+# Jail watches the Caddy JSON access log for 401 responses on
+# /api/auth/login. The jail (maxretry=10 / findtime=10m / bantime=30m)
+# and filter are committed under infra/fail2ban/ — symlink them in:
 apt install fail2ban
-# Drop the jail definition under /etc/fail2ban/jail.d/familienarchiv.conf
+ln -sf /opt/familienarchiv/infra/fail2ban/jail.d/familienarchiv.conf \
+       /etc/fail2ban/jail.d/familienarchiv.conf
+ln -sf /opt/familienarchiv/infra/fail2ban/filter.d/familienarchiv-auth.conf \
+       /etc/fail2ban/filter.d/familienarchiv-auth.conf
+systemctl reload fail2ban
+# Verify after first deploy with:
+#   fail2ban-client status familienarchiv-auth
+#   fail2ban-regex /var/log/caddy/access.log familienarchiv-auth

 # Tailscale — used by the backup pipeline to reach heim-nas (follow-up issue)
 curl -fsSL https://tailscale.com/install.sh | sh && tailscale up

-# Self-hosted Gitea runner — register against the repo with a runner token
-# (see https://docs.gitea.com/usage/actions/quickstart for the register step)
+# Self-hosted Gitea runner — register against the repo with a runner token.
+# This runner is assumed single-tenant: the deploy workflows write .env.*
+# files to disk during execution (cleaned up unconditionally on completion).
+# A multi-tenant runner would need to switch to stdin-piped env files.
+# (See https://docs.gitea.com/usage/actions/quickstart for the register step.)
 ```

 ### 3.2 DNS records
@@ -198,8 +210,12 @@ git.raddatz.cloud      A   <server IP>

 ```bash
 # 1. Trigger nightly.yml manually (Repo → Actions → nightly → "Run workflow")
-#    Expected: docker compose up -d --wait succeeds for archiv-staging
-# 2. Verify TLS + reverse proxy
+#    Expected: docker compose up -d --wait succeeds for archiv-staging, then
+#    the workflow's "Smoke test deployed environment" step asserts:
+#      - https://staging.raddatz.cloud/login returns 200
+#      - HSTS header is present
+#      - /actuator/health returns 404 (defense-in-depth check)
+# 2. (Optional) Re-verify manually
 curl -I https://staging.raddatz.cloud/
 #    Expected: 200 (login page) with HSTS + X-Content-Type-Options headers
 # 3. When staging looks healthy, push a v* tag to trigger release.yml