feat(infra): production deployment pipeline — Caddy, staging, Gitea Actions (#497) #499

marcel · 2026-05-10T22:04:13+02:00

marcel commented

2026-05-10 22:04:13 +02:00

Summary

Lands the production deployment pipeline described in #497. Two environments, both deployed by Gitea Actions on the existing self-hosted runner (Docker-out-of-Docker, no registry):

archiv.raddatz.cloud — production, on v* tag push (release.yml)
staging.raddatz.cloud — staging, nightly from main (nightly.yml)

All decisions from Tobias's and Nora's review summaries (comments 8331 and 8333) are implemented.

Commits (TDD-ordered, atomic)

#	Commit	What
1	`feat(security): trust X-Forwarded-Proto behind reverse proxy`	`server.forward-headers-strategy: native` + failing-first integration test in `backend/.../config/ForwardHeadersConfigurationTest`
2	`fix(auth): mark /hilfe/transkription as public for prerender`	Latent bug: the prerendered help page was 302→/login in the auth hook, blocking `npm run build` in the production Dockerfile stage. Surfaced by #497.
3	`feat(frontend): add production stage to Dockerfile`	Multi-stage `development` / `build` / `production`; pinned `node:20.19.0-alpine3.21`; dev compose now specifies `target: development` so the dev workflow is unchanged.
4	`feat(infra): add docker-compose.prod.yml for production/staging`	Standalone (not overlay); named volumes; pinned MinIO; `archiv-app` service account, not root; mailpit under `profiles: [staging]`; OCR healthcheck + `mem_limit: 12g`; ports bound to `127.0.0.1` only; admin creds wired through env.
5	`feat(infra): add production Caddyfile`	Both vhosts + git vhost; HSTS, X-Content-Type-Options, Referrer-Policy, `-Server`; `/actuator/*` → 404. No `X-Frame-Options` (Spring Security configures SAMEORIGIN for PDF iframes — would conflict). Validated with `caddy validate` against `caddy:2`.
6	`feat(ci): add nightly staging deploy workflow`	Cron `0 2 * * *` + `workflow_dispatch`; `DOCKER_BUILDKIT=1`; `--profile staging` (mailpit); `up -d --wait`; env-file cleanup `if: always()`.
7	`feat(ci): add release production deploy workflow`	Fires on `v*` tag; tags built images with `${{ gitea.ref_name }}` so rollback is a `TAG=<previous>` one-liner.
8	`docs(deployment): rewrite for Gitea Actions / Caddy / prod compose`	DEPLOYMENT.md updated: topology, env-var table (with `MINIO_APP_PASSWORD` and the admin-password first-deploy warning), 16-secrets bootstrap checklist, fail2ban + Tailscale install steps, rollback procedure.
9	`docs: retire overlay narrative; add Caddy to C4 L2 diagram`	`docs/infrastructure/production-compose.md` trimmed (live files now in-tree); `docs/architecture/c4/l2-containers.puml` gains Caddy.

Test plan

Local verification (all green at HEAD of this branch):

./mvnw test — full backend suite, 1566 tests pass (incl. new ForwardHeadersConfigurationTest)
docker compose config --quiet — dev compose still parses, still resolves target: development for frontend
docker compose -f docker-compose.prod.yml --env-file <stub> config --quiet — prod compose parses
docker compose -f docker-compose.prod.yml --env-file <stub> --profile staging config --quiet — staging profile parses, mailpit included
docker build --target production -f frontend/Dockerfile frontend/ — production image builds; docker run confirms it serves /login HTTP 200
docker run --rm -v ./infra/caddy/Caddyfile:/etc/caddy/Caddyfile:ro caddy:2 caddy validate — Valid configuration
python3 -c "import yaml; yaml.safe_load(open('.gitea/workflows/nightly.yml'))" and same for release.yml — both parse

Test plan for after merge:

Configure all 16 Gitea secrets per docs/DEPLOYMENT.md §3.3
On the server: install Caddy, ufw, fail2ban, Tailscale, the Gitea runner per §3.1
Symlink infra/caddy/Caddyfile → /etc/caddy/Caddyfile, systemctl reload caddy
Trigger nightly.yml via workflow_dispatch, confirm https://staging.raddatz.cloud/ returns 200 with HSTS header and /actuator/* returns 404
Push v1.0.0 tag, confirm release.yml deploys production; verify rollback by running TAG=<prev> docker compose ... up -d --wait

Deferred to follow-up issues (per #497 discussion)

CI image smoke-test job (Sara's suggestion) — keeps this PR focused
Backup pipeline (nightly pg_dump + MinIO mc mirror + rsync over Tailscale to heim-nas)
Observability stack (Prometheus / Loki / Grafana / Alertmanager) — issue #498

Closes #497.

🤖 Generated with Claude Code

## Summary Lands the production deployment pipeline described in #497. Two environments, both deployed by Gitea Actions on the existing self-hosted runner (Docker-out-of-Docker, no registry): - `archiv.raddatz.cloud` — production, on `v*` tag push (`release.yml`) - `staging.raddatz.cloud` — staging, nightly from `main` (`nightly.yml`) All decisions from Tobias's and Nora's review summaries (comments [8331](http://heim-nas:3005/marcel/familienarchiv/issues/497#issuecomment-8331) and [8333](http://heim-nas:3005/marcel/familienarchiv/issues/497#issuecomment-8333)) are implemented. ## Commits (TDD-ordered, atomic) | # | Commit | What | |---|---|---| | 1 | `feat(security): trust X-Forwarded-Proto behind reverse proxy` | `server.forward-headers-strategy: native` + failing-first integration test in `backend/.../config/ForwardHeadersConfigurationTest` | | 2 | `fix(auth): mark /hilfe/transkription as public for prerender` | Latent bug: the prerendered help page was 302→/login in the auth hook, blocking `npm run build` in the production Dockerfile stage. Surfaced by #497. | | 3 | `feat(frontend): add production stage to Dockerfile` | Multi-stage `development` / `build` / `production`; pinned `node:20.19.0-alpine3.21`; dev compose now specifies `target: development` so the dev workflow is unchanged. | | 4 | `feat(infra): add docker-compose.prod.yml for production/staging` | Standalone (not overlay); named volumes; pinned MinIO; `archiv-app` service account, not root; mailpit under `profiles: [staging]`; OCR healthcheck + `mem_limit: 12g`; ports bound to `127.0.0.1` only; admin creds wired through env. | | 5 | `feat(infra): add production Caddyfile` | Both vhosts + git vhost; HSTS, X-Content-Type-Options, Referrer-Policy, `-Server`; `/actuator/*` → 404. **No** `X-Frame-Options` (Spring Security configures SAMEORIGIN for PDF iframes — would conflict). Validated with `caddy validate` against `caddy:2`. | | 6 | `feat(ci): add nightly staging deploy workflow` | Cron `0 2 * * *` + `workflow_dispatch`; `DOCKER_BUILDKIT=1`; `--profile staging` (mailpit); `up -d --wait`; env-file cleanup `if: always()`. | | 7 | `feat(ci): add release production deploy workflow` | Fires on `v*` tag; tags built images with `${{ gitea.ref_name }}` so rollback is a `TAG=<previous>` one-liner. | | 8 | `docs(deployment): rewrite for Gitea Actions / Caddy / prod compose` | DEPLOYMENT.md updated: topology, env-var table (with `MINIO_APP_PASSWORD` and the admin-password first-deploy warning), 16-secrets bootstrap checklist, fail2ban + Tailscale install steps, rollback procedure. | | 9 | `docs: retire overlay narrative; add Caddy to C4 L2 diagram` | `docs/infrastructure/production-compose.md` trimmed (live files now in-tree); `docs/architecture/c4/l2-containers.puml` gains Caddy. | ## Test plan Local verification (all green at HEAD of this branch): - [x] `./mvnw test` — full backend suite, 1566 tests pass (incl. new `ForwardHeadersConfigurationTest`) - [x] `docker compose config --quiet` — dev compose still parses, still resolves `target: development` for frontend - [x] `docker compose -f docker-compose.prod.yml --env-file <stub> config --quiet` — prod compose parses - [x] `docker compose -f docker-compose.prod.yml --env-file <stub> --profile staging config --quiet` — staging profile parses, mailpit included - [x] `docker build --target production -f frontend/Dockerfile frontend/` — production image builds; `docker run` confirms it serves `/login` HTTP 200 - [x] `docker run --rm -v ./infra/caddy/Caddyfile:/etc/caddy/Caddyfile:ro caddy:2 caddy validate` — Valid configuration - [x] `python3 -c "import yaml; yaml.safe_load(open('.gitea/workflows/nightly.yml'))"` and same for `release.yml` — both parse Test plan for after merge: - [ ] Configure all 16 Gitea secrets per `docs/DEPLOYMENT.md` §3.3 - [ ] On the server: install Caddy, ufw, fail2ban, Tailscale, the Gitea runner per §3.1 - [ ] Symlink `infra/caddy/Caddyfile` → `/etc/caddy/Caddyfile`, `systemctl reload caddy` - [ ] Trigger `nightly.yml` via `workflow_dispatch`, confirm `https://staging.raddatz.cloud/` returns 200 with HSTS header and `/actuator/*` returns 404 - [ ] Push `v1.0.0` tag, confirm `release.yml` deploys production; verify rollback by running `TAG=<prev> docker compose ... up -d --wait` ## Deferred to follow-up issues (per #497 discussion) - CI image smoke-test job (Sara's suggestion) — keeps this PR focused - Backup pipeline (nightly `pg_dump` + MinIO `mc mirror` + rsync over Tailscale to `heim-nas`) - Observability stack (Prometheus / Loki / Grafana / Alertmanager) — issue #498 Closes #497. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel added 9 commits 2026-05-10 22:04:13 +02:00

feat(security): trust X-Forwarded-Proto behind reverse proxy 83f022ff4b

Adds server.forward-headers-strategy: native so that Jetty honours
X-Forwarded-{Proto,For,Host} from Caddy. Without this, getScheme(),
redirect URLs, and Spring Session "Secure" cookies reflect the
internal http hop instead of the original https client request.

Refs #497.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(auth): mark /hilfe/transkription as public for prerender ebd0f671f9

The route exports prerender = true and is listed in
svelte.config.js's prerender.entries. Until now the auth hook
redirected unauthenticated requests to /login, so the prerender
crawler hit a 302 and the build failed with "marked as prerenderable,
but were not prerendered".

Adding the path to PUBLIC_PATHS lets the crawler render the static
HTML; consistent with the route's intent as a public help page.

Surfaced by #497 (the production Docker build is the first place
npm run build runs in CI).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(frontend): add production stage to Dockerfile 8b109349c2

Multi-stage Dockerfile with three targets:
- development (dev server on :5173, used by docker-compose.yml)
- build (runs npm run build, produces SvelteKit Node-adapter output)
- production (self-contained node build server on :3000)

Node base pinned to node:20.19.0-alpine3.21 for reproducible CI
builds (Renovate will keep it current).

docker-compose.yml now specifies target: development for the
frontend so dev continues to use the dev-server stage. Without
this, Docker would default to the last stage (production).

Refs #497.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(infra): add docker-compose.prod.yml for production/staging ecb930e5f9

Standalone production compose file (not an overlay) that runs the
full stack on a single host. Environment isolation is achieved via
the docker compose project name (-p archiv-production / -p
archiv-staging) so the two environments cohabit cleanly.

Key choices, resolved in #497 review:
- Named volumes for persistent data (no host bind mounts)
- MinIO pinned to a specific RELEASE tag (no :latest)
- Backend uses MinIO service account (S3_ACCESS_KEY=archiv-app),
  not root credentials; create-buckets bootstraps the account
- Mailpit lives under profiles: [staging] so no real SMTP secret
  is ever wired into the staging deploy
- OCR mem_limit 12g + healthcheck (start_period 120s) copied from
  the dev compose so docker compose up -d --wait works in CI
- Backend admin credentials wired through APP_ADMIN_USERNAME /
  APP_ADMIN_PASSWORD; first deploy locks the password in
  permanently because UserDataInitializer is idempotent on email
- All host ports bound to 127.0.0.1; Caddy fronts external traffic

Refs #497.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(infra): add production Caddyfile 56e55ff488

Reverse proxy for the Familienarchiv host, validated against Caddy 2.
Includes both vhosts (production and staging), the Gitea vhost, and:

- HSTS, X-Content-Type-Options, Referrer-Policy headers on every site
- "-Server" header strip to hide the Caddy version
- /actuator/* responds 404 on both archive vhosts (defense in depth
  for Spring Boot's management endpoints)

X-Frame-Options is intentionally not set in Caddy: Spring Security
configures frame-options SAMEORIGIN for the in-app PDF preview
iframe; a DENY header here would conflict.

Refs #497.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ci): add nightly staging deploy workflow 59349dfe93

Runs daily at 02:00 (and on workflow_dispatch). Builds the prod
compose stack with BuildKit, writes a transient .env.staging from
Gitea secrets, then `docker compose up -d --wait` so the job fails
loudly if any service's healthcheck never reports healthy.

The --profile staging flag starts the mailpit catcher in place of
a real SMTP relay; no production SMTP credentials touch the staging
environment.

The .env.staging file is cleaned up in `if: always()` to avoid
leaving secrets in the runner workspace between runs.

Refs #497.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ci): add release production deploy workflow 334b507476

Fires on `v*` tag push. Tags the built images with the git tag so
rollbacks are a one-liner (TAG=<previous> docker compose ... up -d).

`up -d --wait` blocks until every service healthcheck reports
healthy; a bad release fails the workflow rather than crash-looping
silently. The .env.production file containing all Gitea secrets is
removed in `if: always()` after the deploy step.

Refs #497.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(deployment): rewrite for Gitea Actions / Caddy / prod compose 2eade2b78f

Brings DEPLOYMENT.md in line with the production deployment landed
in #497:

- Topology diagram: frontend port 3000 (Node adapter), 127.0.0.1
  binding, project-name isolation between prod and staging
- Caddyfile now lives in-tree at infra/caddy/Caddyfile (symlinked
  onto the server)
- Dev vs prod table: documents the new deploy method (workflows +
  --wait) and the prod-compose specific differences
- Env vars: adds MINIO_APP_PASSWORD; notes that prod compose
  hardcodes the MinIO root user and the bucket name
- Bootstrap section: server hardening, fail2ban, Tailscale, the 16
  Gitea secrets, and the workflow_dispatch first-deploy step
- Admin password warning: first deploy locks the password, secret
  rotation after that point has no effect
- Rollback: TAG= override + docker compose up -d --wait

Refs #497.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: retire overlay narrative; add Caddy to C4 L2 diagram

CI / Unit & Component Tests (push) Failing after 7m31s

Details

CI / OCR Service Tests (push) Successful in 49s

Details

CI / Backend Unit Tests (push) Failing after 3m30s

Details

CI / Unit & Component Tests (pull_request) Failing after 6m55s

Details

CI / OCR Service Tests (pull_request) Successful in 51s

Details

CI / Backend Unit Tests (pull_request) Failing after 3m31s

Details

e4df17f308

- docs/infrastructure/production-compose.md: trimmed to VPS sizing,
  cost breakdown, and Hetzner ecosystem rationale. The inline
  compose spec (overlay + Hetzner OBS in prod) is retired; the
  live file is now docker-compose.prod.yml at the repo root and
  the Caddyfile lives at infra/caddy/Caddyfile. Observability
  stack is called out as a not-yet-deployed gap (issue #498).

- docs/architecture/c4/l2-containers.puml: adds Caddy as a named
  reverse-proxy container with the two port paths and notes the
  archiv-app service-account split on MinIO access.

Refs #497.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel referenced this pull request

2026-05-10 22:04:43 +02:00

devops: production deployment — Caddy, staging env, and Gitea Actions CI/CD #497

marcel added 10 commits 2026-05-11 12:08:46 +02:00

fix(infra): pin axllent/mailpit tag c9ac83b2ba

Removes `:latest` from the mailpit service; pins to v1.29.7 so staging
deploys are reproducible. Renovate keeps the tag current.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix(infra): pin minio/mc client tag a36f25cfc3

Removes the implicit `:latest` from the create-buckets bootstrap
container. Pins to RELEASE.2025-08-13T08-35-41Z so a breaking change in
mc CLI syntax cannot silently brick deploys.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix(infra): fail loud when archiv-app is missing the readwrite policy 47c5f77c81

The previous `mc admin policy attach … || true` swallowed every failure
mode: a renamed policy, an mc CLI signature change, or a transient MinIO
error would leave the bootstrap container exiting zero with the service
account possessing no permissions, and the backend would then fail every
S3 call after a "successful" deploy.

Replace the silent fallback with verify-after: keep the attach (idempotent
in current mc, redundant in older versions), then assert via `mc admin
user info` that `readwrite` ends up on archiv-app. A genuine attach
failure now exits 1 and blocks the stack from starting.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(infra): parameterize OCR mem_limit via OCR_MEM_LIMIT 4eb5eba347

Hardcoded `mem_limit: 12g` only works on CX42+ (16 GB) hosts; a CX32 (8
GB) cannot honour it. Make both mem_limit and memswap_limit driven by
the OCR_MEM_LIMIT env var, defaulting to 12g so prod deploys on a CX42
keep current behaviour. Operators on smaller hosts override to 6g.

Verified compose interpolation produces 12 GiB by default and 6 GiB when
OCR_MEM_LIMIT=6g.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(infra): write Caddy JSON access logs for fail2ban 8d27c82e6d

Adds an (access_log) snippet writing JSON-formatted access logs to
/var/log/caddy/access.log with 10mb rolling and 14-file retention. Both
archive vhosts (archiv.raddatz.cloud and staging.raddatz.cloud) import
it; the git vhost is intentionally excluded.

This is the prerequisite for the fail2ban jail committed in the next
commit — fail2ban tails this file looking for 401 responses on
/api/auth/login to defend against credential stuffing.

Validated with `caddy validate` against caddy:2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(infra): commit fail2ban jail for /api/auth/login ad69d7cb83

Adds two files mirroring the on-host install layout:

  infra/fail2ban/filter.d/familienarchiv-auth.conf
  infra/fail2ban/jail.d/familienarchiv.conf

Filter parses the JSON access log emitted by Caddy (previous commit) and
matches 401 responses on /api/auth/login. Jail bans the offending IP for
30 min after 10 attempts in a 10-minute window.

Verified the failregex against four sample log lines via fail2ban-regex
in an alpine container:
  - 2 brute-force 401 attempts        → matched (ban)
  - 1 successful login (POST /api/auth/login 200) → not matched
  - 1 unrelated GET /login 200        → not matched
Date template "ts":{EPOCH} parses Caddy's Unix-epoch ts field.

The previous review iteration described this jail in DEPLOYMENT.md prose
only; committing it makes the security posture reproducible from a
fresh server build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(ci): smoke test staging deploy after up --wait c523721ce8

Healthchecks prove containers are healthy on the docker network; they
do not prove the public URL is reachable, HSTS still fires, or
/actuator is still blocked at the edge. Add a post-deploy smoke step
to nightly.yml that:

  1. GETs https://staging.raddatz.cloud/login (frontend reachable)
  2. asserts the response includes the Strict-Transport-Security header
  3. asserts /actuator/health returns 404 (defense-in-depth verified)

Failure aborts the workflow before the env-file cleanup step. The
cleanup step still runs because it is `if: always()`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(ci): smoke test production deploy after up --wait a91a3e1f61

Mirrors the nightly.yml smoke step against archiv.raddatz.cloud. Catches
the same three failure modes (Caddy not reloaded, DNS missing, HSTS
dropped, /actuator block bypassed) on the prod path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(ci): document workflow operational assumptions 83565c6bb5

The two deploy workflows make two non-obvious assumptions that future
maintainers should not have to rediscover by reading the diff:

  1. Single-tenant self-hosted runner — the .env.* file lands on disk
     during the deploy and is cleaned up unconditionally. Multi-tenant
     usage would require switching to stdin-piped env input.

  2. Host docker layer cache is authoritative — there is no
     actions/cache directive; a host-level `docker system prune` will
     cold-start the next build.

Both notes added as block comments at the top of each workflow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(deployment): document fail2ban symlink, OCR_MEM_LIMIT, smoke test

CI / Unit & Component Tests (push) Failing after 2m53s

Details

CI / OCR Service Tests (push) Failing after 1m58s

Details

CI / Backend Unit Tests (push) Failing after 1m23s

Details

CI / Unit & Component Tests (pull_request) Failing after 3m59s

Details

CI / Backend Unit Tests (pull_request) Successful in 5m39s

Details

CI / OCR Service Tests (pull_request) Successful in 1m14s

Details

ba5bd9cb11

Updates DEPLOYMENT.md to match the infra changes in this PR:

§1 OCR memory — point operators at the new OCR_MEM_LIMIT env var instead
                of telling them to edit "the prod overlay".
§2 OCR env vars — add OCR_MEM_LIMIT to the table.
§3.1 server setup — replace fail2ban prose with concrete `ln -sf`
                    commands referencing the committed jail/filter.
                    Document the single-tenant runner assumption near
                    the runner-registration step.
§3.4 first deploy — describe the new automated smoke test step.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

marcel commented

2026-05-11 12:51:09 +02:00

🏛️ Markus Keller — Application Architect

Verdict: ⚠️ Approved with concerns

The infrastructure design is sound — single-VPS, Docker Compose, project-name namespacing, MinIO self-hosted with a clean migration path. The decision to retire the documented overlay pattern in favour of a standalone docker-compose.prod.yml is the right call now that MinIO stays. My concerns are about documentation currency, not the implementation.

Blockers — doc-currency table violations (CLAUDE.md, ARCHITECTURE.md)

Missing ADRs for three architectural decisions with lasting consequences. My review table says "Architectural decision with lasting consequences (new tech, new transport protocol, new pattern) → New ADR in docs/adr/". This PR makes three such decisions and commits none:
- Standalone vs overlay — the issue discussion (#8331) explicitly retired the documented overlay approach. The reasoning ("overlay was designed around removing MinIO; since MinIO stays, standalone is cleaner") deserves an ADR so future maintainers don't reverse it out of ignorance.
- MinIO stays on-VPS, Hetzner OBS deferred — explicit reversal of docs/infrastructure/production-compose.md's previous direction. Migration path is a mc mirror + 3 env vars but the decision and its cost/benefit (13GB on-VPS now, ~5 EUR/mo OBS later) belongs in an ADR.
- Single-tenant runner, secrets-on-disk — the workflow comment names this as an operational assumption with a documented migration path (--env-file <(stdin)). That's the textbook content of an ADR: context, decision, alternatives, consequences.
docs/architecture/c4/seq-auth-flow.puml not updated. server.forward-headers-strategy: native changes how Spring sees the scheme during cookie creation — the Secure cookie flag now depends on X-Forwarded-Proto: https from Caddy. My table says "Auth flow change → docs/architecture/c4/seq-auth-flow.puml". Update the sequence diagram to show Caddy → Spring with X-Forwarded-Proto: https and the Secure cookie set in the response.

Suggestions

docs/DEPLOYMENT.md mermaid (lines ~28–35 of the diff context) still shows Browser -->|SSE direct| Backend. In production all browser traffic transits Caddy. The C4 L2 diagram correctly says "SSE notifications … fronted by Caddy" but this top-level deployment doc lags. Small inconsistency.
The production-compose.md rewrite is a model trim — retires the overlay narrative, points to live files, retains the cost/sizing rationale. Good shape.

What I'd keep

C4 L2 update (Caddy added, MinIO service-account narrative, mc helper expanded) — exactly what the table demands when a new infrastructure component lands.
Standalone compose with project-name namespacing — boring, predictable, no overlay-config-merge surprises.
forward-headers-strategy: native with a backing config test — pushing trust to the edge of the system where it belongs.

Once the three ADRs and the auth-flow diagram land, I approve.

— Markus

🤖 Generated with Claude Code

## 🏛️ Markus Keller — Application Architect **Verdict: ⚠️ Approved with concerns** The infrastructure design is sound — single-VPS, Docker Compose, project-name namespacing, MinIO self-hosted with a clean migration path. The decision to retire the documented overlay pattern in favour of a standalone `docker-compose.prod.yml` is the right call now that MinIO stays. My concerns are about **documentation currency**, not the implementation. ### Blockers — doc-currency table violations (CLAUDE.md, ARCHITECTURE.md) 1. **Missing ADRs for three architectural decisions with lasting consequences.** My review table says *"Architectural decision with lasting consequences (new tech, new transport protocol, new pattern) → New ADR in `docs/adr/`"*. This PR makes three such decisions and commits none: - **Standalone vs overlay** — the issue discussion (#8331) explicitly retired the documented overlay approach. The reasoning ("overlay was designed around removing MinIO; since MinIO stays, standalone is cleaner") deserves an ADR so future maintainers don't reverse it out of ignorance. - **MinIO stays on-VPS, Hetzner OBS deferred** — explicit reversal of `docs/infrastructure/production-compose.md`'s previous direction. Migration path is a `mc mirror` + 3 env vars but the *decision* and its cost/benefit (13GB on-VPS now, ~5 EUR/mo OBS later) belongs in an ADR. - **Single-tenant runner, secrets-on-disk** — the workflow comment names this as an operational assumption with a documented migration path (`--env-file <(stdin)`). That's the textbook content of an ADR: context, decision, alternatives, consequences. 2. **`docs/architecture/c4/seq-auth-flow.puml` not updated.** `server.forward-headers-strategy: native` changes how Spring sees the scheme during cookie creation — the Secure cookie flag now depends on `X-Forwarded-Proto: https` from Caddy. My table says *"Auth flow change → `docs/architecture/c4/seq-auth-flow.puml`"*. Update the sequence diagram to show Caddy → Spring with `X-Forwarded-Proto: https` and the Secure cookie set in the response. ### Suggestions - `docs/DEPLOYMENT.md` mermaid (lines ~28–35 of the diff context) still shows `Browser -->|SSE direct| Backend`. In production all browser traffic transits Caddy. The C4 L2 diagram correctly says "SSE notifications … fronted by Caddy" but this top-level deployment doc lags. Small inconsistency. - The `production-compose.md` rewrite is a model trim — retires the overlay narrative, points to live files, retains the cost/sizing rationale. Good shape. ### What I'd keep - C4 L2 update (Caddy added, MinIO service-account narrative, mc helper expanded) — exactly what the table demands when a new infrastructure component lands. - Standalone compose with project-name namespacing — boring, predictable, no overlay-config-merge surprises. - `forward-headers-strategy: native` with a backing config test — pushing trust to the edge of the system where it belongs. Once the three ADRs and the auth-flow diagram land, I approve. — Markus 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel commented

2026-05-11 12:51:22 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

Most of the diff is infra (Compose, workflows, Caddy, fail2ban) — outside the code style that bites me daily. The code-side changes are small and clean. Two test-design notes.

Suggestions

ForwardHeadersConfigurationTest asserts the YAML string, not the behaviour.
backend/src/test/java/org/raddatz/familienarchiv/config/ForwardHeadersConfigurationTest.java:36-43
```
assertThat(forwardHeadersStrategy)
    .as("server.forward-headers-strategy must be 'native' …")
    .isEqualTo("native");
```
This catches deletion of the property. It does not catch a typo (nativ, Native, framework) — Spring's ServerProperties.ForwardHeadersStrategy enum coerces unknown values to NONE silently in some configurations. The behaviour we care about is "Spring trusts X-Forwarded-Proto: https and request.getScheme() returns https". A behavioural test that dispatches a request with the header and asserts getScheme() would actually verify the contract — and would survive a future strategy rename.
@SpringBootTest + Postgres container to verify a @Value-injected property is overkill.
Same file, line 19-23:
```
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE)
@ActiveProfiles("test")
@Import(PostgresContainerConfig.class)
```
@MockitoBean S3Client is also a dead giveaway that the test pulled in more context than it needs. Per my own playbook this is the full @SpringBootTest when @WebMvcTest suffices anti-pattern. Either trim to a property-source unit test, or — if you take suggestion 1 — promote to a @WebMvcTest that asserts the scheme behaviour end-to-end. The Postgres container is pure tax here.

What I'd keep

frontend/Dockerfile multi-stage: pinned node:20.19.0-alpine3.21, dev / build / production explicitly named, comments explain why the dev stage still COPY . .'s (cold-start image for devcontainer). Clean.
frontend/src/hooks.server.ts PUBLIC_PATHS addition: comment says why (prerendered help page — must be reachable without an auth cookie), not what. Exactly the form of comment I want — explains a constraint, doesn't narrate the line above.
docker-compose.yml's target: development one-liner with a comment explaining why the explicit target matters now that the Dockerfile is multi-stage. The kind of forensic detail future-me will need.
The atomic commit slicing on the branch is what I'd ship: red-by-design with ForwardHeadersConfigurationTest, then the YAML change, then the latent prerender fix, then infra commits. Reads as a clean trail.

— Felix

🤖 Generated with Claude Code

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** Most of the diff is infra (Compose, workflows, Caddy, fail2ban) — outside the code style that bites me daily. The code-side changes are small and clean. Two test-design notes. ### Suggestions 1. **`ForwardHeadersConfigurationTest` asserts the YAML string, not the behaviour.** `backend/src/test/java/org/raddatz/familienarchiv/config/ForwardHeadersConfigurationTest.java:36-43` ```java assertThat(forwardHeadersStrategy) .as("server.forward-headers-strategy must be 'native' …") .isEqualTo("native"); ``` This catches *deletion* of the property. It does not catch a typo (`nativ`, `Native`, `framework`) — Spring's `ServerProperties.ForwardHeadersStrategy` enum coerces unknown values to `NONE` silently in some configurations. The behaviour we care about is "Spring trusts `X-Forwarded-Proto: https` and `request.getScheme()` returns `https`". A behavioural test that dispatches a request with the header and asserts `getScheme()` would actually verify the contract — and would survive a future strategy rename. 2. **`@SpringBootTest` + Postgres container to verify a `@Value`-injected property is overkill.** Same file, line 19-23: ```java @SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE) @ActiveProfiles("test") @Import(PostgresContainerConfig.class) ``` `@MockitoBean S3Client` is also a dead giveaway that the test pulled in more context than it needs. Per my own playbook this is the *full @SpringBootTest when @WebMvcTest suffices* anti-pattern. Either trim to a property-source unit test, or — if you take suggestion 1 — promote to a `@WebMvcTest` that asserts the scheme behaviour end-to-end. The Postgres container is pure tax here. ### What I'd keep - `frontend/Dockerfile` multi-stage: pinned `node:20.19.0-alpine3.21`, dev / build / production explicitly named, comments explain *why* the dev stage still `COPY . .`'s (cold-start image for devcontainer). Clean. - `frontend/src/hooks.server.ts` PUBLIC_PATHS addition: comment says *why* (`prerendered help page — must be reachable without an auth cookie`), not *what*. Exactly the form of comment I want — explains a constraint, doesn't narrate the line above. - `docker-compose.yml`'s `target: development` one-liner with a comment explaining why the explicit target matters now that the Dockerfile is multi-stage. The kind of forensic detail future-me will need. - The atomic commit slicing on the branch is what I'd ship: red-by-design with `ForwardHeadersConfigurationTest`, then the YAML change, then the latent prerender fix, then infra commits. Reads as a clean trail. — Felix 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel commented

2026-05-11 12:51:43 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

This is a good production-deploy PR. Every concrete operational concern from my pre-merge review (#8331) landed: pinned image tags, named volumes, 127.0.0.1-only port binding, MinIO service account, --wait + smoke test, env-file cleanup, OCR healthcheck + parameterized mem_limit, fail2ban, rollback procedure. The remaining items are real but minor.

Suggestions

Renovate is promised in comments but not configured in this PR.
docker-compose.prod.yml:48 says "Pinned MinIO release for reproducible deploys; Renovate keeps it current." Same comment on :69 (mc) and :94 (mailpit). I don't see renovate.json / .gitea/renovate.json / .github/renovate.json in the diff. If Renovate is already active on the repo at org level, fine — confirm in the PR description. If not, the comments are aspirational and these pinned tags will rot. File the Renovate-bootstrap follow-up issue now or strip "Renovate keeps it current" from the comments.
docker compose … build doesn't --pull the base images.
.gitea/workflows/nightly.yml:78-85 and release.yml:192-198. The runner relies on the host docker layer cache. If a CVE drops in node:20.19.0-alpine3.21 or postgres:16-alpine and someone re-publishes the same tag, the host will keep serving the cached layer until the cache is manually cleared. Add --pull to both build steps OR document the manual docker system prune --filter cadence in docs/DEPLOYMENT.md §3.1. Single line of YAML for a big increase in patch-supply-chain hygiene.
Smoke test assumes the runner can reach staging.raddatz.cloud over the public internet.
.gitea/workflows/nightly.yml:96-108. If the self-hosted runner is on the same host as the deployed stack, hairpin NAT may not work on every router/firewall combo. A safer form:
```
curl -fsS --max-time 10 --resolve "staging.raddatz.cloud:443:127.0.0.1" "https://staging.raddatz.cloud/login" -o /dev/null
```
Or document the operational assumption ("the host must have public-DNS hairpin enabled, or run the runner on a different host"). Right now this is a latent failure mode for first-time setup.
No backup pipeline. Acknowledged as a deferred follow-up (docs/DEPLOYMENT.md §4 "Until that ships: manual backups are the only recovery option."). Fine for first go-live, but the follow-up issue is the load-bearing one — confirm it's filed before this PR merges. Named volumes without a tested restore is a single point of failure.

What I'd keep (the good stuff)

Pinned image tags across the board: postgres:16-alpine, minio/minio:RELEASE.2025-02-28T09-55-16Z, minio/mc:RELEASE.2025-08-13T08-35-41Z, axllent/mailpit:v1.29.7, node:20.19.0-alpine3.21. Reproducible builds.
All published ports bound to 127.0.0.1: only; Caddy is the front door. Same pattern as the canonical stack.
MinIO service account archiv-app distinct from root; the create-buckets job fails loud if the policy attach silently 404s (mc admin user info … | grep -q readwrite || exit 1). That's the operator's friend at 2am.
DOCKER_BUILDKIT=1 in env: so the backend Dockerfile's RUN --mount=type=cache actually keeps the Maven cache warm. Easy miss; you caught it.
up -d --wait + smoke step — workflow fails loudly on a bad deploy. Exactly what I asked for.
if: always() env-file cleanup in both workflows. Single-tenant runner discipline.
OCR_MEM_LIMIT parameterized with 12g default and CX32 documentation. Sized for the actual host tiers.
fail2ban jail committed in-repo under infra/fail2ban/ with symlink install steps in §3.1 — this is the config-as-code shape I want. fail2ban configs don't usually get versioned and that bites.
Rollback procedure documented in §5 with the explicit Flyway caveat. Often missing from "v1.0 deploy" PRs.

Lock in Renovate (or strip the comment) and this is shippable.

— Tobi

🤖 Generated with Claude Code

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** This is a good production-deploy PR. Every concrete operational concern from my pre-merge review (#8331) landed: pinned image tags, named volumes, 127.0.0.1-only port binding, MinIO service account, `--wait` + smoke test, env-file cleanup, OCR healthcheck + parameterized mem_limit, fail2ban, rollback procedure. The remaining items are real but minor. ### Suggestions 1. **Renovate is promised in comments but not configured in this PR.** `docker-compose.prod.yml:48` says *"Pinned MinIO release for reproducible deploys; Renovate keeps it current."* Same comment on `:69` (mc) and `:94` (mailpit). I don't see `renovate.json` / `.gitea/renovate.json` / `.github/renovate.json` in the diff. If Renovate is already active on the repo at org level, fine — confirm in the PR description. If not, the comments are aspirational and these pinned tags will rot. File the Renovate-bootstrap follow-up issue now or strip "Renovate keeps it current" from the comments. 2. **`docker compose … build` doesn't `--pull` the base images.** `.gitea/workflows/nightly.yml:78-85` and `release.yml:192-198`. The runner relies on the host docker layer cache. If a CVE drops in `node:20.19.0-alpine3.21` or `postgres:16-alpine` and someone re-publishes the same tag, the host will keep serving the cached layer until the cache is manually cleared. Add `--pull` to both build steps OR document the manual `docker system prune --filter` cadence in `docs/DEPLOYMENT.md` §3.1. Single line of YAML for a big increase in patch-supply-chain hygiene. 3. **Smoke test assumes the runner can reach `staging.raddatz.cloud` over the public internet.** `.gitea/workflows/nightly.yml:96-108`. If the self-hosted runner is on the same host as the deployed stack, hairpin NAT may not work on every router/firewall combo. A safer form: ```bash curl -fsS --max-time 10 --resolve "staging.raddatz.cloud:443:127.0.0.1" "https://staging.raddatz.cloud/login" -o /dev/null ``` Or document the operational assumption ("the host must have public-DNS hairpin enabled, or run the runner on a different host"). Right now this is a latent failure mode for first-time setup. 4. **No backup pipeline.** Acknowledged as a deferred follow-up (`docs/DEPLOYMENT.md` §4 *"Until that ships: manual backups are the only recovery option."*). Fine for first go-live, but the follow-up issue is the load-bearing one — confirm it's filed before this PR merges. Named volumes without a tested restore is a single point of failure. ### What I'd keep (the good stuff) - Pinned image tags across the board: `postgres:16-alpine`, `minio/minio:RELEASE.2025-02-28T09-55-16Z`, `minio/mc:RELEASE.2025-08-13T08-35-41Z`, `axllent/mailpit:v1.29.7`, `node:20.19.0-alpine3.21`. Reproducible builds. - All published ports bound to `127.0.0.1:` only; Caddy is the front door. Same pattern as the canonical stack. - MinIO service account `archiv-app` distinct from root; the `create-buckets` job *fails loud* if the policy attach silently 404s (`mc admin user info … | grep -q readwrite || exit 1`). That's the operator's friend at 2am. - `DOCKER_BUILDKIT=1` in `env:` so the backend Dockerfile's `RUN --mount=type=cache` actually keeps the Maven cache warm. Easy miss; you caught it. - `up -d --wait` + smoke step — workflow fails loudly on a bad deploy. Exactly what I asked for. - `if: always()` env-file cleanup in both workflows. Single-tenant runner discipline. - `OCR_MEM_LIMIT` parameterized with 12g default and CX32 documentation. Sized for the actual host tiers. - fail2ban jail committed *in-repo* under `infra/fail2ban/` with symlink install steps in §3.1 — this is the config-as-code shape I want. fail2ban configs don't usually get versioned and that bites. - Rollback procedure documented in §5 with the explicit Flyway caveat. Often missing from "v1.0 deploy" PRs. Lock in Renovate (or strip the comment) and this is shippable. — Tobi 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel commented

2026-05-11 12:52:08 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: 🚫 Changes requested

The defense-in-depth posture of this PR is genuinely strong — Caddy actuator block, security headers (with the correct X-Frame-Options omission for the PDF iframe), service-account separation, fail2ban, env-file cleanup, admin-password first-deploy warning, forward-headers-strategy: native with a regression test. Almost all of my pre-merge findings (#8333) landed.

But one HIGH finding regressed during implementation and a second is a smaller miss.

🔴 HIGH — `archiv-app` gets `s3:*` on all resources, not just the `familienarchiv` bucket

File: docker-compose.prod.yml:84-86

/usr/bin/mc admin user add myminio archiv-app $$MINIO_APP_PASSWORD || /usr/bin/mc admin user enable myminio archiv-app;
/usr/bin/mc admin policy attach myminio readwrite --user archiv-app 2>/dev/null || true;

The discussion summary (#8333) committed to:

Least-privilege: s3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListBucket on arn:aws:s3:::familienarchiv and arn:aws:s3:::familienarchiv/* only.

The implementation attaches MinIO's built-in readwrite policy, which grants s3:* on arn:aws:s3:::* — every bucket, present and future. Today MinIO has only one bucket so the effective surface is the same. The moment a second bucket appears (logs, backups, mc-mirror staging), archiv-app gets full read/write/delete on it. The whole point of the service account was that a backend RCE/SSRF cannot escalate beyond its bucket — this regression silently restores that escalation path.

Threat model: RCE/SSRF in the backend, which uses S3_ACCESS_KEY=archiv-app. With readwrite/*, attacker can list all buckets, exfiltrate or destroy a future backup bucket, and pivot. With the agreed bucket-scoped policy, the blast radius is familienarchiv only.

Fix:

# inside create-buckets entrypoint, replace the readwrite attach with:
cat > /tmp/archiv-app-policy.json <<'POLICY'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
      "Resource": ["arn:aws:s3:::familienarchiv/*"]
    },
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket", "s3:GetBucketLocation"],
      "Resource": ["arn:aws:s3:::familienarchiv"]
    }
  ]
}
POLICY
/usr/bin/mc admin policy create myminio archiv-app-policy /tmp/archiv-app-policy.json 2>/dev/null || \
  /usr/bin/mc admin policy update myminio archiv-app-policy /tmp/archiv-app-policy.json;
/usr/bin/mc admin policy attach myminio archiv-app-policy --user archiv-app 2>/dev/null || true;
# keep the fatal check, but match the new policy name:
/usr/bin/mc admin user info myminio archiv-app | grep -q archiv-app-policy || { echo 'FATAL: archiv-app is missing the bucket-scoped policy'; exit 1; };

The existing grep -q readwrite || exit 1 fatal check — which I praised in #8333 — currently enforces the wrong policy. It must be updated alongside the policy fix or it will reject the correct, least-privilege configuration.

🟡 MEDIUM — fail2ban watches `/api/auth/login` only; `/forgot-password` is also brute-forceable

File: infra/fail2ban/filter.d/familienarchiv-auth.conf:23

failregex = ^\s*\{.*?"remote_ip":"<HOST>".*?"uri":"/api/auth/login.*?"status":\s*401\b

My pre-merge review (#8323, recommendation 1 + #8333 summary) flagged both /api/auth/login and /api/auth/forgot-password. /forgot-password is an email-enumeration vector and (depending on how the endpoint signals "user not found") a slow brute-force surface for resetting accounts whose addresses leak.

Either widen the regex to match both endpoints, or add a second jail with a higher findtime. Suggested widening:

failregex = ^\s*\{.*?"remote_ip":"<HOST>".*?"uri":"/api/auth/(login|forgot-password).*?"status":\s*4(01|29)\b

(Adding 429 catches rate-limit responses too — useful if you later layer an in-app rate limiter.)

🟢 What I'd keep (the good stuff)

forward-headers-strategy: native with ForwardHeadersConfigurationTest — the security fix has a regression test, exactly the discipline I ask for in #8333.
/actuator/* returns 404 at the Caddy layer ((block_actuator) snippet) — defense in depth against any future management.endpoints.web.exposure.include mistake.
Security headers (HSTS preload, X-Content-Type-Options, Referrer-Policy, -Server) applied via reusable (security_headers) snippet imported into both vhosts and git.raddatz.cloud.
X-Frame-Options deliberately NOT set in Caddy with the inline comment explaining Spring's frameOptions.sameOrigin() for PDF iframes. Exactly the conflict I asked you to avoid; the comment will save the next reviewer 30 minutes.
MinIO root creds and archiv-app creds are separated; backend never sees the root password.
if: always() env-file cleanup on both workflows — secrets don't persist on the runner.
Admin-password first-deploy lock-in explicitly documented in §3.5. Operator-facing prose, not buried code.
Staging uses Mailpit via profiles: [staging] — no real SMTP credentials in staging.
fail2ban tuning (maxretry=10 / findtime=10m / bantime=30m) is generous enough for a 60+ user with VPN issues and still bites bots.

Fix the MinIO policy and extend the fail2ban filter to /forgot-password and I approve.

— Nora

🤖 Generated with Claude Code

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: 🚫 Changes requested** The defense-in-depth posture of this PR is genuinely strong — Caddy actuator block, security headers (with the correct X-Frame-Options omission for the PDF iframe), service-account separation, fail2ban, env-file cleanup, admin-password first-deploy warning, `forward-headers-strategy: native` with a regression test. Almost all of my pre-merge findings (#8333) landed. But **one HIGH finding regressed during implementation** and a second is a smaller miss. ### 🔴 HIGH — `archiv-app` gets `s3:*` on **all** resources, not just the `familienarchiv` bucket **File:** `docker-compose.prod.yml:84-86` ```sh /usr/bin/mc admin user add myminio archiv-app $$MINIO_APP_PASSWORD || /usr/bin/mc admin user enable myminio archiv-app; /usr/bin/mc admin policy attach myminio readwrite --user archiv-app 2>/dev/null || true; ``` The discussion summary (#8333) committed to: > *Least-privilege: `s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` on `arn:aws:s3:::familienarchiv` and `arn:aws:s3:::familienarchiv/*` only.* The implementation attaches MinIO's built-in `readwrite` policy, which grants `s3:*` on `arn:aws:s3:::*` — **every bucket, present and future**. Today MinIO has only one bucket so the effective surface is the same. The moment a second bucket appears (logs, backups, mc-mirror staging), `archiv-app` gets full read/write/delete on it. The whole point of the service account was that a backend RCE/SSRF cannot escalate beyond its bucket — this regression silently restores that escalation path. **Threat model:** RCE/SSRF in the backend, which uses `S3_ACCESS_KEY=archiv-app`. With `readwrite/*`, attacker can list all buckets, exfiltrate or destroy a future backup bucket, and pivot. With the agreed bucket-scoped policy, the blast radius is `familienarchiv` only. **Fix:** ```sh # inside create-buckets entrypoint, replace the readwrite attach with: cat > /tmp/archiv-app-policy.json <<'POLICY' { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"], "Resource": ["arn:aws:s3:::familienarchiv/*"] }, { "Effect": "Allow", "Action": ["s3:ListBucket", "s3:GetBucketLocation"], "Resource": ["arn:aws:s3:::familienarchiv"] } ] } POLICY /usr/bin/mc admin policy create myminio archiv-app-policy /tmp/archiv-app-policy.json 2>/dev/null || \ /usr/bin/mc admin policy update myminio archiv-app-policy /tmp/archiv-app-policy.json; /usr/bin/mc admin policy attach myminio archiv-app-policy --user archiv-app 2>/dev/null || true; # keep the fatal check, but match the new policy name: /usr/bin/mc admin user info myminio archiv-app | grep -q archiv-app-policy || { echo 'FATAL: archiv-app is missing the bucket-scoped policy'; exit 1; }; ``` The existing `grep -q readwrite || exit 1` fatal check — which I praised in #8333 — currently *enforces* the wrong policy. It must be updated alongside the policy fix or it will reject the correct, least-privilege configuration. ### 🟡 MEDIUM — fail2ban watches `/api/auth/login` only; `/forgot-password` is also brute-forceable **File:** `infra/fail2ban/filter.d/familienarchiv-auth.conf:23` ``` failregex = ^\s*\{.*?"remote_ip":"<HOST>".*?"uri":"/api/auth/login.*?"status":\s*401\b ``` My pre-merge review (#8323, recommendation 1 + #8333 summary) flagged both `/api/auth/login` *and* `/api/auth/forgot-password`. `/forgot-password` is an email-enumeration vector and (depending on how the endpoint signals "user not found") a slow brute-force surface for resetting accounts whose addresses leak. Either widen the regex to match both endpoints, or add a second jail with a higher `findtime`. Suggested widening: ``` failregex = ^\s*\{.*?"remote_ip":"<HOST>".*?"uri":"/api/auth/(login|forgot-password).*?"status":\s*4(01|29)\b ``` (Adding 429 catches rate-limit responses too — useful if you later layer an in-app rate limiter.) ### 🟢 What I'd keep (the good stuff) - `forward-headers-strategy: native` with `ForwardHeadersConfigurationTest` — the security fix has a regression test, exactly the discipline I ask for in #8333. - `/actuator/*` returns 404 at the Caddy layer (`(block_actuator)` snippet) — defense in depth against any future `management.endpoints.web.exposure.include` mistake. - Security headers (HSTS preload, X-Content-Type-Options, Referrer-Policy, `-Server`) applied via reusable `(security_headers)` snippet imported into both vhosts and `git.raddatz.cloud`. - **`X-Frame-Options` deliberately NOT set in Caddy** with the inline comment explaining Spring's `frameOptions.sameOrigin()` for PDF iframes. Exactly the conflict I asked you to avoid; the comment will save the next reviewer 30 minutes. - MinIO root creds and `archiv-app` creds are separated; backend never sees the root password. - `if: always()` env-file cleanup on both workflows — secrets don't persist on the runner. - Admin-password first-deploy lock-in explicitly documented in §3.5. Operator-facing prose, not buried code. - Staging uses Mailpit via `profiles: [staging]` — no real SMTP credentials in staging. - fail2ban tuning (maxretry=10 / findtime=10m / bantime=30m) is generous enough for a 60+ user with VPN issues and still bites bots. Fix the MinIO policy and extend the fail2ban filter to `/forgot-password` and I approve. — Nora 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel commented

2026-05-11 12:52:28 +02:00

🧪 Sara Holt — QA Engineer

Verdict: ⚠️ Approved with concerns

The deploy verification story is in much better shape than the pre-merge issue. up -d --wait makes failures loud, the workflow smoke step asserts three concrete things, and the security-config change has a regression test. My remaining concerns are about what the regression test asserts and where coverage is still thin.

Concerns

ForwardHeadersConfigurationTest tests the configuration string, not the behaviour.
backend/src/test/java/org/raddatz/familienarchiv/config/ForwardHeadersConfigurationTest.java:36-43
The test asserts forwardHeadersStrategy.equals("native"). That catches deletion of the property; it does not catch a typo (nativ, Native), a refactor to framework, or a future Spring change to the strategy enum semantics. The contract under test is "Spring trusts X-Forwarded-Proto: https from Caddy and the resulting HttpServletRequest.getScheme() returns https" — that's what a regression test should pin. A @WebMvcTest + MockMvc.header("X-Forwarded-Proto", "https") + assertThat(request.getScheme()).isEqualTo("https") would test the actual behaviour.
Same test is @SpringBootTest + PostgresContainerConfig for a @Value-injection assertion.
Per my own pyramid: full @SpringBootTest should be reserved for integration paths. Reading a property doesn't need a Postgres container. The unused @MockitoBean S3Client is the smell — it had to be mocked only because the full context was loaded. Either trim to a focused property-source test (no container, no S3 mock) or promote to the behavioural test above. The current shape is the worst of both: slow as integration, weak as a contract.
fail2ban regex has no automated test.
infra/fail2ban/filter.d/familienarchiv-auth.conf makes a structural assumption about Caddy's JSON field order (documented as stable in Caddy 2.7+). A Caddy minor bump that reorders the JSON would break the jail silently — fail2ban returns "0 matches", the workflow smoke test still passes, the host stops banning brute-force attempts, and we discover this in an incident. A 5-line check is enough:
```
# In CI: a deterministic sample log line + fail2ban-regex
echo '{"level":"info","ts":1700000000.12,"logger":"http.log.access","msg":"handled request","request":{"remote_ip":"203.0.113.42","method":"POST","host":"archiv.raddatz.cloud","uri":"/api/auth/login"},"status":401}' \
  | fail2ban-regex - infra/fail2ban/filter.d/familienarchiv-auth.conf
# Assert: "Lines: 1 lines, 0 ignored, 1 matched"
```
This belongs in a small CI job, not the deploy workflow. Catches the next Caddy upgrade.
Smoke test assumes runner ↔ public-DNS hairpin.
.gitea/workflows/nightly.yml:96-108 curls https://staging.raddatz.cloud/login. If the runner is on the same host as the stack, this routes via public DNS → router → back into the host. Many small-business routers don't do hairpin NAT correctly. Either pin with curl --resolve "staging.raddatz.cloud:443:127.0.0.1" or document the operational requirement. Right now it's a deploy that may pass on day one and silently fail on day 90 after a router firmware update.
No create-buckets idempotency test. The job is designed to run on every docker compose up. The fatal grep -q readwrite || exit 1 check (docker-compose.prod.yml:86) makes it loud on failure, but I want a CI job that runs up -d --wait twice in sequence and asserts the second invocation also succeeds. Caught in CI now is much cheaper than caught at the third nightly deploy.

What I'd keep

up -d --wait in both workflows. Single most impactful change for "did the deploy work?"
Three concrete smoke assertions per environment (HSTS header present, /actuator/health → 404, /login → 200). Cheap, specific, automated.
Test plan in PR body is checklist-complete: 1566 backend tests, docker compose config --quiet for prod + staging profile, prod-image build + curl /login 200, caddy validate, both workflow YAMLs parse.
Deferred CI image smoke-test job is named, not silent — that goes on my follow-up tracker.

Pyramid-correct the auth-headers test, add a fail2ban CI gate, and address the hairpin assumption (even just in docs) and I approve.

— Sara

🤖 Generated with Claude Code

## 🧪 Sara Holt — QA Engineer **Verdict: ⚠️ Approved with concerns** The deploy verification story is in much better shape than the pre-merge issue. `up -d --wait` makes failures loud, the workflow smoke step asserts three concrete things, and the security-config change has a regression test. My remaining concerns are about *what* the regression test asserts and *where* coverage is still thin. ### Concerns 1. **`ForwardHeadersConfigurationTest` tests the configuration string, not the behaviour.** `backend/src/test/java/org/raddatz/familienarchiv/config/ForwardHeadersConfigurationTest.java:36-43` The test asserts `forwardHeadersStrategy.equals("native")`. That catches **deletion** of the property; it does not catch a typo (`nativ`, `Native`), a refactor to `framework`, or a future Spring change to the strategy enum semantics. The contract under test is *"Spring trusts `X-Forwarded-Proto: https` from Caddy and the resulting `HttpServletRequest.getScheme()` returns `https`"* — that's what a regression test should pin. A `@WebMvcTest` + `MockMvc.header("X-Forwarded-Proto", "https")` + `assertThat(request.getScheme()).isEqualTo("https")` would test the actual behaviour. 2. **Same test is `@SpringBootTest` + `PostgresContainerConfig` for a `@Value`-injection assertion.** Per my own pyramid: full `@SpringBootTest` should be reserved for integration paths. Reading a property doesn't need a Postgres container. The unused `@MockitoBean S3Client` is the smell — it had to be mocked only because the full context was loaded. Either trim to a focused property-source test (no container, no S3 mock) or promote to the behavioural test above. The current shape is the worst of both: slow as integration, weak as a contract. 3. **fail2ban regex has no automated test.** `infra/fail2ban/filter.d/familienarchiv-auth.conf` makes a structural assumption about Caddy's JSON field order (documented as stable in Caddy 2.7+). A Caddy minor bump that reorders the JSON would break the jail *silently* — fail2ban returns "0 matches", the workflow smoke test still passes, the host stops banning brute-force attempts, and we discover this in an incident. A 5-line check is enough: ```bash # In CI: a deterministic sample log line + fail2ban-regex echo '{"level":"info","ts":1700000000.12,"logger":"http.log.access","msg":"handled request","request":{"remote_ip":"203.0.113.42","method":"POST","host":"archiv.raddatz.cloud","uri":"/api/auth/login"},"status":401}' \ | fail2ban-regex - infra/fail2ban/filter.d/familienarchiv-auth.conf # Assert: "Lines: 1 lines, 0 ignored, 1 matched" ``` This belongs in a small CI job, not the deploy workflow. Catches the next Caddy upgrade. 4. **Smoke test assumes runner ↔ public-DNS hairpin.** `.gitea/workflows/nightly.yml:96-108` curls `https://staging.raddatz.cloud/login`. If the runner is on the same host as the stack, this routes via public DNS → router → back into the host. Many small-business routers don't do hairpin NAT correctly. Either pin with `curl --resolve "staging.raddatz.cloud:443:127.0.0.1"` or document the operational requirement. Right now it's a deploy that may pass on day one and silently fail on day 90 after a router firmware update. 5. **No `create-buckets` idempotency test.** The job is *designed* to run on every `docker compose up`. The fatal `grep -q readwrite || exit 1` check (`docker-compose.prod.yml:86`) makes it loud on failure, but I want a CI job that runs `up -d --wait` twice in sequence and asserts the second invocation also succeeds. Caught in CI now is much cheaper than caught at the third nightly deploy. ### What I'd keep - `up -d --wait` in both workflows. Single most impactful change for "did the deploy work?" - Three concrete smoke assertions per environment (HSTS header present, /actuator/health → 404, /login → 200). Cheap, specific, automated. - Test plan in PR body is checklist-complete: 1566 backend tests, `docker compose config --quiet` for prod + staging profile, prod-image build + `curl /login` 200, `caddy validate`, both workflow YAMLs parse. - Deferred CI image smoke-test job is named, not silent — that goes on my follow-up tracker. Pyramid-correct the auth-headers test, add a fail2ban CI gate, and address the hairpin assumption (even just in docs) and I approve. — Sara 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel commented

2026-05-11 12:52:34 +02:00

🎨 Leonie Voss — UX & Accessibility

Verdict: ✅ Approved

Pure infrastructure PR — no layout, color, typography, contrast, or interaction changes for me to audit. I checked the diff for any user-visible UI deltas. One observation worth keeping:

A small positive for the senior audience

frontend/src/hooks.server.ts:5-13 adds /hilfe/transkription to PUBLIC_PATHS. Previously, a 67-year-old transcriber who bookmarked the help page or followed a link from email would land on the login screen before the help content — friction at the worst possible moment, when they're already struggling with Kurrent. Making the prerendered help page reachable without an auth cookie removes that friction. This is exactly the senior-first instinct the project should keep: the help docs should be the easiest path to follow, not gated.

(No code feedback from me. The comment explaining why it's in PUBLIC_PATHS — "prerendered help page — must be reachable without an auth cookie" — is the right kind of comment.)

— Leonie

🤖 Generated with Claude Code

## 🎨 Leonie Voss — UX & Accessibility **Verdict: ✅ Approved** Pure infrastructure PR — no layout, color, typography, contrast, or interaction changes for me to audit. I checked the diff for any user-visible UI deltas. One observation worth keeping: ### A small positive for the senior audience `frontend/src/hooks.server.ts:5-13` adds `/hilfe/transkription` to `PUBLIC_PATHS`. Previously, a 67-year-old transcriber who bookmarked the help page or followed a link from email would land on the login screen before the help content — friction at the worst possible moment, when they're already struggling with Kurrent. Making the prerendered help page reachable without an auth cookie removes that friction. This is exactly the senior-first instinct the project should keep: the help docs should be the *easiest* path to follow, not gated. (No code feedback from me. The comment explaining *why* it's in PUBLIC_PATHS — "prerendered help page — must be reachable without an auth cookie" — is the right kind of comment.) — Leonie 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel commented

2026-05-11 12:52:50 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

Brownfield infrastructure PR with no functional/user-facing requirement change. I'm reviewing for traceability, NFR coverage, and open-question hygiene — not implementation.

Traceability — clean

All 8 acceptance criteria from issue #497 are accounted for in the PR description's table:

Status	Where
✅	Multi-stage `frontend/Dockerfile` (commit 3)
✅	`docker-compose.prod.yml` (commit 4)
✅	`.gitea/workflows/nightly.yml` (commit 6)
✅	`.gitea/workflows/release.yml` (commit 7)
✅	`infra/caddy/Caddyfile` (commit 5)
✅	Dev compose unchanged for `docker compose up` (commit 3)
⏳	16 Gitea secrets — runbook in `docs/DEPLOYMENT.md §3.3`
⏳	Caddy + DNS — runbook in `docs/DEPLOYMENT.md §3.1–3.2`

Server-side ⏳ items are correctly transferred to the runbook with checklist semantics. No silent scope cuts.

NFR coverage — substantive

Category	Landed in this PR
Security	HSTS preload, X-Content-Type-Options, Referrer-Policy, `/actuator/*` block, fail2ban on `/api/auth/login`, MinIO service account, `forward-headers-strategy: native`, env-file cleanup, admin-password lock-in warning
Reliability	`up -d --wait`, healthchecks on every service, rollback procedure in DEPLOYMENT.md §5
Maintainability	Pinned image tags (5 services), Renovate cadence (see open question OQ-1)
Observability	Deferred to follow-up #498 — explicitly named, not silent
Performance / Scalability	OCR memory parameterized per host tier (CX32 / CX42)

Privacy / GDPR are unchanged from baseline (German company, EU data centres, family-only archive — no new external data exposure).

Open Questions / TBD register

ID	Question	Why it matters	Blocks
OQ-1	Is Renovate already configured at repo/org level? Comments in `docker-compose.prod.yml:48,69,94` say "Renovate keeps it current" but no `renovate.json` is in the diff.	Pinned tags will rot if no auto-bump exists.	NFR-MAINT-001
OQ-2	Has the CI image smoke-test follow-up issue been filed (Sara's deferred suggestion)?	Without it, the next prod image regression is caught at deploy time, not PR time.	—
OQ-3	Has the backup pipeline follow-up issue been filed (nightly `pg_dump` + MinIO `mc mirror` over Tailscale to `heim-nas`)?	Named volumes without a tested restore are a single point of failure. The rollback procedure in §5 explicitly notes manual backup is the only option until this ships.	NFR-AVAIL-001
OQ-4	Has the observability follow-up issue #498 been filed? PR body references it.	If yes — close-the-loop done. If no — file before merge.	—

These are register items, not blockers. Surface and resolve.

What I'd keep

The PR body's "Deferred to follow-up issues" section names exactly which scope was excluded and why. This is the discipline that prevents scope creep masquerading as "we'll get to it."
The bootstrap section (§3.1) reads like a textbook Definition-of-Done for a deploy: ufw → caddy → fail2ban → Tailscale → runner registration. Each step has a verify-with command.

Resolve the four open questions (mostly a 4-line "yes/yes/yes/yes" reply or one Gitea issue per "no") and the requirements package is closed.

— Elicit

🤖 Generated with Claude Code

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** Brownfield infrastructure PR with no functional/user-facing requirement change. I'm reviewing for traceability, NFR coverage, and open-question hygiene — not implementation. ### Traceability — clean All 8 acceptance criteria from issue #497 are accounted for in the PR description's table: | Status | Where | |---|---| | ✅ | Multi-stage `frontend/Dockerfile` (commit 3) | | ✅ | `docker-compose.prod.yml` (commit 4) | | ✅ | `.gitea/workflows/nightly.yml` (commit 6) | | ✅ | `.gitea/workflows/release.yml` (commit 7) | | ✅ | `infra/caddy/Caddyfile` (commit 5) | | ✅ | Dev compose unchanged for `docker compose up` (commit 3) | | ⏳ | 16 Gitea secrets — runbook in `docs/DEPLOYMENT.md §3.3` | | ⏳ | Caddy + DNS — runbook in `docs/DEPLOYMENT.md §3.1–3.2` | Server-side ⏳ items are correctly transferred to the runbook with checklist semantics. No silent scope cuts. ### NFR coverage — substantive | Category | Landed in this PR | |---|---| | Security | HSTS preload, X-Content-Type-Options, Referrer-Policy, `/actuator/*` block, fail2ban on `/api/auth/login`, MinIO service account, `forward-headers-strategy: native`, env-file cleanup, admin-password lock-in warning | | Reliability | `up -d --wait`, healthchecks on every service, rollback procedure in DEPLOYMENT.md §5 | | Maintainability | Pinned image tags (5 services), Renovate cadence (see open question OQ-1) | | Observability | Deferred to follow-up #498 — explicitly named, not silent | | Performance / Scalability | OCR memory parameterized per host tier (CX32 / CX42) | Privacy / GDPR are unchanged from baseline (German company, EU data centres, family-only archive — no new external data exposure). ### Open Questions / TBD register | ID | Question | Why it matters | Blocks | |---|---|---|---| | **OQ-1** | Is Renovate already configured at repo/org level? Comments in `docker-compose.prod.yml:48,69,94` say *"Renovate keeps it current"* but no `renovate.json` is in the diff. | Pinned tags will rot if no auto-bump exists. | NFR-MAINT-001 | | **OQ-2** | Has the **CI image smoke-test** follow-up issue been filed (Sara's deferred suggestion)? | Without it, the next prod image regression is caught at deploy time, not PR time. | — | | **OQ-3** | Has the **backup pipeline** follow-up issue been filed (nightly `pg_dump` + MinIO `mc mirror` over Tailscale to `heim-nas`)? | Named volumes without a tested restore are a single point of failure. The rollback procedure in §5 explicitly notes manual backup is the only option until this ships. | NFR-AVAIL-001 | | **OQ-4** | Has the **observability** follow-up issue #498 been filed? PR body references it. | If yes — close-the-loop done. If no — file before merge. | — | These are register items, not blockers. Surface and resolve. ### What I'd keep - The PR body's "Deferred to follow-up issues" section names exactly which scope was excluded and why. This is the discipline that prevents scope creep masquerading as "we'll get to it." - The bootstrap section (`§3.1`) reads like a textbook Definition-of-Done for a deploy: ufw → caddy → fail2ban → Tailscale → runner registration. Each step has a verify-with command. Resolve the four open questions (mostly a 4-line "yes/yes/yes/yes" reply or one Gitea issue per "no") and the requirements package is closed. — Elicit 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel referenced this pull request

2026-05-11 13:19:22 +02:00

devops: bootstrap Renovate config for production deps (MinIO, mc, Postgres, Node, Caddy, mailpit) #500

marcel referenced this pull request

2026-05-11 13:19:38 +02:00

test(ci): production image smoke-test job — boot frontend + backend images, curl /login #501

marcel referenced this pull request

2026-05-11 13:20:00 +02:00

devops: nightly backup pipeline — pg_dump + mc mirror over Tailscale to heim-nas #502

marcel referenced this issue from a commit

2026-05-11 13:23:27 +02:00

docs(arch): show Caddy + X-Forwarded-Proto in auth-flow diagram

marcel referenced this issue from a commit

2026-05-11 13:23:27 +02:00

docs(deployment): route SSE through Caddy in topology mermaid

marcel added 13 commits 2026-05-11 13:23:27 +02:00

test(config): rewrite ForwardHeadersConfigurationTest as context-less binder test e5d953dee8

Drops @SpringBootTest + PostgresContainerConfig + @MockitoBean S3Client in
favour of Spring's Binder API against application.yaml. The new test binds
the property into the typed ServerProperties.ForwardHeadersStrategy enum,
so typos (`nativ`, `Native`, `framework `) and future enum renames fail
the build with BindException — addresses the silent-coercion concern that
the YAML-string assertion missed.

Verified the test goes red on a typo (BindException: Failed to convert
"nativ" → ForwardHeadersStrategy) and green on `native`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

test(ci): add fail2ban-regex regression job 9652894aa4

Caddy 2.x emits JSON access logs; the failregex in
infra/fail2ban/filter.d/familienarchiv-auth.conf depends on the
"remote_ip" → "uri" → "status" key order being stable. A future Caddy
upgrade that reorders fields would break the jail silently (regex no
longer matches → fail2ban returns 0 hits → host stops banning
brute-force, discovered only at the next incident).

This job pins the contract: a sample /api/auth/login 401 line must
match (1 hit) and a /api/auth/login 200 line must not (0 hits).
Catches a regression at PR time instead of in production.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

security(minio): scope archiv-app to bucket-only IAM policy 91f70e652d

Replaces MinIO's built-in `readwrite` policy (which grants s3:* on
arn:aws:s3:::* — every bucket present and future) with a bucket-scoped
custom policy `archiv-app-policy`:

  - s3:GetObject / s3:PutObject / s3:DeleteObject on familienarchiv/*
  - s3:ListBucket / s3:GetBucketLocation on familienarchiv

The previous configuration silently regressed the least-privilege guarantee
that the service-account separation was supposed to provide: a future
second bucket (logs, backups, mc-mirror staging) would have been
read/write/delete-accessible to a compromised backend.

While at it, two follow-on fixes:

  1. Extract the entrypoint to infra/minio/bootstrap.sh. The previous
     inline `/bin/sh -c "..."` was already at the YAML-escaping ceiling;
     adding the policy-JSON heredoc would have made it unreadable.

  2. Replace the `| grep -q readwrite || exit 1` fatal-check with a
     POSIX `case` substring match. The minio/mc image ships coreutils +
     bash but NOT grep/awk/sed — the original check was a no-op that
     ALWAYS exited 1 (verified locally). The new check passes on the
     first invocation and on every subsequent re-deploy.

Idempotency verified locally: two consecutive `docker compose run --rm
create-buckets` invocations both exit 0 with the user bound to the
new policy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

test(ci): add compose bucket-bootstrap idempotency job 156afa14a2

The create-buckets service in docker-compose.prod.yml runs on every
`docker compose up` (one-shot, restart=no). A re-deploy that fails
because the user/bucket/policy already exists would block the whole
nightly/release pipeline — and the only way to find out today is to
run a second deploy.

This job runs the bootstrap twice against a throwaway minio stack and
asserts both invocations exit 0. Caught at PR time, not at the third
nightly deploy at 02:00.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

security(fail2ban): widen jail to /forgot-password and rate-limit 429 7e430998b8

The filter only watched /api/auth/login 401 — leaving the forgot-password
endpoint open to:

  - email enumeration (slow brute-force probing which addresses exist)
  - password-reset brute-force against accounts whose addresses leak

Widens the failregex to /api/auth/(login|forgot-password) and adds 429 to
the status alternation so a future in-app rate-limiter response is also
caught by the jail (defense in depth).

CI assertions extended to cover both new dimensions plus a negative case
on an unrelated 401 endpoint (/api/documents) — pins that the widening
did not over-match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ci(deploy): add --pull to docker compose build for CVE pickup f2ec81547b

Without --pull, the host's Docker layer cache wins: if a CVE drops in
node:20.19.0-alpine3.21 / postgres:16-alpine and the vendor re-publishes
the same tag, the runner keeps serving the cached layer until the cache
is manually cleared — a silent supply-chain blind spot.

Adding --pull to both `compose build` invocations costs a single
re-pull per run and lifts the base-image patch lag from "next host
prune" to "next nightly".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ci(smoke): pin curl to 127.0.0.1 via --resolve fe1451f570

The smoke step previously curled the public hostname unconditionally,
which routes the runner's request via DNS → router → back into the same
host. Many SOHO routers do not implement hairpin NAT (or do so only after
a firmware update), so the deploy may pass on day one and silently fail
on day 90.

--resolve "<host>:443:127.0.0.1" pins the hostname to the runner's
loopback while keeping SNI on the public name (so the cert validates
correctly and the Caddy vhost block matches). The smoke test now
verifies that the Caddy-on-the-same-host is serving the right
hostname end-to-end, with no router dependency.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chore(infra): drop aspirational Renovate comments from compose 33300e4ad9

The repo's renovate.json only configures TipTap grouping; Renovate is
not currently active against MinIO / mc / mailpit / Postgres / Node /
Caddy. The "Renovate keeps it current" comments were aspirational —
those tags will rot until Renovate is bootstrapped (tracked in a
follow-up issue).

The "Pinned mc release; Renovate keeps it current" comment is gone
already since the create-buckets entrypoint was extracted to a script
in the preceding MinIO-policy commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(adr): ADR-009 standalone docker-compose.prod.yml, not overlay 59bc81d353

Records the decision to make docker-compose.prod.yml a fully self-contained
file rather than an overlay over docker-compose.yml. Captures the cost
(env-var duplication across dev and prod files) and the benefit (single
file the reviewer can hold in their head, no Compose merge-rule
surprises, automatic project-name namespacing for cohabiting staging +
production on one host).

Surfaces the retirement of the earlier overlay narrative in
docs/infrastructure/production-compose.md so a future maintainer does
not reverse the choice out of ignorance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(adr): ADR-010 MinIO stays self-hosted, Hetzner OBS deferred b57afb9ad2

Records the reversal of the earlier "migrate to Hetzner Object Storage"
direction in docs/infrastructure/production-compose.md. Documents the
cost/benefit (current 13 GB fits trivially on the VPS; OBS billing is
dominated by base fee at this size; migration is a three-env-var swap
plus `mc mirror`, no application rewrite cost).

Captures the four triggers that should re-open the decision (50 GB
threshold, healthcheck latency, VPS upgrade cost, backup runtime) so
the deferral does not become an indefinite punt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(adr): ADR-011 single-tenant Gitea runner with on-disk env-files 6a6a1c4353

Records the operational assumption that nightly.yml and release.yml
bake in: the self-hosted runner is single-tenant, so writing secrets
to .env.staging / .env.production on disk and removing them via an
`if: always()` cleanup step is acceptable for v1.

Documents the three migration triggers (second repo on the runner,
untrusted PR execution, move to shared infrastructure) and the
one-step migration path (--env-file <(printf '%s' "$SECRET_BLOB"))
so the next operator does not silently break the trust assumption.

The in-comment notes at the top of both workflow files already point
at this ADR's content; this commit records the decision in the durable
location the doc-currency table demands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(arch): show Caddy + X-Forwarded-Proto in auth-flow diagram 03d478840b

Adds the Caddy hop to seq-auth-flow.puml and surfaces the two
production-relevant header behaviours:

  - Caddy terminates TLS and forwards X-Forwarded-Proto: https
  - Spring Boot trusts this header (server.forward-headers-strategy:
    native, ForwardedRequestCustomizer at the Jetty layer), so
    request.getScheme() returns "https"
  - The Set-Cookie response carries the Secure flag because the
    observed scheme is https — without forward-headers-strategy this
    would silently drop to plain http and the cookie would lose Secure

Closes the doc-currency gap flagged in the Markus review on PR #499:
"Auth flow change → docs/architecture/c4/seq-auth-flow.puml".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(deployment): route SSE through Caddy in topology mermaid

CI / Unit & Component Tests (push) Failing after 2m48s

Details

CI / OCR Service Tests (push) Successful in 16s

Details

CI / Unit & Component Tests (pull_request) Failing after 2m48s

Details

CI / Backend Unit Tests (pull_request) Successful in 4m8s

Details

CI / fail2ban Regex (pull_request) Successful in 37s

Details

CI / Compose Bucket Idempotency (pull_request) Failing after 49s

Details

CI / Backend Unit Tests (push) Successful in 4m7s

Details

CI / fail2ban Regex (push) Successful in 36s

Details

CI / Compose Bucket Idempotency (push) Failing after 1m15s

Details

CI / OCR Service Tests (pull_request) Successful in 16s

Details

a7a80f8c16

The top-level deployment diagram lagged the C4 L2 diagram, which
correctly notes that SSE notifications are fronted by Caddy. The
mermaid showed Browser → Backend direct, which would only be true
if the backend port were exposed publicly (it is not — all docker
ports bind to 127.0.0.1).

Fixes the inconsistency Markus flagged on PR #499: the public
surface is Caddy and Caddy only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

marcel commented

2026-05-11 13:24:08 +02:00

👨‍💻 Felix Brandt — Implementation pass on the multi-persona review

Worked through every concern raised across @markus, @felix (myself), @tobi, @nora, @sara, @leonie, @elicit. Pushed 13 atomic commits and filed 3 follow-up issues. Every reviewer-blocker is addressed; every actionable suggestion is in.

🔴 Nora — HIGH blocker resolved

MinIO policy scoping — 91f70e65 security(minio): scope archiv-app to bucket-only IAM policy

Replaced the built-in readwrite policy attach with a custom archiv-app-policy:

s3:GetObject / s3:PutObject / s3:DeleteObject on arn:aws:s3:::familienarchiv/*
s3:ListBucket / s3:GetBucketLocation on arn:aws:s3:::familienarchiv

While in there, caught a latent bug worth flagging: the minio/mc image ships coreutils + bash but NOT grep/awk/sed. The | grep -q readwrite || exit 1 fatal-check that Nora praised in #8333 and again in #8353 always exited 1 — i.e. the entrypoint always failed (verified locally). Replaced with a POSIX case substring match, which actually works. Extracted the whole entrypoint to infra/minio/bootstrap.sh because the policy JSON heredoc + escaping would have made the inline YAML unreadable.

Idempotency verified locally: two consecutive docker compose run --rm create-buckets invocations both exit 0 with archiv-app bound to archiv-app-policy.

🟡 Nora — MEDIUM resolved

fail2ban widened — 7e430998 security(fail2ban): widen jail to /forgot-password and rate-limit 429

failregex = ^\s*\{.*?"remote_ip":"<HOST>".*?"uri":"/api/auth/(login|forgot-password).*?"status":\s*4(01|29)\b

Covers both endpoints Nora flagged, plus 429 in case a future in-app rate-limiter responds before fail2ban hits the volume threshold.

🏛️ Markus — Blockers resolved

Three new ADRs for the architectural decisions that needed lasting records:

59bc81d3 docs(adr): ADR-009 standalone docker-compose.prod.yml — captures why overlay was retired now that MinIO stays, and the cost (env-var duplication) vs benefit (one file, no merge surprises) trade-off
b57afb9a docs(adr): ADR-010 MinIO stays self-hosted — Hetzner OBS deferred; four explicit triggers documented to re-open the decision (50 GB threshold, healthcheck latency, VPS upgrade cost, backup runtime)
6a6a1c43 docs(adr): ADR-011 single-tenant Gitea runner — operational assumption + migration path to --env-file <(printf …) when the trust boundary changes

Auth-flow diagram — 03d47884 docs(arch): show Caddy + X-Forwarded-Proto in auth-flow diagram

Updated seq-auth-flow.puml to show the Caddy hop with X-Forwarded-Proto: https, the Jetty ForwardedRequestCustomizer reading it, and the Secure flag on the Set-Cookie response that depends on it.

DEPLOYMENT.md mermaid — a7a80f8c docs(deployment): route SSE through Caddy in topology mermaid

The top-level mermaid now agrees with the C4 L2 diagram: SSE notifications transit Caddy, the backend is never reachable directly.

👨‍💻 Felix / 🧪 Sara — Test pyramid concerns

ForwardHeadersConfigurationTest rewrite — e5d953de test(config): rewrite as context-less binder test

Replaced @SpringBootTest + Postgres container + @MockitoBean S3Client (the worst-of-both: slow as integration, weak as contract) with a context-less Spring Binder test that loads application.yaml and binds the property to the typed ServerProperties.ForwardHeadersStrategy enum.

Trade-off note: the suggested @WebMvcTest-with-X-Forwarded-Proto-header path doesn't actually work for strategy: native — that's a Jetty-layer customizer, not a Spring filter, so MockMvc would never exercise it. A true behavioural test would need @SpringBootTest(webEnvironment = RANDOM_PORT) + a test controller, which keeps the Postgres container and is heavier than the original. The Binder + typed-enum approach catches what Sara and I both worried about — typos (nativ, Native), case errors, and Spring renames all fail with BindException. Verified red on nativ, green on native.

🔧 Tobi — Suggestions resolved

--pull on docker compose build — f2ec8154 ci(deploy): add --pull for CVE pickup

Both nightly.yml and release.yml now force a base-image re-fetch so CVE re-publications of pinned tags are picked up at the next nightly instead of "next manual prune".

Smoke-test hairpin — fe1451f5 ci(smoke): pin curl to 127.0.0.1 via --resolve

Smoke step uses curl --resolve "$HOST:443:127.0.0.1" against both archiv.raddatz.cloud and staging.raddatz.cloud. SNI still uses the public hostname (cert validates) but the connection goes to loopback — no hairpin NAT dependency on the host router.

Renovate aspiration — 33300e4a chore(infra): drop aspirational Renovate comments

The repo's renovate.json is minimal (only TipTap grouping); Renovate isn't bumping the production images. Stripped the "Renovate keeps it current" lines from docker-compose.prod.yml. Filed #500 for the bootstrap work.

🧪 Sara — CI gaps resolved

fail2ban-regex CI job — 9652894a test(ci): add fail2ban-regex regression job

New fail2ban-regex CI job pipes a deterministic Caddy access-log line through fail2ban-regex against the live filter and asserts the expected match count. The regex widening commit (7e430998) added matching cases for /forgot-password 401, /login 429, and a negative case (/api/documents 401). A future Caddy upgrade that reorders JSON fields will fail this job loudly at PR time instead of silently breaking the jail.

Compose idempotency CI job — 156afa14 test(ci): add compose bucket-bootstrap idempotency job

Runs docker compose run --rm create-buckets twice against a throwaway minio stack and asserts both invocations exit 0. Catches re-deploy breakage before the third nightly run at 02:00.

📋 Elicit — Open questions closed

OQ-1 (Renovate active?) — No, filed #500 as the bootstrap follow-up
OQ-2 (CI image smoke-test filed?) — Filed #501
OQ-3 (Backup pipeline filed?) — Filed #502
OQ-4 (Observability filed?) — Already #498, confirmed open

🎨 Leonie

No code action needed — the prerendered help page was already correctly opened in PUBLIC_PATHS. Thanks for the senior-audience UX framing.

Verification before push

./mvnw test — 1566 tests, 0 failures, 0 errors, 0 skipped
./mvnw clean package -DskipTests — BUILD SUCCESS
docker compose -f docker-compose.prod.yml -p test-idem run --rm create-buckets twice locally — both exit 0 with new policy bound
python3 -c "import yaml; yaml.safe_load(open('.gitea/workflows/<each>'))" — all three workflows parse
ForwardHeadersConfigurationTest verified red on typo + green on native

Commits on this branch since the review

a7a80f8c docs(deployment): route SSE through Caddy in topology mermaid
03d47884 docs(arch): show Caddy + X-Forwarded-Proto in auth-flow diagram
6a6a1c43 docs(adr): ADR-011 single-tenant Gitea runner with on-disk env-files
b57afb9a docs(adr): ADR-010 MinIO stays self-hosted, Hetzner OBS deferred
59bc81d3 docs(adr): ADR-009 standalone docker-compose.prod.yml, not overlay
33300e4a chore(infra): drop aspirational Renovate comments from compose
fe1451f5 ci(smoke): pin curl to 127.0.0.1 via --resolve
f2ec8154 ci(deploy): add --pull to docker compose build for CVE pickup
7e430998 security(fail2ban): widen jail to /forgot-password and rate-limit 429
156afa14 test(ci): add compose bucket-bootstrap idempotency job
91f70e65 security(minio): scope archiv-app to bucket-only IAM policy
9652894a test(ci): add fail2ban-regex regression job
e5d953de test(config): rewrite ForwardHeadersConfigurationTest as context-less binder test

Follow-up issues filed

#500 — devops: bootstrap Renovate config for production deps
#501 — test(ci): production image smoke-test job
#502 — devops: nightly backup pipeline — pg_dump + mc mirror over Tailscale

Ready for a re-review pass.

— Felix

🤖 Generated with Claude Code

## 👨‍💻 Felix Brandt — Implementation pass on the multi-persona review Worked through every concern raised across @markus, @felix (myself), @tobi, @nora, @sara, @leonie, @elicit. Pushed 13 atomic commits and filed 3 follow-up issues. Every reviewer-blocker is addressed; every actionable suggestion is in. ### 🔴 Nora — HIGH blocker resolved **MinIO policy scoping** — `91f70e65` security(minio): scope archiv-app to bucket-only IAM policy Replaced the built-in `readwrite` policy attach with a custom `archiv-app-policy`: - `s3:GetObject` / `s3:PutObject` / `s3:DeleteObject` on `arn:aws:s3:::familienarchiv/*` - `s3:ListBucket` / `s3:GetBucketLocation` on `arn:aws:s3:::familienarchiv` While in there, caught a **latent bug** worth flagging: the `minio/mc` image ships coreutils + bash but NOT `grep`/`awk`/`sed`. The `| grep -q readwrite || exit 1` fatal-check that Nora praised in [#8333](https://git.raddatz.cloud/marcel/familienarchiv/issues/497#issuecomment-8333) and again in [#8353](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8353) **always exited 1** — i.e. the entrypoint always failed (verified locally). Replaced with a POSIX `case` substring match, which actually works. Extracted the whole entrypoint to `infra/minio/bootstrap.sh` because the policy JSON heredoc + escaping would have made the inline YAML unreadable. Idempotency verified locally: two consecutive `docker compose run --rm create-buckets` invocations both exit 0 with `archiv-app` bound to `archiv-app-policy`. ### 🟡 Nora — MEDIUM resolved **fail2ban widened** — `7e430998` security(fail2ban): widen jail to /forgot-password and rate-limit 429 `failregex = ^\s*\{.*?"remote_ip":"<HOST>".*?"uri":"/api/auth/(login|forgot-password).*?"status":\s*4(01|29)\b` Covers both endpoints Nora flagged, plus 429 in case a future in-app rate-limiter responds before fail2ban hits the volume threshold. ### 🏛️ Markus — Blockers resolved **Three new ADRs** for the architectural decisions that needed lasting records: - `59bc81d3` docs(adr): **ADR-009 standalone docker-compose.prod.yml** — captures why overlay was retired now that MinIO stays, and the cost (env-var duplication) vs benefit (one file, no merge surprises) trade-off - `b57afb9a` docs(adr): **ADR-010 MinIO stays self-hosted** — Hetzner OBS deferred; four explicit triggers documented to re-open the decision (50 GB threshold, healthcheck latency, VPS upgrade cost, backup runtime) - `6a6a1c43` docs(adr): **ADR-011 single-tenant Gitea runner** — operational assumption + migration path to `--env-file <(printf …)` when the trust boundary changes **Auth-flow diagram** — `03d47884` docs(arch): show Caddy + X-Forwarded-Proto in auth-flow diagram Updated `seq-auth-flow.puml` to show the Caddy hop with `X-Forwarded-Proto: https`, the Jetty `ForwardedRequestCustomizer` reading it, and the `Secure` flag on the Set-Cookie response that depends on it. **DEPLOYMENT.md mermaid** — `a7a80f8c` docs(deployment): route SSE through Caddy in topology mermaid The top-level mermaid now agrees with the C4 L2 diagram: SSE notifications transit Caddy, the backend is never reachable directly. ### 👨‍💻 Felix / 🧪 Sara — Test pyramid concerns **`ForwardHeadersConfigurationTest` rewrite** — `e5d953de` test(config): rewrite as context-less binder test Replaced `@SpringBootTest` + Postgres container + `@MockitoBean S3Client` (the worst-of-both: slow as integration, weak as contract) with a context-less Spring `Binder` test that loads `application.yaml` and binds the property to the typed `ServerProperties.ForwardHeadersStrategy` enum. Trade-off note: the suggested `@WebMvcTest`-with-`X-Forwarded-Proto`-header path doesn't actually work for `strategy: native` — that's a Jetty-layer customizer, not a Spring filter, so `MockMvc` would never exercise it. A true behavioural test would need `@SpringBootTest(webEnvironment = RANDOM_PORT)` + a test controller, which keeps the Postgres container and is heavier than the original. The Binder + typed-enum approach catches what Sara and I both worried about — typos (`nativ`, `Native`), case errors, and Spring renames all fail with `BindException`. Verified red on `nativ`, green on `native`. ### 🔧 Tobi — Suggestions resolved **`--pull` on docker compose build** — `f2ec8154` ci(deploy): add --pull for CVE pickup Both `nightly.yml` and `release.yml` now force a base-image re-fetch so CVE re-publications of pinned tags are picked up at the next nightly instead of "next manual prune". **Smoke-test hairpin** — `fe1451f5` ci(smoke): pin curl to 127.0.0.1 via --resolve Smoke step uses `curl --resolve "$HOST:443:127.0.0.1"` against both `archiv.raddatz.cloud` and `staging.raddatz.cloud`. SNI still uses the public hostname (cert validates) but the connection goes to loopback — no hairpin NAT dependency on the host router. **Renovate aspiration** — `33300e4a` chore(infra): drop aspirational Renovate comments The repo's `renovate.json` is minimal (only TipTap grouping); Renovate isn't bumping the production images. Stripped the "Renovate keeps it current" lines from `docker-compose.prod.yml`. Filed **#500** for the bootstrap work. ### 🧪 Sara — CI gaps resolved **fail2ban-regex CI job** — `9652894a` test(ci): add fail2ban-regex regression job New `fail2ban-regex` CI job pipes a deterministic Caddy access-log line through `fail2ban-regex` against the live filter and asserts the expected match count. The regex widening commit (`7e430998`) added matching cases for `/forgot-password 401`, `/login 429`, and a negative case (`/api/documents 401`). A future Caddy upgrade that reorders JSON fields will fail this job loudly at PR time instead of silently breaking the jail. **Compose idempotency CI job** — `156afa14` test(ci): add compose bucket-bootstrap idempotency job Runs `docker compose run --rm create-buckets` **twice** against a throwaway minio stack and asserts both invocations exit 0. Catches re-deploy breakage before the third nightly run at 02:00. ### 📋 Elicit — Open questions closed - **OQ-1** (Renovate active?) — No, filed **#500** as the bootstrap follow-up - **OQ-2** (CI image smoke-test filed?) — Filed **#501** - **OQ-3** (Backup pipeline filed?) — Filed **#502** - **OQ-4** (Observability filed?) — Already #498, confirmed open ### 🎨 Leonie No code action needed — the prerendered help page was already correctly opened in PUBLIC_PATHS. Thanks for the senior-audience UX framing. ### Verification before push - `./mvnw test` — 1566 tests, 0 failures, 0 errors, 0 skipped - `./mvnw clean package -DskipTests` — BUILD SUCCESS - `docker compose -f docker-compose.prod.yml -p test-idem run --rm create-buckets` twice locally — both exit 0 with new policy bound - `python3 -c "import yaml; yaml.safe_load(open('.gitea/workflows/<each>'))"` — all three workflows parse - ForwardHeadersConfigurationTest verified red on typo + green on `native` ### Commits on this branch since the review ``` a7a80f8c docs(deployment): route SSE through Caddy in topology mermaid 03d47884 docs(arch): show Caddy + X-Forwarded-Proto in auth-flow diagram 6a6a1c43 docs(adr): ADR-011 single-tenant Gitea runner with on-disk env-files b57afb9a docs(adr): ADR-010 MinIO stays self-hosted, Hetzner OBS deferred 59bc81d3 docs(adr): ADR-009 standalone docker-compose.prod.yml, not overlay 33300e4a chore(infra): drop aspirational Renovate comments from compose fe1451f5 ci(smoke): pin curl to 127.0.0.1 via --resolve f2ec8154 ci(deploy): add --pull to docker compose build for CVE pickup 7e430998 security(fail2ban): widen jail to /forgot-password and rate-limit 429 156afa14 test(ci): add compose bucket-bootstrap idempotency job 91f70e65 security(minio): scope archiv-app to bucket-only IAM policy 9652894a test(ci): add fail2ban-regex regression job e5d953de test(config): rewrite ForwardHeadersConfigurationTest as context-less binder test ``` ### Follow-up issues filed - **#500** — devops: bootstrap Renovate config for production deps - **#501** — test(ci): production image smoke-test job - **#502** — devops: nightly backup pipeline — pg_dump + mc mirror over Tailscale Ready for a re-review pass. — Felix 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel commented

2026-05-11 13:50:25 +02:00

🏛️ Markus Keller — Application Architect

Verdict: ✅ Approved

What I checked

Module boundaries — none affected; this is pure infrastructure/config.
Architecture decisions captured as ADRs.
C4 + sequence diagrams kept in sync with the new topology.
Transport-protocol choices and the proxy hop they imply.
Single-monolith deployment story.

What I like

Three ADRs landed with the change, not after it — 009 (standalone vs overlay), 010 (MinIO stays self-hosted), 011 (single-tenant runner with secrets-on-disk env-files). Each has Context, Decision, Alternatives, Consequences, and — crucially for 010 and 011 — explicit re-evaluation triggers. This is exactly how ADRs should be written: future-me reading these will know when to revisit, not just what was decided.
C4 L2 updated to put Caddy outside the System_Boundary(archiv, "Docker Compose") (docs/architecture/c4/l2-containers.puml:6-7) — accurate, because Caddy is host-installed, not in the compose stack. The accompanying Rel(caddy, frontend, …) / Rel(caddy, backend, …) edges make the new topology unambiguous.
Auth-flow seq diagram updated (docs/architecture/c4/seq-auth-flow.puml) — the X-Forwarded-Proto hop is called out with a note pinned to the Backend lifeline. A reader who never opened application.yaml still understands why server.forward-headers-strategy: native matters.
MinIO stays a single Postgres + a single object store (ADR-010). No premature migration to Hetzner OBS. The "Triggers to re-evaluate" list (>50 GB and >5 GB/mo growth; >200 ms p95 healthcheck latency; backup duration >30 min) gives concrete signals, not a vague "scale later".
Project-name namespacing for staging vs production cohabitation (-p archiv-production / -p archiv-staging) is the right Compose-native isolation primitive. No bespoke YAML mangling.

Suggestions (non-blocking)

One naming inconsistency — the docker network is archive-net (docker-compose.prod.yml:464) while every other identifier in the prod compose is archiv (no -e): archiv user/db/bucket, archiv-app service account, archiv-production / archiv-staging project names. Not load-bearing, but the diff reviewer's eye stutters on it. If you're touching the prod compose for the follow-up backup PR, consider archiv-net.
The "what owns mc?" question in the L2 diagram — Container(mc, "Bucket / Service-Account Init", …) is shown inside the boundary, but it's a one-shot init container, not a long-running service. The current label ("One-shot container on startup") handles this in prose. I would not change it; just noting for the next architect reviewing the diagram.
No new transport protocols introduced — the SSE flow now transits Caddy (browser → Caddy → backend) and you've documented this both in the L2 description and the topology mermaid in DEPLOYMENT.md. Worth a brief sanity check that long-lived SSE connections survive Caddy's default idle timeouts (Caddy 2's reverse_proxy is buffer-less for SSE by default, but a future config change could regress this). Not a blocker — flag as a smoke-test item for the first staging deploy.

The architecture story this PR lands is consistent end-to-end: ADRs justify the structural choices, the compose file matches the ADRs, the C4 diagrams match the compose file, and the deployment doc matches the diagrams. That's the order I look for, and it holds. Ship it.

## 🏛️ Markus Keller — Application Architect **Verdict: ✅ Approved** ### What I checked - Module boundaries — none affected; this is pure infrastructure/config. - Architecture decisions captured as ADRs. - C4 + sequence diagrams kept in sync with the new topology. - Transport-protocol choices and the proxy hop they imply. - Single-monolith deployment story. ### What I like 1. **Three ADRs landed with the change, not after it** — 009 (standalone vs overlay), 010 (MinIO stays self-hosted), 011 (single-tenant runner with secrets-on-disk env-files). Each has Context, Decision, Alternatives, Consequences, and — crucially for 010 and 011 — explicit **re-evaluation triggers**. This is exactly how ADRs should be written: future-me reading these will know *when* to revisit, not just *what* was decided. 2. **C4 L2 updated to put Caddy outside the `System_Boundary(archiv, "Docker Compose")`** (`docs/architecture/c4/l2-containers.puml:6-7`) — accurate, because Caddy is host-installed, not in the compose stack. The accompanying `Rel(caddy, frontend, …)` / `Rel(caddy, backend, …)` edges make the new topology unambiguous. 3. **Auth-flow seq diagram updated** (`docs/architecture/c4/seq-auth-flow.puml`) — the X-Forwarded-Proto hop is called out with a note pinned to the Backend lifeline. A reader who never opened `application.yaml` still understands why `server.forward-headers-strategy: native` matters. 4. **MinIO stays a single Postgres + a single object store** (ADR-010). No premature migration to Hetzner OBS. The "Triggers to re-evaluate" list (>50 GB and >5 GB/mo growth; >200 ms p95 healthcheck latency; backup duration >30 min) gives concrete signals, not a vague "scale later". 5. **Project-name namespacing** for staging vs production cohabitation (`-p archiv-production` / `-p archiv-staging`) is the right Compose-native isolation primitive. No bespoke YAML mangling. ### Suggestions (non-blocking) 1. **One naming inconsistency** — the docker network is `archive-net` (`docker-compose.prod.yml:464`) while every other identifier in the prod compose is `archiv` (no `-e`): `archiv` user/db/bucket, `archiv-app` service account, `archiv-production` / `archiv-staging` project names. Not load-bearing, but the diff reviewer's eye stutters on it. If you're touching the prod compose for the follow-up backup PR, consider `archiv-net`. 2. **The "what owns `mc`?" question** in the L2 diagram — `Container(mc, "Bucket / Service-Account Init", …)` is shown inside the boundary, but it's a one-shot init container, not a long-running service. The current label ("One-shot container on startup") handles this in prose. I would not change it; just noting for the next architect reviewing the diagram. 3. **No new transport protocols introduced** — the SSE flow now transits Caddy (browser → Caddy → backend) and you've documented this both in the L2 description and the topology mermaid in DEPLOYMENT.md. Worth a brief sanity check that long-lived SSE connections survive Caddy's default idle timeouts (Caddy 2's reverse_proxy is buffer-less for SSE by default, but a future config change could regress this). Not a blocker — flag as a smoke-test item for the first staging deploy. The architecture story this PR lands is consistent end-to-end: ADRs justify the structural choices, the compose file matches the ADRs, the C4 diagrams match the compose file, and the deployment doc matches the diagrams. That's the order I look for, and it holds. Ship it.

marcel commented

2026-05-11 13:50:44 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

What I checked

TDD evidence on the one piece of Java code touched.
Naming and intent of every new identifier.
Code-level cleanliness of the shell scripts and the hooks.server.ts change.
Guard clauses / early-return discipline.

What I like

ForwardHeadersConfigurationTest is the cheapest possible config-binding test. (backend/src/test/java/.../config/ForwardHeadersConfigurationTest.java:1) No @SpringBootTest, no Testcontainers, no embedded server. It binds server.forward-headers-strategy from the YAML via Binder + ConfigurationPropertySources and asserts the enum value. If somebody types nativ or Native or framework (trailing space), the binder throws BindException and the test fails loudly — exactly the failure mode the comment promises. This is unit-test discipline applied to a config invariant, which is rare and right.

The assertion message is the documentation.

.as("Spring must trust X-Forwarded-Proto from Caddy so that "
    + "request.getScheme(), redirect URLs, and the Spring Session "
    + "'Secure' cookie reflect the original https client request.")

A failing test in CI tells the next developer why this was wired up. No external doc lookup needed.

hooks.server.ts diff is two lines (frontend/src/hooks.server.ts:8-15) — /hilfe/transkription added to PUBLIC_PATHS. The trailing comment ("prerendered help page — must be reachable without an auth cookie") is the why, not the what. That's the only kind of comment worth writing.
infra/minio/bootstrap.sh is readable shell. The header comment lays out the 7-step contract; each step is one or two lines of mc + idempotency-tolerant ||. The fatal-assertion case block at the bottom is exactly the kind of fail-loud guard rail I want at the boundary of a system — if the policy didn't bind, the script exits 1 and the compose deploy fails, instead of silently shipping a backend that gets AccessDenied from MinIO three days later.
One commit per logical step. TDD-ordered (feat(security): trust X-Forwarded-Proto lands the test first, then the YAML change). The PR description's commit table matches the actual git log — easy to bisect.

Suggestions (non-blocking)

bootstrap.sh — one POSIX-portability nit. The script's shebang is #!/bin/sh and the comment promises POSIX. The line
```
cat > /tmp/archiv-app-policy.json <<'POLICY'
```
uses an unquoted-here-doc-style fine, but case "$INFO" in *archiv-app-policy*) will match any substring including comment lines or other policy names containing that substring. If you ever rename the policy to e.g. archiv-app-policy-v2, the *archiv-app-policy* glob would still match the v2 string. Not a today-problem; flag it if you ever rotate the policy name.
The || idempotency pattern in bootstrap.sh swallows the first command's stderr. If mc admin user add fails for a non-idempotency reason (e.g. MinIO unreachable mid-script), the next line (mc admin user enable) will also fail, but the operator sees only the second failure. This is recoverable from logs and not worth restructuring, but I'd add 2>&1 to the first invocation if I were touching this again so the original error reaches stdout.
ForwardHeadersConfigurationTest covers only the happy path. A second test asserting the binder rejects an invalid value (e.g. server.forward-headers-strategy: yes) would lock the "fail loud on typo" contract that the comment claims. Not a blocker — the binder behavior is framework-tested upstream and your assertion message already documents the intent.

TDD discipline held: test commit precedes the YAML change in the history. Shell scripts read top-to-bottom. The hooks.server.ts change is one line of intent with a one-line why-comment. Ship it.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** ### What I checked - TDD evidence on the one piece of Java code touched. - Naming and intent of every new identifier. - Code-level cleanliness of the shell scripts and the hooks.server.ts change. - Guard clauses / early-return discipline. ### What I like 1. **`ForwardHeadersConfigurationTest` is the cheapest possible config-binding test.** (`backend/src/test/java/.../config/ForwardHeadersConfigurationTest.java:1`) No `@SpringBootTest`, no Testcontainers, no embedded server. It binds `server.forward-headers-strategy` from the YAML via `Binder` + `ConfigurationPropertySources` and asserts the enum value. If somebody types `nativ` or `Native` or `framework ` (trailing space), the binder throws `BindException` and the test fails loudly — exactly the failure mode the comment promises. This is unit-test discipline applied to a config invariant, which is rare and right. 2. **The assertion message is the documentation.** ```java .as("Spring must trust X-Forwarded-Proto from Caddy so that " + "request.getScheme(), redirect URLs, and the Spring Session " + "'Secure' cookie reflect the original https client request.") ``` A failing test in CI tells the next developer *why* this was wired up. No external doc lookup needed. 3. **hooks.server.ts diff is two lines** (`frontend/src/hooks.server.ts:8-15`) — `/hilfe/transkription` added to `PUBLIC_PATHS`. The trailing comment ("prerendered help page — must be reachable without an auth cookie") is the *why*, not the *what*. That's the only kind of comment worth writing. 4. **`infra/minio/bootstrap.sh` is readable shell.** The header comment lays out the 7-step contract; each step is one or two lines of `mc` + idempotency-tolerant `||`. The fatal-assertion `case` block at the bottom is exactly the kind of fail-loud guard rail I want at the boundary of a system — if the policy didn't bind, the script exits 1 and the compose deploy fails, instead of silently shipping a backend that gets `AccessDenied` from MinIO three days later. 5. **One commit per logical step.** TDD-ordered (`feat(security): trust X-Forwarded-Proto` lands the test first, then the YAML change). The PR description's commit table matches the actual git log — easy to bisect. ### Suggestions (non-blocking) 1. **bootstrap.sh — one POSIX-portability nit.** The script's shebang is `#!/bin/sh` and the comment promises POSIX. The line ```sh cat > /tmp/archiv-app-policy.json <<'POLICY' ``` uses an unquoted-here-doc-style fine, but `case "$INFO" in *archiv-app-policy*)` will match any substring including comment lines or other policy names containing that substring. If you ever rename the policy to e.g. `archiv-app-policy-v2`, the `*archiv-app-policy*` glob would still match the v2 string. Not a today-problem; flag it if you ever rotate the policy name. 2. **The `||` idempotency pattern in bootstrap.sh** swallows the first command's stderr. If `mc admin user add` fails for a *non-idempotency* reason (e.g. MinIO unreachable mid-script), the next line (`mc admin user enable`) will also fail, but the operator sees only the second failure. This is recoverable from logs and not worth restructuring, but I'd add `2>&1` to the first invocation if I were touching this again so the original error reaches stdout. 3. **`ForwardHeadersConfigurationTest` covers only the happy path.** A second test asserting the binder rejects an invalid value (e.g. `server.forward-headers-strategy: yes`) would lock the "fail loud on typo" contract that the comment claims. Not a blocker — the binder behavior is framework-tested upstream and your assertion message already documents the intent. TDD discipline held: test commit precedes the YAML change in the history. Shell scripts read top-to-bottom. The hooks.server.ts change is one line of intent with a one-line *why*-comment. Ship it.

marcel commented

2026-05-11 13:51:12 +02:00

🛠️ Tobias Wendt — DevOps & Platform Engineer

Verdict: ✅ Approved

This is the cleanest infra PR I've seen on this project. Let me walk through what I checked.

What I checked

Image tag pinning across every service in the prod compose.
Named volumes vs bind mounts for persistent data in prod.
Host port exposure — anything reachable from the public internet that shouldn't be?
Root credentials vs least-privilege service accounts.
Healthcheck coverage and depends_on: condition: service_healthy chains.
Secrets handling in CI (env-files on disk, cleanup steps, hardcoded values).
Deprecated action versions.
CI test coverage of the load-bearing infrastructure invariants.

What's right

Every image tag is pinned.
- postgres:16-alpine (docker-compose.prod.yml:475)
- minio/minio:RELEASE.2025-02-28T09-55-16Z (line 494)
- minio/mc:RELEASE.2025-08-13T08-35-41Z (line 516)
- axllent/mailpit:v1.29.7 (line 533)
- node:20.19.0-alpine3.21 (frontend/Dockerfile:1, repeated on every stage)
No :latest anywhere in production. The --pull flag on docker compose build in both workflows means CVE re-publications of these pinned tags actually get re-fetched on every deploy instead of being served from a stale layer cache. That's the right interpretation of "pin tags + still patch CVEs" — a nuance most projects get wrong.
Named volumes throughout the prod compose (postgres-data, minio-data, ocr-models, ocr-cache). No bind mounts. archiv-production_postgres-data and archiv-staging_postgres-data are auto-namespaced by -p. Survives docker compose down, survives container rebuilds, survives the operator's curiosity.
Ports bound to 127.0.0.1 only (docker-compose.prod.yml:587, 631). Caddy is the sole external entry point. Combined with ufw allow 22,80,443 and the Hetzner cloud firewall, that's a three-layer net.
MinIO root creds are never the backend's creds. The backend is wired to the bucket-scoped archiv-app service account (docker-compose.prod.yml:594-595), and the archiv-app-policy in bootstrap.sh:50-65 is genuinely least-privilege — GetObject/PutObject/DeleteObject on familienarchiv/*, ListBucket/GetBucketLocation on the bucket only. Not s3:*. A backend RCE cannot escalate beyond this bucket.
Healthchecks on every service, depends_on: condition: service_healthy on every dependency. Backend waits for db + minio + ocr. Frontend waits for backend. No race-on-startup.
CI proves the two invariants that would silently fail otherwise:
- fail2ban-regex job (.gitea/workflows/ci.yml:118+) pins the Caddy-JSON → fail2ban-filter contract against 5 deterministic samples (3 positive, 2 negative). A Caddy upgrade that reorders log keys would now fail CI instead of silently disabling brute-force protection in production.
- compose-idempotency job runs create-buckets twice and asserts both exit 0. The bootstrap script is the only piece of stateful init in the stack; this prevents a "second deploy fails" surprise.
Smoke tests in the deploy workflows are minimal but right — /login returns 200, HSTS header present, /actuator/health returns 404. Three things that will break if Caddy isn't reloaded or the snippet imports are wrong. The --resolve trick to pin archiv.raddatz.cloud → 127.0.0.1 avoids depending on the host router's hairpin NAT — that's the kind of detail that bites you at 2 a.m. when the deploy worked but the smoke test "fails" because the router doesn't loopback.
actions/checkout@v4 — current. No deprecated actions in the workflows.
Rollback procedure is one command (docs/DEPLOYMENT.md:283+):
```
TAG=v1.0.0 docker compose -f docker-compose.prod.yml -p archiv-production --env-file /opt/familienarchiv/.env.production up -d --wait --remove-orphans
```
And the doc explicitly flags that Flyway migrations are not auto-rolled-back — exactly the warning operators need before they try this.

Suggestions (non-blocking)

No actions/cache for Maven/node in the deploy workflows. The in-workflow comment acknowledges this (host docker layer cache is authoritative). For a nightly + tag-trigger cadence on a self-hosted runner with persistent disk, that's acceptable — a cold rebuild after docker system prune costs 5-10 min once, not on every run. Don't add actions/cache here; the host cache is the right tool for a single-tenant self-hosted runner.
OCR_MEM_LIMIT defaults to 12g (docker-compose.prod.yml:554). Documented in DEPLOYMENT.md and called out in the env-var table. Set to 6g if you ever downsize to CX32. Good as-is.
No Renovate config in this PR. The inline comments at minio: and mailpit: say "bumped manually until Renovate is bootstrapped". Open a follow-up issue for the Renovate config — three pinned images is the floor at which manual bumps stop being a chore and start being a security smell.
mailpit healthcheck uses wget — fine, but wget is a non-obvious dependency on the axllent/mailpit:v1.29.7 image. If that image ever drops wget (unlikely; it's alpine based), the healthcheck silently fails. Cheap follow-up: switch to nc -z localhost 8025 which has no binary dependency.
The archive-net network name (one extra e) is the only typo in the file. Everything else is archiv-*. Cosmetic; fix on next infra touch.
Idempotency CI is good; rollback CI is not. The current release.yml build-and-deploy chain is one-way. There's no automated test that a previous tag can be redeployed onto the current state. The doc walks through it; an "actually try rollback in staging" job would catch the case where a destructive migration breaks the rollback path. Open as a follow-up; not blocking v1.

Monthly cost stays ~18 EUR (1× CX32 + DNS + SMTP). MinIO data on-VPS, no Object Storage line item until ADR-010 fires. Good.

Operationally I'd ship this today.

## 🛠️ Tobias Wendt — DevOps & Platform Engineer **Verdict: ✅ Approved** This is the cleanest infra PR I've seen on this project. Let me walk through what I checked. ### What I checked - Image tag pinning across every service in the prod compose. - Named volumes vs bind mounts for persistent data in prod. - Host port exposure — anything reachable from the public internet that shouldn't be? - Root credentials vs least-privilege service accounts. - Healthcheck coverage and `depends_on: condition: service_healthy` chains. - Secrets handling in CI (env-files on disk, cleanup steps, hardcoded values). - Deprecated action versions. - CI test coverage of the load-bearing infrastructure invariants. ### What's right 1. **Every image tag is pinned.** - `postgres:16-alpine` (`docker-compose.prod.yml:475`) - `minio/minio:RELEASE.2025-02-28T09-55-16Z` (line 494) - `minio/mc:RELEASE.2025-08-13T08-35-41Z` (line 516) - `axllent/mailpit:v1.29.7` (line 533) - `node:20.19.0-alpine3.21` (`frontend/Dockerfile:1`, repeated on every stage) No `:latest` anywhere in production. The `--pull` flag on `docker compose build` in both workflows means CVE re-publications of these pinned tags actually get re-fetched on every deploy instead of being served from a stale layer cache. That's the right interpretation of "pin tags + still patch CVEs" — a nuance most projects get wrong. 2. **Named volumes throughout the prod compose** (`postgres-data`, `minio-data`, `ocr-models`, `ocr-cache`). No bind mounts. `archiv-production_postgres-data` and `archiv-staging_postgres-data` are auto-namespaced by `-p`. Survives `docker compose down`, survives container rebuilds, survives the operator's curiosity. 3. **Ports bound to `127.0.0.1` only** (`docker-compose.prod.yml:587, 631`). Caddy is the sole external entry point. Combined with `ufw allow 22,80,443` and the Hetzner cloud firewall, that's a three-layer net. 4. **MinIO root creds are never the backend's creds.** The backend is wired to the bucket-scoped `archiv-app` service account (`docker-compose.prod.yml:594-595`), and the `archiv-app-policy` in `bootstrap.sh:50-65` is genuinely least-privilege — `GetObject/PutObject/DeleteObject` on `familienarchiv/*`, `ListBucket/GetBucketLocation` on the bucket only. Not `s3:*`. A backend RCE cannot escalate beyond this bucket. 5. **Healthchecks on every service, `depends_on: condition: service_healthy` on every dependency.** Backend waits for db + minio + ocr. Frontend waits for backend. No race-on-startup. 6. **CI proves the two invariants that would silently fail otherwise:** - `fail2ban-regex` job (`.gitea/workflows/ci.yml:118+`) pins the Caddy-JSON → fail2ban-filter contract against 5 deterministic samples (3 positive, 2 negative). A Caddy upgrade that reorders log keys would now fail CI instead of silently disabling brute-force protection in production. - `compose-idempotency` job runs `create-buckets` twice and asserts both exit 0. The bootstrap script is the only piece of stateful init in the stack; this prevents a "second deploy fails" surprise. 7. **Smoke tests in the deploy workflows are minimal but right** — `/login` returns 200, HSTS header present, `/actuator/health` returns 404. Three things that *will* break if Caddy isn't reloaded or the snippet imports are wrong. The `--resolve` trick to pin `archiv.raddatz.cloud → 127.0.0.1` avoids depending on the host router's hairpin NAT — that's the kind of detail that bites you at 2 a.m. when the deploy worked but the smoke test "fails" because the router doesn't loopback. 8. **`actions/checkout@v4`** — current. No deprecated actions in the workflows. 9. **Rollback procedure is one command** (`docs/DEPLOYMENT.md:283+`): ```bash TAG=v1.0.0 docker compose -f docker-compose.prod.yml -p archiv-production --env-file /opt/familienarchiv/.env.production up -d --wait --remove-orphans ``` And the doc explicitly flags that **Flyway migrations are not auto-rolled-back** — exactly the warning operators need before they try this. ### Suggestions (non-blocking) 1. **No `actions/cache`** for Maven/node in the deploy workflows. The in-workflow comment acknowledges this (host docker layer cache is authoritative). For a nightly + tag-trigger cadence on a self-hosted runner with persistent disk, that's acceptable — a cold rebuild after `docker system prune` costs 5-10 min once, not on every run. Don't add `actions/cache` here; the host cache is the right tool for a single-tenant self-hosted runner. 2. **`OCR_MEM_LIMIT` defaults to 12g** (`docker-compose.prod.yml:554`). Documented in DEPLOYMENT.md and called out in the env-var table. Set to 6g if you ever downsize to CX32. Good as-is. 3. **No Renovate config in this PR.** The inline comments at `minio:` and `mailpit:` say "bumped manually until Renovate is bootstrapped". Open a follow-up issue for the Renovate config — three pinned images is the floor at which manual bumps stop being a chore and start being a security smell. 4. **`mailpit` healthcheck uses `wget`** — fine, but `wget` is a non-obvious dependency on the `axllent/mailpit:v1.29.7` image. If that image ever drops wget (unlikely; it's `alpine` based), the healthcheck silently fails. Cheap follow-up: switch to `nc -z localhost 8025` which has no binary dependency. 5. **The `archive-net` network name** (one extra `e`) is the only typo in the file. Everything else is `archiv-*`. Cosmetic; fix on next infra touch. 6. **Idempotency CI is good; rollback CI is not.** The current `release.yml` build-and-deploy chain is one-way. There's no automated test that a previous tag can be redeployed onto the current state. The doc walks through it; an "actually try rollback in staging" job would catch the case where a destructive migration breaks the rollback path. Open as a follow-up; not blocking v1. Monthly cost stays ~18 EUR (1× CX32 + DNS + SMTP). MinIO data on-VPS, no Object Storage line item until ADR-010 fires. Good. Operationally I'd ship this today.

marcel commented

2026-05-11 13:51:40 +02:00

🔒 Nora "NullX" Steiner — Application Security

Verdict: ⚠️ Approved with concerns

This is a well-considered security baseline. Most of what I look for in a first production deploy is present and explicit. I'm flagging two concerns I'd want addressed before or shortly after merge, and a few suggestions for hardening.

What I checked

Reverse-proxy header hygiene and trust boundaries.
Security headers — what's set, what's missing, why.
Secrets handling in CI and on the host.
Brute-force protection on auth endpoints.
Privilege boundaries between MinIO root and the application identity.
Defense-in-depth at the /actuator/* boundary.
Admin password bootstrap risk.

What's right

server.forward-headers-strategy: native (backend/src/main/resources/application.yaml:41) with a unit test pinning the binding. The threat model — backend is only reachable from Caddy on 127.0.0.1 and from other containers on the internal archive-net — makes blanket X-Forwarded-Proto trust safe. Production exposure to the public internet is via Caddy only.
HSTS, X-Content-Type-Options, Referrer-Policy, -Server in the Caddyfile snippet (security_headers) (infra/caddy/Caddyfile:17-23).
HSTS value is correct for preload-list eligibility: max-age=31536000; includeSubDomains; preload — one year, subdomains, preload-ready. Don't actually submit to the preload list yet (it's a one-way decision affecting raddatz.cloud siblings) but the header is right.
X-Frame-Options deliberately NOT set in Caddy, with a comment explaining Spring Security configures frame-options SAMEORIGIN for the PDF preview iframe. Conflict avoided. The comment is the audit trail.
/actuator/* → 404 at the Caddy layer (infra/caddy/Caddyfile:25-31) AND the internal Prometheus scrape goes via the Docker network, not Caddy. Defense in depth. /actuator/heapdump cannot leak.
Bucket-scoped MinIO service account (infra/minio/bootstrap.sh:50-66) — archiv-app-policy grants s3:GetObject/PutObject/DeleteObject on familienarchiv/* and s3:ListBucket/GetBucketLocation on familienarchiv. Not the MinIO built-in readwrite which is s3:* on *. A backend RCE or SSRF cannot pivot to mc admin user list or another bucket. The fatal-assertion at script end (case "$INFO" in *archiv-app-policy*)) catches a regression where the policy attach silently fails.
fail2ban watches /api/auth/login and /api/auth/forgot-password on 401 and 429 (infra/fail2ban/filter.d/familienarchiv-auth.conf:32). Covering forgot-password is the right call — it stops email-enumeration via 401 differential responses. Including 429 future-proofs against an in-app rate limiter being added later.
CI regression test for the fail2ban regex (.gitea/workflows/ci.yml:fail2ban-regex job) with 5 deterministic samples covering positive + negative cases. A Caddy upgrade that re-orders the JSON log keys would now fail CI instead of silently disabling the jail — this is the single most important non-obvious test in the PR.
Admin password lock-in is explicitly documented (DEPLOYMENT.md §3.5, ADR-011 comments). UserDataInitializer only seeds when the row doesn't exist; secret rotation after first-boot does nothing. This is captured in two places. Good.
Secrets handling: ${{ secrets.* }} for all sensitive values, env-file written + cleaned with if: always(). ADR-011 documents the single-tenant runner assumption that makes this safe, and the migration path to --env-file <(printf '%s' …) if a second repo ever shares the runner. The assumption is load-bearing and is captured in three places (ADR-011, inline comments in both workflow YAMLs).
No hardcoded credentials in any committed file. The CI idempotency job uses a stub .env.test written at runtime, not committed.

Concerns (non-blocking but worth addressing soon)

Missing Content-Security-Policy header. None of the Caddy snippets emit a CSP. A first-pass Content-Security-Policy: default-src 'self'; img-src 'self' data:; style-src 'self' 'unsafe-inline'; script-src 'self'; frame-ancestors 'self'; would block reflected XSS escalation paths and align with the frame-options SAMEORIGIN you already enforce at the Spring level. SvelteKit's adapter-node can emit nonces for inline scripts if the policy goes stricter later. I would open this as a follow-up issue, not block this PR — CSP misconfigurations break apps in subtle ways and need their own staging-validation cycle.
ALLOWED_PDF_HOSTS is not explicitly set on the prod ocr-service (docker-compose.prod.yml:559-563). The Python OCR service defaults to minio,localhost,127.0.0.1, which is correct for this deployment (the internal Docker hostname is minio), but the env var isn't pinned in the prod compose. If a future commit changes the Python default (or someone overrides via host env), SSRF protection silently loosens. Suggested fix: add ALLOWED_PDF_HOSTS: "minio" explicitly to the prod compose's ocr-service.environment block. Five characters of YAML, lifetime guarantee.

Suggestions

fail2ban tuning is defensible but worth reviewing in 30 days. maxretry=10 / findtime=10m / bantime=30m is generous for a family-archive context (a 67-year-old typing their password wrong eight times should not get banned). The trade-off is that a distributed credential-stuffing attack with low per-IP rate evades this jail entirely. The right counter is an in-app per-account rate limit, not stricter fail2ban. Track as a follow-up issue.
Permissions-Policy header is also missing. Lower-impact than CSP but nearly free: Permissions-Policy: camera=(), microphone=(), geolocation=() denies APIs the app doesn't use, reducing the blast radius if an XSS lands. Add to the (security_headers) snippet whenever you re-touch the Caddyfile.
The if: always() env-file cleanup is the linchpin of the ADR-011 trust model. Every workflow refactor must preserve it. Worth a comment in the workflow itself (something stronger than the current one) explaining that the line is load-bearing — but the ADR already does this, so this is a "nice to have" not a blocker.

Summary

Three things lift this above a typical first-production-deploy security posture: the bucket-scoped service account with a fatal-assertion bootstrap, the CI regression test for the fail2ban regex, and the explicit single-tenant trust model in ADR-011. The two concerns (CSP, explicit ALLOWED_PDF_HOSTS) are follow-ups, not blockers.

## 🔒 Nora "NullX" Steiner — Application Security **Verdict: ⚠️ Approved with concerns** This is a well-considered security baseline. Most of what I look for in a first production deploy is present and explicit. I'm flagging two concerns I'd want addressed before or shortly after merge, and a few suggestions for hardening. ### What I checked - Reverse-proxy header hygiene and trust boundaries. - Security headers — what's set, what's missing, why. - Secrets handling in CI and on the host. - Brute-force protection on auth endpoints. - Privilege boundaries between MinIO root and the application identity. - Defense-in-depth at the `/actuator/*` boundary. - Admin password bootstrap risk. ### What's right 1. **`server.forward-headers-strategy: native`** (`backend/src/main/resources/application.yaml:41`) with a unit test pinning the binding. The threat model — backend is only reachable from Caddy on `127.0.0.1` and from other containers on the internal `archive-net` — makes blanket X-Forwarded-Proto trust safe. Production exposure to the public internet is via Caddy only. 2. **HSTS, X-Content-Type-Options, Referrer-Policy, `-Server`** in the Caddyfile snippet `(security_headers)` (`infra/caddy/Caddyfile:17-23`). 3. **HSTS value is correct for preload-list eligibility**: `max-age=31536000; includeSubDomains; preload` — one year, subdomains, preload-ready. Don't actually submit to the preload list yet (it's a one-way decision affecting `raddatz.cloud` siblings) but the header is right. 4. **`X-Frame-Options` deliberately NOT set in Caddy**, with a comment explaining Spring Security configures `frame-options SAMEORIGIN` for the PDF preview iframe. Conflict avoided. The comment is the audit trail. 5. **`/actuator/* → 404`** at the Caddy layer (`infra/caddy/Caddyfile:25-31`) AND the internal Prometheus scrape goes via the Docker network, not Caddy. Defense in depth. `/actuator/heapdump` cannot leak. 6. **Bucket-scoped MinIO service account** (`infra/minio/bootstrap.sh:50-66`) — `archiv-app-policy` grants `s3:GetObject/PutObject/DeleteObject` on `familienarchiv/*` and `s3:ListBucket/GetBucketLocation` on `familienarchiv`. **Not** the MinIO built-in `readwrite` which is `s3:*` on `*`. A backend RCE or SSRF cannot pivot to `mc admin user list` or another bucket. The fatal-assertion at script end (`case "$INFO" in *archiv-app-policy*)`) catches a regression where the policy attach silently fails. 7. **fail2ban watches `/api/auth/login` and `/api/auth/forgot-password` on 401 and 429** (`infra/fail2ban/filter.d/familienarchiv-auth.conf:32`). Covering forgot-password is the right call — it stops email-enumeration via 401 differential responses. Including 429 future-proofs against an in-app rate limiter being added later. 8. **CI regression test for the fail2ban regex** (`.gitea/workflows/ci.yml:fail2ban-regex` job) with 5 deterministic samples covering positive + negative cases. A Caddy upgrade that re-orders the JSON log keys would now fail CI instead of silently disabling the jail — this is the single most important non-obvious test in the PR. 9. **Admin password lock-in is explicitly documented** (DEPLOYMENT.md §3.5, ADR-011 comments). `UserDataInitializer` only seeds when the row doesn't exist; secret rotation after first-boot does nothing. This is captured in two places. Good. 10. **Secrets handling**: `${{ secrets.* }}` for all sensitive values, env-file written + cleaned with `if: always()`. ADR-011 documents the **single-tenant runner** assumption that makes this safe, and the migration path to `--env-file <(printf '%s' …)` if a second repo ever shares the runner. The assumption is *load-bearing* and is captured in three places (ADR-011, inline comments in both workflow YAMLs). 11. **No hardcoded credentials** in any committed file. The CI idempotency job uses a stub `.env.test` written at runtime, not committed. ### Concerns (non-blocking but worth addressing soon) 1. **Missing `Content-Security-Policy` header.** None of the Caddy snippets emit a CSP. A first-pass `Content-Security-Policy: default-src 'self'; img-src 'self' data:; style-src 'self' 'unsafe-inline'; script-src 'self'; frame-ancestors 'self';` would block reflected XSS escalation paths and align with the `frame-options SAMEORIGIN` you already enforce at the Spring level. SvelteKit's adapter-node can emit nonces for inline scripts if the policy goes stricter later. I would open this as a follow-up issue, not block this PR — CSP misconfigurations break apps in subtle ways and need their own staging-validation cycle. 2. **`ALLOWED_PDF_HOSTS` is not explicitly set on the prod `ocr-service`** (`docker-compose.prod.yml:559-563`). The Python OCR service defaults to `minio,localhost,127.0.0.1`, which is correct for this deployment (the internal Docker hostname is `minio`), but the env var isn't pinned in the prod compose. If a future commit changes the Python default (or someone overrides via host env), SSRF protection silently loosens. **Suggested fix**: add `ALLOWED_PDF_HOSTS: "minio"` explicitly to the prod compose's `ocr-service.environment` block. Five characters of YAML, lifetime guarantee. ### Suggestions 3. **fail2ban tuning is defensible but worth reviewing in 30 days.** `maxretry=10 / findtime=10m / bantime=30m` is generous for a family-archive context (a 67-year-old typing their password wrong eight times should not get banned). The trade-off is that a distributed credential-stuffing attack with low per-IP rate evades this jail entirely. The right counter is an in-app per-account rate limit, not stricter fail2ban. Track as a follow-up issue. 4. **`Permissions-Policy` header is also missing.** Lower-impact than CSP but nearly free: `Permissions-Policy: camera=(), microphone=(), geolocation=()` denies APIs the app doesn't use, reducing the blast radius if an XSS lands. Add to the `(security_headers)` snippet whenever you re-touch the Caddyfile. 5. **The `if: always()` env-file cleanup is the linchpin of the ADR-011 trust model.** Every workflow refactor must preserve it. Worth a comment in the workflow itself (something stronger than the current one) explaining that the line is load-bearing — but the ADR already does this, so this is a "nice to have" not a blocker. ### Summary Three things lift this above a typical first-production-deploy security posture: the bucket-scoped service account with a fatal-assertion bootstrap, the CI regression test for the fail2ban regex, and the explicit single-tenant trust model in ADR-011. The two concerns (CSP, explicit `ALLOWED_PDF_HOSTS`) are follow-ups, not blockers.

marcel commented

2026-05-11 13:52:05 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ✅ Approved

What I checked

Test pyramid coverage for the new code: static / unit / integration / CI / smoke.
Whether the load-bearing infrastructure invariants are pinned by automated checks.
Test reliability — deterministic, no sleep, no shared state.
Whether the test naming reveals intent.
Whether regression coverage is sufficient against the failure modes the PR description claims to prevent.

What's right

The right tests landed at the right layers. This is an infra PR, so the bulk of testing is at the CI-job layer, not the unit layer — and the chosen jobs match the actual failure modes:
- fail2ban-regex (5 cases) → contract test for Caddy-JSON-log → fail2ban-filter compatibility.
- compose-idempotency → contract test for create-buckets re-runnability.
- Smoke tests in nightly.yml / release.yml → post-deploy assertion on the three things most likely to silently break (HTTP 200 on /login, HSTS header, /actuator/* blocked).
- ForwardHeadersConfigurationTest → unit-level pin on the forward-headers-strategy: native binding.

Each test fails for one reason, with a meaningful message.

.as("Spring must trust X-Forwarded-Proto from Caddy so that "
    + "request.getScheme(), redirect URLs, and the Spring Session "
    + "'Secure' cookie reflect the original https client request.")
.isEqualTo(ForwardHeadersStrategy.NATIVE);

When this fires in CI, the developer reads the message and knows exactly what they regressed. That's how an assertion message should read.

The smoke tests assert consequences, not implementation. They don't grep for import security_headers in the Caddyfile — they curl the live endpoint and check the response. That's the right level for a deploy smoke test: it catches "Caddy not reloaded", "snippet not imported", "container not bound to loopback" all at once.
--resolve pinning in the smoke tests (.gitea/workflows/nightly.yml:228) avoids the classic hairpin-NAT flake where the workflow passes locally but fails on the deployed host because the home router doesn't loop traffic back. SNI still uses the public hostname so the cert validates. This is the kind of detail that distinguishes a robust smoke test from one that teams learn to ignore.
Negative tests are present in fail2ban-regex. Two of the five cases (Does not match /api/auth/login 200, Does not match /api/documents 401) assert the filter doesn't match unrelated traffic. False-positive coverage matters as much as true-positive coverage — without it, a regex that matches everything would still pass the positive cases.
The compose-idempotency job tears down with down -v (.gitea/workflows/ci.yml:117) under if: always(). No leftover volumes contaminating the next run.
set -e in bootstrap.sh + the final case assertion → script fails loudly if MinIO state ever drifts. This is a test executed at deploy time, not at CI time, and it's the right place for it.

Suggestions (non-blocking)

ForwardHeadersConfigurationTest has only the happy path. A second test asserting that Binder.bind(..., ForwardHeadersStrategy.class) rejects an invalid value (e.g. binding nativ would throw) would lock the failure-mode the docstring promises. The framework guarantees this behavior, so I'm not asking for it — but if you ever extend this test class, add the negative case.
compose-idempotency doesn't verify the second create-buckets run actually re-attached the policy correctly. It only asserts both docker compose run create-buckets invocations exit 0. The bootstrap.sh fatal-assert at the end (case "$INFO" in *archiv-app-policy*) is what guarantees the policy actually bound — so if the second run silently passed without attaching, the script's own assert would catch it. Coverage is adequate; flagging only because the test name implies it asserts state, when it actually asserts exit code.
No load test / soak test in this PR. Acceptable for a "land the pipeline" PR; the Production v1 milestone presumably has a follow-up. Once SSE-through-Caddy is live in staging, I'd want a k6 script that holds 50 SSE connections for 10 min and confirms no drops — Caddy's reverse_proxy is buffer-less for SSE by default, but a future config change could regress this and the current smoke test wouldn't catch it.
The smoke test asserts HSTS header is present via grep -qi 'strict-transport-security'. That's a presence check, not a value check. If a future Caddyfile edit drops includeSubDomains; preload, this passes. Tightening to grep -qi 'max-age=31536000.*includeSubDomains.*preload' costs nothing and pins the security-team-relevant invariant.
No CI job exercises the rollback path. release.yml happy-path is well-covered; rolling back to a previous tag is documented but not tested. If you ever ship a destructive migration and the rollback breaks at 2am, you'll wish this existed. Follow-up issue, not blocking.

The test pyramid for this PR's surface area looks like:

Static    — YAML parse checks (PR description: docker compose config --quiet) — passed
Unit      — ForwardHeadersConfigurationTest — added in this PR
Contract  — fail2ban-regex, compose-idempotency — added in this PR
Smoke     — workflow smoke step — added in this PR
E2E       — Playwright in ci.yml — unchanged, still runs

That's the right shape. Ship it.

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ✅ Approved** ### What I checked - Test pyramid coverage for the new code: static / unit / integration / CI / smoke. - Whether the load-bearing infrastructure invariants are pinned by automated checks. - Test reliability — deterministic, no `sleep`, no shared state. - Whether the test naming reveals intent. - Whether regression coverage is sufficient against the failure modes the PR description claims to prevent. ### What's right 1. **The right tests landed at the right layers.** This is an infra PR, so the bulk of testing is at the CI-job layer, not the unit layer — and the chosen jobs match the actual failure modes: - `fail2ban-regex` (5 cases) → contract test for Caddy-JSON-log → fail2ban-filter compatibility. - `compose-idempotency` → contract test for `create-buckets` re-runnability. - Smoke tests in `nightly.yml` / `release.yml` → post-deploy assertion on the three things most likely to silently break (HTTP 200 on `/login`, HSTS header, `/actuator/*` blocked). - `ForwardHeadersConfigurationTest` → unit-level pin on the `forward-headers-strategy: native` binding. 2. **Each test fails for one reason, with a meaningful message.** ```java .as("Spring must trust X-Forwarded-Proto from Caddy so that " + "request.getScheme(), redirect URLs, and the Spring Session " + "'Secure' cookie reflect the original https client request.") .isEqualTo(ForwardHeadersStrategy.NATIVE); ``` When this fires in CI, the developer reads the message and knows exactly what they regressed. That's how an assertion message should read. 3. **The smoke tests assert *consequences*, not implementation.** They don't grep for `import security_headers` in the Caddyfile — they curl the live endpoint and check the response. That's the right level for a deploy smoke test: it catches "Caddy not reloaded", "snippet not imported", "container not bound to loopback" all at once. 4. **`--resolve` pinning in the smoke tests** (`.gitea/workflows/nightly.yml:228`) avoids the classic hairpin-NAT flake where the workflow passes locally but fails on the deployed host because the home router doesn't loop traffic back. SNI still uses the public hostname so the cert validates. This is the kind of detail that distinguishes a robust smoke test from one that teams learn to ignore. 5. **Negative tests are present in `fail2ban-regex`.** Two of the five cases (`Does not match /api/auth/login 200`, `Does not match /api/documents 401`) assert the filter *doesn't* match unrelated traffic. False-positive coverage matters as much as true-positive coverage — without it, a regex that matches everything would still pass the positive cases. 6. **The `compose-idempotency` job tears down with `down -v`** (`.gitea/workflows/ci.yml:117`) under `if: always()`. No leftover volumes contaminating the next run. 7. **`set -e` in `bootstrap.sh`** + the final `case` assertion → script fails loudly if MinIO state ever drifts. This is a test executed at deploy time, not at CI time, and it's the right place for it. ### Suggestions (non-blocking) 1. **`ForwardHeadersConfigurationTest` has only the happy path.** A second test asserting that `Binder.bind(..., ForwardHeadersStrategy.class)` rejects an invalid value (e.g. binding `nativ` would throw) would lock the failure-mode the docstring promises. The framework guarantees this behavior, so I'm not asking for it — but if you ever extend this test class, add the negative case. 2. **`compose-idempotency` doesn't verify the second `create-buckets` run actually re-attached the policy correctly.** It only asserts both `docker compose run create-buckets` invocations exit 0. The `bootstrap.sh` fatal-assert at the end (`case "$INFO" in *archiv-app-policy*`) is what guarantees the policy actually bound — so if the second run silently passed without attaching, the script's own assert would catch it. Coverage is adequate; flagging only because the test name implies it asserts state, when it actually asserts exit code. 3. **No load test / soak test in this PR.** Acceptable for a "land the pipeline" PR; the Production v1 milestone presumably has a follow-up. Once SSE-through-Caddy is live in staging, I'd want a k6 script that holds 50 SSE connections for 10 min and confirms no drops — Caddy's reverse_proxy is buffer-less for SSE by default, but a future config change could regress this and the current smoke test wouldn't catch it. 4. **The smoke test asserts `HSTS header is present`** via `grep -qi 'strict-transport-security'`. That's a presence check, not a value check. If a future Caddyfile edit drops `includeSubDomains; preload`, this passes. Tightening to `grep -qi 'max-age=31536000.*includeSubDomains.*preload'` costs nothing and pins the security-team-relevant invariant. 5. **No CI job exercises the *rollback* path.** `release.yml` happy-path is well-covered; rolling back to a previous tag is documented but not tested. If you ever ship a destructive migration and the rollback breaks at 2am, you'll wish this existed. Follow-up issue, not blocking. The test pyramid for this PR's surface area looks like: ``` Static — YAML parse checks (PR description: docker compose config --quiet) — passed Unit — ForwardHeadersConfigurationTest — added in this PR Contract — fail2ban-regex, compose-idempotency — added in this PR Smoke — workflow smoke step — added in this PR E2E — Playwright in ci.yml — unchanged, still runs ``` That's the right shape. Ship it.

marcel commented

2026-05-11 13:52:16 +02:00

🎨 Leonie Voss — UX & Accessibility

Verdict: ✅ Approved (LGTM — out of scope)

What I checked

Whether any user-facing rendering, layout, color, typography, or interaction pattern changed.
Whether /hilfe/transkription being moved to the public-paths list has any user-visible effect.
Whether the auth-flow changes alter the cookie behavior in a way that affects assistive tech or autofill.

Findings

This PR is pure infrastructure. The only frontend-adjacent change is:

frontend/src/hooks.server.ts:8-15 — /hilfe/transkription added to PUBLIC_PATHS, with the inline comment explaining the prerender requirement. The page already exists; it's now reachable without an auth cookie because the build process needs to prerender it. No visual change, no layout change, no accessibility impact.

The auth-flow seq diagram changes are documentation-only; the runtime behavior of the existing login screen is preserved (Caddy now terminates TLS in front of it, but the user-visible login form, the auth_token cookie's Secure flag, and the redirect behavior are unchanged from the user's perspective).

No new colors, no new typography, no new components, no new responsive breakpoints. Brand palette and design tokens untouched.

One forward-looking note

The /hilfe/transkription page becoming publicly accessible (without auth) is a small surface widening from a UX-of-helpfulness perspective: it means a user who lands on the help page from a shared link can read it without first having an account. Good for documentation discoverability; aligned with the dual-audience design where senior transcribers (60+) might be sent the link by a younger family member before they've registered.

Worth a one-line check at first staging deploy: load the page at 320px in a logged-out session and confirm the navigation chrome doesn't render broken links to auth-required pages. The page's own content is fine; the global nav around it might show "Profile" or "Documents" links that 404 for an unauthenticated visitor. Not a blocker for this PR — flag as a smoke-check.

Nothing for me to fix.

## 🎨 Leonie Voss — UX & Accessibility **Verdict: ✅ Approved (LGTM — out of scope)** ### What I checked - Whether any user-facing rendering, layout, color, typography, or interaction pattern changed. - Whether `/hilfe/transkription` being moved to the public-paths list has any user-visible effect. - Whether the auth-flow changes alter the cookie behavior in a way that affects assistive tech or autofill. ### Findings This PR is pure infrastructure. The only frontend-adjacent change is: - `frontend/src/hooks.server.ts:8-15` — `/hilfe/transkription` added to `PUBLIC_PATHS`, with the inline comment explaining the prerender requirement. The page already exists; it's now reachable without an auth cookie because the build process needs to prerender it. No visual change, no layout change, no accessibility impact. The auth-flow seq diagram changes are documentation-only; the runtime behavior of the existing login screen is preserved (Caddy now terminates TLS in front of it, but the user-visible login form, the `auth_token` cookie's `Secure` flag, and the redirect behavior are unchanged from the user's perspective). No new colors, no new typography, no new components, no new responsive breakpoints. Brand palette and design tokens untouched. ### One forward-looking note The `/hilfe/transkription` page becoming publicly accessible (without auth) is a small surface widening from a UX-of-helpfulness perspective: it means a user who lands on the help page from a shared link can read it without first having an account. Good for documentation discoverability; aligned with the dual-audience design where senior transcribers (60+) might be sent the link by a younger family member before they've registered. Worth a one-line check at first staging deploy: load the page at 320px in a logged-out session and confirm the navigation chrome doesn't render broken links to auth-required pages. The page's own content is fine; the global nav around it might show "Profile" or "Documents" links that 404 for an unauthenticated visitor. Not a blocker for this PR — flag as a smoke-check. Nothing for me to fix.

marcel commented

2026-05-11 13:52:35 +02:00

📋 "Elicit" — Requirements Engineer (Brownfield mode)

Verdict: ✅ Approved

What I checked

Whether the PR implements the requirements in issue #497 completely, or leaves gaps.
Whether the acceptance criteria implicit in the PR description are testable.
Whether deferred items are tracked and traceable.
Whether the documentation that future operators will rely on is consistent with the shipped artifacts.

What's right

Traceability is explicit. The PR description maps each commit to a concrete decision from issue #497's review comments (8331, 8333). Anyone six months from now can walk: issue #497 → ADR-009/010/011 → PR #499 commits → docker-compose.prod.yml. That trace is rare on infra PRs.
The "test plan" section is split into pre-merge and post-merge. Pre-merge items are all checked (1566 backend tests, both compose files parse, prod image builds and serves /login, Caddyfile validates, YAML workflows parse). Post-merge items are concrete and verifiable (configure 16 secrets, install Caddy + ufw + fail2ban + Tailscale, trigger nightly via dispatch, confirm smoke checks, then push a v1.0.0 tag). This is how a Definition of Done reads when the work shipped is genuinely complete.
Deferred work is named, not hidden. Three follow-ups are explicit in the description:
- CI image smoke-test job (Sara's suggestion)
- Backup pipeline (nightly pg_dump + mc mirror over Tailscale to heim-nas)
- Observability stack (#498)
Each has a stated reason for deferral ("keeps this PR focused"). That's exactly how a release-ready scope decision should read — Won't-this-release items, not silently dropped.
The admin-password lock-in is documented in three places (DEPLOYMENT.md §3.5, ADR-011 context, prod compose env var comment). For a foot-gun this sharp (first-deploy decision permanently freezes the credential), redundant documentation is correct.
ADRs include re-evaluation triggers, not just decisions. ADR-010's "Triggers to re-evaluate" gives concrete signals (minio-data > 50 GB, healthcheck p95 > 200 ms, etc.). ADR-011 names the migration trigger (second repo on the runner, public-fork PRs accepted). This converts decisions-with-time-bombs into decisions-with-tripwires. Excellent practice.
The deployment runbook is a 6-step concrete sequence (docs/DEPLOYMENT.md §3.1–3.5), not a vague narrative. The 16-secrets table is a checklist, not prose. Rollback is one command, with the data-shape caveat explicit.

Open questions / TBD (track but don't block)

OQ-499-1: When does the minio-data volume cross the 50 GB / 5 GB-per-month threshold that triggers the Hetzner OBS migration per ADR-010? No monitoring exists yet to surface this. Owner: deferred to observability follow-up (#498).
OQ-499-2: What's the operational owner for the off-site backup pipeline (pg_dump + mc mirror to heim-nas)? The PR description treats it as a follow-up but doesn't assign or date it. Owner: should be the next infra issue created.
OQ-499-3: Is there a documented incident-response runbook for the first time fail2ban actually bans a legitimate user (e.g. a 67-year-old transcriber from the dual-audience set)? The jail tuning (maxretry=10 / bantime=30m) is forgiving, but no doc tells the operator "if you get a phone call, here's how to un-ban an IP". Owner: doc follow-up.

Recommendations (non-blocking)

Surface the post-merge test plan as a Gitea checklist on issue #497 so closure is gated on the post-merge items actually being executed, not just on this PR landing.
Open the three deferred items as Gitea issues now, even if they sit unassigned. The PR description names them; if they live only in the description, they fall off the radar the day this PR is merged.
Add a glossary entry (docs/GLOSSARY.md) for "archiv-app service account" — a new domain term introduced by this PR that doesn't otherwise appear in the codebase outside of bootstrap.sh and the prod compose.

This is requirements-clean: scoped, traced, deferred work named, and the doc that survives the PR is internally consistent with the artifacts shipped.

## 📋 "Elicit" — Requirements Engineer (Brownfield mode) **Verdict: ✅ Approved** ### What I checked - Whether the PR implements the requirements in issue #497 completely, or leaves gaps. - Whether the acceptance criteria implicit in the PR description are testable. - Whether deferred items are tracked and traceable. - Whether the documentation that future operators will rely on is consistent with the shipped artifacts. ### What's right 1. **Traceability is explicit.** The PR description maps each commit to a concrete decision from issue #497's review comments (8331, 8333). Anyone six months from now can walk: `issue #497 → ADR-009/010/011 → PR #499 commits → docker-compose.prod.yml`. That trace is rare on infra PRs. 2. **The "test plan" section is split into pre-merge and post-merge.** Pre-merge items are all checked (1566 backend tests, both compose files parse, prod image builds and serves `/login`, Caddyfile validates, YAML workflows parse). Post-merge items are concrete and verifiable (configure 16 secrets, install Caddy + ufw + fail2ban + Tailscale, trigger nightly via dispatch, confirm smoke checks, then push a `v1.0.0` tag). This is how a Definition of Done reads when the work shipped is genuinely complete. 3. **Deferred work is named, not hidden.** Three follow-ups are explicit in the description: - CI image smoke-test job (Sara's suggestion) - Backup pipeline (nightly `pg_dump` + `mc mirror` over Tailscale to heim-nas) - Observability stack (#498) Each has a stated reason for deferral ("keeps this PR focused"). That's exactly how a release-ready scope decision should read — Won't-this-release items, not silently dropped. 4. **The admin-password lock-in is documented in three places** (DEPLOYMENT.md §3.5, ADR-011 context, prod compose env var comment). For a foot-gun this sharp (first-deploy decision permanently freezes the credential), redundant documentation is correct. 5. **ADRs include re-evaluation triggers, not just decisions.** ADR-010's "Triggers to re-evaluate" gives concrete signals (`minio-data > 50 GB`, healthcheck p95 > 200 ms, etc.). ADR-011 names the migration trigger (second repo on the runner, public-fork PRs accepted). This converts decisions-with-time-bombs into decisions-with-tripwires. Excellent practice. 6. **The deployment runbook is a 6-step concrete sequence** (`docs/DEPLOYMENT.md` §3.1–3.5), not a vague narrative. The 16-secrets table is a checklist, not prose. Rollback is one command, with the data-shape caveat explicit. ### Open questions / TBD (track but don't block) - **OQ-499-1**: When does the `minio-data` volume cross the 50 GB / 5 GB-per-month threshold that triggers the Hetzner OBS migration per ADR-010? No monitoring exists yet to surface this. Owner: deferred to observability follow-up (#498). - **OQ-499-2**: What's the operational owner for the off-site backup pipeline (`pg_dump` + `mc mirror` to heim-nas)? The PR description treats it as a follow-up but doesn't assign or date it. Owner: should be the next infra issue created. - **OQ-499-3**: Is there a documented incident-response runbook for the first time fail2ban actually bans a legitimate user (e.g. a 67-year-old transcriber from the dual-audience set)? The jail tuning (`maxretry=10 / bantime=30m`) is forgiving, but no doc tells the operator "if you get a phone call, here's how to un-ban an IP". Owner: doc follow-up. ### Recommendations (non-blocking) 1. **Surface the post-merge test plan as a Gitea checklist on issue #497** so closure is gated on the post-merge items actually being executed, not just on this PR landing. 2. **Open the three deferred items as Gitea issues now**, even if they sit unassigned. The PR description names them; if they live only in the description, they fall off the radar the day this PR is merged. 3. **Add a glossary entry** (`docs/GLOSSARY.md`) for "archiv-app service account" — a new domain term introduced by this PR that doesn't otherwise appear in the codebase outside of bootstrap.sh and the prod compose. This is requirements-clean: scoped, traced, deferred work named, and the doc that survives the PR is internally consistent with the artifacts shipped.

marcel referenced this issue from a commit

2026-05-11 14:18:47 +02:00

ci(smoke): pin HSTS to preload-list-eligible value

marcel referenced this issue from a commit

2026-05-11 14:18:47 +02:00

security(caddy): add Permissions-Policy header

marcel added 7 commits 2026-05-11 14:18:47 +02:00

ci(smoke): pin HSTS to preload-list-eligible value 8fcf653cb0

Replaces the presence-only `grep -qi strict-transport-security` smoke
assertion in both nightly.yml and release.yml with a value-pinning
regex that requires `max-age=31536000`, `includeSubDomains`, and
`preload`. A future Caddyfile edit that drops any of those three
parts now fails the deploy smoke step instead of passing silently.

Verified locally that the new pattern matches the preload-eligible
value and rejects three degraded forms (short max-age, missing
includeSubDomains, missing preload). Addresses @sara's round-2 note
on PR #499 — "presence check, not value check".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

security(caddy): add Permissions-Policy header 09680557ef

Adds `Permissions-Policy: camera=(), microphone=(), geolocation=()` to
the shared (security_headers) snippet, so both archiv vhosts and the
git vhost deny browser APIs the app does not use. Reduces blast radius
of an XSS landing in a privileged origin.

The deploy smoke steps in nightly.yml and release.yml gain a matching
assertion against the canonical header value, so a future Caddyfile
edit that drops or loosens the header (e.g. `camera=(self)`) fails the
deploy instead of regressing silently.

`caddy validate` against caddy:2 passes; both workflow YAMLs parse.
Addresses @nora's round-2 suggestion on PR #499 — "lower-impact than
CSP but nearly free".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

security(ocr): pin ALLOWED_PDF_HOSTS=minio in prod ocr-service env a4f2047bcc

Production never sources PDFs from localhost or 127.0.0.1 — the OCR
service only reads from MinIO over the internal docker network. The
Python default (`minio,localhost,127.0.0.1`) was permissive on
purpose for local dev, but in production a future change to that
default — or a host-env override — would silently broaden the SSRF
surface. Pinning the env var explicitly here freezes the allowlist
to the one hostname production actually needs.

`docker compose config --quiet` and `--profile staging config
--quiet` both still pass. Verified the resolved config emits
`ALLOWED_PDF_HOSTS: minio`. Addresses @nora's round-2 suggestion on
PR #499 — "five characters of YAML, lifetime guarantee".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

infra(mailpit): use nc -z healthcheck instead of wget 1873f50f7f

The mailpit service healthcheck previously assumed `wget` ships in
the axllent/mailpit image. That's true for v1.29.7 but is not part
of the image's contract — a future Alpine slim-down could drop wget
and silently disable the healthcheck. Switched to BusyBox `nc -z
localhost 8025`, which is a TCP-port open check with no dependency
beyond BusyBox itself.

Verified inside axllent/mailpit:v1.29.7 that `nc` is present
(/usr/bin/nc, BusyBox v1.37.0) and that the proposed command
returns 0 against an open port and non-zero against a closed one.
Compose still parses with `--profile staging`. Addresses @tobi's
round-2 suggestion on PR #499.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

infra(workflows): annotate env-file cleanup as load-bearing 440a191138

The `if: always()` conditional on the env-file cleanup step in both
deploy workflows is what makes the ADR-011 single-tenant runner trust
model safe: secrets land on disk before each deploy and are wiped
unconditionally afterwards. A future workflow refactor that drops
`if: always()` would silently leave plaintext secrets on the runner
on any failed deploy.

The ADR documents this; the workflow file did not. Adds a prominent
inline comment so the next reader of the YAML sees the constraint
without having to cross-reference ADR-011. No behaviour change — both
workflows still parse. Addresses @nora's round-2 suggestion on PR
#499 — "linchpin of the ADR-011 trust model".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

refactor(compose): rename docker network archive-net to archiv-net 9adde3cd89

The docker network was the only `archive-*` identifier in either
compose file; everything else (user, db, bucket, service account,
project name) uses the `archiv-*` spelling. Reviewers' eyes stuttered
on it on the prod compose review (round 2 of PR #499 — Markus and
Tobi). Renamed in both prod and dev compose for consistency and
updated the single doc reference to the dev-project-prefixed
network name.

Operational note: applying this change to a running stack will
recreate the network on the next `docker compose up`; containers
restart, named volumes are unaffected.

`docker compose config --quiet` passes for both compose files and
for the staging profile. Sweep confirms zero `archive-net`
references remain in the tree.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(glossary): add archiv-app service account entry

CI / Unit & Component Tests (pull_request) Failing after 2m48s

Details

CI / OCR Service Tests (pull_request) Successful in 16s

Details

CI / Backend Unit Tests (pull_request) Successful in 4m5s

Details

CI / fail2ban Regex (pull_request) Successful in 37s

Details

CI / Compose Bucket Idempotency (pull_request) Failing after 50s

Details

CI / Unit & Component Tests (push) Failing after 2m46s

Details

CI / OCR Service Tests (push) Successful in 15s

Details

CI / Backend Unit Tests (push) Successful in 4m4s

Details

CI / fail2ban Regex (push) Successful in 37s

Details

CI / Compose Bucket Idempotency (push) Failing after 50s

Details

4d4d5793bb

`archiv-app` is the bucket-scoped MinIO service account introduced
in PR #499 alongside the production deploy pipeline. Until now the
term only appeared in `infra/minio/bootstrap.sh` and the prod compose
file; a reader encountering `S3_ACCESS_KEY: archiv-app` had no
single-page reference distinguishing it from the MinIO root account.

Adds a new "Infrastructure Terms" section to docs/GLOSSARY.md so the
distinction (root account vs. application service account) and the
attached `archiv-app-policy` scope live in the canonical glossary
location. Cross-links to ADR-010 for the MinIO-stays-self-hosted
rationale. Addresses @elicit's round-2 recommendation on PR #499.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

marcel referenced this issue from a commit

2026-05-11 14:18:47 +02:00

security(ocr): pin ALLOWED_PDF_HOSTS=minio in prod ocr-service env

marcel referenced this issue from a commit

2026-05-11 14:18:47 +02:00

infra(mailpit): use nc -z healthcheck instead of wget

marcel referenced this issue from a commit

2026-05-11 14:18:47 +02:00

infra(workflows): annotate env-file cleanup as load-bearing

marcel referenced this issue from a commit

2026-05-11 14:18:47 +02:00

refactor(compose): rename docker network archive-net to archiv-net

marcel referenced this issue from a commit

2026-05-11 14:18:47 +02:00

docs(glossary): add archiv-app service account entry

marcel commented

2026-05-11 14:19:18 +02:00

👨‍💻 Felix Brandt — Round-2 follow-up pass

Worked through the non-blocking round-2 concerns. Seven atomic commits pushed. No blockers remained — every reviewer had approved on round 2; these are the small near-free items that came out of the "approved with concerns" / "suggestions" lists.

Concerns addressed

🔒 Nora — security suggestions

Commit	Concern	Resolution
`a4f2047b` security(ocr): pin ALLOWED_PDF_HOSTS=minio in prod ocr-service env	#8384 — "ALLOWED_PDF_HOSTS is not explicitly set on the prod ocr-service"	Explicit `ALLOWED_PDF_HOSTS: "minio"` env var in `docker-compose.prod.yml`. Future changes to the Python default or host-env overrides can no longer silently broaden the SSRF allowlist. Resolved config emits `ALLOWED_PDF_HOSTS: minio`.
`09680557` security(caddy): add Permissions-Policy header	#8384 — "Permissions-Policy header is also missing… nearly free"	Added `Permissions-Policy: camera=(), microphone=(), geolocation=()` to the `(security_headers)` Caddy snippet. Both deploy smoke steps now grep-assert the canonical value, so a future Caddyfile edit that loosens (e.g. `camera=(self)`) or drops the header fails the deploy. `caddy validate` against `caddy:2` passes.
`440a1911` infra(workflows): annotate env-file cleanup as load-bearing	#8384 — "`if: always()` is the linchpin of the ADR-011 trust model… worth a comment"	Inline comment in both `nightly.yml` and `release.yml` explaining the conditional anchors the single-tenant runner trust model from ADR-011 and must not be dropped without re-evaluating.

CSP is deferred as Nora recommended (its own staging-validation cycle).

🧪 Sara — smoke pyramid

Commit	Concern	Resolution
`8fcf653c` ci(smoke): pin HSTS to preload-list-eligible value	#8385 — "presence check, not a value check; if a future Caddyfile edit drops `includeSubDomains; preload`, this passes"	Replaced `grep -qi 'strict-transport-security'` with `grep -Eqi 'strict-transport-security:[[:space:]]max-age=31536000.includeSubDomains.*preload'` in both workflows. Verified locally that the new pattern matches the preload-eligible value and rejects three degraded forms (short `max-age`, missing `includeSubDomains`, missing `preload`).

🔧 Tobi — operational nits

Commit	Concern	Resolution
`1873f50f` infra(mailpit): use nc -z healthcheck instead of wget	#8383 — "`wget` is a non-obvious dependency on the `axllent/mailpit:v1.29.7` image… switch to `nc -z localhost 8025`"	Switched to BusyBox `nc -z localhost 8025`. Verified inside `axllent/mailpit:v1.29.7` that `nc` is present (`/usr/bin/nc`, BusyBox v1.37.0) and the command returns 0 on open ports and non-zero on closed ones.

🏛️ Markus + 🔧 Tobi — consistency

Commit	Concern	Resolution
`9adde3cd` refactor(compose): rename docker network archive-net to archiv-net	#8381 + #8383 — "docker network is `archive-net` while every other identifier is `archiv` (no `-e`)"	Renamed across both `docker-compose.prod.yml` and `docker-compose.yml` plus the lone `familienarchiv_archive-net` reference in `docs/infrastructure/ci-gitea.md`. Final sweep: zero `archive-net` references remain. Operational caveat: next `docker compose up` on a running stack will recreate the network (containers restart; named volumes are unaffected).

📋 Elicit �� traceability

Commit	Concern	Resolution
`4d4d5793` docs(glossary): add archiv-app service account entry	#8387 — "Add a glossary entry for 'archiv-app service account' — a new domain term"	New "Infrastructure Terms" section in `docs/GLOSSARY.md` documenting the `archiv-app` MinIO service account, the distinction from the MinIO root account (`archiv`), and the bucket-scoped `archiv-app-policy`. Cross-links to ADR-010.

Explicitly deferred (reviewers themselves recommended deferral)

CSP header (Nora) — "its own staging-validation cycle"
Rollback CI (Tobi + Sara) — "not blocking v1"
Load/SSE soak test (Sara) — "once SSE-through-Caddy is live in staging"
bootstrap.sh 2>&1 on first invocation (Felix-self) — "not worth restructuring"
Negative-case binder test for ForwardHeadersConfigurationTest (Felix-self + Sara) — "framework-tested upstream"
Policy-glob tightening in bootstrap.sh (Felix-self) — "flag if you ever rename"
fail2ban 30-day tuning review (Nora) — track separately

Verification before push

./mvnw test — 1566 tests, 0 failures, 0 errors, 0 skipped
./mvnw clean package -DskipTests — BUILD SUCCESS
docker run … caddy:2 caddy validate — Valid configuration (Permissions-Policy header included)
docker compose -f docker-compose.prod.yml … config --quiet — passes for prod, staging profile, and dev compose
Live nc verification against axllent/mailpit:v1.29.7 — open port returns 0, closed port returns non-zero
Workflow YAMLs (nightly.yml, release.yml, ci.yml) all parse
Final sweep — zero archive-net references in tree

Commits on this branch since the round-2 reviews

4d4d5793 docs(glossary): add archiv-app service account entry
9adde3cd refactor(compose): rename docker network archive-net to archiv-net
440a1911 infra(workflows): annotate env-file cleanup as load-bearing
1873f50f infra(mailpit): use nc -z healthcheck instead of wget
a4f2047b security(ocr): pin ALLOWED_PDF_HOSTS=minio in prod ocr-service env
09680557 security(caddy): add Permissions-Policy header
8fcf653c ci(smoke): pin HSTS to preload-list-eligible value

Ready for merge.

— Felix

🤖 Generated with Claude Code

## 👨‍💻 Felix Brandt — Round-2 follow-up pass Worked through the non-blocking round-2 concerns. Seven atomic commits pushed. No blockers remained — every reviewer had approved on round 2; these are the small near-free items that came out of the *"approved with concerns"* / *"suggestions"* lists. ### Concerns addressed #### 🔒 Nora — security suggestions | Commit | Concern | Resolution | |---|---|---| | `a4f2047b` security(ocr): pin ALLOWED_PDF_HOSTS=minio in prod ocr-service env | [#8384](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8384) — *"ALLOWED_PDF_HOSTS is not explicitly set on the prod ocr-service"* | Explicit `ALLOWED_PDF_HOSTS: "minio"` env var in `docker-compose.prod.yml`. Future changes to the Python default or host-env overrides can no longer silently broaden the SSRF allowlist. Resolved config emits `ALLOWED_PDF_HOSTS: minio`. | | `09680557` security(caddy): add Permissions-Policy header | [#8384](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8384) — *"Permissions-Policy header is also missing… nearly free"* | Added `Permissions-Policy: camera=(), microphone=(), geolocation=()` to the `(security_headers)` Caddy snippet. Both deploy smoke steps now grep-assert the canonical value, so a future Caddyfile edit that loosens (e.g. `camera=(self)`) or drops the header fails the deploy. `caddy validate` against `caddy:2` passes. | | `440a1911` infra(workflows): annotate env-file cleanup as load-bearing | [#8384](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8384) — *"`if: always()` is the linchpin of the ADR-011 trust model… worth a comment"* | Inline comment in both `nightly.yml` and `release.yml` explaining the conditional anchors the single-tenant runner trust model from ADR-011 and must not be dropped without re-evaluating. | CSP is deferred as Nora recommended (its own staging-validation cycle). #### 🧪 Sara — smoke pyramid | Commit | Concern | Resolution | |---|---|---| | `8fcf653c` ci(smoke): pin HSTS to preload-list-eligible value | [#8385](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8385) — *"presence check, not a value check; if a future Caddyfile edit drops `includeSubDomains; preload`, this passes"* | Replaced `grep -qi 'strict-transport-security'` with `grep -Eqi 'strict-transport-security:[[:space:]]*max-age=31536000.*includeSubDomains.*preload'` in both workflows. Verified locally that the new pattern matches the preload-eligible value and rejects three degraded forms (short `max-age`, missing `includeSubDomains`, missing `preload`). | #### 🔧 Tobi — operational nits | Commit | Concern | Resolution | |---|---|---| | `1873f50f` infra(mailpit): use nc -z healthcheck instead of wget | [#8383](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8383) — *"`wget` is a non-obvious dependency on the `axllent/mailpit:v1.29.7` image… switch to `nc -z localhost 8025`"* | Switched to BusyBox `nc -z localhost 8025`. Verified inside `axllent/mailpit:v1.29.7` that `nc` is present (`/usr/bin/nc`, BusyBox v1.37.0) and the command returns 0 on open ports and non-zero on closed ones. | #### 🏛️ Markus + 🔧 Tobi — consistency | Commit | Concern | Resolution | |---|---|---| | `9adde3cd` refactor(compose): rename docker network archive-net to archiv-net | [#8381](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8381) + [#8383](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8383) — *"docker network is `archive-net` while every other identifier is `archiv` (no `-e`)"* | Renamed across both `docker-compose.prod.yml` and `docker-compose.yml` plus the lone `familienarchiv_archive-net` reference in `docs/infrastructure/ci-gitea.md`. Final sweep: zero `archive-net` references remain. Operational caveat: next `docker compose up` on a running stack will recreate the network (containers restart; named volumes are unaffected). | #### 📋 Elicit �� traceability | Commit | Concern | Resolution | |---|---|---| | `4d4d5793` docs(glossary): add archiv-app service account entry | [#8387](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8387) — *"Add a glossary entry for 'archiv-app service account' — a new domain term"* | New "Infrastructure Terms" section in `docs/GLOSSARY.md` documenting the `archiv-app` MinIO service account, the distinction from the MinIO root account (`archiv`), and the bucket-scoped `archiv-app-policy`. Cross-links to ADR-010. | ### Explicitly deferred (reviewers themselves recommended deferral) - **CSP header** (Nora) — *"its own staging-validation cycle"* - **Rollback CI** (Tobi + Sara) — *"not blocking v1"* - **Load/SSE soak test** (Sara) — *"once SSE-through-Caddy is live in staging"* - **bootstrap.sh `2>&1` on first invocation** (Felix-self) — *"not worth restructuring"* - **Negative-case binder test for ForwardHeadersConfigurationTest** (Felix-self + Sara) — *"framework-tested upstream"* - **Policy-glob tightening in bootstrap.sh** (Felix-self) — *"flag if you ever rename"* - **fail2ban 30-day tuning review** (Nora) — track separately ### Verification before push - `./mvnw test` — **1566 tests, 0 failures, 0 errors, 0 skipped** - `./mvnw clean package -DskipTests` — BUILD SUCCESS - `docker run … caddy:2 caddy validate` — Valid configuration (Permissions-Policy header included) - `docker compose -f docker-compose.prod.yml … config --quiet` — passes for prod, staging profile, and dev compose - Live nc verification against `axllent/mailpit:v1.29.7` — open port returns 0, closed port returns non-zero - Workflow YAMLs (`nightly.yml`, `release.yml`, `ci.yml`) all parse - Final sweep — zero `archive-net` references in tree ### Commits on this branch since the round-2 reviews ``` 4d4d5793 docs(glossary): add archiv-app service account entry 9adde3cd refactor(compose): rename docker network archive-net to archiv-net 440a1911 infra(workflows): annotate env-file cleanup as load-bearing 1873f50f infra(mailpit): use nc -z healthcheck instead of wget a4f2047b security(ocr): pin ALLOWED_PDF_HOSTS=minio in prod ocr-service env 09680557 security(caddy): add Permissions-Policy header 8fcf653c ci(smoke): pin HSTS to preload-list-eligible value ``` Ready for merge. — Felix 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel merged commit 4d4d5793bb into main

2026-05-11 14:29:33 +02:00

marcel deleted branch feat/issue-497-prod-deploy

2026-05-11 14:29:34 +02:00

marcel referenced this pull request

2026-05-11 14:58:50 +02:00

bug(infra/fail2ban): jail defaults to systemd backend on Debian, never inspects Caddy access log #503

marcel referenced this pull request

2026-05-11 15:31:42 +02:00

bug(infra/minio): create-buckets bootstrap.sh bind-mount fails on DooD runner (`Is a directory`) #506

marcel referenced this issue from a commit

2026-05-11 15:33:12 +02:00