feat(infra): production deployment pipeline — Caddy, staging, Gitea Actions (#497) #499
Reference in New Issue
Block a user
Delete Branch "feat/issue-497-prod-deploy"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Lands the production deployment pipeline described in #497. Two environments, both deployed by Gitea Actions on the existing self-hosted runner (Docker-out-of-Docker, no registry):
archiv.raddatz.cloud— production, onv*tag push (release.yml)staging.raddatz.cloud— staging, nightly frommain(nightly.yml)All decisions from Tobias's and Nora's review summaries (comments 8331 and 8333) are implemented.
Commits (TDD-ordered, atomic)
feat(security): trust X-Forwarded-Proto behind reverse proxyserver.forward-headers-strategy: native+ failing-first integration test inbackend/.../config/ForwardHeadersConfigurationTestfix(auth): mark /hilfe/transkription as public for prerendernpm run buildin the production Dockerfile stage. Surfaced by #497.feat(frontend): add production stage to Dockerfiledevelopment/build/production; pinnednode:20.19.0-alpine3.21; dev compose now specifiestarget: developmentso the dev workflow is unchanged.feat(infra): add docker-compose.prod.yml for production/stagingarchiv-appservice account, not root; mailpit underprofiles: [staging]; OCR healthcheck +mem_limit: 12g; ports bound to127.0.0.1only; admin creds wired through env.feat(infra): add production Caddyfile-Server;/actuator/*→ 404. NoX-Frame-Options(Spring Security configures SAMEORIGIN for PDF iframes — would conflict). Validated withcaddy validateagainstcaddy:2.feat(ci): add nightly staging deploy workflow0 2 * * *+workflow_dispatch;DOCKER_BUILDKIT=1;--profile staging(mailpit);up -d --wait; env-file cleanupif: always().feat(ci): add release production deploy workflowv*tag; tags built images with${{ gitea.ref_name }}so rollback is aTAG=<previous>one-liner.docs(deployment): rewrite for Gitea Actions / Caddy / prod composeMINIO_APP_PASSWORDand the admin-password first-deploy warning), 16-secrets bootstrap checklist, fail2ban + Tailscale install steps, rollback procedure.docs: retire overlay narrative; add Caddy to C4 L2 diagramdocs/infrastructure/production-compose.mdtrimmed (live files now in-tree);docs/architecture/c4/l2-containers.pumlgains Caddy.Test plan
Local verification (all green at HEAD of this branch):
./mvnw test— full backend suite, 1566 tests pass (incl. newForwardHeadersConfigurationTest)docker compose config --quiet— dev compose still parses, still resolvestarget: developmentfor frontenddocker compose -f docker-compose.prod.yml --env-file <stub> config --quiet— prod compose parsesdocker compose -f docker-compose.prod.yml --env-file <stub> --profile staging config --quiet— staging profile parses, mailpit includeddocker build --target production -f frontend/Dockerfile frontend/— production image builds;docker runconfirms it serves/loginHTTP 200docker run --rm -v ./infra/caddy/Caddyfile:/etc/caddy/Caddyfile:ro caddy:2 caddy validate— Valid configurationpython3 -c "import yaml; yaml.safe_load(open('.gitea/workflows/nightly.yml'))"and same forrelease.yml— both parseTest plan for after merge:
docs/DEPLOYMENT.md§3.3infra/caddy/Caddyfile→/etc/caddy/Caddyfile,systemctl reload caddynightly.ymlviaworkflow_dispatch, confirmhttps://staging.raddatz.cloud/returns 200 with HSTS header and/actuator/*returns 404v1.0.0tag, confirmrelease.ymldeploys production; verify rollback by runningTAG=<prev> docker compose ... up -d --waitDeferred to follow-up issues (per #497 discussion)
pg_dump+ MinIOmc mirror+ rsync over Tailscale toheim-nas)Closes #497.
🤖 Generated with Claude Code
Adds server.forward-headers-strategy: native so that Jetty honours X-Forwarded-{Proto,For,Host} from Caddy. Without this, getScheme(), redirect URLs, and Spring Session "Secure" cookies reflect the internal http hop instead of the original https client request. Refs #497. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>Adds two files mirroring the on-host install layout: infra/fail2ban/filter.d/familienarchiv-auth.conf infra/fail2ban/jail.d/familienarchiv.conf Filter parses the JSON access log emitted by Caddy (previous commit) and matches 401 responses on /api/auth/login. Jail bans the offending IP for 30 min after 10 attempts in a 10-minute window. Verified the failregex against four sample log lines via fail2ban-regex in an alpine container: - 2 brute-force 401 attempts → matched (ban) - 1 successful login (POST /api/auth/login 200) → not matched - 1 unrelated GET /login 200 → not matched Date template "ts":{EPOCH} parses Caddy's Unix-epoch ts field. The previous review iteration described this jail in DEPLOYMENT.md prose only; committing it makes the security posture reproducible from a fresh server build. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>The two deploy workflows make two non-obvious assumptions that future maintainers should not have to rediscover by reading the diff: 1. Single-tenant self-hosted runner — the .env.* file lands on disk during the deploy and is cleaned up unconditionally. Multi-tenant usage would require switching to stdin-piped env input. 2. Host docker layer cache is authoritative — there is no actions/cache directive; a host-level `docker system prune` will cold-start the next build. Both notes added as block comments at the top of each workflow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>Updates DEPLOYMENT.md to match the infra changes in this PR: §1 OCR memory — point operators at the new OCR_MEM_LIMIT env var instead of telling them to edit "the prod overlay". §2 OCR env vars — add OCR_MEM_LIMIT to the table. §3.1 server setup — replace fail2ban prose with concrete `ln -sf` commands referencing the committed jail/filter. Document the single-tenant runner assumption near the runner-registration step. §3.4 first deploy — describe the new automated smoke test step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>🏛️ Markus Keller — Application Architect
Verdict: ⚠️ Approved with concerns
The infrastructure design is sound — single-VPS, Docker Compose, project-name namespacing, MinIO self-hosted with a clean migration path. The decision to retire the documented overlay pattern in favour of a standalone
docker-compose.prod.ymlis the right call now that MinIO stays. My concerns are about documentation currency, not the implementation.Blockers — doc-currency table violations (CLAUDE.md, ARCHITECTURE.md)
Missing ADRs for three architectural decisions with lasting consequences. My review table says "Architectural decision with lasting consequences (new tech, new transport protocol, new pattern) → New ADR in
docs/adr/". This PR makes three such decisions and commits none:docs/infrastructure/production-compose.md's previous direction. Migration path is amc mirror+ 3 env vars but the decision and its cost/benefit (13GB on-VPS now, ~5 EUR/mo OBS later) belongs in an ADR.--env-file <(stdin)). That's the textbook content of an ADR: context, decision, alternatives, consequences.docs/architecture/c4/seq-auth-flow.pumlnot updated.server.forward-headers-strategy: nativechanges how Spring sees the scheme during cookie creation — the Secure cookie flag now depends onX-Forwarded-Proto: httpsfrom Caddy. My table says "Auth flow change →docs/architecture/c4/seq-auth-flow.puml". Update the sequence diagram to show Caddy → Spring withX-Forwarded-Proto: httpsand the Secure cookie set in the response.Suggestions
docs/DEPLOYMENT.mdmermaid (lines ~28–35 of the diff context) still showsBrowser -->|SSE direct| Backend. In production all browser traffic transits Caddy. The C4 L2 diagram correctly says "SSE notifications … fronted by Caddy" but this top-level deployment doc lags. Small inconsistency.production-compose.mdrewrite is a model trim — retires the overlay narrative, points to live files, retains the cost/sizing rationale. Good shape.What I'd keep
forward-headers-strategy: nativewith a backing config test — pushing trust to the edge of the system where it belongs.Once the three ADRs and the auth-flow diagram land, I approve.
— Markus
🤖 Generated with Claude Code
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
Most of the diff is infra (Compose, workflows, Caddy, fail2ban) — outside the code style that bites me daily. The code-side changes are small and clean. Two test-design notes.
Suggestions
ForwardHeadersConfigurationTestasserts the YAML string, not the behaviour.backend/src/test/java/org/raddatz/familienarchiv/config/ForwardHeadersConfigurationTest.java:36-43This catches deletion of the property. It does not catch a typo (
nativ,Native,framework) — Spring'sServerProperties.ForwardHeadersStrategyenum coerces unknown values toNONEsilently in some configurations. The behaviour we care about is "Spring trustsX-Forwarded-Proto: httpsandrequest.getScheme()returnshttps". A behavioural test that dispatches a request with the header and assertsgetScheme()would actually verify the contract — and would survive a future strategy rename.@SpringBootTest+ Postgres container to verify a@Value-injected property is overkill.Same file, line 19-23:
@MockitoBean S3Clientis also a dead giveaway that the test pulled in more context than it needs. Per my own playbook this is the full @SpringBootTest when @WebMvcTest suffices anti-pattern. Either trim to a property-source unit test, or — if you take suggestion 1 — promote to a@WebMvcTestthat asserts the scheme behaviour end-to-end. The Postgres container is pure tax here.What I'd keep
frontend/Dockerfilemulti-stage: pinnednode:20.19.0-alpine3.21, dev / build / production explicitly named, comments explain why the dev stage stillCOPY . .'s (cold-start image for devcontainer). Clean.frontend/src/hooks.server.tsPUBLIC_PATHS addition: comment says why (prerendered help page — must be reachable without an auth cookie), not what. Exactly the form of comment I want — explains a constraint, doesn't narrate the line above.docker-compose.yml'starget: developmentone-liner with a comment explaining why the explicit target matters now that the Dockerfile is multi-stage. The kind of forensic detail future-me will need.ForwardHeadersConfigurationTest, then the YAML change, then the latent prerender fix, then infra commits. Reads as a clean trail.— Felix
🤖 Generated with Claude Code
🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
This is a good production-deploy PR. Every concrete operational concern from my pre-merge review (#8331) landed: pinned image tags, named volumes, 127.0.0.1-only port binding, MinIO service account,
--wait+ smoke test, env-file cleanup, OCR healthcheck + parameterized mem_limit, fail2ban, rollback procedure. The remaining items are real but minor.Suggestions
Renovate is promised in comments but not configured in this PR.
docker-compose.prod.yml:48says "Pinned MinIO release for reproducible deploys; Renovate keeps it current." Same comment on:69(mc) and:94(mailpit). I don't seerenovate.json/.gitea/renovate.json/.github/renovate.jsonin the diff. If Renovate is already active on the repo at org level, fine — confirm in the PR description. If not, the comments are aspirational and these pinned tags will rot. File the Renovate-bootstrap follow-up issue now or strip "Renovate keeps it current" from the comments.docker compose … builddoesn't--pullthe base images..gitea/workflows/nightly.yml:78-85andrelease.yml:192-198. The runner relies on the host docker layer cache. If a CVE drops innode:20.19.0-alpine3.21orpostgres:16-alpineand someone re-publishes the same tag, the host will keep serving the cached layer until the cache is manually cleared. Add--pullto both build steps OR document the manualdocker system prune --filtercadence indocs/DEPLOYMENT.md§3.1. Single line of YAML for a big increase in patch-supply-chain hygiene.Smoke test assumes the runner can reach
staging.raddatz.cloudover the public internet..gitea/workflows/nightly.yml:96-108. If the self-hosted runner is on the same host as the deployed stack, hairpin NAT may not work on every router/firewall combo. A safer form:Or document the operational assumption ("the host must have public-DNS hairpin enabled, or run the runner on a different host"). Right now this is a latent failure mode for first-time setup.
No backup pipeline. Acknowledged as a deferred follow-up (
docs/DEPLOYMENT.md§4 "Until that ships: manual backups are the only recovery option."). Fine for first go-live, but the follow-up issue is the load-bearing one — confirm it's filed before this PR merges. Named volumes without a tested restore is a single point of failure.What I'd keep (the good stuff)
postgres:16-alpine,minio/minio:RELEASE.2025-02-28T09-55-16Z,minio/mc:RELEASE.2025-08-13T08-35-41Z,axllent/mailpit:v1.29.7,node:20.19.0-alpine3.21. Reproducible builds.127.0.0.1:only; Caddy is the front door. Same pattern as the canonical stack.archiv-appdistinct from root; thecreate-bucketsjob fails loud if the policy attach silently 404s (mc admin user info … | grep -q readwrite || exit 1). That's the operator's friend at 2am.DOCKER_BUILDKIT=1inenv:so the backend Dockerfile'sRUN --mount=type=cacheactually keeps the Maven cache warm. Easy miss; you caught it.up -d --wait+ smoke step — workflow fails loudly on a bad deploy. Exactly what I asked for.if: always()env-file cleanup in both workflows. Single-tenant runner discipline.OCR_MEM_LIMITparameterized with 12g default and CX32 documentation. Sized for the actual host tiers.infra/fail2ban/with symlink install steps in §3.1 — this is the config-as-code shape I want. fail2ban configs don't usually get versioned and that bites.Lock in Renovate (or strip the comment) and this is shippable.
— Tobi
🤖 Generated with Claude Code
🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: 🚫 Changes requested
The defense-in-depth posture of this PR is genuinely strong — Caddy actuator block, security headers (with the correct X-Frame-Options omission for the PDF iframe), service-account separation, fail2ban, env-file cleanup, admin-password first-deploy warning,
forward-headers-strategy: nativewith a regression test. Almost all of my pre-merge findings (#8333) landed.But one HIGH finding regressed during implementation and a second is a smaller miss.
🔴 HIGH —
archiv-appgetss3:*on all resources, not just thefamilienarchivbucketFile:
docker-compose.prod.yml:84-86The discussion summary (#8333) committed to:
The implementation attaches MinIO's built-in
readwritepolicy, which grantss3:*onarn:aws:s3:::*— every bucket, present and future. Today MinIO has only one bucket so the effective surface is the same. The moment a second bucket appears (logs, backups, mc-mirror staging),archiv-appgets full read/write/delete on it. The whole point of the service account was that a backend RCE/SSRF cannot escalate beyond its bucket — this regression silently restores that escalation path.Threat model: RCE/SSRF in the backend, which uses
S3_ACCESS_KEY=archiv-app. Withreadwrite/*, attacker can list all buckets, exfiltrate or destroy a future backup bucket, and pivot. With the agreed bucket-scoped policy, the blast radius isfamilienarchivonly.Fix:
The existing
grep -q readwrite || exit 1fatal check — which I praised in #8333 — currently enforces the wrong policy. It must be updated alongside the policy fix or it will reject the correct, least-privilege configuration.🟡 MEDIUM — fail2ban watches
/api/auth/loginonly;/forgot-passwordis also brute-forceableFile:
infra/fail2ban/filter.d/familienarchiv-auth.conf:23My pre-merge review (#8323, recommendation 1 + #8333 summary) flagged both
/api/auth/loginand/api/auth/forgot-password./forgot-passwordis an email-enumeration vector and (depending on how the endpoint signals "user not found") a slow brute-force surface for resetting accounts whose addresses leak.Either widen the regex to match both endpoints, or add a second jail with a higher
findtime. Suggested widening:(Adding 429 catches rate-limit responses too — useful if you later layer an in-app rate limiter.)
🟢 What I'd keep (the good stuff)
forward-headers-strategy: nativewithForwardHeadersConfigurationTest— the security fix has a regression test, exactly the discipline I ask for in #8333./actuator/*returns 404 at the Caddy layer ((block_actuator)snippet) — defense in depth against any futuremanagement.endpoints.web.exposure.includemistake.-Server) applied via reusable(security_headers)snippet imported into both vhosts andgit.raddatz.cloud.X-Frame-Optionsdeliberately NOT set in Caddy with the inline comment explaining Spring'sframeOptions.sameOrigin()for PDF iframes. Exactly the conflict I asked you to avoid; the comment will save the next reviewer 30 minutes.archiv-appcreds are separated; backend never sees the root password.if: always()env-file cleanup on both workflows — secrets don't persist on the runner.profiles: [staging]— no real SMTP credentials in staging.Fix the MinIO policy and extend the fail2ban filter to
/forgot-passwordand I approve.— Nora
🤖 Generated with Claude Code
🧪 Sara Holt — QA Engineer
Verdict: ⚠️ Approved with concerns
The deploy verification story is in much better shape than the pre-merge issue.
up -d --waitmakes failures loud, the workflow smoke step asserts three concrete things, and the security-config change has a regression test. My remaining concerns are about what the regression test asserts and where coverage is still thin.Concerns
ForwardHeadersConfigurationTesttests the configuration string, not the behaviour.backend/src/test/java/org/raddatz/familienarchiv/config/ForwardHeadersConfigurationTest.java:36-43The test asserts
forwardHeadersStrategy.equals("native"). That catches deletion of the property; it does not catch a typo (nativ,Native), a refactor toframework, or a future Spring change to the strategy enum semantics. The contract under test is "Spring trustsX-Forwarded-Proto: httpsfrom Caddy and the resultingHttpServletRequest.getScheme()returnshttps" — that's what a regression test should pin. A@WebMvcTest+MockMvc.header("X-Forwarded-Proto", "https")+assertThat(request.getScheme()).isEqualTo("https")would test the actual behaviour.Same test is
@SpringBootTest+PostgresContainerConfigfor a@Value-injection assertion.Per my own pyramid: full
@SpringBootTestshould be reserved for integration paths. Reading a property doesn't need a Postgres container. The unused@MockitoBean S3Clientis the smell — it had to be mocked only because the full context was loaded. Either trim to a focused property-source test (no container, no S3 mock) or promote to the behavioural test above. The current shape is the worst of both: slow as integration, weak as a contract.fail2ban regex has no automated test.
infra/fail2ban/filter.d/familienarchiv-auth.confmakes a structural assumption about Caddy's JSON field order (documented as stable in Caddy 2.7+). A Caddy minor bump that reorders the JSON would break the jail silently — fail2ban returns "0 matches", the workflow smoke test still passes, the host stops banning brute-force attempts, and we discover this in an incident. A 5-line check is enough:This belongs in a small CI job, not the deploy workflow. Catches the next Caddy upgrade.
Smoke test assumes runner ↔ public-DNS hairpin.
.gitea/workflows/nightly.yml:96-108curlshttps://staging.raddatz.cloud/login. If the runner is on the same host as the stack, this routes via public DNS → router → back into the host. Many small-business routers don't do hairpin NAT correctly. Either pin withcurl --resolve "staging.raddatz.cloud:443:127.0.0.1"or document the operational requirement. Right now it's a deploy that may pass on day one and silently fail on day 90 after a router firmware update.No
create-bucketsidempotency test. The job is designed to run on everydocker compose up. The fatalgrep -q readwrite || exit 1check (docker-compose.prod.yml:86) makes it loud on failure, but I want a CI job that runsup -d --waittwice in sequence and asserts the second invocation also succeeds. Caught in CI now is much cheaper than caught at the third nightly deploy.What I'd keep
up -d --waitin both workflows. Single most impactful change for "did the deploy work?"docker compose config --quietfor prod + staging profile, prod-image build +curl /login200,caddy validate, both workflow YAMLs parse.Pyramid-correct the auth-headers test, add a fail2ban CI gate, and address the hairpin assumption (even just in docs) and I approve.
— Sara
🤖 Generated with Claude Code
🎨 Leonie Voss — UX & Accessibility
Verdict: ✅ Approved
Pure infrastructure PR — no layout, color, typography, contrast, or interaction changes for me to audit. I checked the diff for any user-visible UI deltas. One observation worth keeping:
A small positive for the senior audience
frontend/src/hooks.server.ts:5-13adds/hilfe/transkriptiontoPUBLIC_PATHS. Previously, a 67-year-old transcriber who bookmarked the help page or followed a link from email would land on the login screen before the help content — friction at the worst possible moment, when they're already struggling with Kurrent. Making the prerendered help page reachable without an auth cookie removes that friction. This is exactly the senior-first instinct the project should keep: the help docs should be the easiest path to follow, not gated.(No code feedback from me. The comment explaining why it's in PUBLIC_PATHS — "prerendered help page — must be reachable without an auth cookie" — is the right kind of comment.)
— Leonie
🤖 Generated with Claude Code
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Brownfield infrastructure PR with no functional/user-facing requirement change. I'm reviewing for traceability, NFR coverage, and open-question hygiene — not implementation.
Traceability — clean
All 8 acceptance criteria from issue #497 are accounted for in the PR description's table:
frontend/Dockerfile(commit 3)docker-compose.prod.yml(commit 4).gitea/workflows/nightly.yml(commit 6).gitea/workflows/release.yml(commit 7)infra/caddy/Caddyfile(commit 5)docker compose up(commit 3)docs/DEPLOYMENT.md §3.3docs/DEPLOYMENT.md §3.1–3.2Server-side ⏳ items are correctly transferred to the runbook with checklist semantics. No silent scope cuts.
NFR coverage — substantive
/actuator/*block, fail2ban on/api/auth/login, MinIO service account,forward-headers-strategy: native, env-file cleanup, admin-password lock-in warningup -d --wait, healthchecks on every service, rollback procedure in DEPLOYMENT.md §5Privacy / GDPR are unchanged from baseline (German company, EU data centres, family-only archive — no new external data exposure).
Open Questions / TBD register
docker-compose.prod.yml:48,69,94say "Renovate keeps it current" but norenovate.jsonis in the diff.pg_dump+ MinIOmc mirrorover Tailscale toheim-nas)?These are register items, not blockers. Surface and resolve.
What I'd keep
§3.1) reads like a textbook Definition-of-Done for a deploy: ufw → caddy → fail2ban → Tailscale → runner registration. Each step has a verify-with command.Resolve the four open questions (mostly a 4-line "yes/yes/yes/yes" reply or one Gitea issue per "no") and the requirements package is closed.
— Elicit
🤖 Generated with Claude Code
Replaces MinIO's built-in `readwrite` policy (which grants s3:* on arn:aws:s3:::* — every bucket present and future) with a bucket-scoped custom policy `archiv-app-policy`: - s3:GetObject / s3:PutObject / s3:DeleteObject on familienarchiv/* - s3:ListBucket / s3:GetBucketLocation on familienarchiv The previous configuration silently regressed the least-privilege guarantee that the service-account separation was supposed to provide: a future second bucket (logs, backups, mc-mirror staging) would have been read/write/delete-accessible to a compromised backend. While at it, two follow-on fixes: 1. Extract the entrypoint to infra/minio/bootstrap.sh. The previous inline `/bin/sh -c "..."` was already at the YAML-escaping ceiling; adding the policy-JSON heredoc would have made it unreadable. 2. Replace the `| grep -q readwrite || exit 1` fatal-check with a POSIX `case` substring match. The minio/mc image ships coreutils + bash but NOT grep/awk/sed — the original check was a no-op that ALWAYS exited 1 (verified locally). The new check passes on the first invocation and on every subsequent re-deploy. Idempotency verified locally: two consecutive `docker compose run --rm create-buckets` invocations both exit 0 with the user bound to the new policy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>Adds the Caddy hop to seq-auth-flow.puml and surfaces the two production-relevant header behaviours: - Caddy terminates TLS and forwards X-Forwarded-Proto: https - Spring Boot trusts this header (server.forward-headers-strategy: native, ForwardedRequestCustomizer at the Jetty layer), so request.getScheme() returns "https" - The Set-Cookie response carries the Secure flag because the observed scheme is https — without forward-headers-strategy this would silently drop to plain http and the cookie would lose Secure Closes the doc-currency gap flagged in the Markus review on PR #499: "Auth flow change → docs/architecture/c4/seq-auth-flow.puml". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>👨💻 Felix Brandt — Implementation pass on the multi-persona review
Worked through every concern raised across @markus, @felix (myself), @tobi, @nora, @sara, @leonie, @elicit. Pushed 13 atomic commits and filed 3 follow-up issues. Every reviewer-blocker is addressed; every actionable suggestion is in.
🔴 Nora — HIGH blocker resolved
MinIO policy scoping —
91f70e65security(minio): scope archiv-app to bucket-only IAM policyReplaced the built-in
readwritepolicy attach with a customarchiv-app-policy:s3:GetObject/s3:PutObject/s3:DeleteObjectonarn:aws:s3:::familienarchiv/*s3:ListBucket/s3:GetBucketLocationonarn:aws:s3:::familienarchivWhile in there, caught a latent bug worth flagging: the
minio/mcimage ships coreutils + bash but NOTgrep/awk/sed. The| grep -q readwrite || exit 1fatal-check that Nora praised in #8333 and again in #8353 always exited 1 — i.e. the entrypoint always failed (verified locally). Replaced with a POSIXcasesubstring match, which actually works. Extracted the whole entrypoint toinfra/minio/bootstrap.shbecause the policy JSON heredoc + escaping would have made the inline YAML unreadable.Idempotency verified locally: two consecutive
docker compose run --rm create-bucketsinvocations both exit 0 witharchiv-appbound toarchiv-app-policy.🟡 Nora — MEDIUM resolved
fail2ban widened —
7e430998security(fail2ban): widen jail to /forgot-password and rate-limit 429failregex = ^\s*\{.*?"remote_ip":"<HOST>".*?"uri":"/api/auth/(login|forgot-password).*?"status":\s*4(01|29)\bCovers both endpoints Nora flagged, plus 429 in case a future in-app rate-limiter responds before fail2ban hits the volume threshold.
🏛️ Markus — Blockers resolved
Three new ADRs for the architectural decisions that needed lasting records:
59bc81d3docs(adr): ADR-009 standalone docker-compose.prod.yml — captures why overlay was retired now that MinIO stays, and the cost (env-var duplication) vs benefit (one file, no merge surprises) trade-offb57afb9adocs(adr): ADR-010 MinIO stays self-hosted — Hetzner OBS deferred; four explicit triggers documented to re-open the decision (50 GB threshold, healthcheck latency, VPS upgrade cost, backup runtime)6a6a1c43docs(adr): ADR-011 single-tenant Gitea runner — operational assumption + migration path to--env-file <(printf …)when the trust boundary changesAuth-flow diagram —
03d47884docs(arch): show Caddy + X-Forwarded-Proto in auth-flow diagramUpdated
seq-auth-flow.pumlto show the Caddy hop withX-Forwarded-Proto: https, the JettyForwardedRequestCustomizerreading it, and theSecureflag on the Set-Cookie response that depends on it.DEPLOYMENT.md mermaid —
a7a80f8cdocs(deployment): route SSE through Caddy in topology mermaidThe top-level mermaid now agrees with the C4 L2 diagram: SSE notifications transit Caddy, the backend is never reachable directly.
👨💻 Felix / 🧪 Sara — Test pyramid concerns
ForwardHeadersConfigurationTestrewrite —e5d953detest(config): rewrite as context-less binder testReplaced
@SpringBootTest+ Postgres container +@MockitoBean S3Client(the worst-of-both: slow as integration, weak as contract) with a context-less SpringBindertest that loadsapplication.yamland binds the property to the typedServerProperties.ForwardHeadersStrategyenum.Trade-off note: the suggested
@WebMvcTest-with-X-Forwarded-Proto-header path doesn't actually work forstrategy: native— that's a Jetty-layer customizer, not a Spring filter, soMockMvcwould never exercise it. A true behavioural test would need@SpringBootTest(webEnvironment = RANDOM_PORT)+ a test controller, which keeps the Postgres container and is heavier than the original. The Binder + typed-enum approach catches what Sara and I both worried about — typos (nativ,Native), case errors, and Spring renames all fail withBindException. Verified red onnativ, green onnative.🔧 Tobi — Suggestions resolved
--pullon docker compose build —f2ec8154ci(deploy): add --pull for CVE pickupBoth
nightly.ymlandrelease.ymlnow force a base-image re-fetch so CVE re-publications of pinned tags are picked up at the next nightly instead of "next manual prune".Smoke-test hairpin —
fe1451f5ci(smoke): pin curl to 127.0.0.1 via --resolveSmoke step uses
curl --resolve "$HOST:443:127.0.0.1"against botharchiv.raddatz.cloudandstaging.raddatz.cloud. SNI still uses the public hostname (cert validates) but the connection goes to loopback — no hairpin NAT dependency on the host router.Renovate aspiration —
33300e4achore(infra): drop aspirational Renovate commentsThe repo's
renovate.jsonis minimal (only TipTap grouping); Renovate isn't bumping the production images. Stripped the "Renovate keeps it current" lines fromdocker-compose.prod.yml. Filed #500 for the bootstrap work.🧪 Sara — CI gaps resolved
fail2ban-regex CI job —
9652894atest(ci): add fail2ban-regex regression jobNew
fail2ban-regexCI job pipes a deterministic Caddy access-log line throughfail2ban-regexagainst the live filter and asserts the expected match count. The regex widening commit (7e430998) added matching cases for/forgot-password 401,/login 429, and a negative case (/api/documents 401). A future Caddy upgrade that reorders JSON fields will fail this job loudly at PR time instead of silently breaking the jail.Compose idempotency CI job —
156afa14test(ci): add compose bucket-bootstrap idempotency jobRuns
docker compose run --rm create-bucketstwice against a throwaway minio stack and asserts both invocations exit 0. Catches re-deploy breakage before the third nightly run at 02:00.📋 Elicit — Open questions closed
🎨 Leonie
No code action needed — the prerendered help page was already correctly opened in PUBLIC_PATHS. Thanks for the senior-audience UX framing.
Verification before push
./mvnw test— 1566 tests, 0 failures, 0 errors, 0 skipped./mvnw clean package -DskipTests— BUILD SUCCESSdocker compose -f docker-compose.prod.yml -p test-idem run --rm create-bucketstwice locally — both exit 0 with new policy boundpython3 -c "import yaml; yaml.safe_load(open('.gitea/workflows/<each>'))"— all three workflows parsenativeCommits on this branch since the review
Follow-up issues filed
Ready for a re-review pass.
— Felix
🤖 Generated with Claude Code
🏛️ Markus Keller — Application Architect
Verdict: ✅ Approved
What I checked
What I like
System_Boundary(archiv, "Docker Compose")(docs/architecture/c4/l2-containers.puml:6-7) — accurate, because Caddy is host-installed, not in the compose stack. The accompanyingRel(caddy, frontend, …)/Rel(caddy, backend, …)edges make the new topology unambiguous.docs/architecture/c4/seq-auth-flow.puml) — the X-Forwarded-Proto hop is called out with a note pinned to the Backend lifeline. A reader who never openedapplication.yamlstill understands whyserver.forward-headers-strategy: nativematters.-p archiv-production/-p archiv-staging) is the right Compose-native isolation primitive. No bespoke YAML mangling.Suggestions (non-blocking)
archive-net(docker-compose.prod.yml:464) while every other identifier in the prod compose isarchiv(no-e):archivuser/db/bucket,archiv-appservice account,archiv-production/archiv-stagingproject names. Not load-bearing, but the diff reviewer's eye stutters on it. If you're touching the prod compose for the follow-up backup PR, considerarchiv-net.mc?" question in the L2 diagram —Container(mc, "Bucket / Service-Account Init", …)is shown inside the boundary, but it's a one-shot init container, not a long-running service. The current label ("One-shot container on startup") handles this in prose. I would not change it; just noting for the next architect reviewing the diagram.The architecture story this PR lands is consistent end-to-end: ADRs justify the structural choices, the compose file matches the ADRs, the C4 diagrams match the compose file, and the deployment doc matches the diagrams. That's the order I look for, and it holds. Ship it.
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
What I checked
What I like
ForwardHeadersConfigurationTestis the cheapest possible config-binding test. (backend/src/test/java/.../config/ForwardHeadersConfigurationTest.java:1) No@SpringBootTest, no Testcontainers, no embedded server. It bindsserver.forward-headers-strategyfrom the YAML viaBinder+ConfigurationPropertySourcesand asserts the enum value. If somebody typesnativorNativeorframework(trailing space), the binder throwsBindExceptionand the test fails loudly — exactly the failure mode the comment promises. This is unit-test discipline applied to a config invariant, which is rare and right.frontend/src/hooks.server.ts:8-15) —/hilfe/transkriptionadded toPUBLIC_PATHS. The trailing comment ("prerendered help page — must be reachable without an auth cookie") is the why, not the what. That's the only kind of comment worth writing.infra/minio/bootstrap.shis readable shell. The header comment lays out the 7-step contract; each step is one or two lines ofmc+ idempotency-tolerant||. The fatal-assertioncaseblock at the bottom is exactly the kind of fail-loud guard rail I want at the boundary of a system — if the policy didn't bind, the script exits 1 and the compose deploy fails, instead of silently shipping a backend that getsAccessDeniedfrom MinIO three days later.feat(security): trust X-Forwarded-Protolands the test first, then the YAML change). The PR description's commit table matches the actual git log — easy to bisect.Suggestions (non-blocking)
#!/bin/shand the comment promises POSIX. The linecase "$INFO" in *archiv-app-policy*)will match any substring including comment lines or other policy names containing that substring. If you ever rename the policy to e.g.archiv-app-policy-v2, the*archiv-app-policy*glob would still match the v2 string. Not a today-problem; flag it if you ever rotate the policy name.||idempotency pattern in bootstrap.sh swallows the first command's stderr. Ifmc admin user addfails for a non-idempotency reason (e.g. MinIO unreachable mid-script), the next line (mc admin user enable) will also fail, but the operator sees only the second failure. This is recoverable from logs and not worth restructuring, but I'd add2>&1to the first invocation if I were touching this again so the original error reaches stdout.ForwardHeadersConfigurationTestcovers only the happy path. A second test asserting the binder rejects an invalid value (e.g.server.forward-headers-strategy: yes) would lock the "fail loud on typo" contract that the comment claims. Not a blocker — the binder behavior is framework-tested upstream and your assertion message already documents the intent.TDD discipline held: test commit precedes the YAML change in the history. Shell scripts read top-to-bottom. The hooks.server.ts change is one line of intent with a one-line why-comment. Ship it.
🛠️ Tobias Wendt — DevOps & Platform Engineer
Verdict: ✅ Approved
This is the cleanest infra PR I've seen on this project. Let me walk through what I checked.
What I checked
depends_on: condition: service_healthychains.What's right
Every image tag is pinned.
postgres:16-alpine(docker-compose.prod.yml:475)minio/minio:RELEASE.2025-02-28T09-55-16Z(line 494)minio/mc:RELEASE.2025-08-13T08-35-41Z(line 516)axllent/mailpit:v1.29.7(line 533)node:20.19.0-alpine3.21(frontend/Dockerfile:1, repeated on every stage)No
:latestanywhere in production. The--pullflag ondocker compose buildin both workflows means CVE re-publications of these pinned tags actually get re-fetched on every deploy instead of being served from a stale layer cache. That's the right interpretation of "pin tags + still patch CVEs" — a nuance most projects get wrong.Named volumes throughout the prod compose (
postgres-data,minio-data,ocr-models,ocr-cache). No bind mounts.archiv-production_postgres-dataandarchiv-staging_postgres-dataare auto-namespaced by-p. Survivesdocker compose down, survives container rebuilds, survives the operator's curiosity.Ports bound to
127.0.0.1only (docker-compose.prod.yml:587, 631). Caddy is the sole external entry point. Combined withufw allow 22,80,443and the Hetzner cloud firewall, that's a three-layer net.MinIO root creds are never the backend's creds. The backend is wired to the bucket-scoped
archiv-appservice account (docker-compose.prod.yml:594-595), and thearchiv-app-policyinbootstrap.sh:50-65is genuinely least-privilege —GetObject/PutObject/DeleteObjectonfamilienarchiv/*,ListBucket/GetBucketLocationon the bucket only. Nots3:*. A backend RCE cannot escalate beyond this bucket.Healthchecks on every service,
depends_on: condition: service_healthyon every dependency. Backend waits for db + minio + ocr. Frontend waits for backend. No race-on-startup.CI proves the two invariants that would silently fail otherwise:
fail2ban-regexjob (.gitea/workflows/ci.yml:118+) pins the Caddy-JSON → fail2ban-filter contract against 5 deterministic samples (3 positive, 2 negative). A Caddy upgrade that reorders log keys would now fail CI instead of silently disabling brute-force protection in production.compose-idempotencyjob runscreate-bucketstwice and asserts both exit 0. The bootstrap script is the only piece of stateful init in the stack; this prevents a "second deploy fails" surprise.Smoke tests in the deploy workflows are minimal but right —
/loginreturns 200, HSTS header present,/actuator/healthreturns 404. Three things that will break if Caddy isn't reloaded or the snippet imports are wrong. The--resolvetrick to pinarchiv.raddatz.cloud → 127.0.0.1avoids depending on the host router's hairpin NAT — that's the kind of detail that bites you at 2 a.m. when the deploy worked but the smoke test "fails" because the router doesn't loopback.actions/checkout@v4— current. No deprecated actions in the workflows.Rollback procedure is one command (
docs/DEPLOYMENT.md:283+):And the doc explicitly flags that Flyway migrations are not auto-rolled-back — exactly the warning operators need before they try this.
Suggestions (non-blocking)
No
actions/cachefor Maven/node in the deploy workflows. The in-workflow comment acknowledges this (host docker layer cache is authoritative). For a nightly + tag-trigger cadence on a self-hosted runner with persistent disk, that's acceptable — a cold rebuild afterdocker system prunecosts 5-10 min once, not on every run. Don't addactions/cachehere; the host cache is the right tool for a single-tenant self-hosted runner.OCR_MEM_LIMITdefaults to 12g (docker-compose.prod.yml:554). Documented in DEPLOYMENT.md and called out in the env-var table. Set to 6g if you ever downsize to CX32. Good as-is.No Renovate config in this PR. The inline comments at
minio:andmailpit:say "bumped manually until Renovate is bootstrapped". Open a follow-up issue for the Renovate config — three pinned images is the floor at which manual bumps stop being a chore and start being a security smell.mailpithealthcheck useswget— fine, butwgetis a non-obvious dependency on theaxllent/mailpit:v1.29.7image. If that image ever drops wget (unlikely; it'salpinebased), the healthcheck silently fails. Cheap follow-up: switch tonc -z localhost 8025which has no binary dependency.The
archive-netnetwork name (one extrae) is the only typo in the file. Everything else isarchiv-*. Cosmetic; fix on next infra touch.Idempotency CI is good; rollback CI is not. The current
release.ymlbuild-and-deploy chain is one-way. There's no automated test that a previous tag can be redeployed onto the current state. The doc walks through it; an "actually try rollback in staging" job would catch the case where a destructive migration breaks the rollback path. Open as a follow-up; not blocking v1.Monthly cost stays ~18 EUR (1× CX32 + DNS + SMTP). MinIO data on-VPS, no Object Storage line item until ADR-010 fires. Good.
Operationally I'd ship this today.
🔒 Nora "NullX" Steiner — Application Security
Verdict: ⚠️ Approved with concerns
This is a well-considered security baseline. Most of what I look for in a first production deploy is present and explicit. I'm flagging two concerns I'd want addressed before or shortly after merge, and a few suggestions for hardening.
What I checked
/actuator/*boundary.What's right
server.forward-headers-strategy: native(backend/src/main/resources/application.yaml:41) with a unit test pinning the binding. The threat model — backend is only reachable from Caddy on127.0.0.1and from other containers on the internalarchive-net— makes blanket X-Forwarded-Proto trust safe. Production exposure to the public internet is via Caddy only.-Serverin the Caddyfile snippet(security_headers)(infra/caddy/Caddyfile:17-23).max-age=31536000; includeSubDomains; preload— one year, subdomains, preload-ready. Don't actually submit to the preload list yet (it's a one-way decision affectingraddatz.cloudsiblings) but the header is right.X-Frame-Optionsdeliberately NOT set in Caddy, with a comment explaining Spring Security configuresframe-options SAMEORIGINfor the PDF preview iframe. Conflict avoided. The comment is the audit trail./actuator/* → 404at the Caddy layer (infra/caddy/Caddyfile:25-31) AND the internal Prometheus scrape goes via the Docker network, not Caddy. Defense in depth./actuator/heapdumpcannot leak.infra/minio/bootstrap.sh:50-66) —archiv-app-policygrantss3:GetObject/PutObject/DeleteObjectonfamilienarchiv/*ands3:ListBucket/GetBucketLocationonfamilienarchiv. Not the MinIO built-inreadwritewhich iss3:*on*. A backend RCE or SSRF cannot pivot tomc admin user listor another bucket. The fatal-assertion at script end (case "$INFO" in *archiv-app-policy*)) catches a regression where the policy attach silently fails./api/auth/loginand/api/auth/forgot-passwordon 401 and 429 (infra/fail2ban/filter.d/familienarchiv-auth.conf:32). Covering forgot-password is the right call — it stops email-enumeration via 401 differential responses. Including 429 future-proofs against an in-app rate limiter being added later..gitea/workflows/ci.yml:fail2ban-regexjob) with 5 deterministic samples covering positive + negative cases. A Caddy upgrade that re-orders the JSON log keys would now fail CI instead of silently disabling the jail — this is the single most important non-obvious test in the PR.UserDataInitializeronly seeds when the row doesn't exist; secret rotation after first-boot does nothing. This is captured in two places. Good.${{ secrets.* }}for all sensitive values, env-file written + cleaned withif: always(). ADR-011 documents the single-tenant runner assumption that makes this safe, and the migration path to--env-file <(printf '%s' …)if a second repo ever shares the runner. The assumption is load-bearing and is captured in three places (ADR-011, inline comments in both workflow YAMLs)..env.testwritten at runtime, not committed.Concerns (non-blocking but worth addressing soon)
Missing
Content-Security-Policyheader. None of the Caddy snippets emit a CSP. A first-passContent-Security-Policy: default-src 'self'; img-src 'self' data:; style-src 'self' 'unsafe-inline'; script-src 'self'; frame-ancestors 'self';would block reflected XSS escalation paths and align with theframe-options SAMEORIGINyou already enforce at the Spring level. SvelteKit's adapter-node can emit nonces for inline scripts if the policy goes stricter later. I would open this as a follow-up issue, not block this PR — CSP misconfigurations break apps in subtle ways and need their own staging-validation cycle.ALLOWED_PDF_HOSTSis not explicitly set on the prodocr-service(docker-compose.prod.yml:559-563). The Python OCR service defaults tominio,localhost,127.0.0.1, which is correct for this deployment (the internal Docker hostname isminio), but the env var isn't pinned in the prod compose. If a future commit changes the Python default (or someone overrides via host env), SSRF protection silently loosens. Suggested fix: addALLOWED_PDF_HOSTS: "minio"explicitly to the prod compose'socr-service.environmentblock. Five characters of YAML, lifetime guarantee.Suggestions
fail2ban tuning is defensible but worth reviewing in 30 days.
maxretry=10 / findtime=10m / bantime=30mis generous for a family-archive context (a 67-year-old typing their password wrong eight times should not get banned). The trade-off is that a distributed credential-stuffing attack with low per-IP rate evades this jail entirely. The right counter is an in-app per-account rate limit, not stricter fail2ban. Track as a follow-up issue.Permissions-Policyheader is also missing. Lower-impact than CSP but nearly free:Permissions-Policy: camera=(), microphone=(), geolocation=()denies APIs the app doesn't use, reducing the blast radius if an XSS lands. Add to the(security_headers)snippet whenever you re-touch the Caddyfile.The
if: always()env-file cleanup is the linchpin of the ADR-011 trust model. Every workflow refactor must preserve it. Worth a comment in the workflow itself (something stronger than the current one) explaining that the line is load-bearing — but the ADR already does this, so this is a "nice to have" not a blocker.Summary
Three things lift this above a typical first-production-deploy security posture: the bucket-scoped service account with a fatal-assertion bootstrap, the CI regression test for the fail2ban regex, and the explicit single-tenant trust model in ADR-011. The two concerns (CSP, explicit
ALLOWED_PDF_HOSTS) are follow-ups, not blockers.🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ✅ Approved
What I checked
sleep, no shared state.What's right
fail2ban-regex(5 cases) → contract test for Caddy-JSON-log → fail2ban-filter compatibility.compose-idempotency→ contract test forcreate-bucketsre-runnability.nightly.yml/release.yml→ post-deploy assertion on the three things most likely to silently break (HTTP 200 on/login, HSTS header,/actuator/*blocked).ForwardHeadersConfigurationTest→ unit-level pin on theforward-headers-strategy: nativebinding.import security_headersin the Caddyfile — they curl the live endpoint and check the response. That's the right level for a deploy smoke test: it catches "Caddy not reloaded", "snippet not imported", "container not bound to loopback" all at once.--resolvepinning in the smoke tests (.gitea/workflows/nightly.yml:228) avoids the classic hairpin-NAT flake where the workflow passes locally but fails on the deployed host because the home router doesn't loop traffic back. SNI still uses the public hostname so the cert validates. This is the kind of detail that distinguishes a robust smoke test from one that teams learn to ignore.fail2ban-regex. Two of the five cases (Does not match /api/auth/login 200,Does not match /api/documents 401) assert the filter doesn't match unrelated traffic. False-positive coverage matters as much as true-positive coverage — without it, a regex that matches everything would still pass the positive cases.compose-idempotencyjob tears down withdown -v(.gitea/workflows/ci.yml:117) underif: always(). No leftover volumes contaminating the next run.set -einbootstrap.sh+ the finalcaseassertion → script fails loudly if MinIO state ever drifts. This is a test executed at deploy time, not at CI time, and it's the right place for it.Suggestions (non-blocking)
ForwardHeadersConfigurationTesthas only the happy path. A second test asserting thatBinder.bind(..., ForwardHeadersStrategy.class)rejects an invalid value (e.g. bindingnativwould throw) would lock the failure-mode the docstring promises. The framework guarantees this behavior, so I'm not asking for it — but if you ever extend this test class, add the negative case.compose-idempotencydoesn't verify the secondcreate-bucketsrun actually re-attached the policy correctly. It only asserts bothdocker compose run create-bucketsinvocations exit 0. Thebootstrap.shfatal-assert at the end (case "$INFO" in *archiv-app-policy*) is what guarantees the policy actually bound — so if the second run silently passed without attaching, the script's own assert would catch it. Coverage is adequate; flagging only because the test name implies it asserts state, when it actually asserts exit code.HSTS header is presentviagrep -qi 'strict-transport-security'. That's a presence check, not a value check. If a future Caddyfile edit dropsincludeSubDomains; preload, this passes. Tightening togrep -qi 'max-age=31536000.*includeSubDomains.*preload'costs nothing and pins the security-team-relevant invariant.release.ymlhappy-path is well-covered; rolling back to a previous tag is documented but not tested. If you ever ship a destructive migration and the rollback breaks at 2am, you'll wish this existed. Follow-up issue, not blocking.The test pyramid for this PR's surface area looks like:
That's the right shape. Ship it.
🎨 Leonie Voss — UX & Accessibility
Verdict: ✅ Approved (LGTM — out of scope)
What I checked
/hilfe/transkriptionbeing moved to the public-paths list has any user-visible effect.Findings
This PR is pure infrastructure. The only frontend-adjacent change is:
frontend/src/hooks.server.ts:8-15—/hilfe/transkriptionadded toPUBLIC_PATHS, with the inline comment explaining the prerender requirement. The page already exists; it's now reachable without an auth cookie because the build process needs to prerender it. No visual change, no layout change, no accessibility impact.The auth-flow seq diagram changes are documentation-only; the runtime behavior of the existing login screen is preserved (Caddy now terminates TLS in front of it, but the user-visible login form, the
auth_tokencookie'sSecureflag, and the redirect behavior are unchanged from the user's perspective).No new colors, no new typography, no new components, no new responsive breakpoints. Brand palette and design tokens untouched.
One forward-looking note
The
/hilfe/transkriptionpage becoming publicly accessible (without auth) is a small surface widening from a UX-of-helpfulness perspective: it means a user who lands on the help page from a shared link can read it without first having an account. Good for documentation discoverability; aligned with the dual-audience design where senior transcribers (60+) might be sent the link by a younger family member before they've registered.Worth a one-line check at first staging deploy: load the page at 320px in a logged-out session and confirm the navigation chrome doesn't render broken links to auth-required pages. The page's own content is fine; the global nav around it might show "Profile" or "Documents" links that 404 for an unauthenticated visitor. Not a blocker for this PR — flag as a smoke-check.
Nothing for me to fix.
📋 "Elicit" — Requirements Engineer (Brownfield mode)
Verdict: ✅ Approved
What I checked
What's right
Traceability is explicit. The PR description maps each commit to a concrete decision from issue #497's review comments (8331, 8333). Anyone six months from now can walk:
issue #497 → ADR-009/010/011 → PR #499 commits → docker-compose.prod.yml. That trace is rare on infra PRs.The "test plan" section is split into pre-merge and post-merge. Pre-merge items are all checked (1566 backend tests, both compose files parse, prod image builds and serves
/login, Caddyfile validates, YAML workflows parse). Post-merge items are concrete and verifiable (configure 16 secrets, install Caddy + ufw + fail2ban + Tailscale, trigger nightly via dispatch, confirm smoke checks, then push av1.0.0tag). This is how a Definition of Done reads when the work shipped is genuinely complete.Deferred work is named, not hidden. Three follow-ups are explicit in the description:
pg_dump+mc mirrorover Tailscale to heim-nas)Each has a stated reason for deferral ("keeps this PR focused"). That's exactly how a release-ready scope decision should read — Won't-this-release items, not silently dropped.
The admin-password lock-in is documented in three places (DEPLOYMENT.md §3.5, ADR-011 context, prod compose env var comment). For a foot-gun this sharp (first-deploy decision permanently freezes the credential), redundant documentation is correct.
ADRs include re-evaluation triggers, not just decisions. ADR-010's "Triggers to re-evaluate" gives concrete signals (
minio-data > 50 GB, healthcheck p95 > 200 ms, etc.). ADR-011 names the migration trigger (second repo on the runner, public-fork PRs accepted). This converts decisions-with-time-bombs into decisions-with-tripwires. Excellent practice.The deployment runbook is a 6-step concrete sequence (
docs/DEPLOYMENT.md§3.1–3.5), not a vague narrative. The 16-secrets table is a checklist, not prose. Rollback is one command, with the data-shape caveat explicit.Open questions / TBD (track but don't block)
minio-datavolume cross the 50 GB / 5 GB-per-month threshold that triggers the Hetzner OBS migration per ADR-010? No monitoring exists yet to surface this. Owner: deferred to observability follow-up (#498).pg_dump+mc mirrorto heim-nas)? The PR description treats it as a follow-up but doesn't assign or date it. Owner: should be the next infra issue created.maxretry=10 / bantime=30m) is forgiving, but no doc tells the operator "if you get a phone call, here's how to un-ban an IP". Owner: doc follow-up.Recommendations (non-blocking)
docs/GLOSSARY.md) for "archiv-app service account" — a new domain term introduced by this PR that doesn't otherwise appear in the codebase outside of bootstrap.sh and the prod compose.This is requirements-clean: scoped, traced, deferred work named, and the doc that survives the PR is internally consistent with the artifacts shipped.
👨💻 Felix Brandt — Round-2 follow-up pass
Worked through the non-blocking round-2 concerns. Seven atomic commits pushed. No blockers remained — every reviewer had approved on round 2; these are the small near-free items that came out of the "approved with concerns" / "suggestions" lists.
Concerns addressed
🔒 Nora — security suggestions
a4f2047bsecurity(ocr): pin ALLOWED_PDF_HOSTS=minio in prod ocr-service envALLOWED_PDF_HOSTS: "minio"env var indocker-compose.prod.yml. Future changes to the Python default or host-env overrides can no longer silently broaden the SSRF allowlist. Resolved config emitsALLOWED_PDF_HOSTS: minio.09680557security(caddy): add Permissions-Policy headerPermissions-Policy: camera=(), microphone=(), geolocation=()to the(security_headers)Caddy snippet. Both deploy smoke steps now grep-assert the canonical value, so a future Caddyfile edit that loosens (e.g.camera=(self)) or drops the header fails the deploy.caddy validateagainstcaddy:2passes.440a1911infra(workflows): annotate env-file cleanup as load-bearingif: always()is the linchpin of the ADR-011 trust model… worth a comment"nightly.ymlandrelease.ymlexplaining the conditional anchors the single-tenant runner trust model from ADR-011 and must not be dropped without re-evaluating.CSP is deferred as Nora recommended (its own staging-validation cycle).
🧪 Sara — smoke pyramid
8fcf653cci(smoke): pin HSTS to preload-list-eligible valueincludeSubDomains; preload, this passes"grep -qi 'strict-transport-security'withgrep -Eqi 'strict-transport-security:[[:space:]]*max-age=31536000.*includeSubDomains.*preload'in both workflows. Verified locally that the new pattern matches the preload-eligible value and rejects three degraded forms (shortmax-age, missingincludeSubDomains, missingpreload).🔧 Tobi — operational nits
1873f50finfra(mailpit): use nc -z healthcheck instead of wgetwgetis a non-obvious dependency on theaxllent/mailpit:v1.29.7image… switch tonc -z localhost 8025"nc -z localhost 8025. Verified insideaxllent/mailpit:v1.29.7thatncis present (/usr/bin/nc, BusyBox v1.37.0) and the command returns 0 on open ports and non-zero on closed ones.🏛️ Markus + 🔧 Tobi — consistency
9adde3cdrefactor(compose): rename docker network archive-net to archiv-netarchive-netwhile every other identifier isarchiv(no-e)"docker-compose.prod.ymlanddocker-compose.ymlplus the lonefamilienarchiv_archive-netreference indocs/infrastructure/ci-gitea.md. Final sweep: zeroarchive-netreferences remain. Operational caveat: nextdocker compose upon a running stack will recreate the network (containers restart; named volumes are unaffected).📋 Elicit ��� traceability
4d4d5793docs(glossary): add archiv-app service account entrydocs/GLOSSARY.mddocumenting thearchiv-appMinIO service account, the distinction from the MinIO root account (archiv), and the bucket-scopedarchiv-app-policy. Cross-links to ADR-010.Explicitly deferred (reviewers themselves recommended deferral)
2>&1on first invocation (Felix-self) — "not worth restructuring"Verification before push
./mvnw test— 1566 tests, 0 failures, 0 errors, 0 skipped./mvnw clean package -DskipTests— BUILD SUCCESSdocker run … caddy:2 caddy validate— Valid configuration (Permissions-Policy header included)docker compose -f docker-compose.prod.yml … config --quiet— passes for prod, staging profile, and dev composeaxllent/mailpit:v1.29.7— open port returns 0, closed port returns non-zeronightly.yml,release.yml,ci.yml) all parsearchive-netreferences in treeCommits on this branch since the round-2 reviews
Ready for merge.
— Felix
🤖 Generated with Claude Code