devops: production deployment — Caddy, staging env, and Gitea Actions CI/CD #497

New Issue

marcel · 2026-05-10T19:52:35+02:00

marcel commented

2026-05-10 19:52:35 +02:00

Context

Set up the full production deployment pipeline for the Familienarchiv app on the root server. Covers two environments:

archiv.raddatz.cloud — production, deployed on git tag push (v*)
staging.raddatz.cloud — staging, deployed nightly from main

The runner already uses Docker-out-of-Docker (DooD) via the mounted socket, so CI builds go directly to the host daemon — no registry needed.

Spring Boot production profile

No application-prod.yaml is needed. The base application.yaml is already production-ready:

All sensitive values come from env vars
open-in-view: false ✓
show-sql: false ✓
Swagger/OpenAPI disabled ✓

The dev profile only enables Swagger and SQL logging. In production we simply don't activate it (SPRING_PROFILES_ACTIVE is not set to dev).

Codebase changes

1. `frontend/Dockerfile` — add production stage

Currently dev-only (runs npm run dev). Needs a production target using the Node adapter output:

# ── Development (default) ────────────────────────────────────────────────────
FROM node:20-alpine AS development
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
EXPOSE 5173
CMD ["npm", "run", "dev"]

# ── Production ───────────────────────────────────────────────────────────────
FROM node:20-alpine AS build
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine AS production
WORKDIR /app
COPY --from=build /app/build ./build
COPY --from=build /app/package.json ./
RUN npm ci --omit=dev
EXPOSE 3000
ENV NODE_ENV=production
CMD ["node", "build"]

The dev docker-compose.yml is unaffected — its bind mount overrides the COPY and the CMD is already overridden.

2. `docker-compose.prod.yml` — new file

Key differences from the dev compose:

Named volumes for all data (no ./data/ bind mounts)
Frontend uses target: production build stage
Ports bound to 127.0.0.1 only (Caddy handles external traffic)
No Mailpit, no Vite dev server, no source bind mounts
SPRING_PROFILES_ACTIVE not set to dev

networks:
  archive-net:
    driver: bridge

volumes:
  postgres-data:
  minio-data:
  ocr-models:
  ocr-cache:

services:
  db:
    image: postgres:16-alpine
    restart: unless-stopped
    environment:
      POSTGRES_USER: archiv
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: archiv
    volumes:
      - postgres-data:/var/lib/postgresql/data
    networks:
      - archive-net
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U archiv -d archiv"]
      interval: 10s
      timeout: 5s
      retries: 5

  minio:
    image: minio/minio:latest
    restart: unless-stopped
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: archiv
      MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
    volumes:
      - minio-data:/data
    networks:
      - archive-net
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

  create-buckets:
    image: minio/mc
    depends_on:
      minio:
        condition: service_healthy
    networks:
      - archive-net
    entrypoint: >
      /bin/sh -c "
      /usr/bin/mc alias set myminio http://minio:9000 archiv ${MINIO_PASSWORD};
      /usr/bin/mc mb myminio/familienarchiv --ignore-existing;
      /usr/bin/mc anonymous set private myminio/familienarchiv;
      exit 0;
      "

  ocr-service:
    build:
      context: ./ocr-service
    restart: unless-stopped
    volumes:
      - ocr-models:/models
      - ocr-cache:/cache
    environment:
      TRAINING_TOKEN: ${OCR_TRAINING_TOKEN}
    networks:
      - archive-net

  backend:
    image: familienarchiv/backend:${TAG:-nightly}
    build:
      context: ./backend
    restart: unless-stopped
    depends_on:
      db:
        condition: service_healthy
      minio:
        condition: service_healthy
    ports:
      - "127.0.0.1:${PORT_BACKEND:-8081}:8080"
    environment:
      SPRING_DATASOURCE_URL: jdbc:postgresql://db:5432/archiv
      SPRING_DATASOURCE_USERNAME: archiv
      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
      S3_ENDPOINT: http://minio:9000
      S3_ACCESS_KEY: archiv
      S3_SECRET_KEY: ${MINIO_PASSWORD}
      S3_BUCKET_NAME: familienarchiv
      S3_REGION: us-east-1
      APP_BASE_URL: https://${APP_DOMAIN}
      APP_OCR_BASE_URL: http://ocr-service:8000
      APP_OCR_TRAINING_TOKEN: ${OCR_TRAINING_TOKEN}
      MAIL_HOST: ${MAIL_HOST}
      MAIL_PORT: ${MAIL_PORT:-587}
      MAIL_USERNAME: ${MAIL_USERNAME}
      MAIL_PASSWORD: ${MAIL_PASSWORD}
      APP_MAIL_FROM: noreply@raddatz.cloud
      SPRING_MAIL_PROPERTIES_MAIL_SMTP_AUTH: "true"
      SPRING_MAIL_PROPERTIES_MAIL_SMTP_STARTTLS_ENABLE: "true"
    networks:
      - archive-net
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://localhost:8080/actuator/health | grep -q UP || exit 1"]
      interval: 15s
      timeout: 5s
      retries: 10
      start_period: 30s

  frontend:
    image: familienarchiv/frontend:${TAG:-nightly}
    build:
      context: ./frontend
      target: production
    restart: unless-stopped
    depends_on:
      backend:
        condition: service_healthy
    ports:
      - "127.0.0.1:${PORT_FRONTEND:-3001}:3000"
    environment:
      API_INTERNAL_URL: http://backend:8080
    networks:
      - archive-net

3. `.gitea/workflows/nightly.yml`

Triggered at 02:00 every night and on workflow_dispatch. Deploys to staging (archiv-staging project, ports 8081/3001).

name: nightly

on:
  schedule:
    - cron: '0 2 * * *'
  workflow_dispatch:

jobs:
  deploy-staging:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4

      - name: Build images
        run: TAG=nightly docker compose -f docker-compose.prod.yml -p archiv-staging build

      - name: Write staging env
        run: |
          cat > .env.staging << EOF
          TAG=nightly
          PORT_BACKEND=8081
          PORT_FRONTEND=3001
          APP_DOMAIN=staging.raddatz.cloud
          POSTGRES_PASSWORD=${{ secrets.STAGING_POSTGRES_PASSWORD }}
          MINIO_PASSWORD=${{ secrets.STAGING_MINIO_PASSWORD }}
          OCR_TRAINING_TOKEN=${{ secrets.STAGING_OCR_TRAINING_TOKEN }}
          MAIL_HOST=${{ secrets.MAIL_HOST }}
          MAIL_PORT=${{ secrets.MAIL_PORT }}
          MAIL_USERNAME=${{ secrets.MAIL_USERNAME }}
          MAIL_PASSWORD=${{ secrets.MAIL_PASSWORD }}
          EOF

      - name: Deploy
        run: |
          docker compose -f docker-compose.prod.yml \
            -p archiv-staging \
            --env-file .env.staging \
            up -d --remove-orphans

4. `.gitea/workflows/release.yml`

Triggered on v* tag push. Deploys to production (archiv-production project, ports 8080/3000).

name: release

on:
  push:
    tags:
      - 'v*'

jobs:
  deploy-production:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4

      - name: Build images
        run: TAG=${{ gitea.ref_name }} docker compose -f docker-compose.prod.yml -p archiv-production build

      - name: Write production env
        run: |
          cat > .env.production << EOF
          TAG=${{ gitea.ref_name }}
          PORT_BACKEND=8080
          PORT_FRONTEND=3000
          APP_DOMAIN=archiv.raddatz.cloud
          POSTGRES_PASSWORD=${{ secrets.PROD_POSTGRES_PASSWORD }}
          MINIO_PASSWORD=${{ secrets.PROD_MINIO_PASSWORD }}
          OCR_TRAINING_TOKEN=${{ secrets.PROD_OCR_TRAINING_TOKEN }}
          MAIL_HOST=${{ secrets.MAIL_HOST }}
          MAIL_PORT=${{ secrets.MAIL_PORT }}
          MAIL_USERNAME=${{ secrets.MAIL_USERNAME }}
          MAIL_PASSWORD=${{ secrets.MAIL_PASSWORD }}
          EOF

      - name: Deploy
        run: |
          docker compose -f docker-compose.prod.yml \
            -p archiv-production \
            --env-file .env.production \
            up -d --remove-orphans

Gitea secrets to configure

Repo → Settings → Secrets and Variables → Actions:

Secret	Notes
`PROD_POSTGRES_PASSWORD`	strong unique password
`PROD_MINIO_PASSWORD`	strong unique password
`PROD_OCR_TRAINING_TOKEN`	random token
`STAGING_POSTGRES_PASSWORD`	different from prod
`STAGING_MINIO_PASSWORD`	different from prod
`STAGING_OCR_TRAINING_TOKEN`	random token
`MAIL_HOST`	SMTP server hostname
`MAIL_PORT`	typically `587`
`MAIL_USERNAME`	SMTP user
`MAIL_PASSWORD`	SMTP password

Server one-time setup

Caddy (`/etc/caddy/Caddyfile`)

archiv.raddatz.cloud {
    handle /api/* {
        reverse_proxy localhost:8080
    }
    handle {
        reverse_proxy localhost:3000
    }
}

staging.raddatz.cloud {
    handle /api/* {
        reverse_proxy localhost:8081
    }
    handle {
        reverse_proxy localhost:3001
    }
}

git.raddatz.cloud {
    reverse_proxy localhost:3005
}

DNS records

archiv.raddatz.cloud   A   <server IP>
staging.raddatz.cloud  A   <server IP>
git.raddatz.cloud      A   <server IP>

Firewall

Ports 80 and 443 must be open. Port 222 for Git SSH.

Environment isolation

Docker project name (-p) namespaces all resources automatically:

Resource	Production	Staging
Project	`archiv-production`	`archiv-staging`
DB volume	`archiv-production_postgres-data`	`archiv-staging_postgres-data`
Backend port	`8080`	`8081`
Frontend port	`3000`	`3001`

Acceptance criteria

frontend/Dockerfile has a production stage; dev compose still works unchanged
docker-compose.prod.yml exists and starts all services with named volumes
nightly.yml workflow deploys to staging on schedule; manually triggerable
release.yml workflow deploys to production on v* tag push
All 10 Gitea secrets configured
Caddy routes archiv.raddatz.cloud and staging.raddatz.cloud correctly with TLS
DNS records pointing to server
docker compose up (dev) still works locally without changes

Effort

M — 1 day. Most time is server provisioning and first-deploy smoke testing.

## Context Set up the full production deployment pipeline for the Familienarchiv app on the root server. Covers two environments: - `archiv.raddatz.cloud` — production, deployed on git tag push (`v*`) - `staging.raddatz.cloud` — staging, deployed nightly from `main` The runner already uses Docker-out-of-Docker (DooD) via the mounted socket, so CI builds go directly to the host daemon — no registry needed. --- ## Spring Boot production profile **No `application-prod.yaml` is needed.** The base `application.yaml` is already production-ready: - All sensitive values come from env vars - `open-in-view: false` ✓ - `show-sql: false` ✓ - Swagger/OpenAPI disabled ✓ The `dev` profile only enables Swagger and SQL logging. In production we simply don't activate it (`SPRING_PROFILES_ACTIVE` is not set to `dev`). --- ## Codebase changes ### 1. `frontend/Dockerfile` — add production stage Currently dev-only (runs `npm run dev`). Needs a `production` target using the Node adapter output: ```dockerfile # ── Development (default) ──────────────────────────────────────────────────── FROM node:20-alpine AS development WORKDIR /app COPY package.json package-lock.json ./ RUN npm ci COPY . . EXPOSE 5173 CMD ["npm", "run", "dev"] # ── Production ─────────────────────────────────────────────────────────────── FROM node:20-alpine AS build WORKDIR /app COPY package.json package-lock.json ./ RUN npm ci COPY . . RUN npm run build FROM node:20-alpine AS production WORKDIR /app COPY --from=build /app/build ./build COPY --from=build /app/package.json ./ RUN npm ci --omit=dev EXPOSE 3000 ENV NODE_ENV=production CMD ["node", "build"] ``` The dev `docker-compose.yml` is unaffected — its bind mount overrides the COPY and the CMD is already overridden. ### 2. `docker-compose.prod.yml` — new file Key differences from the dev compose: - Named volumes for all data (no `./data/` bind mounts) - Frontend uses `target: production` build stage - Ports bound to `127.0.0.1` only (Caddy handles external traffic) - No Mailpit, no Vite dev server, no source bind mounts - `SPRING_PROFILES_ACTIVE` not set to `dev` ```yaml networks: archive-net: driver: bridge volumes: postgres-data: minio-data: ocr-models: ocr-cache: services: db: image: postgres:16-alpine restart: unless-stopped environment: POSTGRES_USER: archiv POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} POSTGRES_DB: archiv volumes: - postgres-data:/var/lib/postgresql/data networks: - archive-net healthcheck: test: ["CMD-SHELL", "pg_isready -U archiv -d archiv"] interval: 10s timeout: 5s retries: 5 minio: image: minio/minio:latest restart: unless-stopped command: server /data --console-address ":9001" environment: MINIO_ROOT_USER: archiv MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD} volumes: - minio-data:/data networks: - archive-net healthcheck: test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"] interval: 30s timeout: 20s retries: 3 create-buckets: image: minio/mc depends_on: minio: condition: service_healthy networks: - archive-net entrypoint: > /bin/sh -c " /usr/bin/mc alias set myminio http://minio:9000 archiv ${MINIO_PASSWORD}; /usr/bin/mc mb myminio/familienarchiv --ignore-existing; /usr/bin/mc anonymous set private myminio/familienarchiv; exit 0; " ocr-service: build: context: ./ocr-service restart: unless-stopped volumes: - ocr-models:/models - ocr-cache:/cache environment: TRAINING_TOKEN: ${OCR_TRAINING_TOKEN} networks: - archive-net backend: image: familienarchiv/backend:${TAG:-nightly} build: context: ./backend restart: unless-stopped depends_on: db: condition: service_healthy minio: condition: service_healthy ports: - "127.0.0.1:${PORT_BACKEND:-8081}:8080" environment: SPRING_DATASOURCE_URL: jdbc:postgresql://db:5432/archiv SPRING_DATASOURCE_USERNAME: archiv SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD} S3_ENDPOINT: http://minio:9000 S3_ACCESS_KEY: archiv S3_SECRET_KEY: ${MINIO_PASSWORD} S3_BUCKET_NAME: familienarchiv S3_REGION: us-east-1 APP_BASE_URL: https://${APP_DOMAIN} APP_OCR_BASE_URL: http://ocr-service:8000 APP_OCR_TRAINING_TOKEN: ${OCR_TRAINING_TOKEN} MAIL_HOST: ${MAIL_HOST} MAIL_PORT: ${MAIL_PORT:-587} MAIL_USERNAME: ${MAIL_USERNAME} MAIL_PASSWORD: ${MAIL_PASSWORD} APP_MAIL_FROM: noreply@raddatz.cloud SPRING_MAIL_PROPERTIES_MAIL_SMTP_AUTH: "true" SPRING_MAIL_PROPERTIES_MAIL_SMTP_STARTTLS_ENABLE: "true" networks: - archive-net healthcheck: test: ["CMD-SHELL", "wget -qO- http://localhost:8080/actuator/health | grep -q UP || exit 1"] interval: 15s timeout: 5s retries: 10 start_period: 30s frontend: image: familienarchiv/frontend:${TAG:-nightly} build: context: ./frontend target: production restart: unless-stopped depends_on: backend: condition: service_healthy ports: - "127.0.0.1:${PORT_FRONTEND:-3001}:3000" environment: API_INTERNAL_URL: http://backend:8080 networks: - archive-net ``` ### 3. `.gitea/workflows/nightly.yml` Triggered at 02:00 every night and on `workflow_dispatch`. Deploys to staging (`archiv-staging` project, ports 8081/3001). ```yaml name: nightly on: schedule: - cron: '0 2 * * *' workflow_dispatch: jobs: deploy-staging: runs-on: self-hosted steps: - uses: actions/checkout@v4 - name: Build images run: TAG=nightly docker compose -f docker-compose.prod.yml -p archiv-staging build - name: Write staging env run: | cat > .env.staging << EOF TAG=nightly PORT_BACKEND=8081 PORT_FRONTEND=3001 APP_DOMAIN=staging.raddatz.cloud POSTGRES_PASSWORD=${{ secrets.STAGING_POSTGRES_PASSWORD }} MINIO_PASSWORD=${{ secrets.STAGING_MINIO_PASSWORD }} OCR_TRAINING_TOKEN=${{ secrets.STAGING_OCR_TRAINING_TOKEN }} MAIL_HOST=${{ secrets.MAIL_HOST }} MAIL_PORT=${{ secrets.MAIL_PORT }} MAIL_USERNAME=${{ secrets.MAIL_USERNAME }} MAIL_PASSWORD=${{ secrets.MAIL_PASSWORD }} EOF - name: Deploy run: | docker compose -f docker-compose.prod.yml \ -p archiv-staging \ --env-file .env.staging \ up -d --remove-orphans ``` ### 4. `.gitea/workflows/release.yml` Triggered on `v*` tag push. Deploys to production (`archiv-production` project, ports 8080/3000). ```yaml name: release on: push: tags: - 'v*' jobs: deploy-production: runs-on: self-hosted steps: - uses: actions/checkout@v4 - name: Build images run: TAG=${{ gitea.ref_name }} docker compose -f docker-compose.prod.yml -p archiv-production build - name: Write production env run: | cat > .env.production << EOF TAG=${{ gitea.ref_name }} PORT_BACKEND=8080 PORT_FRONTEND=3000 APP_DOMAIN=archiv.raddatz.cloud POSTGRES_PASSWORD=${{ secrets.PROD_POSTGRES_PASSWORD }} MINIO_PASSWORD=${{ secrets.PROD_MINIO_PASSWORD }} OCR_TRAINING_TOKEN=${{ secrets.PROD_OCR_TRAINING_TOKEN }} MAIL_HOST=${{ secrets.MAIL_HOST }} MAIL_PORT=${{ secrets.MAIL_PORT }} MAIL_USERNAME=${{ secrets.MAIL_USERNAME }} MAIL_PASSWORD=${{ secrets.MAIL_PASSWORD }} EOF - name: Deploy run: | docker compose -f docker-compose.prod.yml \ -p archiv-production \ --env-file .env.production \ up -d --remove-orphans ``` --- ## Gitea secrets to configure Repo → Settings → Secrets and Variables → Actions: | Secret | Notes | |---|---| | `PROD_POSTGRES_PASSWORD` | strong unique password | | `PROD_MINIO_PASSWORD` | strong unique password | | `PROD_OCR_TRAINING_TOKEN` | random token | | `STAGING_POSTGRES_PASSWORD` | different from prod | | `STAGING_MINIO_PASSWORD` | different from prod | | `STAGING_OCR_TRAINING_TOKEN` | random token | | `MAIL_HOST` | SMTP server hostname | | `MAIL_PORT` | typically `587` | | `MAIL_USERNAME` | SMTP user | | `MAIL_PASSWORD` | SMTP password | --- ## Server one-time setup ### Caddy (`/etc/caddy/Caddyfile`) ```caddyfile archiv.raddatz.cloud { handle /api/* { reverse_proxy localhost:8080 } handle { reverse_proxy localhost:3000 } } staging.raddatz.cloud { handle /api/* { reverse_proxy localhost:8081 } handle { reverse_proxy localhost:3001 } } git.raddatz.cloud { reverse_proxy localhost:3005 } ``` ### DNS records ``` archiv.raddatz.cloud A <server IP> staging.raddatz.cloud A <server IP> git.raddatz.cloud A <server IP> ``` ### Firewall Ports 80 and 443 must be open. Port 222 for Git SSH. --- ## Environment isolation Docker project name (`-p`) namespaces all resources automatically: | Resource | Production | Staging | |---|---|---| | Project | `archiv-production` | `archiv-staging` | | DB volume | `archiv-production_postgres-data` | `archiv-staging_postgres-data` | | Backend port | `8080` | `8081` | | Frontend port | `3000` | `3001` | --- ## Acceptance criteria - [ ] `frontend/Dockerfile` has a `production` stage; dev compose still works unchanged - [ ] `docker-compose.prod.yml` exists and starts all services with named volumes - [ ] `nightly.yml` workflow deploys to staging on schedule; manually triggerable - [ ] `release.yml` workflow deploys to production on `v*` tag push - [ ] All 10 Gitea secrets configured - [ ] Caddy routes `archiv.raddatz.cloud` and `staging.raddatz.cloud` correctly with TLS - [ ] DNS records pointing to server - [ ] `docker compose up` (dev) still works locally without changes ## Effort M — 1 day. Most time is server provisioning and first-deploy smoke testing.

marcel added the P2-medium devops phase-2: container-images phase-3: prod-compose phase-4: spring-prod-profile labels 2026-05-10 19:52:46 +02:00

marcel commented

2026-05-10 19:56:54 +02:00

🏛️ Markus Keller — Application Architect

Observations

Standalone vs overlay — diverges from existing docs. docs/infrastructure/production-compose.md documents an overlay pattern: docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d, where MinIO is disabled via profiles: ["dev"] in favour of Hetzner Object Storage. This issue proposes a self-contained docker-compose.prod.yml that re-introduces MinIO as a production service. The two approaches are mutually exclusive and the docs need to win or be updated — not silently bypassed.
MinIO in production vs Hetzner OBS — an architectural decision, not an implementation detail. The existing architecture docs made an explicit choice: Hetzner OBS for production (S3-compatible, no MinIO to operate, no data on the VPS itself). The issue reverses this without naming the reason. Both approaches are valid, but the choice has lasting consequences for backup strategy, storage costs, and operational complexity.
Architecture diagram update required. Per the doc-update table in CLAUDE.md: "New Docker service or infrastructure component → docs/architecture/c4/l2-containers.puml + docs/DEPLOYMENT.md." If Caddy runs as a host service (not a Docker container), it should still appear in the C4 L2 diagram as an infrastructure component. The current l2-containers.puml references it only as an implicit boundary, not as a named container. This PR must update that diagram before merge.
DooD (Docker-out-of-Docker) is the right call here. Building directly on the host daemon avoids a registry entirely on a single-VPS setup. The project name namespacing (archiv-staging / archiv-production) cleanly isolates volumes, networks, and containers between environments.
"No application-prod.yaml is needed" — verified correct. Checked application.yaml: springdoc.api-docs.enabled: false, springdoc.swagger-ui.enabled: false, open-in-view: false, show-sql: false. Swagger and SQL logging are already off at baseline. The dev profile re-enables them. Simply not setting SPRING_PROFILES_ACTIVE=dev is sufficient.
docs/DEPLOYMENT.md references the overlay approach in its "Dev vs production differences" table (Spring profile: prod). This will be stale after the issue is implemented.

Recommendations

Choose MinIO vs Hetzner OBS now and state the reason in the issue body. If the decision is MinIO (simpler, self-contained, no Hetzner dependency), update docs/infrastructure/production-compose.md and the DEPLOYMENT.md table to reflect it. If the decision is Hetzner OBS, keep the overlay pattern and add the profiles: ["dev"] gate. Don't leave the docs contradicting the code.
Update docs/architecture/c4/l2-containers.puml to add Caddy as an infrastructure component and show the two port paths (:3000 → frontend, :8080 → backend).
Update docs/DEPLOYMENT.md "Dev vs production differences" table to match whichever compose strategy is chosen.

Open Decisions

MinIO in production vs Hetzner Object Storage. The existing docs say Hetzner OBS; this issue uses MinIO. Options: (A) MinIO — simpler, everything on VPS, but adds ~500MB RAM overhead and requires backup strategy for MinIO data. (B) Hetzner OBS — no MinIO to operate, built-in replication, S3-compatible, ~5 EUR/month, but adds external dependency. The right answer depends on whether Marcel wants the archive fully self-contained or is comfortable with Hetzner as a storage provider. (Raised by: Markus)

## 🏛️ Markus Keller — Application Architect ### Observations - **Standalone vs overlay — diverges from existing docs.** `docs/infrastructure/production-compose.md` documents an overlay pattern: `docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d`, where MinIO is disabled via `profiles: ["dev"]` in favour of Hetzner Object Storage. This issue proposes a self-contained `docker-compose.prod.yml` that re-introduces MinIO as a production service. The two approaches are mutually exclusive and the docs need to win or be updated — not silently bypassed. - **MinIO in production vs Hetzner OBS — an architectural decision, not an implementation detail.** The existing architecture docs made an explicit choice: Hetzner OBS for production (S3-compatible, no MinIO to operate, no data on the VPS itself). The issue reverses this without naming the reason. Both approaches are valid, but the choice has lasting consequences for backup strategy, storage costs, and operational complexity. - **Architecture diagram update required.** Per the doc-update table in CLAUDE.md: "New Docker service or infrastructure component → `docs/architecture/c4/l2-containers.puml` + `docs/DEPLOYMENT.md`." If Caddy runs as a host service (not a Docker container), it should still appear in the C4 L2 diagram as an infrastructure component. The current `l2-containers.puml` references it only as an implicit boundary, not as a named container. This PR must update that diagram before merge. - **DooD (Docker-out-of-Docker) is the right call here.** Building directly on the host daemon avoids a registry entirely on a single-VPS setup. The project name namespacing (`archiv-staging` / `archiv-production`) cleanly isolates volumes, networks, and containers between environments. - **"No `application-prod.yaml` is needed" — verified correct.** Checked `application.yaml`: `springdoc.api-docs.enabled: false`, `springdoc.swagger-ui.enabled: false`, `open-in-view: false`, `show-sql: false`. Swagger and SQL logging are already off at baseline. The dev profile re-enables them. Simply not setting `SPRING_PROFILES_ACTIVE=dev` is sufficient. - **`docs/DEPLOYMENT.md` references the overlay approach** in its "Dev vs production differences" table (`Spring profile: prod`). This will be stale after the issue is implemented. ### Recommendations - **Choose MinIO vs Hetzner OBS now and state the reason in the issue body.** If the decision is MinIO (simpler, self-contained, no Hetzner dependency), update `docs/infrastructure/production-compose.md` and the DEPLOYMENT.md table to reflect it. If the decision is Hetzner OBS, keep the overlay pattern and add the `profiles: ["dev"]` gate. Don't leave the docs contradicting the code. - **Update `docs/architecture/c4/l2-containers.puml`** to add Caddy as an infrastructure component and show the two port paths (`:3000` → frontend, `:8080` → backend). - **Update `docs/DEPLOYMENT.md`** "Dev vs production differences" table to match whichever compose strategy is chosen. ### Open Decisions - **MinIO in production vs Hetzner Object Storage.** The existing docs say Hetzner OBS; this issue uses MinIO. Options: (A) MinIO — simpler, everything on VPS, but adds ~500MB RAM overhead and requires backup strategy for MinIO data. (B) Hetzner OBS — no MinIO to operate, built-in replication, S3-compatible, ~5 EUR/month, but adds external dependency. The right answer depends on whether Marcel wants the archive fully self-contained or is comfortable with Hetzner as a storage provider. _(Raised by: Markus)_

marcel commented

2026-05-10 19:57:12 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

Frontend Dockerfile production stage is clean. Multi-stage build is correct: build stage runs npm run build, production stage copies the output and runs npm ci --omit=dev. The node build CMD matches the SvelteKit Node adapter's default output directory.
BuildKit not explicitly enabled in CI workflows. The backend Dockerfile uses RUN --mount=type=cache,target=/root/.m2 — a BuildKit-only feature. Both nightly.yml and release.yml run docker compose build without setting DOCKER_BUILDKIT=1. On most modern Docker installations (23+), BuildKit is the default, but on the self-hosted NAS runner (currently running Docker 24.x per the existing ci.yml), it may not be set. Without BuildKit, the --mount=type=cache directive is silently ignored — builds still succeed but Maven re-downloads all dependencies on every run, adding several minutes.
node:20-alpine is unpinned in the production stage. The issue adds a production stage using FROM node:20-alpine. The existing dev stage also uses node:20-alpine. In production, unpinned base images mean docker compose build on different dates can pull different Node patch versions. This is a reproducibility risk.
Runtime dependencies are correctly scoped. npm ci --omit=dev in the production stage correctly excludes dev tooling. No concern here — the SvelteKit Node adapter output (build/) is self-contained and its runtime deps are in dependencies, not devDependencies.
Backend Dockerfile already has a good multi-stage build (builder → JRE). Nothing new is proposed for the backend Dockerfile. The existing image is production-ready.
.env.staging and .env.production written as heredocs in CI. This works but leaves secret-containing files in the workspace on disk. If the runner reuses the workspace directory (which Gitea's self-hosted runners do by default), these files persist across workflow runs. They should be cleaned up after use.

Recommendations

Add DOCKER_BUILDKIT=1 to both workflow files as a top-level env: to guarantee BuildKit is active and the Maven cache mount works:
```
env:
  DOCKER_BUILDKIT: "1"
```
Pin the production Node base image to a specific patch version, e.g. node:20.19.0-alpine3.21, to match what's already tested in CI and prevent silent runtime differences.

Add cleanup step after deploy in both workflows:

- name: Cleanup env file
  if: always()
  run: rm -f .env.staging   # or .env.production

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - **Frontend Dockerfile production stage is clean.** Multi-stage build is correct: `build` stage runs `npm run build`, `production` stage copies the output and runs `npm ci --omit=dev`. The `node build` CMD matches the SvelteKit Node adapter's default output directory. - **BuildKit not explicitly enabled in CI workflows.** The backend `Dockerfile` uses `RUN --mount=type=cache,target=/root/.m2` — a BuildKit-only feature. Both `nightly.yml` and `release.yml` run `docker compose build` without setting `DOCKER_BUILDKIT=1`. On most modern Docker installations (23+), BuildKit is the default, but on the self-hosted NAS runner (currently running Docker 24.x per the existing `ci.yml`), it may not be set. Without BuildKit, the `--mount=type=cache` directive is silently ignored — builds still succeed but Maven re-downloads all dependencies on every run, adding several minutes. - **`node:20-alpine` is unpinned in the production stage.** The issue adds a `production` stage using `FROM node:20-alpine`. The existing dev stage also uses `node:20-alpine`. In production, unpinned base images mean `docker compose build` on different dates can pull different Node patch versions. This is a reproducibility risk. - **Runtime dependencies are correctly scoped.** `npm ci --omit=dev` in the production stage correctly excludes dev tooling. No concern here — the SvelteKit Node adapter output (`build/`) is self-contained and its runtime deps are in `dependencies`, not `devDependencies`. - **Backend Dockerfile already has a good multi-stage build (builder → JRE).** Nothing new is proposed for the backend Dockerfile. The existing image is production-ready. - **`.env.staging` and `.env.production` written as heredocs in CI.** This works but leaves secret-containing files in the workspace on disk. If the runner reuses the workspace directory (which Gitea's self-hosted runners do by default), these files persist across workflow runs. They should be cleaned up after use. ### Recommendations - **Add `DOCKER_BUILDKIT=1` to both workflow files** as a top-level `env:` to guarantee BuildKit is active and the Maven cache mount works: ```yaml env: DOCKER_BUILDKIT: "1" ``` - **Pin the production Node base image** to a specific patch version, e.g. `node:20.19.0-alpine3.21`, to match what's already tested in CI and prevent silent runtime differences. - **Add cleanup step** after deploy in both workflows: ```yaml - name: Cleanup env file if: always() run: rm -f .env.staging # or .env.production ```

marcel commented

2026-05-10 19:57:44 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Observations

minio/minio:latest in the prod compose. :latest is not a version — it's a pointer that moves. Two deploys a month apart can run different MinIO versions without any record of what changed. The dev compose also uses :latest, but that's acceptable for local iteration. Production needs a pinned tag. Check the MinIO release page and pin to the current stable, e.g. minio/minio:RELEASE.2025-02-28T09-55-16Z. Add Renovate to automate future bumps.
MinIO root credentials used as application S3 credentials. The prod compose sets S3_ACCESS_KEY: archiv and S3_SECRET_KEY: ${MINIO_PASSWORD} — the same account that is MINIO_ROOT_USER: archiv. The root account has full MinIO admin rights: creating and deleting buckets, managing users, resetting passwords. If the backend is compromised, an attacker has full MinIO admin access, not just read/write on the archive bucket. Create a dedicated service account: mc admin user add myminio archiv-app <strong-password> and attach a bucket-scoped policy.
No post-deploy smoke test. Both workflows end at docker compose up -d --remove-orphans. If Flyway finds a migration conflict, the backend container crash-loops silently. There is no step to verify the stack actually came up healthy. Add a health check step:
```
- name: Verify deployment
  run: |
    sleep 10
    docker compose -p archiv-<env> exec backend \
      wget -qO- http://localhost:8080/actuator/health | grep -q '"status":"UP"'
```
No backup strategy. The prod compose defines named volumes (postgres-data, minio-data) but the issue has no section on backup. For a family archive containing irreplaceable digitised documents, "named volume without backup = single point of failure." At minimum, a nightly pg_dump to Hetzner S3 and a MinIO mc mirror to a second location should be part of this milestone.
OCR service memory limit missing from prod compose. The dev compose sets mem_limit: 12g and memswap_limit: 12g for the OCR service (documented with a comment: "Surya OCR loads ~5GB of transformer models"). The prod compose omits this entirely. On a CX32 (8GB RAM), an unconstrained OCR service can consume all available memory and OOM-kill other services including PostgreSQL. Add mem_limit: 6g for CX32 or mem_limit: 12g for CX42.
OCR service has no healthcheck in prod compose. Dev compose has start_period: 120s because model loading takes 30–50 seconds. Without this in prod, the backend's depends_on: ocr-service: condition: service_healthy would need a healthcheck to be useful — but the prod compose omits the healthcheck definition entirely, so the condition silently falls back to service_started.
.env.staging / .env.production persist on disk. Gitea's self-hosted runner reuses the workspace directory between runs. The heredoc step writes a file containing POSTGRES_PASSWORD, MINIO_PASSWORD, and SMTP credentials. These are not cleaned up. Either pipe to docker compose --env-file /dev/stdin, or add rm -f .env.* in an always() cleanup step.
No observability stack. docs/infrastructure/production-compose.md includes Prometheus, Grafana, Loki, and Alertmanager. This issue deploys a production environment with no metrics, no log aggregation, and no alerting. Operations will be blind. This may be intentional for a first-deploy milestone, but it should be a named gap and a follow-up issue.
Standalone vs overlay. The existing docs describe an overlay pattern. The standalone approach doubles the service definitions between dev and prod compose. If a service-level change (e.g. new env var, new volume) is made in docker-compose.yml, the prod compose won't inherit it and will silently drift. With the overlay approach, common config lives in one place. Recommend the overlay pattern — or if standalone is intentional, extract shared sections into a docker-compose.base.yml.

Recommendations

Pin MinIO image tag, add Renovate config for the prod compose.
Create a MinIO service account scoped to the archive bucket. Document the setup steps in docs/DEPLOYMENT.md.
Add a smoke test step to both workflows.
Add OCR memory limits appropriate to the target VPS tier (document the CX32 vs CX42 difference).
Add OCR healthcheck to prod compose (copy from dev compose).
Add env file cleanup in if: always() after deploy.
Create a follow-up issue for the observability stack (Prometheus + Grafana + Loki).
Resolve the overlay vs standalone design decision with Markus before coding starts.

Open Decisions

Standalone docker-compose.prod.yml vs overlay pattern. Standalone is simpler to reason about for a first deploy; overlay avoids drift between dev and prod service definitions. The cost of standalone: any new env var added to a service in docker-compose.yml must also be manually added to docker-compose.prod.yml. The existing docs say overlay. (Raised by: Tobias)

## 🔧 Tobias Wendt — DevOps & Platform Engineer ### Observations - **`minio/minio:latest` in the prod compose.** `:latest` is not a version — it's a pointer that moves. Two deploys a month apart can run different MinIO versions without any record of what changed. The dev compose also uses `:latest`, but that's acceptable for local iteration. Production needs a pinned tag. Check the MinIO release page and pin to the current stable, e.g. `minio/minio:RELEASE.2025-02-28T09-55-16Z`. Add Renovate to automate future bumps. - **MinIO root credentials used as application S3 credentials.** The prod compose sets `S3_ACCESS_KEY: archiv` and `S3_SECRET_KEY: ${MINIO_PASSWORD}` — the same account that is `MINIO_ROOT_USER: archiv`. The root account has full MinIO admin rights: creating and deleting buckets, managing users, resetting passwords. If the backend is compromised, an attacker has full MinIO admin access, not just read/write on the archive bucket. Create a dedicated service account: `mc admin user add myminio archiv-app <strong-password>` and attach a bucket-scoped policy. - **No post-deploy smoke test.** Both workflows end at `docker compose up -d --remove-orphans`. If Flyway finds a migration conflict, the backend container crash-loops silently. There is no step to verify the stack actually came up healthy. Add a health check step: ```yaml - name: Verify deployment run: | sleep 10 docker compose -p archiv-<env> exec backend \ wget -qO- http://localhost:8080/actuator/health | grep -q '"status":"UP"' ``` - **No backup strategy.** The prod compose defines named volumes (`postgres-data`, `minio-data`) but the issue has no section on backup. For a family archive containing irreplaceable digitised documents, "named volume without backup = single point of failure." At minimum, a nightly `pg_dump` to Hetzner S3 and a MinIO `mc mirror` to a second location should be part of this milestone. - **OCR service memory limit missing from prod compose.** The dev compose sets `mem_limit: 12g` and `memswap_limit: 12g` for the OCR service (documented with a comment: "Surya OCR loads ~5GB of transformer models"). The prod compose omits this entirely. On a CX32 (8GB RAM), an unconstrained OCR service can consume all available memory and OOM-kill other services including PostgreSQL. Add `mem_limit: 6g` for CX32 or `mem_limit: 12g` for CX42. - **OCR service has no healthcheck in prod compose.** Dev compose has `start_period: 120s` because model loading takes 30–50 seconds. Without this in prod, the backend's `depends_on: ocr-service: condition: service_healthy` would need a healthcheck to be useful — but the prod compose omits the healthcheck definition entirely, so the condition silently falls back to `service_started`. - **`.env.staging` / `.env.production` persist on disk.** Gitea's self-hosted runner reuses the workspace directory between runs. The heredoc step writes a file containing `POSTGRES_PASSWORD`, `MINIO_PASSWORD`, and SMTP credentials. These are not cleaned up. Either pipe to `docker compose --env-file /dev/stdin`, or add `rm -f .env.*` in an `always()` cleanup step. - **No observability stack.** `docs/infrastructure/production-compose.md` includes Prometheus, Grafana, Loki, and Alertmanager. This issue deploys a production environment with no metrics, no log aggregation, and no alerting. Operations will be blind. This may be intentional for a first-deploy milestone, but it should be a named gap and a follow-up issue. - **Standalone vs overlay.** The existing docs describe an overlay pattern. The standalone approach doubles the service definitions between dev and prod compose. If a service-level change (e.g. new env var, new volume) is made in `docker-compose.yml`, the prod compose won't inherit it and will silently drift. With the overlay approach, common config lives in one place. Recommend the overlay pattern — or if standalone is intentional, extract shared sections into a `docker-compose.base.yml`. ### Recommendations 1. Pin MinIO image tag, add Renovate config for the prod compose. 2. Create a MinIO service account scoped to the archive bucket. Document the setup steps in `docs/DEPLOYMENT.md`. 3. Add a smoke test step to both workflows. 4. Add OCR memory limits appropriate to the target VPS tier (document the CX32 vs CX42 difference). 5. Add OCR healthcheck to prod compose (copy from dev compose). 6. Add env file cleanup in `if: always()` after deploy. 7. Create a follow-up issue for the observability stack (Prometheus + Grafana + Loki). 8. Resolve the overlay vs standalone design decision with Markus before coding starts. ### Open Decisions - **Standalone `docker-compose.prod.yml` vs overlay pattern.** Standalone is simpler to reason about for a first deploy; overlay avoids drift between dev and prod service definitions. The cost of standalone: any new env var added to a service in `docker-compose.yml` must also be manually added to `docker-compose.prod.yml`. The existing docs say overlay. _(Raised by: Tobias)_

marcel commented

2026-05-10 19:58:11 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

MEDIUM: Actuator not blocked at Caddy. The proposed Caddyfile uses a catch-all handle { reverse_proxy localhost:8080 }. This routes all paths — including /actuator/* — to the backend. I checked application.yaml: Spring Boot's management endpoints are not explicitly configured, which means only /actuator/health is exposed by default in Spring Boot 3+. This is safe today, but it is one misplaced config line away from exposing /actuator/env (which dumps all environment variables including POSTGRES_PASSWORD and MINIO_PASSWORD), /actuator/heapdump (full JVM heap with in-memory secrets), and /actuator/beans. The existing docs/DEPLOYMENT.md states: "Management port 8081 (Spring Actuator / Prometheus scrape) is internal only — the Caddy config blocks /actuator/* externally." This PR must implement that block. It is free defense:
```
archiv.raddatz.cloud {
    @actuator path /actuator/*
    respond @actuator 404
    handle /api/* {
        reverse_proxy localhost:8080
    }
    handle {
        reverse_proxy localhost:3000
    }
}
```
HIGH: MinIO root credentials as application credentials. S3_ACCESS_KEY: archiv + S3_SECRET_KEY: ${MINIO_PASSWORD} is the MinIO root account. The root account can: delete all buckets and their contents, create new users, change the root password, and access the MinIO console. A backend RCE or SSRF vulnerability would give an attacker complete control of the object store. Fix: create a MinIO service account with s3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListBucket permissions on arn:aws:s3:::familienarchiv/* only.
LOW: Secrets written to disk in CI. The Write staging env / Write production env steps write database passwords, MinIO passwords, SMTP credentials, and the OCR training token to .env.staging / .env.production files on the runner's filesystem. Gitea's self-hosted runner (running on the NAS) reuses the workspace directory between runs, so these files persist. Anyone with shell access to the runner can read them. Mitigation options: (A) pipe env directly via stdin (docker compose ... --env-file /dev/stdin <<< "$VARS"), or (B) add a cleanup step with if: always() that removes the file after deploy.
LOW: Shared SMTP credentials across staging and production. Both environments use MAIL_HOST, MAIL_USERNAME, and MAIL_PASSWORD from the same secret set. A staging misconfiguration — or a bug in a new mail flow being tested on staging — would send real emails from the production SMTP account to real addresses. Consider using a separate staging mail account (e.g. Mailpit exposed externally, or a dedicated SMTP credential) so staging email is sandboxed.

Missing security headers in Caddyfile. The proposed Caddyfile has no security headers. These are free defense:

header {
    Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
    X-Content-Type-Options "nosniff"
    X-Frame-Options "DENY"
    Referrer-Policy "strict-origin-when-cross-origin"
    -Server
}

HSTS enforces HTTPS permanently. X-Frame-Options: DENY prevents clickjacking of the archive UI. -Server hides the Caddy version.

No rate limiting on auth endpoints. This is not introduced by this issue, but the first production deployment is the right moment to add it. /api/auth/login and /api/auth/forgot-password have no rate limiting at the Caddy layer. Caddy's rate_limit directive (or the community plugin) can cap these to 5 requests/minute per IP.

Recommendations

Add /actuator/* block to both vhosts in the Caddyfile — this is a blocker before production goes live.
Create MinIO service account for the backend before first deploy, update docs/DEPLOYMENT.md with the setup steps, and rotate the root password to something different from the service account password.
Add env file cleanup (rm -f .env.*) in if: always() in both workflow files.
Add security headers to the Caddyfile — these take five lines and protect all users.
Consider a staging-specific SMTP account or re-point staging to Mailpit exposed on a private port.

Open Decisions

Rate limiting at Caddy vs application layer. Caddy's community rate_limit plugin adds one install step; Spring can do it via bucket4j but requires code changes. For a first production deploy, Caddy-layer limiting is simpler. Decision: does Marcel want rate limiting now or as a follow-up? (Raised by: Nora)

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Observations - **MEDIUM: Actuator not blocked at Caddy.** The proposed Caddyfile uses a catch-all `handle { reverse_proxy localhost:8080 }`. This routes all paths — including `/actuator/*` — to the backend. I checked `application.yaml`: Spring Boot's management endpoints are _not_ explicitly configured, which means only `/actuator/health` is exposed by default in Spring Boot 3+. This is safe today, but it is one misplaced config line away from exposing `/actuator/env` (which dumps all environment variables including `POSTGRES_PASSWORD` and `MINIO_PASSWORD`), `/actuator/heapdump` (full JVM heap with in-memory secrets), and `/actuator/beans`. The existing `docs/DEPLOYMENT.md` states: "Management port 8081 (Spring Actuator / Prometheus scrape) is internal only — the Caddy config blocks `/actuator/*` externally." This PR must implement that block. It is free defense: ```caddyfile archiv.raddatz.cloud { @actuator path /actuator/* respond @actuator 404 handle /api/* { reverse_proxy localhost:8080 } handle { reverse_proxy localhost:3000 } } ``` - **HIGH: MinIO root credentials as application credentials.** `S3_ACCESS_KEY: archiv` + `S3_SECRET_KEY: ${MINIO_PASSWORD}` is the MinIO root account. The root account can: delete all buckets and their contents, create new users, change the root password, and access the MinIO console. A backend RCE or SSRF vulnerability would give an attacker complete control of the object store. Fix: create a MinIO service account with `s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` permissions on `arn:aws:s3:::familienarchiv/*` only. - **LOW: Secrets written to disk in CI.** The `Write staging env` / `Write production env` steps write database passwords, MinIO passwords, SMTP credentials, and the OCR training token to `.env.staging` / `.env.production` files on the runner's filesystem. Gitea's self-hosted runner (running on the NAS) reuses the workspace directory between runs, so these files persist. Anyone with shell access to the runner can read them. Mitigation options: (A) pipe env directly via stdin (`docker compose ... --env-file /dev/stdin <<< "$VARS"`), or (B) add a cleanup step with `if: always()` that removes the file after deploy. - **LOW: Shared SMTP credentials across staging and production.** Both environments use `MAIL_HOST`, `MAIL_USERNAME`, and `MAIL_PASSWORD` from the same secret set. A staging misconfiguration — or a bug in a new mail flow being tested on staging — would send real emails from the production SMTP account to real addresses. Consider using a separate staging mail account (e.g. Mailpit exposed externally, or a dedicated SMTP credential) so staging email is sandboxed. - **Missing security headers in Caddyfile.** The proposed Caddyfile has no security headers. These are free defense: ```caddyfile header { Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" X-Content-Type-Options "nosniff" X-Frame-Options "DENY" Referrer-Policy "strict-origin-when-cross-origin" -Server } ``` HSTS enforces HTTPS permanently. `X-Frame-Options: DENY` prevents clickjacking of the archive UI. `-Server` hides the Caddy version. - **No rate limiting on auth endpoints.** This is not introduced by this issue, but the first production deployment is the right moment to add it. `/api/auth/login` and `/api/auth/forgot-password` have no rate limiting at the Caddy layer. Caddy's `rate_limit` directive (or the community plugin) can cap these to 5 requests/minute per IP. ### Recommendations 1. **Add `/actuator/*` block to both vhosts in the Caddyfile** — this is a blocker before production goes live. 2. **Create MinIO service account for the backend** before first deploy, update `docs/DEPLOYMENT.md` with the setup steps, and rotate the root password to something different from the service account password. 3. **Add env file cleanup** (`rm -f .env.*`) in `if: always()` in both workflow files. 4. **Add security headers** to the Caddyfile — these take five lines and protect all users. 5. Consider a staging-specific SMTP account or re-point staging to Mailpit exposed on a private port. ### Open Decisions - **Rate limiting at Caddy vs application layer.** Caddy's community `rate_limit` plugin adds one install step; Spring can do it via bucket4j but requires code changes. For a first production deploy, Caddy-layer limiting is simpler. Decision: does Marcel want rate limiting now or as a follow-up? _(Raised by: Nora)_

marcel commented

2026-05-10 19:58:31 +02:00

🧪 Sara Holt — QA Engineer

Observations

Acceptance criteria are concrete and testable. All 8 criteria have clear pass/fail conditions. Good.
No automated post-deploy verification in workflows. Both nightly.yml and release.yml end with docker compose up -d --remove-orphans. If the deployment fails silently — Flyway migration conflict, missing env var, backend crash-loop — the workflow exits 0. The acceptance criterion "Caddy routes correctly with TLS" requires someone to manually open a browser. Automated verification should be part of the workflow, not the acceptance criterion.
No rollback procedure defined. The acceptance criteria don't include rollback. If v1.0.0 ships with a broken Flyway migration that crash-loops the backend, the recovery path is undefined. At minimum, docs/DEPLOYMENT.md should document: "To roll back: TAG=<previous-tag> docker compose -f docker-compose.prod.yml -p archiv-production up -d." Flyway rollbacks are harder and should also be addressed.
create-buckets service is correctly idempotent (--ignore-existing). It will not fail on re-deploys. No concern here.
Staging nightly criterion requires manual verification. "nightly.yml workflow deploys to staging on schedule" can only be verified by waiting overnight. Suggest testing it on first implementation via workflow_dispatch (which is already in the YAML — good), and noting this in the acceptance criteria.
The OCR service in the prod compose has no healthcheck, meaning backend's depends_on: ocr-service: condition: service_healthy will silently downgrade to service_started. This means the backend may receive its first OCR request before models are loaded (30–120 seconds), producing a 503 from the OCR service. This is the same issue Tobias raised from an ops angle — here it's a reliability concern.

No test coverage of the deployment artifacts. The existing CI workflow runs unit and integration tests against source code. It does not verify that docker compose build succeeds, or that the resulting images start and serve correctly. Consider adding a CI job that builds the production images and smoke-tests them:

- name: Build production images
  run: TAG=ci docker compose -f docker-compose.prod.yml build
- name: Smoke-test production images
  run: |
    TAG=ci docker compose -f docker-compose.prod.yml -p archiv-ci up -d
    sleep 20
    docker compose -p archiv-ci exec backend wget -qO- http://localhost:8080/actuator/health
    docker compose -p archiv-ci down

Recommendations

Add a health-verification step to both workflow files — at minimum, check actuator/health returns UP after deploy. Fail the workflow if it doesn't.
Document a rollback procedure in docs/DEPLOYMENT.md before the first production deploy.
Add OCR healthcheck to prod compose (copy from dev compose, adjust mem_limit for VPS tier). Otherwise the service_healthy condition silently falls back.
Consider a CI image smoke-test job in ci.yml that builds production images and starts them, so image build failures are caught on every PR rather than at deploy time.

## 🧪 Sara Holt — QA Engineer ### Observations - **Acceptance criteria are concrete and testable.** All 8 criteria have clear pass/fail conditions. Good. - **No automated post-deploy verification in workflows.** Both `nightly.yml` and `release.yml` end with `docker compose up -d --remove-orphans`. If the deployment fails silently — Flyway migration conflict, missing env var, backend crash-loop — the workflow exits 0. The acceptance criterion "Caddy routes correctly with TLS" requires someone to manually open a browser. Automated verification should be part of the workflow, not the acceptance criterion. - **No rollback procedure defined.** The acceptance criteria don't include rollback. If `v1.0.0` ships with a broken Flyway migration that crash-loops the backend, the recovery path is undefined. At minimum, `docs/DEPLOYMENT.md` should document: "To roll back: `TAG=<previous-tag> docker compose -f docker-compose.prod.yml -p archiv-production up -d`." Flyway rollbacks are harder and should also be addressed. - **`create-buckets` service is correctly idempotent** (`--ignore-existing`). It will not fail on re-deploys. No concern here. - **Staging nightly criterion requires manual verification.** "nightly.yml workflow deploys to staging on schedule" can only be verified by waiting overnight. Suggest testing it on first implementation via `workflow_dispatch` (which is already in the YAML — good), and noting this in the acceptance criteria. - **The OCR service in the prod compose has no healthcheck**, meaning `backend`'s `depends_on: ocr-service: condition: service_healthy` will silently downgrade to `service_started`. This means the backend may receive its first OCR request before models are loaded (30–120 seconds), producing a 503 from the OCR service. This is the same issue Tobias raised from an ops angle — here it's a reliability concern. - **No test coverage of the deployment artifacts.** The existing CI workflow runs unit and integration tests against source code. It does not verify that `docker compose build` succeeds, or that the resulting images start and serve correctly. Consider adding a CI job that builds the production images and smoke-tests them: ```yaml - name: Build production images run: TAG=ci docker compose -f docker-compose.prod.yml build - name: Smoke-test production images run: | TAG=ci docker compose -f docker-compose.prod.yml -p archiv-ci up -d sleep 20 docker compose -p archiv-ci exec backend wget -qO- http://localhost:8080/actuator/health docker compose -p archiv-ci down ``` ### Recommendations 1. **Add a health-verification step to both workflow files** — at minimum, check `actuator/health` returns `UP` after deploy. Fail the workflow if it doesn't. 2. **Document a rollback procedure** in `docs/DEPLOYMENT.md` before the first production deploy. 3. **Add OCR healthcheck** to prod compose (copy from dev compose, adjust `mem_limit` for VPS tier). Otherwise the `service_healthy` condition silently falls back. 4. **Consider a CI image smoke-test job** in `ci.yml` that builds production images and starts them, so image build failures are caught on every PR rather than at deploy time.

marcel commented

2026-05-10 19:58:36 +02:00

🎨 Leonie Voss — UX Design & Accessibility

No UX concerns from this issue — it's pure infrastructure. From a user perspective, this work is invisible and positive: Caddy's automatic TLS provisioning means users will always connect over HTTPS, which protects their session cookies and authentication credentials in transit. The separate staging environment also means new features can be user-tested before they reach the production archive.

One small note: once staging is live, consider whether the staging URL (staging.raddatz.cloud) should include a visible banner or <meta name="robots" content="noindex"> so family members who accidentally land on it aren't confused by staging data or half-finished features. This is a cosmetic concern for after the infrastructure is up.

## 🎨 Leonie Voss — UX Design & Accessibility No UX concerns from this issue — it's pure infrastructure. From a user perspective, this work is invisible and positive: Caddy's automatic TLS provisioning means users will always connect over HTTPS, which protects their session cookies and authentication credentials in transit. The separate staging environment also means new features can be user-tested before they reach the production archive. One small note: once staging is live, consider whether the staging URL (`staging.raddatz.cloud`) should include a visible banner or `<meta name="robots" content="noindex">` so family members who accidentally land on it aren't confused by staging data or half-finished features. This is a cosmetic concern for after the infrastructure is up.

marcel commented

2026-05-10 19:58:58 +02:00

🗳️ Decision Queue — Action Required

3 decisions need your input before implementation starts.

Architecture

MinIO in production vs Hetzner Object Storage. docs/infrastructure/production-compose.md already made this choice: Hetzner OBS for production, MinIO disabled via profiles: ["dev"]. This issue reverses that choice by keeping MinIO in the prod compose. Options: (A) MinIO — self-contained, all data on VPS, simpler networking, but ~500MB RAM overhead and you own the backup strategy for object storage. (B) Hetzner OBS — no MinIO to operate, built-in geo-replication, S3-compatible, ~5 EUR/month, data not on VPS. The decision determines whether the prod compose is standalone or an overlay. (Raised by: Markus)
Standalone docker-compose.prod.yml vs overlay pattern. The standalone approach (proposed in this issue) is easier to read and reason about on first deploy. The overlay pattern (docker compose -f docker-compose.yml -f docker-compose.prod.yml) avoids drift — any new env var or service added to the dev compose is automatically present in prod. Cost of standalone: every future change to a service definition must be applied to both files manually. The existing docs say overlay; this issue says standalone. Pick one and update the docs. (Raised by: Tobias, intersects with Markus's MinIO decision — they are linked)

Security

Rate limiting on auth endpoints: now vs follow-up. /api/auth/login and /api/auth/forgot-password are unprotected against brute-force in the proposed Caddyfile. Adding Caddy-layer rate limiting requires the community rate_limit plugin (one install step, ~10 lines of config). Option A: add it now as part of this issue, while the Caddyfile is being written. Option B: create a follow-up security issue and ship rate limiting separately. For a family archive that isn't publicly advertised, risk is low but not zero. (Raised by: Nora)

## 🗳️ Decision Queue — Action Required _3 decisions need your input before implementation starts._ ### Architecture - **MinIO in production vs Hetzner Object Storage.** `docs/infrastructure/production-compose.md` already made this choice: Hetzner OBS for production, MinIO disabled via `profiles: ["dev"]`. This issue reverses that choice by keeping MinIO in the prod compose. Options: (A) **MinIO** — self-contained, all data on VPS, simpler networking, but ~500MB RAM overhead and you own the backup strategy for object storage. (B) **Hetzner OBS** — no MinIO to operate, built-in geo-replication, S3-compatible, ~5 EUR/month, data not on VPS. The decision determines whether the prod compose is standalone or an overlay. _(Raised by: Markus)_ - **Standalone `docker-compose.prod.yml` vs overlay pattern.** The standalone approach (proposed in this issue) is easier to read and reason about on first deploy. The overlay pattern (`docker compose -f docker-compose.yml -f docker-compose.prod.yml`) avoids drift — any new env var or service added to the dev compose is automatically present in prod. Cost of standalone: every future change to a service definition must be applied to both files manually. The existing docs say overlay; this issue says standalone. Pick one and update the docs. _(Raised by: Tobias, intersects with Markus's MinIO decision — they are linked)_ ### Security - **Rate limiting on auth endpoints: now vs follow-up.** `/api/auth/login` and `/api/auth/forgot-password` are unprotected against brute-force in the proposed Caddyfile. Adding Caddy-layer rate limiting requires the community `rate_limit` plugin (one install step, ~10 lines of config). Option A: add it now as part of this issue, while the Caddyfile is being written. Option B: create a follow-up security issue and ship rate limiting separately. For a family archive that isn't publicly advertised, risk is low but not zero. _(Raised by: Nora)_

marcel referenced this issue

2026-05-10 20:47:48 +02:00

devops: production observability stack — Prometheus, Loki, Grafana, Alertmanager #498

marcel commented

2026-05-10 20:49:43 +02:00

🔧 Tobias Wendt — Ops Discussion Summary

Worked through all 8 open items from my review comment. All resolved.

Resolved decisions

MinIO vs Hetzner OBS — MinIO stays in production. Start with 13GB on-VPS, migrate to Hetzner OBS later. Switch is trivial: update three env vars + mc mirror. Migration path to be documented in docs/DEPLOYMENT.md.
Standalone vs overlay — standalone docker-compose.prod.yml is the chosen pattern. The overlay approach was designed around removing MinIO; since MinIO stays, standalone is cleaner. Update docs/infrastructure/production-compose.md to retire the overlay pattern.
MinIO root credentials — create a dedicated MinIO service account scoped to the familienarchiv bucket during server bootstrap. Add the mc admin user add + policy steps to the bootstrap checklist in docs/DEPLOYMENT.md. Use a separate MINIO_APP_PASSWORD secret; MINIO_PASSWORD stays root-only.
Post-deploy verification — replace docker compose up -d with docker compose up -d --wait in both nightly.yml and release.yml. The --wait flag blocks until all healthchecks report healthy, making the workflow fail loudly on a bad deploy. No separate smoke-test step needed.
Backup strategy — add Tailscale installation to the server one-time setup section of this issue (it's needed regardless). Backup implementation (nightly pg_dump + MinIO volume backup + rsync over Tailscale to heim-nas) goes in a separate follow-up issue. VPS uses Tailscale to reach heim-nas; Hetzner S3 as the eventual destination.
OCR memory + healthcheck — copy the healthcheck from the dev compose verbatim (start_period: 120s, interval: 10s, retries: 12) into the prod compose. Set mem_limit: 12g for consistency with dev. Not a safety concern on a 64GB host, but keeps environments aligned.
Observability — deliberate gap for the initial go-live. Follow-up issue #498 created with full spec: Prometheus + Loki + Grafana + Alertmanager, all internal-only, provisioned via config files.
Env file cleanup — add rm -f .env.staging / rm -f .env.production with if: always() after the deploy step in both workflow files.

Overall: the issue is well-scoped and the design is sound. The --wait flag and MinIO service account are the two things I'd consider blocking before first production deploy. Everything else is hardening.

## 🔧 Tobias Wendt — Ops Discussion Summary Worked through all 8 open items from my review comment. All resolved. ### Resolved decisions - **MinIO vs Hetzner OBS** — MinIO stays in production. Start with 13GB on-VPS, migrate to Hetzner OBS later. Switch is trivial: update three env vars + `mc mirror`. Migration path to be documented in `docs/DEPLOYMENT.md`. - **Standalone vs overlay** — standalone `docker-compose.prod.yml` is the chosen pattern. The overlay approach was designed around removing MinIO; since MinIO stays, standalone is cleaner. Update `docs/infrastructure/production-compose.md` to retire the overlay pattern. - **MinIO root credentials** — create a dedicated MinIO service account scoped to the `familienarchiv` bucket during server bootstrap. Add the `mc admin user add` + policy steps to the bootstrap checklist in `docs/DEPLOYMENT.md`. Use a separate `MINIO_APP_PASSWORD` secret; `MINIO_PASSWORD` stays root-only. - **Post-deploy verification** — replace `docker compose up -d` with `docker compose up -d --wait` in both `nightly.yml` and `release.yml`. The `--wait` flag blocks until all healthchecks report healthy, making the workflow fail loudly on a bad deploy. No separate smoke-test step needed. - **Backup strategy** — add Tailscale installation to the server one-time setup section of this issue (it's needed regardless). Backup implementation (nightly `pg_dump` + MinIO volume backup + rsync over Tailscale to `heim-nas`) goes in a separate follow-up issue. VPS uses Tailscale to reach `heim-nas`; Hetzner S3 as the eventual destination. - **OCR memory + healthcheck** — copy the healthcheck from the dev compose verbatim (`start_period: 120s`, `interval: 10s`, `retries: 12`) into the prod compose. Set `mem_limit: 12g` for consistency with dev. Not a safety concern on a 64GB host, but keeps environments aligned. - **Observability** — deliberate gap for the initial go-live. Follow-up issue #498 created with full spec: Prometheus + Loki + Grafana + Alertmanager, all internal-only, provisioned via config files. - **Env file cleanup** — add `rm -f .env.staging` / `rm -f .env.production` with `if: always()` after the deploy step in both workflow files. Overall: the issue is well-scoped and the design is sound. The `--wait` flag and MinIO service account are the two things I'd consider blocking before first production deploy. Everything else is hardening.

marcel commented

2026-05-10 21:10:09 +02:00

🔒 Nora "NullX" Steiner — Security Discussion Summary

Worked through all open items from my review comment, plus three additional findings from a code audit of SecurityConfig.java, UserDataInitializer.java, and application.yaml.

Resolved decisions

Actuator block at Caddy — add @actuator matcher with respond @actuator 404 to both vhosts (archiv.raddatz.cloud and staging.raddatz.cloud). Blocks /actuator/* regardless of what gets added to management.endpoints.web.exposure.include in future.

Security headers — add to both vhosts, with one correction (see X-Frame-Options below):

header {
    Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
    X-Content-Type-Options "nosniff"
    Referrer-Policy "strict-origin-when-cross-origin"
    -Server
}

X-Frame-Options is intentionally excluded from Caddy — see finding below.

MinIO service account policy — least-privilege: s3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListBucket on arn:aws:s3:::familienarchiv and arn:aws:s3:::familienarchiv/* only. The arn:aws:s3::: prefix works on MinIO — it's part of their S3 compatibility layer, not Amazon-specific. Document the exact mc commands in the docs/DEPLOYMENT.md bootstrap checklist.
Staging SMTP isolation — staging points to a Mailpit container, not real SMTP. Add a mailpit service to docker-compose.prod.yml with profiles: [staging]. The nightly workflow starts it with --profile staging and sets MAIL_HOST=mailpit, MAIL_PORT=1025 — no real SMTP secrets used in staging at all.
Rate limiting — fail2ban jail on Caddy access log, watching for 401 responses on /api/auth/login. Thresholds: maxretry=10, findtime=10m, bantime=30m. Generous enough for a 60+ user who mistyped several times; stops bots instantly. Add to server one-time setup section alongside SSH hardening.

Additional findings from code audit

🔴 CRITICAL — Default admin password missing from prod compose

UserDataInitializer.java:37: admin account seeded with ${app.admin.password:admin123} on first startup. The default is admin123. Neither APP_ADMIN_EMAIL nor APP_ADMIN_PASSWORD appear in the prod compose environment: block or the Gitea secrets table — meaning first production deploy creates a full-admin account with a known password.

Fix:

Add PROD_APP_ADMIN_EMAIL and PROD_APP_ADMIN_PASSWORD to the Gitea secrets table in this issue
Add STAGING_APP_ADMIN_EMAIL and STAGING_APP_ADMIN_PASSWORD for staging
Pass both as env vars to the backend service in docker-compose.prod.yml
Note: the initializer is idempotent — it only creates the account if the email doesn't exist. First deploy sets the password permanently. Changing the secret after go-live has no effect unless the account is deleted and recreated.

🟡 MEDIUM — X-Frame-Options: DENY in Caddy conflicts with Spring Security

SecurityConfig.java:68-70 explicitly sets frameOptions.sameOrigin() — intentional, for PDF preview iframes. Adding X-Frame-Options: DENY in Caddy creates two conflicting response headers. Spring Security's SAMEORIGIN is the correct value for this app. Solution: omit X-Frame-Options from the Caddy header block entirely — Spring Security handles it correctly.

🟡 MEDIUM — Missing server.forward-headers-strategy in application.yaml

Not present in application.yaml. Behind Caddy, Spring Boot doesn't know it's serving HTTPS: it generates HTTP redirect URLs and won't set the Secure flag on Spring Session cookies. Add to application.yaml as a code change in this PR:

server:
  forward-headers-strategy: native

This tells Spring Boot to trust X-Forwarded-Proto: https from Caddy.

✅ CORS — Not a concern. The SvelteKit SSR architecture makes all browser requests same-origin via Caddy. No @CrossOrigin is needed and none exists.

✅ E2E profile in production — The e2e profile resets the admin password and creates test users on every startup. The prod compose correctly omits SPRING_PROFILES_ACTIVE: dev,e2e. No action needed.

Overall: the infrastructure design is solid. The three findings above — especially the default admin password — are blockers before any real user accesses the production instance.

## 🔒 Nora "NullX" Steiner — Security Discussion Summary Worked through all open items from my review comment, plus three additional findings from a code audit of `SecurityConfig.java`, `UserDataInitializer.java`, and `application.yaml`. ### Resolved decisions - **Actuator block at Caddy** — add `@actuator` matcher with `respond @actuator 404` to both vhosts (`archiv.raddatz.cloud` and `staging.raddatz.cloud`). Blocks `/actuator/*` regardless of what gets added to `management.endpoints.web.exposure.include` in future. - **Security headers** — add to both vhosts, with one correction (see X-Frame-Options below): ```caddyfile header { Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" X-Content-Type-Options "nosniff" Referrer-Policy "strict-origin-when-cross-origin" -Server } ``` `X-Frame-Options` is intentionally **excluded** from Caddy — see finding below. - **MinIO service account policy** — least-privilege: `s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` on `arn:aws:s3:::familienarchiv` and `arn:aws:s3:::familienarchiv/*` only. The `arn:aws:s3:::` prefix works on MinIO — it's part of their S3 compatibility layer, not Amazon-specific. Document the exact `mc` commands in the `docs/DEPLOYMENT.md` bootstrap checklist. - **Staging SMTP isolation** — staging points to a Mailpit container, not real SMTP. Add a `mailpit` service to `docker-compose.prod.yml` with `profiles: [staging]`. The nightly workflow starts it with `--profile staging` and sets `MAIL_HOST=mailpit`, `MAIL_PORT=1025` — no real SMTP secrets used in staging at all. - **Rate limiting** — `fail2ban` jail on Caddy access log, watching for 401 responses on `/api/auth/login`. Thresholds: `maxretry=10`, `findtime=10m`, `bantime=30m`. Generous enough for a 60+ user who mistyped several times; stops bots instantly. Add to server one-time setup section alongside SSH hardening. --- ### Additional findings from code audit **🔴 CRITICAL — Default admin password missing from prod compose** `UserDataInitializer.java:37`: admin account seeded with `${app.admin.password:admin123}` on first startup. The default is `admin123`. Neither `APP_ADMIN_EMAIL` nor `APP_ADMIN_PASSWORD` appear in the prod compose `environment:` block or the Gitea secrets table — meaning first production deploy creates a full-admin account with a known password. Fix: - Add `PROD_APP_ADMIN_EMAIL` and `PROD_APP_ADMIN_PASSWORD` to the Gitea secrets table in this issue - Add `STAGING_APP_ADMIN_EMAIL` and `STAGING_APP_ADMIN_PASSWORD` for staging - Pass both as env vars to the backend service in `docker-compose.prod.yml` - Note: the initializer is idempotent — it only creates the account if the email doesn't exist. **First deploy sets the password permanently.** Changing the secret after go-live has no effect unless the account is deleted and recreated. **🟡 MEDIUM — `X-Frame-Options: DENY` in Caddy conflicts with Spring Security** `SecurityConfig.java:68-70` explicitly sets `frameOptions.sameOrigin()` — intentional, for PDF preview iframes. Adding `X-Frame-Options: DENY` in Caddy creates two conflicting response headers. Spring Security's `SAMEORIGIN` is the correct value for this app. Solution: omit `X-Frame-Options` from the Caddy header block entirely — Spring Security handles it correctly. **🟡 MEDIUM — Missing `server.forward-headers-strategy` in `application.yaml`** Not present in `application.yaml`. Behind Caddy, Spring Boot doesn't know it's serving HTTPS: it generates HTTP redirect URLs and won't set the `Secure` flag on Spring Session cookies. Add to `application.yaml` as a code change in this PR: ```yaml server: forward-headers-strategy: native ``` This tells Spring Boot to trust `X-Forwarded-Proto: https` from Caddy. **✅ CORS** — Not a concern. The SvelteKit SSR architecture makes all browser requests same-origin via Caddy. No `@CrossOrigin` is needed and none exists. **✅ E2E profile in production** — The `e2e` profile resets the admin password and creates test users on every startup. The prod compose correctly omits `SPRING_PROFILES_ACTIVE: dev,e2e`. No action needed. --- Overall: the infrastructure design is solid. The three findings above — especially the default admin password — are blockers before any real user accesses the production instance.