Commit Graph

18 Commits

Author SHA1 Message Date
Marcel
4c8a23ff14 devops(caddy): add Grafana and GlitchTip vhosts
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 5m33s
CI / OCR Service Tests (pull_request) Successful in 33s
CI / Backend Unit Tests (pull_request) Successful in 7m10s
CI / fail2ban Regex (pull_request) Successful in 1m55s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m42s
grafana.archiv.raddatz.cloud → 127.0.0.1:3003 (with security headers)
glitchtip.archiv.raddatz.cloud → 127.0.0.1:3002 (no security headers —
  GlitchTip manages its own; the Sentry SDK also POSTs here)

Requires A records for both subdomains pointing at the server before
the next `systemctl reload caddy`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 11:27:07 +02:00
Marcel
f3f8345b03 feat(observability): add Grafana with provisioned datasources and dashboards
Add obs-grafana service (grafana/grafana-oss:11.6.1) to docker-compose.observability.yml.
Datasources (Prometheus, Loki, Tempo) are auto-provisioned via
infra/observability/grafana/provisioning/datasources/datasources.yml with
cross-datasource linking (Loki traceId → Tempo, Tempo → Loki, service map via Prometheus).
Three dashboards are pre-loaded: Node Exporter Full (1860), Spring Boot Observability (17175),
Loki Logs (13639) — datasource template variables replaced with provisioned UIDs.
GRAFANA_ADMIN_PASSWORD added to .env.example.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 04:03:33 +02:00
Marcel
de08ffe989 devops(observability): add Tempo for distributed trace storage (OTLP receiver)
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m22s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 4m32s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 56s
Closes #575

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 03:01:22 +02:00
Marcel
c1406a32f1 devops(observability): fix C4 diagram, security comment, and add Loki compactor block
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m22s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 4m33s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 56s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 02:25:34 +02:00
Marcel
22e1b25398 devops(observability): add Loki + Promtail for centralised container log aggregation
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m21s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 4m31s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
- Add obs-loki (grafana/loki:3.4.2) to docker-compose.observability.yml
  with healthcheck (wget /ready), expose-only port 3100, named volume loki_data
- Add obs-promtail (grafana/promtail:3.4.2) bridging archiv-net + obs-net,
  depends_on loki service_healthy, docker.sock:ro, promtail_positions volume
  for restart-safe position tracking
- Create infra/observability/loki/loki-config.yml: single-node TSDB schema v13,
  30-day retention, auth disabled (obs-net only), telemetry off
- Create infra/observability/promtail/promtail-config.yml: Docker SD scrape,
  container_name / compose_service / compose_project / logstream labels
- Update docs/DEPLOYMENT.md §4 with service table and Loki quick-check commands

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 02:18:22 +02:00
Marcel
0c66f6298b devops(observability): fix Prometheus port binding, scrape port, and update DEPLOYMENT.md
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m21s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 4m35s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
- Fix spring-boot scrape target from backend:8080 to backend:8081 (actuator/management port)
- Restrict Prometheus host port binding to 127.0.0.1 to prevent unintended external exposure
- Add observability stack (Prometheus, Node Exporter, cAdvisor) to topology description
- Add PORT_PROMETHEUS env var to DEPLOYMENT.md reference table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 01:52:28 +02:00
Marcel
0c9973fdff devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m22s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 4m40s
CI / fail2ban Regex (pull_request) Successful in 39s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 01:47:07 +02:00
Marcel
1d42be9882 devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m19s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 4m28s
CI / fail2ban Regex (pull_request) Successful in 39s
CI / Compose Bucket Idempotency (pull_request) Successful in 55s
Creates the skeleton observability stack (no running services yet) that all
subsequent Grafana LGTM + GlitchTip issues depend on:

- docker-compose.observability.yml: external archiv-net join, obs-net bridge,
  named volumes for all five services, placeholder comments for each service
  group (Metrics/Logs/Traces/Dashboards/Error Tracking), startup-order note
- infra/observability/{prometheus,loki,promtail,tempo,grafana/provisioning/{datasources,dashboards}}/.gitkeep
- .env.example: new # --- Observability --- section with PORT_GRAFANA,
  PORT_GLITCHTIP, PORT_PROMETHEUS, GLITCHTIP_DOMAIN, GLITCHTIP_SECRET_KEY
  (with generation hint), SENTRY_DSN, VITE_SENTRY_DSN

Verified: docker compose -f docker-compose.observability.yml config exits 0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 01:23:03 +02:00
Marcel
9686e304c2 fix(caddy): wrap actuator block in handle so it takes precedence over catch-all
Some checks failed
CI / Unit & Component Tests (push) Has been cancelled
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
Closes #512.

The previous `(block_actuator)` snippet emitted `respond @actuator 404`
at the top level of each archive vhost. But each vhost also has a
catch-all `handle { reverse_proxy ... }` that matches /actuator/*
too. Caddy's `handle` blocks are mutually exclusive — once one matches,
the request never reaches a top-level `respond`. So /actuator/health
was being proxied to the backend, which 302s to /login.

Wrap the actuator response in its own `handle /actuator/*` block.
Caddy sorts `handle` blocks by path specificity, so /actuator/* wins
over the catch-all and the 404 is actually returned.

Verified with `caddy validate` against the caddy:2 image.

Also unblocks the nightly.yml smoke test's `/actuator/health → 404`
assertion, which has been failing since the first staging deploy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 17:15:03 +02:00
Marcel
f8f0951bd5 fix(minio): bake bootstrap.sh into image instead of bind-mounting
Some checks failed
CI / Unit & Component Tests (push) Has been cancelled
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
CI / Unit & Component Tests (pull_request) Failing after 2m50s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 4m9s
CI / fail2ban Regex (pull_request) Failing after 12s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
Closes #506.

Under Docker-out-of-Docker (the production Gitea Actions runner), the
host daemon resolves the relative bind-mount path against the host
filesystem — not the runner container's /workspace. The script is not
there, so Docker creates an empty directory at /bootstrap.sh and the
entrypoint fails with `/bootstrap.sh: Is a directory`.

Bake the script into a tiny derived image (infra/minio/Dockerfile) so
there is no runtime path resolution. Works in DooD, regular Docker,
and CI.

Unblocks the staging / production deploy pipelines from #497 / #499
and turns the Compose Bucket Idempotency CI job green.

Verified locally:
- `docker compose ... config --quiet` parses
- `docker compose ... build create-buckets` builds the image
- bootstrap.sh exists as a +x file at /bootstrap.sh inside the image

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 15:32:36 +02:00
Marcel
e5363913ec fix(fail2ban): pin polling backend so jail actually reads Caddy access log
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m49s
CI / OCR Service Tests (push) Successful in 16s
CI / Backend Unit Tests (push) Successful in 4m8s
CI / fail2ban Regex (push) Successful in 37s
CI / Compose Bucket Idempotency (push) Failing after 53s
CI / Unit & Component Tests (pull_request) Failing after 2m46s
CI / OCR Service Tests (pull_request) Successful in 15s
CI / Backend Unit Tests (pull_request) Successful in 4m14s
CI / fail2ban Regex (pull_request) Successful in 37s
CI / Compose Bucket Idempotency (pull_request) Failing after 50s
Closes #503.

Debian's fail2ban package ships defaults-debian.conf with
`[DEFAULT] backend = systemd`. Without an explicit override, our
familienarchiv-auth jail inherits the systemd backend at runtime,
reads from journald, and never inspects /var/log/caddy/access.log.
A live login brute-force would not be banned.

Add `backend = polling` to the jail and a CI step that links the jail
into /etc/fail2ban/ and asserts `fail2ban-client -d` resolves it to
the polling backend, not the inherited systemd backend.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 14:59:40 +02:00
Marcel
09680557ef security(caddy): add Permissions-Policy header
Adds `Permissions-Policy: camera=(), microphone=(), geolocation=()` to
the shared (security_headers) snippet, so both archiv vhosts and the
git vhost deny browser APIs the app does not use. Reduces blast radius
of an XSS landing in a privileged origin.

The deploy smoke steps in nightly.yml and release.yml gain a matching
assertion against the canonical header value, so a future Caddyfile
edit that drops or loosens the header (e.g. `camera=(self)`) fails the
deploy instead of regressing silently.

`caddy validate` against caddy:2 passes; both workflow YAMLs parse.
Addresses @nora's round-2 suggestion on PR #499 — "lower-impact than
CSP but nearly free".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 14:06:13 +02:00
Marcel
7e430998b8 security(fail2ban): widen jail to /forgot-password and rate-limit 429
The filter only watched /api/auth/login 401 — leaving the forgot-password
endpoint open to:

  - email enumeration (slow brute-force probing which addresses exist)
  - password-reset brute-force against accounts whose addresses leak

Widens the failregex to /api/auth/(login|forgot-password) and adds 429 to
the status alternation so a future in-app rate-limiter response is also
caught by the jail (defense in depth).

CI assertions extended to cover both new dimensions plus a negative case
on an unrelated 401 endpoint (/api/documents) — pins that the widening
did not over-match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 13:10:08 +02:00
Marcel
91f70e652d security(minio): scope archiv-app to bucket-only IAM policy
Replaces MinIO's built-in `readwrite` policy (which grants s3:* on
arn:aws:s3:::* — every bucket present and future) with a bucket-scoped
custom policy `archiv-app-policy`:

  - s3:GetObject / s3:PutObject / s3:DeleteObject on familienarchiv/*
  - s3:ListBucket / s3:GetBucketLocation on familienarchiv

The previous configuration silently regressed the least-privilege guarantee
that the service-account separation was supposed to provide: a future
second bucket (logs, backups, mc-mirror staging) would have been
read/write/delete-accessible to a compromised backend.

While at it, two follow-on fixes:

  1. Extract the entrypoint to infra/minio/bootstrap.sh. The previous
     inline `/bin/sh -c "..."` was already at the YAML-escaping ceiling;
     adding the policy-JSON heredoc would have made it unreadable.

  2. Replace the `| grep -q readwrite || exit 1` fatal-check with a
     POSIX `case` substring match. The minio/mc image ships coreutils +
     bash but NOT grep/awk/sed — the original check was a no-op that
     ALWAYS exited 1 (verified locally). The new check passes on the
     first invocation and on every subsequent re-deploy.

Idempotency verified locally: two consecutive `docker compose run --rm
create-buckets` invocations both exit 0 with the user bound to the
new policy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 13:07:56 +02:00
Marcel
ad69d7cb83 feat(infra): commit fail2ban jail for /api/auth/login
Adds two files mirroring the on-host install layout:

  infra/fail2ban/filter.d/familienarchiv-auth.conf
  infra/fail2ban/jail.d/familienarchiv.conf

Filter parses the JSON access log emitted by Caddy (previous commit) and
matches 401 responses on /api/auth/login. Jail bans the offending IP for
30 min after 10 attempts in a 10-minute window.

Verified the failregex against four sample log lines via fail2ban-regex
in an alpine container:
  - 2 brute-force 401 attempts        → matched (ban)
  - 1 successful login (POST /api/auth/login 200) → not matched
  - 1 unrelated GET /login 200        → not matched
Date template "ts":{EPOCH} parses Caddy's Unix-epoch ts field.

The previous review iteration described this jail in DEPLOYMENT.md prose
only; committing it makes the security posture reproducible from a
fresh server build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 12:04:06 +02:00
Marcel
8d27c82e6d feat(infra): write Caddy JSON access logs for fail2ban
Adds an (access_log) snippet writing JSON-formatted access logs to
/var/log/caddy/access.log with 10mb rolling and 14-file retention. Both
archive vhosts (archiv.raddatz.cloud and staging.raddatz.cloud) import
it; the git vhost is intentionally excluded.

This is the prerequisite for the fail2ban jail committed in the next
commit — fail2ban tails this file looking for 401 responses on
/api/auth/login to defend against credential stuffing.

Validated with `caddy validate` against caddy:2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 12:02:28 +02:00
Marcel
56e55ff488 feat(infra): add production Caddyfile
Reverse proxy for the Familienarchiv host, validated against Caddy 2.
Includes both vhosts (production and staging), the Gitea vhost, and:

- HSTS, X-Content-Type-Options, Referrer-Policy headers on every site
- "-Server" header strip to hide the Caddy version
- /actuator/* responds 404 on both archive vhosts (defense in depth
  for Spring Boot's management endpoints)

X-Frame-Options is intentionally not set in Caddy: Spring Security
configures frame-options SAMEORIGIN for the in-app PDF preview
iframe; a DENY header here would conflict.

Refs #497.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 21:54:38 +02:00
Marcel
e85057bed2 refactor(document): move document domain core to document/ package
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:39:20 +02:00