Compare commits

...

63 Commits

Author SHA1 Message Date
Marcel
13dd38a590 docs(obs): document promtail job label mapping in DEPLOYMENT.md
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m0s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 2m44s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
The job label (derived from the Docker Compose service name) is what
powers {job="backend"} queries in Loki dashboards and populates the
Grafana "App" variable dropdown. Operators need to know this mapping
when writing custom Loki queries.

Addresses @markus non-blocker suggestion from PR #606 review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 13:56:51 +02:00
Marcel
060bacf36c docs(adr): add ADR-017 — Spring Boot 4.0 management port shares main security filter chain
Documents the architectural decision behind the dedicated management
SecurityFilterChain, the discovery that SB4+Jetty removed the isolated
management child-context security, and the consequences for actuator
endpoint exposure.

Addresses @markus blocker from PR #606 review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 13:56:20 +02:00
Marcel
c80d0c810a fix(obs): add management security chain and split Prometheus IT tests
- Add @Order(1) managementFilterChain scoped to /actuator/** with explicit
  401 entry point, blocking all non-public actuator paths without the
  form-login redirect that the main chain uses for browser clients.
- Split single combined test into two focused assertions
  (prometheus_endpoint_returns_200_without_credentials,
   prometheus_endpoint_returns_jvm_metrics).
- Add negative regression test: actuator_metrics_requires_authentication
  verifies that /actuator/metrics returns 401 without credentials.

Addresses reviewer concerns from @sara (missing negative test, split
assertions) and @nora (dedicated management security layer).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 13:55:28 +02:00
Marcel
91a227f5c8 fix(obs): wire Prometheus endpoint for Spring Boot 4.0
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m1s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 2m41s
CI / fail2ban Regex (pull_request) Successful in 40s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
Four Spring Boot 4.0-specific issues prevented /actuator/prometheus from working:

1. spring-boot-starter-micrometer-metrics missing — Spring Boot 4.0 splits
   Micrometer metrics export (including the Prometheus scrape endpoint) out of
   spring-boot-starter-actuator into its own starter. Added dependency.

2. management.prometheus.metrics.export.enabled not set — Spring Boot 4.0
   defaults metrics export to false (opt-in). Added the property to
   application.yaml.

3. SecurityConfig did not permit /actuator/prometheus — Spring Boot 4.0
   with Jetty serves the management port (8081) via the same security filter
   chain as the main port (8080). The previous commit's exclusion of
   ManagementWebSecurityAutoConfiguration was a no-op (that class no longer
   exists in Spring Boot 4.0); removed it and added the correct permitAll()
   rule. Updated the architecture comment in application.yaml to reflect the
   true filter-chain behaviour.

4. Reverted invalid FamilienarchivApplication.java change from the prior
   commit (ManagementWebSecurityAutoConfiguration import compiled against a
   class that does not exist in the Spring Boot 4.0 BOM).

Also adds ActuatorPrometheusIT — an integration test that asserts the
/actuator/prometheus endpoint returns 200 with jvm_memory_used_bytes without
credentials, serving as regression protection against future Spring Boot
upgrades silently breaking metrics collection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 12:08:20 +02:00
Marcel
11320ecd43 fix(obs): wire Prometheus metrics and Loki job label for Grafana dashboards
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 3m3s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Failing after 29s
CI / fail2ban Regex (pull_request) Successful in 40s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
Three root causes confirmed via live server investigation (issue #604):

1. ManagementWebSecurityAutoConfiguration applied HTTP Basic auth to the
   management port (8081), causing Prometheus to receive 401 HTML responses
   instead of metrics. Excluded the auto-config — the Docker network
   (archiv-net) provides the security boundary for this internal port.

2. promtail-config.yml had no `job` relabel rule. Grafana's Loki dashboards
   query {job="$app"} which matched nothing; logs were in Loki under
   compose_service but invisible to every dashboard panel.

3. prometheus.yml had a stale comment claiming the spring-boot target would
   be DOWN until micrometer-registry-prometheus was added — it has been
   present in pom.xml for some time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 11:20:59 +02:00
Marcel
6145a25fe2 fix(obs): correct GlitchTip port and healthcheck for v6.x
Some checks failed
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
CI / Unit & Component Tests (push) Has been cancelled
GlitchTip 6.x moved its internal listen port from 8080 to 8000.
The ports mapping was forwarding to the wrong port (host traffic
never reached the app), and the healthcheck was probing 8080 with
wget (not present in the image), causing the container to stay
permanently unhealthy.

Fix: map to port 8000, check with bash /dev/tcp (no external tools
needed, available in the Python base image).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 10:31:07 +02:00
Marcel
c43f45a472 Merge branch 'fix/issue-601-obs-stack-permanent'
Some checks failed
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
CI / Unit & Component Tests (push) Has been cancelled
2026-05-16 10:19:59 +02:00
Marcel
134f1e2ae0 chore(runner): mount /opt/familienarchiv into job containers
The live runner config was missing /opt/familienarchiv in valid_volumes
and options, so deploy steps wrote files into the ephemeral job
container rather than the host — silently discarded on exit.

Updated /root/docker/gitea/runner-config.yaml on the server and
restarted gitea-runner. Repo file now matches the server exactly,
including the network: gitea_gitea setting that was previously
only on the server.

DEPLOYMENT.md: clarifies that /opt/familienarchiv does not need to be
in the runner container's own volumes (DooD spawns job containers from
the host daemon directly); updates restart command from systemctl to
docker restart; narrows the cp-r stale-file note to manual ops only
(CI uses rm -rf before copying).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 10:19:09 +02:00
Marcel
55ccd5f3c0 ci(obs): replace rsync with rm+cp in deploy step
rsync is not present in the act_runner job container image. rm -rf +
cp -r gives identical semantics (including removal of deleted files)
using only coreutils, which are always available.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 10:18:42 +02:00
3658733003 fix(obs): add GlitchTip healthcheck on /_health/ (port 8080)
Some checks failed
CI / Unit & Component Tests (push) Waiting to run
CI / Unit & Component Tests (pull_request) Has been cancelled
CI / OCR Service Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
CI / OCR Service Tests (push) Successful in 42s
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
2026-05-16 09:37:17 +02:00
0bb0a314ad ci(obs): add obs-glitchtip to health assertion loop (now has /_health/ healthcheck)
Some checks are pending
CI / Unit & Component Tests (pull_request) Waiting to run
CI / OCR Service Tests (pull_request) Waiting to run
CI / Backend Unit Tests (pull_request) Waiting to run
CI / fail2ban Regex (pull_request) Waiting to run
CI / Compose Bucket Idempotency (pull_request) Waiting to run
2026-05-16 09:36:37 +02:00
b194b565f6 ci(obs): reference #603 in keep-in-sync comments; add obs-glitchtip to health assertion
Some checks failed
CI / Unit & Component Tests (pull_request) Has been cancelled
CI / OCR Service Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
2026-05-16 09:35:43 +02:00
Marcel
6720a5aeb2 chore(obs): improve deploy maintainability from review feedback
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 5m45s
CI / OCR Service Tests (pull_request) Successful in 47s
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
- Move POSTGRES_USER to obs.env (non-secret, constant across envs)
- Replace cp -r with rsync -a --delete so removed config files are
  purged from /opt/familienarchiv on next deploy instead of lingering
- Document --env-file ordering contract in validate + start steps:
  obs.env first (defaults), obs-secrets.env second (wins on dupes)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 09:20:08 +02:00
Marcel
a7f60ebed8 docs(obs): add cp-r stale-file cleanup note to DEPLOYMENT.md
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 5m39s
CI / OCR Service Tests (pull_request) Successful in 46s
CI / Backend Unit Tests (pull_request) Failing after 9m24s
CI / fail2ban Regex (pull_request) Successful in 2m52s
CI / Compose Bucket Idempotency (pull_request) Successful in 2m24s
CI uses 'cp -r' which does not remove deleted files. Documents the
manual cleanup step for config files removed from git.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 09:04:41 +02:00
Marcel
25062be657 ci(obs): quote heredoc delimiter in release obs-secrets.env write
Same fix as nightly.yml: prevents shell expansion of '$' in secret
values after Gitea renders them. Keep in sync with nightly.yml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 09:04:12 +02:00
Marcel
9662ff5f8c ci(obs): quote heredoc delimiter in nightly obs-secrets.env write
Prevents shell from expanding '$' in Gitea-rendered secret values.
Without the quote, a password like 'P@$s5w0rd' has '$s5w0rd' silently
expanded to '' — writing a truncated value to obs-secrets.env.
'<<'EOF'' suppresses shell expansion; Gitea's '${{ }}' template
rendering already ran before the shell sees the script.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 09:03:46 +02:00
Marcel
f5c7be932b ci(obs): document POSTGRES_HOST derivation from Compose project name
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 5m38s
CI / OCR Service Tests (pull_request) Successful in 45s
CI / Backend Unit Tests (pull_request) Failing after 10m48s
CI / fail2ban Regex (pull_request) Successful in 2m51s
CI / Compose Bucket Idempotency (pull_request) Successful in 2m16s
The container names archiv-staging-db-1 and archiv-production-db-1 are
derived from the Compose project name + service name. A project rename
silently breaks the obs stack DB connection. Add a comment at the point
of definition so the dependency is obvious when someone changes it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 08:54:17 +02:00
Marcel
dec0001bd1 ci(obs): chmod 600 obs-secrets.env after creation in both workflows
The heredoc creates the file with default umask permissions (644 —
world-readable). Setting 600 immediately after creation prevents other
processes on the host from reading the Grafana, GlitchTip, and Postgres
credentials. Defence-in-depth for the single-tenant VPS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 08:53:49 +02:00
Marcel
f628ab6435 ci(obs): add validate + health assertion steps to release.yml
nightly.yml had two observability gates that release.yml lacked:
- "Validate observability compose config" (docker compose config --quiet)
  catches missing env vars and YAML errors before any containers start
- "Assert observability stack health" checks obs-loki/prometheus/grafana/tempo
  are healthy after up --wait, covering services without healthcheck directives

Mirrors the nightly.yml steps verbatim so the production deploy path is at
least as well-verified as the nightly staging path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 08:53:18 +02:00
Marcel
4c5ee96e36 docs(adr): correct ADR-016 Decision section to match two-source env model
The Decision section described an operator-managed /opt/familienarchiv/.env
that CI does not touch. The actual implementation is a two-source model:
obs.env (git-tracked, non-secret config) + obs-secrets.env (CI-written
fresh from Gitea secrets on every deploy). Also updates the Consequences
bullet that incorrectly stated secrets are decoupled from CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 08:52:42 +02:00
Marcel
53cf1837b2 fix(obs): set POSTGRES_HOST per environment — staging/prod use compose auto-names not archive-db
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 2m58s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 2m39s
CI / fail2ban Regex (pull_request) Successful in 40s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:21:53 +02:00
Marcel
d83ed7254d docs(obs): document obs vs main stack env model, obs.env + obs-secrets.env approach
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:20:21 +02:00
Marcel
1ae4bfe325 ci(obs): GitOps obs env split in release — deploy to /opt/familienarchiv/, secrets fresh from Gitea
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:19:12 +02:00
Marcel
c5139851b8 ci(obs): GitOps obs env split in nightly — obs.env in git, secrets fresh from Gitea
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:18:38 +02:00
Marcel
f9baf02b86 feat(obs): add GF_SERVER_ROOT_URL to Grafana service
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:17:47 +02:00
Marcel
b67bd201b2 feat(obs): add obs.env with non-secret config tracked in git
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:17:07 +02:00
Marcel
79735e23e0 ci(obs): assert obs-loki/prometheus/grafana/tempo are healthy after stack up
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 2m58s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 2m36s
CI / fail2ban Regex (pull_request) Successful in 41s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:01:48 +02:00
Marcel
df37113d38 ci(obs): add compose config dry-run before obs stack up to catch .env substitution errors
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:01:17 +02:00
Marcel
c7d2eeb3f0 docs(ci): harden runner-config.yaml security comment for /opt/familienarchiv/ write access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:00:44 +02:00
Marcel
4e94d85d7e docs(adr): add ADR-016 for obs stack co-location and CI-push config sync
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 00:00:07 +02:00
Marcel
dec6b8139b docs(c4): update l2-containers obs boundary to show /opt/familienarchiv/ permanent path
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 23:59:11 +02:00
Marcel
7b7d0c92a8 docs(obs): update DEPLOYMENT.md with /opt/familienarchiv/ ops section, env keys, runner restart
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 23:58:42 +02:00
Marcel
448c3cdcdb docs(obs): update .env.example for PORT_GRAFANA 3003, POSTGRES_HOST, $$ escaping
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 23:57:31 +02:00
Marcel
7e52494880 fix(ci): deploy obs configs to /opt/familienarchiv/ before starting stack
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m4s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 2m42s
CI / fail2ban Regex (pull_request) Successful in 41s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
The observability stack's bind-mount sources pointed to workspace-relative
paths. When CI wiped the workspace between runs, containers kept running but
their config files disappeared — causing Docker to auto-create directories
at the missing paths and crash the services on next restart.

Fix: mount /opt/familienarchiv/ into CI job containers via runner-config.yaml,
then copy infra/observability/ and docker-compose.observability.yml there before
docker compose up. Compose runs from the permanent path, so bind mounts resolve
to stable host paths that survive workspace wipes.

Docker Compose reads /opt/familienarchiv/.env automatically (no --env-file flag),
which is managed on the server and persists between CI runs.

Closes #601

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 21:59:23 +02:00
Marcel
1181b97f94 fix(obs): make Postgres host configurable and fix PORT_GRAFANA default
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m6s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 2m43s
CI / fail2ban Regex (pull_request) Successful in 39s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
POSTGRES_HOST variable (default: archive-db) lets the observability stack
connect to a different Postgres container — needed when only the staging
stack is running (container name: archiv-staging-db-1).

PORT_GRAFANA default changed from 3001 to 3003 to avoid collision with
the staging frontend which occupies 3001.

Closes #601

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 21:46:11 +02:00
Marcel
458968ded5 fix(obs): remove invalid processors block from tempo metrics_generator
Tempo 2.7.2 removed `processors` from the top-level metrics_generator
config; the field is only valid under `overrides.defaults.metrics_generator`.
The setting was already present there, so this only removes the now-rejected
duplicate at the top level.

Closes part of #601

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 21:45:49 +02:00
Marcel
23515b8542 fix(eslint): remove projectService from Svelte parser — restores fast lint
Some checks failed
CI / OCR Service Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
CI / Unit & Component Tests (pull_request) Has been cancelled
CI / Unit & Component Tests (push) Successful in 3m23s
CI / OCR Service Tests (push) Successful in 17s
CI / Backend Unit Tests (push) Successful in 2m37s
CI / fail2ban Regex (push) Successful in 44s
CI / Compose Bucket Idempotency (push) Successful in 1m1s
nightly / deploy-staging (push) Failing after 2m33s
5646e739 added svelte-kit sync before lint so .svelte-kit/tsconfig.json
always exists. This activated projectService: true for every run, which
builds the full TypeScript language service for all .svelte files and
caused CI lint to take 7+ minutes.

None of the rules in the Svelte-specific block need type information —
they are all AST-selector-based no-restricted-syntax checks. Removing
projectService restores the previous fast path without losing any lint
coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 20:08:52 +02:00
Marcel
e4ac5f08e7 docs(ci): document workspace bind-mount setup for DooD runners
Some checks failed
CI / OCR Service Tests (pull_request) Successful in 59s
CI / Unit & Component Tests (push) Has been cancelled
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
CI / Backend Unit Tests (pull_request) Successful in 5m22s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
CI / Unit & Component Tests (pull_request) Failing after 14m44s
Add the /srv/gitea-workspace prerequisite step to DEPLOYMENT.md §3.1
and a new "Workspace bind-mount setup" subsection plus failure mode 4
to ci-gitea.md, covering the root cause, one-time host setup, disk
management, and troubleshooting for the bind-mount resolution fix
introduced in ADR-015.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 19:46:54 +02:00
Marcel
15ef079eff docs(adr): add ADR-015 for DooD workspace bind-mount approach
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m36s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 3m7s
CI / fail2ban Regex (pull_request) Successful in 39s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
Documents the decision to use workdir_parent + identical host<->container
path instead of the overlay2 MergedDir sync that was in the initial fix.
Captures the alternatives (nsenter sync, image-baked configs, path mismatch)
and the operational consequences (prereq directory, out-of-band compose.yaml).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 19:38:18 +02:00
Marcel
56c3e51657 fix(ci): replace overlay2 sync with workspace bind mount for DooD
runner-config.yaml: correct path to /srv/gitea-workspace (VPS, not Synology).
docker-compose.observability.yml: revert 5 bind mounts to plain relative paths;
  OBS_CONFIG_DIR variable is no longer needed.
nightly.yml / release.yml: remove OBS_CONFIG_DIR env injection and the
  "Sync observability configs to host" step from both workflows.

With workdir_parent=/srv/gitea-workspace and an identical host<->container
bind mount, $(pwd) inside job containers resolves to a real host path the
daemon can find — no privileged container, no overlay2 inspection, no nsenter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 19:36:55 +02:00
Marcel
2cc8b1174b fix(ci): configure workspace bind mount for DooD bind-mount resolution
Set workdir_parent to /volume1/gitea-workspace so act_runner stores job
workspaces at a real NAS path. Mounting that path at the same absolute
location in job containers means $(pwd) inside any job container resolves
to a host path the daemon can find — no overlay2 tricks needed.

Prerequisite (NAS): mkdir -p /volume1/gitea-workspace and add
  - /volume1/gitea-workspace:/volume1/gitea-workspace
to the runner service volumes in gitea's docker-compose.yml, then restart
the runner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 19:33:36 +02:00
Marcel
1fc47888d5 fix(ci): sync observability configs to host before docker compose up (#598)
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m26s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 2m40s
CI / fail2ban Regex (pull_request) Successful in 41s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
DooD runner only shares /var/run/docker.sock — no workspace directory is
mapped to the host daemon. Relative bind mounts in
docker-compose.observability.yml resolved to paths that didn't exist on
the host; Docker auto-created directories in their place, causing
'not a directory' mount failures for all five config files.

Fix:
- docker-compose.observability.yml: replace hardcoded ./infra/observability/
  prefix with ${OBS_CONFIG_DIR:-./infra/observability} so the path is
  configurable while remaining backwards-compatible for local use.
- nightly.yml / release.yml: add a 'Sync observability configs to host'
  step that finds the job container's overlay2 MergedDir (the container's
  full filesystem as seen from the host mount namespace), then uses the
  existing nsenter/alpine pattern to cp the config tree into a stable host
  path (/srv/familienarchiv-{staging,production}/obs-configs).
  OBS_CONFIG_DIR is injected into the env file so Compose picks it up.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 19:02:53 +02:00
Marcel
d435b2b0e4 fix(infra): pin GlitchTip image to 6.1.6 (v4 tag never existed)
All checks were successful
CI / Unit & Component Tests (push) Successful in 3m30s
CI / OCR Service Tests (push) Successful in 16s
CI / Backend Unit Tests (push) Successful in 2m34s
CI / fail2ban Regex (push) Successful in 40s
CI / Compose Bucket Idempotency (push) Successful in 57s
glitchtip/glitchtip:v4 is not a real tag — GlitchTip does not use a
v-prefix in its Docker image versioning. Latest stable release is 6.1.6.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 18:07:32 +02:00
Marcel
fed427dc4a fix(infra): set OTEL_EXPORTER_OTLP_ENDPOINT in docker-compose.prod.yml
Some checks failed
CI / Unit & Component Tests (pull_request) Has been cancelled
CI / OCR Service Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
CI / Unit & Component Tests (push) Has been cancelled
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
CI / fail2ban Regex (push) Has been cancelled
CI / Compose Bucket Idempotency (push) Has been cancelled
The endpoint belongs in the compose file (hardcoded to the in-network
Tempo service) rather than per-environment workflow files. This covers
both staging (nightly.yml) and production (release.yml) with a single
change and removes the duplicate from the nightly env-file block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 17:43:23 +02:00
Marcel
cf78ab2f8e fix(staging): correct backend healthcheck port and OTel endpoint
Some checks failed
CI / OCR Service Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
CI / Unit & Component Tests (pull_request) Has been cancelled
Two bugs introduced when the management port was split from the app port:

1. Backend healthcheck hit localhost:8080/actuator/health (app port) —
   actuator is on management.server.port=8081, so every probe got a 404
   from the main MVC dispatcher, marking the container permanently unhealthy.
   Fix: change the probe to localhost:8081.

2. OTEL_EXPORTER_OTLP_ENDPOINT was not set in .env.staging, so the exporter
   fell back to http://localhost:4317 (the CI-safe default) instead of
   http://tempo:4317 (the in-network Tempo service). Fix: inject the correct
   endpoint in the nightly env-file generation step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 17:37:15 +02:00
Marcel
c8883d0e40 fix(ci): isolate compose-idempotency network from archiv-net collisions
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 5m40s
CI / OCR Service Tests (pull_request) Successful in 34s
CI / Backend Unit Tests (pull_request) Successful in 7m8s
CI / fail2ban Regex (pull_request) Successful in 1m58s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m41s
CI / Unit & Component Tests (push) Successful in 5m37s
CI / OCR Service Tests (push) Successful in 28s
CI / Backend Unit Tests (push) Successful in 6m59s
CI / fail2ban Regex (push) Successful in 1m59s
CI / Compose Bucket Idempotency (push) Successful in 1m44s
The name: archiv-net declaration (needed so docker-compose.observability.yml
can join the network as external: true) caused the compose-idempotency CI job
to collide with any archiv-net left on the runner from staging or a previous
run. mc would resolve 'minio' to the wrong container and fail with a signature
mismatch.

Make the network name interpolable via COMPOSE_NETWORK_NAME (default: archiv-net
so production/staging behaviour is unchanged). Inject COMPOSE_NETWORK_NAME=
test-idem-archiv-net into the stub env file so the idempotency test always
gets a fully isolated network.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:33:07 +02:00
Marcel
7154092547 fix(deps): pin opentelemetry-bom to 1.61.0 to fix startup crash
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 5m34s
CI / OCR Service Tests (pull_request) Successful in 30s
CI / Backend Unit Tests (pull_request) Successful in 7m6s
CI / fail2ban Regex (pull_request) Successful in 1m49s
CI / Compose Bucket Idempotency (pull_request) Failing after 1m26s
opentelemetry-spring-boot-starter:2.27.0 was built against
opentelemetry-api:1.61.0. Spring Boot 4.0.0 only manages 1.55.0,
which is missing GlobalOpenTelemetry.getOrNoop(). The backend crashed
at startup with NoSuchMethodError on the first staging nightly.

Add a <dependencyManagement> import of opentelemetry-bom:1.61.0 before
the Spring Boot BOM applies, so all OTel core artifacts resolve to the
version the instrumentation starter actually requires.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:05:44 +02:00
Marcel
ada3a3ccaf devops(ci): add --remove-orphans to observability stack deploy steps
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 5m27s
CI / OCR Service Tests (pull_request) Successful in 34s
CI / Backend Unit Tests (pull_request) Successful in 7m13s
CI / fail2ban Regex (pull_request) Successful in 1m51s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m47s
CI / Unit & Component Tests (push) Successful in 5m45s
CI / OCR Service Tests (push) Successful in 36s
CI / Backend Unit Tests (push) Successful in 7m12s
CI / fail2ban Regex (push) Successful in 1m54s
CI / Compose Bucket Idempotency (push) Successful in 1m41s
Both nightly and release workflows were missing --remove-orphans on the
observability compose up, while the main app deploy step already had it.
Without it, containers removed from docker-compose.observability.yml
linger as unnamed orphans until manually pruned.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 14:55:28 +02:00
Marcel
8cf3a2a726 devops(caddy): apply full security_headers snippet to GlitchTip vhost
The GlitchTip vhost only had a manual HSTS header; the rest of the
(security_headers) snippet (X-Content-Type-Options, Referrer-Policy,
Permissions-Policy, -Server removal) was missing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 14:54:54 +02:00
Marcel
553e2f8898 docs(deployment): add observability secrets to §3.3 Gitea secrets table
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 5m35s
CI / OCR Service Tests (pull_request) Successful in 33s
CI / Backend Unit Tests (pull_request) Successful in 7m10s
CI / fail2ban Regex (pull_request) Successful in 1m54s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m39s
GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, and SENTRY_DSN were
referenced in the workflow env files but absent from the secrets table,
leaving the first-run operator without a complete checklist.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 13:46:01 +02:00
Marcel
4a7349543a devops(ci): wire SENTRY_DSN into staging and production env files
Adds SENTRY_DSN as an optional secret (empty by default) so it can be
set after GlitchTip first-run without requiring another code change.
Backend reads it via application.yaml; empty value keeps Sentry disabled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 13:45:07 +02:00
Marcel
f15e004645 devops(ci): add --wait to observability stack startup
Prometheus, Loki, Tempo, and Grafana all define healthchecks in
docker-compose.observability.yml. Without --wait, the step exits 0
as soon as containers are created, masking startup failures silently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 13:44:16 +02:00
Marcel
b137e3e72d devops(caddy): add HSTS to GlitchTip vhost
Caddy does not set Strict-Transport-Security on GlitchTip because the
full security_headers snippet is intentionally omitted (Permissions-Policy
interferes with the Sentry SDK CORS). Adding HSTS alone guarantees
HTTPS enforcement at the Caddy layer without breaking SDK ingestion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 13:43:35 +02:00
Marcel
4c8a23ff14 devops(caddy): add Grafana and GlitchTip vhosts
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 5m33s
CI / OCR Service Tests (pull_request) Successful in 33s
CI / Backend Unit Tests (pull_request) Successful in 7m10s
CI / fail2ban Regex (pull_request) Successful in 1m55s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m42s
grafana.archiv.raddatz.cloud → 127.0.0.1:3003 (with security headers)
glitchtip.archiv.raddatz.cloud → 127.0.0.1:3002 (no security headers —
  GlitchTip manages its own; the Sentry SDK also POSTs here)

Requires A records for both subdomains pointing at the server before
the next `systemctl reload caddy`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 11:27:07 +02:00
Marcel
d7d225af77 devops(observability): wire observability stack into nightly and release deploys
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 4m32s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 4m3s
CI / fail2ban Regex (pull_request) Successful in 1m55s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m42s
- docker-compose.prod.yml: add `name: archiv-net` so the network has a
  stable Docker name regardless of compose project name (-p flag).
  Both staging and production share the same host-level network, which
  is correct since the observability stack is a single shared instance.

- nightly.yml / release.yml: add observability env vars (POSTGRES_USER,
  PORT_GRAFANA=3003, PORT_GLITCHTIP=3002, PORT_PROMETHEUS=9090,
  GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY, GLITCHTIP_DOMAIN) to the
  env file, then `docker compose -f docker-compose.observability.yml up -d`
  after the app deploy step. PORT_GRAFANA=3003 avoids collision with
  staging frontend on 3001.

  Requires two new Gitea secrets: GRAFANA_ADMIN_PASSWORD, GLITCHTIP_SECRET_KEY.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 11:22:37 +02:00
Marcel
4358997482 perf(test): replace DirtiesContext(AFTER_EACH_TEST_METHOD) with @Transactional
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 4m40s
CI / OCR Service Tests (pull_request) Successful in 18s
CI / Backend Unit Tests (pull_request) Successful in 3m20s
CI / fail2ban Regex (pull_request) Successful in 47s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m4s
CI / Unit & Component Tests (push) Successful in 4m20s
CI / OCR Service Tests (push) Successful in 16s
CI / Backend Unit Tests (push) Successful in 3m8s
CI / fail2ban Regex (push) Successful in 44s
CI / Compose Bucket Idempotency (push) Successful in 1m1s
4 integration test classes were restarting the full Spring context (and a new
Postgres Testcontainer, ~75s each) after every test method — 10 unnecessary
container startups adding ~12 minutes to CI. Fixed by:

- PersonServiceIntegrationTest, DocumentSearchPagedIntegrationTest,
  GeschichteServiceIntegrationTest: swap to @Transactional so each test
  rolls back instead of destroying the context.
- AuditServiceIntegrationTest: cannot use @Transactional (logAfterCommit
  hooks into AFTER_COMMIT which requires a real commit); reset state with
  @BeforeEach deleteAll() instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 10:29:35 +02:00
Marcel
7c2e75facc fix(backend): switch to sentry-spring-boot-4:8.41.0 for Spring Boot 4/SF7 compatibility
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 6m12s
CI / OCR Service Tests (pull_request) Successful in 42s
CI / Backend Unit Tests (pull_request) Failing after 17m13s
CI / fail2ban Regex (pull_request) Successful in 2m37s
CI / Compose Bucket Idempotency (pull_request) Successful in 2m6s
sentry-spring-boot-starter-jakarta 8.5.0 does not support Spring Boot 4.0 —
it logs an "Incompatible Spring Boot Version" warning and its SentryAutoConfiguration
crashes SF7 bean-name generation. sentry-spring-boot-4 (added in 8.21.0) is the
dedicated Spring Boot 4 module with a fixed auto-configuration class.

- Replace sentry-spring-boot-starter-jakarta:8.5.0 with sentry-spring-boot-4:8.41.0
- Delete SentryConfig.java — workaround no longer needed, auto-config handles init
- Remove spring.autoconfigure.exclude from application.yaml + application-test.yaml
- Delete SentryConfigTest.java — tested the deleted workaround class
- Update ApplicationContextTest: assert Sentry.isEnabled() is false when no DSN set

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 09:51:53 +02:00
Marcel
7b05b9d5a0 test(context): assert SentryAutoConfiguration is excluded from Spring context
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 09:45:32 +02:00
Marcel
20edc0474c test(exception): verify handleGeneric captures exception in Sentry and returns 500
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 09:44:10 +02:00
Marcel
fa191b5c05 test(config): unit-test SentryConfig blank-DSN no-op and non-blank init paths
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 09:43:08 +02:00
Marcel
2139d600f5 fix(backend): exclude SentryAutoConfiguration — Spring Boot 4/SF7 bean name incompatibility
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 6m26s
CI / OCR Service Tests (pull_request) Successful in 43s
CI / fail2ban Regex (pull_request) Has been cancelled
CI / Compose Bucket Idempotency (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
SentryAutoConfiguration$HubConfiguration$SentrySpanRestClientConfiguration is a triply-
nested @Configuration class conditionally loaded when RestClient is on the classpath
(always true on Spring Framework 7). Spring Framework 7's bean name generator fails
on such deeply-nested @Import-ed classes, crashing every @SpringBootTest context.

Replace the broken auto-configuration with a minimal SentryConfig bean that calls
Sentry.init() with the same properties (DSN, environment, sample rate, PII guard,
DomainException filter). Unexpected 5xx exceptions are forwarded to Sentry via
Sentry.captureException() in GlobalExceptionHandler.handleGeneric().

Also add management.server.port=0 to application-test.yaml to eliminate TIME_WAIT
conflicts from @DirtiesContext restarts on the fixed management port 8081 (see #593).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 09:25:14 +02:00
Marcel
68e4ff4121 fix(backend): make sentry traces-sample-rate env-configurable
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 6m4s
CI / OCR Service Tests (pull_request) Successful in 32s
CI / Backend Unit Tests (pull_request) Failing after 7m9s
CI / fail2ban Regex (pull_request) Successful in 2m27s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m59s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 08:55:40 +02:00
Marcel
0a1d709c5f feat(backend): add sentry-spring-boot-starter-jakarta for GlitchTip error reporting
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 08:55:40 +02:00
31 changed files with 761 additions and 45 deletions

View File

@@ -29,16 +29,17 @@ OCR_TRAINING_TOKEN=change-me-in-production
# --- Observability ---
# Optional stack — start with: docker compose -f docker-compose.observability.yml up -d
# Requires the main stack to already be running (docker compose up -d creates archiv-net).
# In production the stack is managed from /opt/familienarchiv/ (see docs/DEPLOYMENT.md §4).
# Ports for host access
PORT_GRAFANA=3001
PORT_GRAFANA=3003
PORT_GLITCHTIP=3002
PORT_PROMETHEUS=9090
# Grafana admin password — change this before exposing Grafana beyond localhost
GRAFANA_ADMIN_PASSWORD=changeme
# GlitchTip domain — production: use https://grafana.raddatz.cloud (must match Caddy vhost)
# GlitchTip domain — production: use https://glitchtip.archiv.raddatz.cloud (must match Caddy vhost)
GLITCHTIP_DOMAIN=http://localhost:3002
# GlitchTip secret key — Django SECRET_KEY equivalent, used to sign sessions and tokens.
@@ -47,9 +48,19 @@ GLITCHTIP_DOMAIN=http://localhost:3002
# Generate with: python3 -c "import secrets; print(secrets.token_hex(50))"
GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret
# PostgreSQL hostname for GlitchTip's db-init job and workers.
# Override when only the staging stack is running (container name differs from archive-db).
# Default (archive-db) is correct for production with the full stack up.
POSTGRES_HOST=archive-db
# $$ escaping note: passwords in /opt/familienarchiv/.env that contain a literal '$' must
# use '$$' so Docker Compose does not expand them as variable references.
# Example: a password 'p@$$word' should be written as 'p@$$$$word' in the .env file.
# Error reporting DSNs — leave empty to disable the SDK (safe default).
# SENTRY_DSN: backend (Spring Boot) — used by the GlitchTip/Sentry Java SDK
SENTRY_DSN=
SENTRY_TRACES_SAMPLE_RATE=
# VITE_SENTRY_DSN: frontend (SvelteKit) — injected at build time via Vite
VITE_SENTRY_DSN=
# Sentry/GlitchTip auth token for source map upload at build time (optional)

View File

@@ -305,6 +305,7 @@ jobs:
MAIL_PORT=1025
APP_MAIL_FROM=noreply@local
IMPORT_HOST_DIR=/tmp/dummy-import
COMPOSE_NETWORK_NAME=test-idem-archiv-net
EOF
- name: Bring up minio

View File

@@ -30,6 +30,9 @@ name: nightly
# STAGING_OCR_TRAINING_TOKEN
# STAGING_APP_ADMIN_USERNAME
# STAGING_APP_ADMIN_PASSWORD
# GRAFANA_ADMIN_PASSWORD
# GLITCHTIP_SECRET_KEY
# SENTRY_DSN (set after GlitchTip first-run; empty = Sentry disabled)
on:
schedule:
@@ -74,6 +77,8 @@ jobs:
MAIL_STARTTLS_ENABLE=false
APP_MAIL_FROM=noreply@staging.raddatz.cloud
IMPORT_HOST_DIR=/srv/familienarchiv-staging/import
POSTGRES_USER=archiv
SENTRY_DSN=${{ secrets.SENTRY_DSN }}
EOF
- name: Verify backend /import:ro mount is wired
@@ -120,6 +125,77 @@ jobs:
--profile staging \
up -d --wait --remove-orphans
- name: Deploy observability configs
# Copies the compose file and config tree from the workspace checkout
# into /opt/familienarchiv/ — the permanent location that persists
# between CI runs. Containers started in the next step bind-mount
# from there, so a future workspace wipe cannot corrupt a running
# config file.
#
# obs-secrets.env is written fresh from Gitea secrets on every run so
# Gitea is always the single source of truth for secret rotation.
# Non-secret config lives in infra/observability/obs.env (tracked in git).
run: |
rm -rf /opt/familienarchiv/infra/observability
mkdir -p /opt/familienarchiv/infra/observability
cp -r infra/observability/. /opt/familienarchiv/infra/observability/
cp docker-compose.observability.yml /opt/familienarchiv/
cat > /opt/familienarchiv/obs-secrets.env <<'EOF'
GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }}
GLITCHTIP_SECRET_KEY=${{ secrets.GLITCHTIP_SECRET_KEY }}
POSTGRES_PASSWORD=${{ secrets.STAGING_POSTGRES_PASSWORD }}
POSTGRES_HOST=archiv-staging-db-1
EOF
# Note: POSTGRES_HOST is derived from the Compose project name (archiv-staging)
# and service name (db). A project rename requires updating this value.
chmod 600 /opt/familienarchiv/obs-secrets.env
- name: Validate observability compose config
# Dry-run: resolves all variable substitutions and reports any missing
# required keys before containers start. Catches undefined variables and
# YAML errors in config files updated by the previous step.
# --env-file order: obs.env first (git-tracked defaults), obs-secrets.env
# second (CI-written secrets). Later files win on duplicate keys, so
# obs-secrets.env overrides POSTGRES_HOST set in obs.env.
run: |
docker compose \
-f /opt/familienarchiv/docker-compose.observability.yml \
--env-file /opt/familienarchiv/infra/observability/obs.env \
--env-file /opt/familienarchiv/obs-secrets.env \
config --quiet
- name: Start observability stack
# Runs with absolute paths so bind mounts resolve to stable host paths
# that survive workspace wipes between nightly runs (see ADR-016).
# Non-secret config from obs.env (git-tracked); secrets from obs-secrets.env
# (written fresh from Gitea secrets above). --env-file order: obs.env first,
# obs-secrets.env second — later file wins on duplicate keys.
run: |
docker compose \
-f /opt/familienarchiv/docker-compose.observability.yml \
--env-file /opt/familienarchiv/infra/observability/obs.env \
--env-file /opt/familienarchiv/obs-secrets.env \
up -d --wait --remove-orphans
- name: Assert observability stack health
# docker compose up --wait covers services WITH healthcheck directives only.
# obs-promtail, obs-cadvisor, obs-node-exporter, and obs-glitchtip-worker have
# no healthcheck — they are considered "started" as soon as the process runs.
# This step explicitly asserts the five healthchecked critical services are
# healthy before the smoke test proceeds.
run: |
set -e
unhealthy=""
for svc in obs-loki obs-prometheus obs-grafana obs-tempo obs-glitchtip; do
status=$(docker inspect "$svc" --format '{{.State.Health.Status}}' 2>/dev/null || echo "missing")
if [ "$status" != "healthy" ]; then
echo "::error::$svc is not healthy (status: $status)"
unhealthy="$unhealthy $svc"
fi
done
[ -z "$unhealthy" ] || exit 1
echo "All critical observability services are healthy"
- name: Reload Caddy
# Apply any committed Caddyfile changes before smoke-testing the
# public surface. Without this step, a Caddyfile edit lands in the

View File

@@ -34,6 +34,9 @@ name: release
# MAIL_PORT
# MAIL_USERNAME
# MAIL_PASSWORD
# GRAFANA_ADMIN_PASSWORD
# GLITCHTIP_SECRET_KEY
# SENTRY_DSN (set after GlitchTip first-run; empty = Sentry disabled)
on:
push:
@@ -72,6 +75,8 @@ jobs:
MAIL_STARTTLS_ENABLE=true
APP_MAIL_FROM=noreply@raddatz.cloud
IMPORT_HOST_DIR=/srv/familienarchiv-production/import
POSTGRES_USER=archiv
SENTRY_DSN=${{ secrets.SENTRY_DSN }}
EOF
- name: Build images
@@ -93,6 +98,75 @@ jobs:
--env-file .env.production \
up -d --wait --remove-orphans
- name: Deploy observability configs
# Mirrors the nightly approach: copies obs compose file and config tree
# to /opt/familienarchiv/ (permanent path, survives workspace wipes — ADR-016),
# then writes obs-secrets.env fresh from Gitea secrets.
# Non-secret config lives in infra/observability/obs.env (tracked in git).
run: |
rm -rf /opt/familienarchiv/infra/observability
mkdir -p /opt/familienarchiv/infra/observability
cp -r infra/observability/. /opt/familienarchiv/infra/observability/
cp docker-compose.observability.yml /opt/familienarchiv/
cat > /opt/familienarchiv/obs-secrets.env <<'EOF'
GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }}
GLITCHTIP_SECRET_KEY=${{ secrets.GLITCHTIP_SECRET_KEY }}
POSTGRES_PASSWORD=${{ secrets.PROD_POSTGRES_PASSWORD }}
POSTGRES_HOST=archiv-production-db-1
EOF
# Note: POSTGRES_HOST is derived from the Compose project name (archiv-production)
# and service name (db). A project rename requires updating this value.
chmod 600 /opt/familienarchiv/obs-secrets.env
- name: Validate observability compose config
# Dry-run: resolves all variable substitutions and reports any missing
# required keys before containers start. Catches undefined variables and
# YAML errors in config files updated by the previous step.
# --env-file order: obs.env first (git-tracked defaults), obs-secrets.env
# second (CI-written secrets). Later files win on duplicate keys, so
# obs-secrets.env overrides POSTGRES_HOST set in obs.env.
# Keep in sync with the equivalent step in nightly.yml (#603).
run: |
docker compose \
-f /opt/familienarchiv/docker-compose.observability.yml \
--env-file /opt/familienarchiv/infra/observability/obs.env \
--env-file /opt/familienarchiv/obs-secrets.env \
config --quiet
- name: Start observability stack
# Runs with absolute paths so bind mounts resolve to stable host paths
# that survive workspace wipes between runs (see ADR-016).
# Non-secret config from obs.env (git-tracked); secrets from obs-secrets.env
# (written fresh from Gitea secrets above). --env-file order: obs.env first,
# obs-secrets.env second — later file wins on duplicate keys.
# Keep in sync with the equivalent step in nightly.yml (#603).
run: |
docker compose \
-f /opt/familienarchiv/docker-compose.observability.yml \
--env-file /opt/familienarchiv/infra/observability/obs.env \
--env-file /opt/familienarchiv/obs-secrets.env \
up -d --wait --remove-orphans
- name: Assert observability stack health
# docker compose up --wait covers services WITH healthcheck directives only.
# obs-promtail, obs-cadvisor, obs-node-exporter, and obs-glitchtip-worker have
# no healthcheck — they are considered "started" as soon as the process runs.
# This step explicitly asserts the five healthchecked critical services are
# healthy before the smoke test proceeds.
# Keep in sync with the equivalent step in nightly.yml (#603).
run: |
set -e
unhealthy=""
for svc in obs-loki obs-prometheus obs-grafana obs-tempo obs-glitchtip; do
status=$(docker inspect "$svc" --format '{{.State.Health.Status}}' 2>/dev/null || echo "missing")
if [ "$status" != "healthy" ]; then
echo "::error::$svc is not healthy (status: $status)"
unhealthy="$unhealthy $svc"
fi
done
[ -z "$unhealthy" ] || exit 1
echo "All critical observability services are healthy"
- name: Reload Caddy
# See nightly.yml — same rationale and mechanism: DooD job containers
# cannot call systemctl directly; nsenter via a privileged sibling

View File

@@ -29,11 +29,30 @@
<properties>
<java.version>21</java.version>
</properties>
<dependencyManagement>
<dependencies>
<!-- opentelemetry-spring-boot-starter:2.27.0 was built against opentelemetry-api:1.61.0,
but Spring Boot 4.0.0 BOM only manages 1.55.0 (missing GlobalOpenTelemetry.getOrNoop()).
Import the core OTel BOM here to override it before the Spring Boot BOM applies. -->
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-bom</artifactId>
<version>1.61.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- Spring Boot 4.0 splits Micrometer metrics export (incl. Prometheus scrape endpoint) into its own starter -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-micrometer-metrics</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-validation</artifactId>
@@ -224,6 +243,15 @@
</exclusion>
</exclusions>
</dependency>
<!-- Sentry error reporting (GlitchTip-compatible) — sentry-spring-boot-4 is the
Spring Boot 4 / Spring Framework 7 compatible module (replaces the jakarta starter
which crashes with SF7 due to bean-name generation for triply-nested @Import classes) -->
<dependency>
<groupId>io.sentry</groupId>
<artifactId>sentry-spring-boot-4</artifactId>
<version>8.41.0</version>
</dependency>
</dependencies>

View File

@@ -2,6 +2,7 @@ package org.raddatz.familienarchiv.exception;
import java.util.stream.Collectors;
import io.sentry.Sentry;
import jakarta.validation.ConstraintViolationException;
import org.raddatz.familienarchiv.exception.DomainException;
import org.raddatz.familienarchiv.exception.ErrorCode;
@@ -63,6 +64,7 @@ public class GlobalExceptionHandler {
@ExceptionHandler(Exception.class)
public ResponseEntity<ErrorResponse> handleGeneric(Exception ex) {
Sentry.captureException(ex);
log.error("Unhandled exception", ex);
return ResponseEntity.internalServerError()
.body(new ErrorResponse(ErrorCode.INTERNAL_ERROR, "An unexpected error occurred"));

View File

@@ -3,13 +3,16 @@ package org.raddatz.familienarchiv.security;
import lombok.RequiredArgsConstructor;
import org.raddatz.familienarchiv.user.CustomUserDetailsService;
import jakarta.servlet.http.HttpServletResponse;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.annotation.Order;
import org.springframework.core.env.Environment;
import org.springframework.security.authentication.dao.DaoAuthenticationProvider;
import org.springframework.security.config.Customizer;
import org.springframework.security.config.annotation.web.builders.HttpSecurity;
import org.springframework.security.config.annotation.web.configuration.EnableWebSecurity;
import org.springframework.security.config.annotation.web.configurers.AbstractHttpConfigurer;
import org.springframework.security.crypto.bcrypt.BCryptPasswordEncoder;
import org.springframework.security.crypto.password.PasswordEncoder;
import org.springframework.security.web.SecurityFilterChain;
@@ -34,6 +37,28 @@ public class SecurityConfig {
return authProvider;
}
@Bean
@Order(1)
public SecurityFilterChain managementFilterChain(HttpSecurity http) throws Exception {
http
.securityMatcher("/actuator/**")
.authorizeHttpRequests(auth -> {
// Health and Prometheus are open — Docker health checks and Prometheus scraping need no credentials.
auth.requestMatchers("/actuator/health", "/actuator/prometheus").permitAll();
// All other actuator endpoints (metrics, info, env, heapdump…) require authentication.
auth.anyRequest().authenticated();
})
// Explicitly return 401 for any unauthenticated actuator request.
// Without this override, Spring Security's DelegatingAuthenticationEntryPoint
// would redirect browser-like clients to the form-login page (302 → /login),
// making it impossible to distinguish "not authenticated" from "not found" in tests.
.exceptionHandling(ex -> ex.authenticationEntryPoint(
(req, res, e) -> res.setStatus(HttpServletResponse.SC_UNAUTHORIZED)))
.formLogin(AbstractHttpConfigurer::disable)
.csrf(AbstractHttpConfigurer::disable);
return http.build();
}
@Bean
public SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception {
http
@@ -54,8 +79,10 @@ public class SecurityConfig {
.csrf(csrf -> csrf.disable())
.authorizeHttpRequests(auth -> {
// Health endpoint must be open so CI/Docker health checks work without credentials
auth.requestMatchers("/actuator/health").permitAll();
// Actuator endpoints are governed by managementFilterChain (@Order(1)) above.
// The permitAll() lines here are a belt-and-suspenders fallback in case any
// actuator path escapes that chain's securityMatcher. See docs/adr/017.
auth.requestMatchers("/actuator/health", "/actuator/prometheus").permitAll();
// Password reset endpoints are unauthenticated by nature
auth.requestMatchers("/api/auth/forgot-password", "/api/auth/reset-password").permitAll();
// Invite-based registration endpoints are public

View File

@@ -49,7 +49,8 @@ management:
# Management port is separate from the app port so that:
# (a) Caddy never proxies /actuator/* (it only routes :8080 → the app port)
# (b) Prometheus scrapes backend:8081 directly inside archiv-net, not via Caddy
# (c) Spring Security's session-authenticated filter chain on :8080 never sees actuator requests
# Note: in Spring Boot 4.0 the management port shares the security filter chain; /actuator/health
# and /actuator/prometheus must be explicitly permitted in SecurityConfig — see SecurityConfig.java.
port: 8081
endpoints:
web:
@@ -58,6 +59,11 @@ management:
endpoint:
prometheus:
enabled: true
# Spring Boot 4.0: metrics export is disabled by default — explicitly opt in for Prometheus
prometheus:
metrics:
export:
enabled: true
health:
mail:
enabled: false
@@ -118,3 +124,12 @@ ocr:
sender-model:
activation-threshold: 100
retrain-delta: 50
sentry:
dsn: ${SENTRY_DSN:}
environment: ${SPRING_PROFILES_ACTIVE:dev}
traces-sample-rate: ${SENTRY_TRACES_SAMPLE_RATE:1.0}
send-default-pii: false
enable-tracing: true
ignored-exceptions-for-type:
- org.raddatz.familienarchiv.exception.DomainException

View File

@@ -0,0 +1,63 @@
package org.raddatz.familienarchiv;
import org.junit.jupiter.api.Test;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.boot.test.web.server.LocalManagementPort;
import org.springframework.context.annotation.Import;
import org.springframework.http.ResponseEntity;
import org.springframework.test.context.ActiveProfiles;
import org.springframework.test.context.bean.override.mockito.MockitoBean;
import org.springframework.web.client.DefaultResponseErrorHandler;
import org.springframework.web.client.RestTemplate;
import software.amazon.awssdk.services.s3.S3Client;
import java.io.IOException;
import static org.assertj.core.api.Assertions.assertThat;
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@ActiveProfiles("test")
@Import(PostgresContainerConfig.class)
class ActuatorPrometheusIT {
@LocalManagementPort
private int managementPort;
@MockitoBean
S3Client s3Client;
@Test
void prometheus_endpoint_returns_200_without_credentials() {
ResponseEntity<String> response = noThrowTemplate().getForEntity(
"http://localhost:" + managementPort + "/actuator/prometheus", String.class);
assertThat(response.getStatusCode().value()).isEqualTo(200);
}
@Test
void prometheus_endpoint_returns_jvm_metrics() {
ResponseEntity<String> response = noThrowTemplate().getForEntity(
"http://localhost:" + managementPort + "/actuator/prometheus", String.class);
assertThat(response.getBody()).contains("jvm_memory_used_bytes");
}
@Test
void actuator_metrics_requires_authentication() {
ResponseEntity<String> response = noThrowTemplate().getForEntity(
"http://localhost:" + managementPort + "/actuator/metrics", String.class);
assertThat(response.getStatusCode().value()).isEqualTo(401);
}
private RestTemplate noThrowTemplate() {
RestTemplate template = new RestTemplate();
template.setErrorHandler(new DefaultResponseErrorHandler() {
@Override
public boolean hasError(org.springframework.http.client.ClientHttpResponse response) throws IOException {
return false;
}
});
return template;
}
}

View File

@@ -1,14 +1,18 @@
package org.raddatz.familienarchiv;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.boot.testcontainers.service.connection.ServiceConnection;
import org.springframework.context.ApplicationContext;
import org.springframework.context.annotation.Import;
import org.springframework.test.context.ActiveProfiles;
import org.springframework.test.context.bean.override.mockito.MockitoBean;
import org.testcontainers.containers.PostgreSQLContainer;
import software.amazon.awssdk.services.s3.S3Client;
import static org.assertj.core.api.Assertions.assertThat;
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE)
@ActiveProfiles("test")
@Import(PostgresContainerConfig.class)
@@ -17,9 +21,18 @@ class ApplicationContextTest {
@MockitoBean
S3Client s3Client;
@Autowired
ApplicationContext ctx;
@Test
void contextLoads() {
// verifies that the Spring context starts successfully with all beans wired,
// Flyway migrations applied, and no configuration errors
}
@Test
void sentry_is_disabled_when_no_dsn_is_configured() {
// application-test.yaml has no sentry.dsn — SDK must stay inactive so tests are clean
assertThat(io.sentry.Sentry.isEnabled()).isFalse();
}
}

View File

@@ -1,11 +1,11 @@
package org.raddatz.familienarchiv.audit;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.raddatz.familienarchiv.PostgresContainerConfig;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.context.annotation.Import;
import org.springframework.test.annotation.DirtiesContext;
import org.springframework.test.context.ActiveProfiles;
import org.springframework.test.context.bean.override.mockito.MockitoBean;
import org.springframework.transaction.support.TransactionTemplate;
@@ -18,7 +18,6 @@ import static org.awaitility.Awaitility.await;
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE)
@ActiveProfiles("test")
@Import(PostgresContainerConfig.class)
@DirtiesContext(classMode = DirtiesContext.ClassMode.AFTER_EACH_TEST_METHOD)
class AuditServiceIntegrationTest {
@MockitoBean S3Client s3Client;
@@ -26,6 +25,11 @@ class AuditServiceIntegrationTest {
@Autowired AuditLogRepository auditLogRepository;
@Autowired TransactionTemplate transactionTemplate;
@BeforeEach
void resetAuditLog() {
auditLogRepository.deleteAll();
}
@Test
void logAfterCommit_writes_ANNOTATION_CREATED_row_after_transaction_commits() {
transactionTemplate.execute(status -> {

View File

@@ -12,9 +12,9 @@ import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.context.annotation.Import;
import org.springframework.data.domain.PageRequest;
import org.springframework.test.annotation.DirtiesContext;
import org.springframework.test.context.ActiveProfiles;
import org.springframework.test.context.bean.override.mockito.MockitoBean;
import org.springframework.transaction.annotation.Transactional;
import software.amazon.awssdk.services.s3.S3Client;
import java.time.LocalDate;
@@ -33,7 +33,7 @@ import static org.assertj.core.api.Assertions.assertThat;
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE)
@ActiveProfiles("test")
@Import(PostgresContainerConfig.class)
@DirtiesContext(classMode = DirtiesContext.ClassMode.AFTER_EACH_TEST_METHOD)
@Transactional
class DocumentSearchPagedIntegrationTest {
private static final int FIXTURE_SIZE = 120;

View File

@@ -0,0 +1,33 @@
package org.raddatz.familienarchiv.exception;
import io.sentry.Sentry;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.InjectMocks;
import org.mockito.MockedStatic;
import org.mockito.junit.jupiter.MockitoExtension;
import org.springframework.http.ResponseEntity;
import static org.assertj.core.api.Assertions.assertThat;
import static org.mockito.Mockito.mockStatic;
@ExtendWith(MockitoExtension.class)
class GlobalExceptionHandlerTest {
@InjectMocks
private GlobalExceptionHandler handler;
@Test
void handleGeneric_captures_exception_in_sentry_and_returns_500() {
RuntimeException ex = new RuntimeException("unexpected failure");
try (MockedStatic<Sentry> sentryMock = mockStatic(Sentry.class)) {
ResponseEntity<GlobalExceptionHandler.ErrorResponse> response = handler.handleGeneric(ex);
sentryMock.verify(() -> Sentry.captureException(ex));
assertThat(response.getStatusCode().value()).isEqualTo(500);
assertThat(response.getBody()).isNotNull();
assertThat(response.getBody().code()).isEqualTo(ErrorCode.INTERNAL_ERROR);
}
}
}

View File

@@ -19,9 +19,9 @@ import org.springframework.context.annotation.Import;
import org.springframework.security.authentication.UsernamePasswordAuthenticationToken;
import org.springframework.security.core.authority.SimpleGrantedAuthority;
import org.springframework.security.core.context.SecurityContextHolder;
import org.springframework.test.annotation.DirtiesContext;
import org.springframework.test.context.ActiveProfiles;
import org.springframework.test.context.bean.override.mockito.MockitoBean;
import org.springframework.transaction.annotation.Transactional;
import software.amazon.awssdk.services.s3.S3Client;
import java.util.List;
@@ -32,7 +32,7 @@ import static org.assertj.core.api.Assertions.assertThat;
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE)
@ActiveProfiles("test")
@Import(PostgresContainerConfig.class)
@DirtiesContext(classMode = DirtiesContext.ClassMode.AFTER_EACH_TEST_METHOD)
@Transactional
class GeschichteServiceIntegrationTest {
@MockitoBean

View File

@@ -8,9 +8,9 @@ import org.raddatz.familienarchiv.person.PersonRepository;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.context.annotation.Import;
import org.springframework.test.annotation.DirtiesContext;
import org.springframework.test.context.ActiveProfiles;
import org.springframework.test.context.bean.override.mockito.MockitoBean;
import org.springframework.transaction.annotation.Transactional;
import software.amazon.awssdk.services.s3.S3Client;
import static org.assertj.core.api.Assertions.assertThat;
@@ -18,7 +18,7 @@ import static org.assertj.core.api.Assertions.assertThat;
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.NONE)
@ActiveProfiles("test")
@Import(PostgresContainerConfig.class)
@DirtiesContext(classMode = DirtiesContext.ClassMode.AFTER_EACH_TEST_METHOD)
@Transactional
class PersonServiceIntegrationTest {
@MockitoBean S3Client s3Client;

View File

@@ -142,10 +142,11 @@ services:
container_name: obs-grafana
restart: unless-stopped
ports:
- "127.0.0.1:${PORT_GRAFANA:-3001}:3000"
- "127.0.0.1:${PORT_GRAFANA:-3003}:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}
GF_USERS_ALLOW_SIGN_UP: "false"
GF_SERVER_ROOT_URL: ${GF_SERVER_ROOT_URL:-http://localhost:3003}
volumes:
- grafana_data:/var/lib/grafana
- ./infra/observability/grafana/provisioning:/etc/grafana/provisioning:ro
@@ -184,7 +185,7 @@ services:
- obs-net
obs-glitchtip:
image: glitchtip/glitchtip:v4
image: glitchtip/glitchtip:6.1.6
container_name: obs-glitchtip
restart: unless-stopped
depends_on:
@@ -193,7 +194,7 @@ services:
obs-glitchtip-db-init:
condition: service_completed_successfully
environment:
DATABASE_URL: postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@archive-db:5432/glitchtip
DATABASE_URL: postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST:-archive-db}:5432/glitchtip
REDIS_URL: redis://obs-redis:6379/0
SECRET_KEY: ${GLITCHTIP_SECRET_KEY}
GLITCHTIP_DOMAIN: ${GLITCHTIP_DOMAIN:-http://localhost:3002}
@@ -201,13 +202,19 @@ services:
EMAIL_URL: smtp://mailpit:1025
GLITCHTIP_MAX_EVENT_LIFE_DAYS: 90
ports:
- "127.0.0.1:${PORT_GLITCHTIP:-3002}:8080"
- "127.0.0.1:${PORT_GLITCHTIP:-3002}:8000"
healthcheck:
test: ["CMD", "bash", "-c", "echo > /dev/tcp/localhost/8000"]
interval: 30s
timeout: 10s
retries: 5
start_period: 60s
networks:
- archiv-net
- obs-net
obs-glitchtip-worker:
image: glitchtip/glitchtip:v4
image: glitchtip/glitchtip:6.1.6
container_name: obs-glitchtip-worker
restart: unless-stopped
command: ./bin/run-celery-with-beat.sh
@@ -215,7 +222,7 @@ services:
obs-redis:
condition: service_healthy
environment:
DATABASE_URL: postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@archive-db:5432/glitchtip
DATABASE_URL: postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST:-archive-db}:5432/glitchtip
REDIS_URL: redis://obs-redis:6379/0
SECRET_KEY: ${GLITCHTIP_SECRET_KEY}
networks:
@@ -229,10 +236,10 @@ services:
environment:
PGPASSWORD: ${POSTGRES_PASSWORD}
command: >
sh -c "psql -h archive-db -U ${POSTGRES_USER} -tc
sh -c "psql -h ${POSTGRES_HOST:-archive-db} -U ${POSTGRES_USER} -tc
\"SELECT 1 FROM pg_database WHERE datname = 'glitchtip'\" |
grep -q 1 ||
psql -h archive-db -U ${POSTGRES_USER} -c \"CREATE DATABASE glitchtip;\""
psql -h ${POSTGRES_HOST:-archive-db} -U ${POSTGRES_USER} -c \"CREATE DATABASE glitchtip;\""
networks:
- archiv-net

View File

@@ -39,6 +39,7 @@
networks:
archiv-net:
driver: bridge
name: ${COMPOSE_NETWORK_NAME:-archiv-net}
volumes:
postgres-data:
@@ -212,10 +213,11 @@ services:
APP_MAIL_FROM: ${APP_MAIL_FROM:-noreply@raddatz.cloud}
SPRING_MAIL_PROPERTIES_MAIL_SMTP_AUTH: ${MAIL_SMTP_AUTH:-true}
SPRING_MAIL_PROPERTIES_MAIL_SMTP_STARTTLS_ENABLE: ${MAIL_STARTTLS_ENABLE:-true}
OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317
networks:
- archiv-net
healthcheck:
test: ["CMD-SHELL", "wget -qO- http://localhost:8080/actuator/health | grep -q UP || exit 1"]
test: ["CMD-SHELL", "wget -qO- http://localhost:8081/actuator/health | grep -q UP || exit 1"]
interval: 15s
timeout: 5s
retries: 10

View File

@@ -147,6 +147,8 @@ services:
SPRING_MAIL_PROPERTIES_MAIL_SMTP_STARTTLS_ENABLE: ${MAIL_STARTTLS_ENABLE:-false}
APP_OCR_BASE_URL: http://ocr-service:8000
APP_OCR_TRAINING_TOKEN: "${OCR_TRAINING_TOKEN:-}"
SENTRY_DSN: ${SENTRY_DSN:-}
SENTRY_TRACES_SAMPLE_RATE: ${SENTRY_TRACES_SAMPLE_RATE:-1.0}
# Observability: send traces to Tempo inside archiv-net (OTLP gRPC port 4317)
# Tempo is defined in docker-compose.observability.yml (future issue).
# OTLP failures are non-fatal — backend starts cleanly without the observability stack.

View File

@@ -43,7 +43,7 @@ graph TD
- SSE notifications transit Caddy (browser → Caddy → backend); the backend is never reachable directly from the public internet. The SvelteKit SSR layer is bypassed for SSE, but Caddy is not.
- The Caddyfile responds `404` on `/actuator/*` (defense in depth). Internal monitoring scrapes the backend on the docker network, not through Caddy.
- Production and staging cohabit on the same host via docker compose project names: `archiv-production` (ports 8080/3000) and `archiv-staging` (ports 8081/3001).
- An optional observability stack (Prometheus, Node Exporter, cAdvisor) runs as a separate compose file: `docker compose -f docker-compose.observability.yml up -d`. It joins `archiv-net` and scrapes the backend's management port (`:8081`). Configuration lives under `infra/observability/`.
- An optional observability stack (Prometheus, Node Exporter, cAdvisor, Loki, Tempo, Grafana, GlitchTip) runs as a separate compose file. Configuration lives under `infra/observability/`. In production and CI, the stack is managed from `/opt/familienarchiv/` (CI copies it there on every nightly run) so bind mounts survive workspace wipes — see §4 for the ops procedure.
### OCR memory requirements
@@ -142,7 +142,8 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
| Variable | Purpose | Default | Required? | Sensitive? |
|---|---|---|---|---|
| `PORT_PROMETHEUS` | Host port for the Prometheus UI (bound to `127.0.0.1` only) | `9090` | — | — |
| `PORT_GRAFANA` | Host port for the Grafana UI (bound to `127.0.0.1` only) | `3001` | — | — |
| `PORT_GRAFANA` | Host port for the Grafana UI (bound to `127.0.0.1` only) | `3003` | — | — |
| `POSTGRES_HOST` | PostgreSQL hostname for GlitchTip's db-init job and workers. Override when only the staging stack is running and `archive-db` is not resolvable by that name. | `archive-db` | — | — |
| `GRAFANA_ADMIN_PASSWORD` | Grafana `admin` user password | `changeme` | YES (prod) | YES |
| `PORT_GLITCHTIP` | Host port for the GlitchTip UI (bound to `127.0.0.1` only) | `3002` | — | — |
| `GLITCHTIP_DOMAIN` | Public-facing base URL for GlitchTip (used in email links and CORS) | `http://localhost:3002` | YES (prod) | — |
@@ -193,6 +194,29 @@ curl -fsSL https://tailscale.com/install.sh | sh && tailscale up
# files to disk during execution (cleaned up unconditionally on completion).
# A multi-tenant runner would need to switch to stdin-piped env files.
# (See https://docs.gitea.com/usage/actions/quickstart for the register step.)
# Runner workspace directory — required for DooD bind-mount resolution (ADR-015).
# act_runner stores job workspaces here so that docker compose bind mounts resolve
# to real host paths. The path must be identical on the host and inside job containers.
mkdir -p /srv/gitea-workspace
# Observability config permanent directory — the nightly CI job copies
# docker-compose.observability.yml and infra/observability/ here on every run.
# The obs stack is always started from this path, not from the workspace.
# See ADR-016 for why this directory is used instead of a server-pull approach.
mkdir -p /opt/familienarchiv/infra
# Both paths must also appear in the runner service volumes in ~/docker/gitea/compose.yaml:
# volumes:
# - /srv/gitea-workspace:/srv/gitea-workspace
# /opt/familienarchiv does NOT need to be in the runner container's volumes — job
# containers are spawned by the host daemon directly (DooD), so the host path is
# accessible to them as long as runner-config.yaml lists it in valid_volumes + options.
# See runner-config.yaml (workdir_parent + valid_volumes + options) and ADR-015/016.
# ⚠ IMPORTANT: after any change to runner-config.yaml (valid_volumes, options, workdir_parent),
# restart the Gitea Act runner for the new config to take effect:
# docker restart gitea-runner
# Until restarted, job containers are spawned with the old config and any new bind mounts
# (e.g. /opt/familienarchiv) will not be available inside job steps.
```
### 3.2 DNS records
@@ -223,6 +247,9 @@ git.raddatz.cloud A <server IP>
| `MAIL_PORT` | release.yml | typically `587` |
| `MAIL_USERNAME` | release.yml | SMTP user |
| `MAIL_PASSWORD` | release.yml | SMTP password |
| `GRAFANA_ADMIN_PASSWORD` | both | Grafana `admin` login — generate a strong password |
| `GLITCHTIP_SECRET_KEY` | both | Django secret key — `openssl rand -hex 32` |
| `SENTRY_DSN` | both | GlitchTip project DSN — set after first-run (§4); leave empty to keep Sentry disabled |
### 3.4 First deploy
@@ -272,13 +299,70 @@ docker compose logs --tail=200 <service>
### Observability stack
An observability stack is available via `docker-compose.observability.yml`. Configuration lives under `infra/observability/`. Start it after the main stack is up (which creates `archiv-net`):
An observability stack is available via `docker-compose.observability.yml`. Configuration lives under `infra/observability/`.
#### Dev — start from the workspace
```bash
docker compose up -d # creates archiv-net
docker compose -f docker-compose.observability.yml up -d
```
#### Why the obs stack is managed differently from the main app stack
The main app stack (`docker-compose.prod.yml`) has no config-file bind mounts — its containers read config from env vars and image defaults. The workspace is wiped after each CI run but that does not affect running containers, because they hold no references to workspace paths.
The obs stack is different: `prometheus.yml`, `tempo.yml`, Loki config, Grafana provisioning files, and Promtail config are all bind-mounted from the host filesystem into their containers. If those source paths disappear (workspace wipe), the containers can restart fine until a `docker compose up` is run again — at that point Docker tries to re-resolve the bind-mount source and fails because the workspace path no longer exists.
The fix is to keep the obs compose file and config tree at a **permanent path** that CI copies to on every run but which survives between runs: `/opt/familienarchiv/` (see ADR-016).
#### Production — managed from `/opt/familienarchiv/`
Every CI run (nightly + release) copies `docker-compose.observability.yml` and `infra/observability/` to `/opt/familienarchiv/` before starting the stack. Bind mounts then resolve to `/opt/familienarchiv/infra/observability/…` — a stable path that outlasts any workspace wipe.
**Environment variables** follow the same two-source model as the main stack:
| Source | What it contains | Managed by |
|---|---|---|
| `infra/observability/obs.env` | All non-secret config (ports, URLs, hostnames) | Git — reviewed in PRs |
| `/opt/familienarchiv/obs-secrets.env` | Passwords and secret keys only | CI — written fresh from Gitea secrets on every deploy |
Both files are passed explicitly via `--env-file` to the compose command, so there is no implicit auto-read `.env` and no operator-managed file to keep in sync.
**Non-secret config** (`infra/observability/obs.env`):
| Key | Value | Notes |
|---|---|---|
| `PORT_GRAFANA` | `3003` | Avoids collision with staging frontend on port 3001 |
| `PORT_GLITCHTIP` | `3002` | |
| `PORT_PROMETHEUS` | `9090` | |
| `GF_SERVER_ROOT_URL` | `https://grafana.archiv.raddatz.cloud` | Required for alert email links and OAuth redirects |
| `GLITCHTIP_DOMAIN` | `https://glitchtip.archiv.raddatz.cloud` | Must match the Caddy vhost |
| `POSTGRES_HOST` | `archive-db` | Override if only the staging stack is running |
**Secret keys** (set in Gitea secrets, injected by CI into `obs-secrets.env`):
| Gitea secret | Notes |
|---|---|
| `GRAFANA_ADMIN_PASSWORD` | Strong unique password; shared by nightly and release |
| `GLITCHTIP_SECRET_KEY` | `openssl rand -hex 32`; shared by nightly and release |
| `STAGING_POSTGRES_PASSWORD` / `PROD_POSTGRES_PASSWORD` | Must match the running PostgreSQL container |
To start or restart the obs stack manually on the server (after CI has run at least once):
```bash
docker compose \
-f /opt/familienarchiv/docker-compose.observability.yml \
--env-file /opt/familienarchiv/infra/observability/obs.env \
--env-file /opt/familienarchiv/obs-secrets.env \
up -d --wait --remove-orphans
```
> **Note (manual ops only):** CI clears the destination with `rm -rf` before copying, so deleted files are removed automatically on the next run. If you copy manually with `cp -r` without first removing the directory, stale files from deleted configs will persist until cleaned up:
> ```bash
> rm /opt/familienarchiv/infra/observability/<path-to-removed-file>
> ```
Current services:
| Service | Image | Purpose |
@@ -287,7 +371,7 @@ Current services:
| `obs-node-exporter` | `prom/node-exporter:v1.9.0` | Host-level CPU / memory / disk / network metrics |
| `obs-cadvisor` | `gcr.io/cadvisor/cadvisor:v0.52.1` | Per-container resource metrics |
| `obs-loki` | `grafana/loki:3.4.2` | Log aggregation — receives log streams from Promtail. Port 3100 is `expose`-only (not host-bound). |
| `obs-promtail` | `grafana/promtail:3.4.2` | Log shipping agent — reads all Docker container logs via the Docker socket and forwards them to Loki with `container_name`, `compose_service`, and `compose_project` labels |
| `obs-promtail` | `grafana/promtail:3.4.2` | Log shipping agent — reads all Docker container logs via the Docker socket and forwards them to Loki with `container_name`, `compose_service`, `compose_project`, and `job` labels. The `job` label is mapped from the Docker Compose service name (`com.docker.compose.service`) so that Grafana Loki dashboard queries (`{job="backend"}`, `{job="frontend"}`) work out of the box and the "App" variable dropdown is populated. |
| `obs-tempo` | `grafana/tempo:2.7.2` | Distributed trace storage — OTLP gRPC receiver on port 4317, OTLP HTTP on port 4318 (both `archiv-net`-internal). Grafana queries traces on port 3200 (`obs-net`-internal). All ports are `expose`-only (not host-bound). |
| `obs-grafana` | `grafana/grafana-oss:11.6.1` | Unified observability UI — metrics dashboards, log exploration, trace viewer. Bound to `127.0.0.1:${PORT_GRAFANA:-3001}` on the host. |
| `obs-glitchtip` | `glitchtip/glitchtip:v4` | Sentry-compatible error tracker. Receives frontend + backend error events, groups by fingerprint, provides issue UI with stack traces. Bound to `127.0.0.1:${PORT_GLITCHTIP:-3002}`. |
@@ -299,7 +383,7 @@ Current services:
| Item | Value |
|---|---|
| URL | `http://localhost:3001` (or `http://localhost:$PORT_GRAFANA`) |
| URL | `http://localhost:3003` (or `http://localhost:$PORT_GRAFANA`) |
| Username | `admin` |
| Password | `$GRAFANA_ADMIN_PASSWORD` (default: `changeme`**change before exposing to a network**) |
@@ -329,7 +413,7 @@ docker exec obs-loki wget -qO- \
**Prefer `compose_service` over `container_name` in LogQL queries**`container_name` differs between dev (`archive-backend`) and prod (`archiv-production-backend-1`), while `compose_service` is stable (`backend`, `db`, `minio`, etc.).
Prometheus port `9090` and Grafana port `3001` are bound to `127.0.0.1` on the host. No other observability ports are host-bound.
Prometheus port `9090` and Grafana port `3003` (default; configurable via `PORT_GRAFANA`) are bound to `127.0.0.1` on the host. No other observability ports are host-bound.
#### GlitchTip

View File

@@ -0,0 +1,69 @@
# ADR-015: DooD workspace bind mount for Compose file bind-mount resolution
## Status
Accepted
## Context
The deploy workflows (`.gitea/workflows/nightly.yml`, `release.yml`) run job steps inside Docker containers via Docker-out-of-Docker (DooD): the Gitea runner mounts the host Docker socket, and act_runner spawns sibling containers for each job.
When a job step calls `docker compose -f docker-compose.observability.yml up`, Docker Compose resolves relative bind-mount sources against `$(pwd)` inside the job container and passes the resulting absolute paths to the **host** daemon. For example, `./infra/observability/prometheus/prometheus.yml` becomes `/some/path/infra/observability/prometheus/prometheus.yml`, and the host daemon tries to bind-mount that path from the **host filesystem**.
In the default DooD setup (`runner-config.yaml` with only `valid_volumes: ["/var/run/docker.sock"]`), job container workspaces live in the act_runner overlay2 layer. The host has no corresponding directory at the job container's `$(pwd)` path, so the daemon auto-creates an empty directory in its place. The container then fails to start because the mount target was expected to be a file, not a directory:
```
error mounting "…/prometheus/prometheus.yml" to rootfs at "/etc/prometheus/prometheus.yml": not a directory
```
This affected all five config file bind mounts in `docker-compose.observability.yml`.
## Decision
Configure act_runner to store job workspaces on a real host path (`/srv/gitea-workspace`) and mount that path into both the runner container and every job container at the **same absolute path**. The identity of the host path and container path is the key constraint: Compose resolves to an absolute path and hands it to the host daemon, which looks for that exact path on the host filesystem.
**runner-config.yaml changes:**
```yaml
container:
workdir_parent: /srv/gitea-workspace
valid_volumes:
- "/var/run/docker.sock"
- "/srv/gitea-workspace"
options: "-v /srv/gitea-workspace:/srv/gitea-workspace"
```
**Runner compose.yaml change** (host side — not in this repo):
```yaml
runner:
volumes:
- /srv/gitea-workspace:/srv/gitea-workspace
```
With this in place, `$(pwd)` inside a job container resolves to `/srv/gitea-workspace/<owner>/<repo>/`, which is a real directory on the host. Compose-managed bind mounts from that directory work without any additional steps.
## Alternatives Considered
| Alternative | Why rejected |
|---|---|
| **overlay2 `MergedDir` sync via privileged nsenter** (the previous approach, see PR #599 v1) | Required `--privileged --pid=host` (effective root on the host) plus fragile overlay2 driver assumption. Introduced stale-file risk on the host and a second stable path (`/srv/familienarchiv-*/obs-configs`) to maintain separately from the source tree. Replaced by this ADR. |
| **Build configs into a dedicated Docker image** (pattern used for MinIO bootstrap, see `infra/minio/Dockerfile`) | Viable for static files that change infrequently. Requires a build step and an image rebuild every time a config changes. Appropriate for bootstrap scripts; too heavy for frequently-tuned observability configs. |
| **Add workspace directory to runner-config `valid_volumes` only** (without `workdir_parent`) | `valid_volumes` whitelists paths that workflow steps may reference, but does not change where act_runner stores workspaces. Without `workdir_parent`, the workspace would still be in overlay2 and the bind-mount resolution problem would remain. |
| **Map workspace under a different host path than container path** (e.g. host `/srv/workspace`, container `/workspace`) | Compose resolves to the container-internal path (e.g. `/workspace/…`) and passes that to the host daemon. The host daemon interprets the source as a host path. If host `/workspace` does not exist, the daemon creates an empty directory — the original bug. The paths must be identical. |
## Consequences
- `/srv/gitea-workspace` must exist on the VPS before the runner starts. The directory was created as part of this change; it is not created automatically.
- The runner container's `compose.yaml` (maintained outside this repo at `~/docker/gitea/compose.yaml` on the VPS) must include the `- /srv/gitea-workspace:/srv/gitea-workspace` volume line. This is an out-of-band operational dependency; the prerequisite is documented in `runner-config.yaml`.
- `workdir_parent` applies to all jobs on this runner. Any future workflow that calls `docker compose` with relative bind mounts benefits automatically without further configuration.
- Job workspaces persist across runs under `/srv/gitea-workspace`. act_runner manages per-run subdirectory cleanup. Orphaned directories from interrupted runs should be cleaned up manually if disk space becomes a concern.
- Workflows that previously relied on `OBS_CONFIG_DIR` env var or the `obs-configs` stable path on the host no longer need those. Both were removed in this PR.
- This pattern does **not** apply to the `nsenter`-based Caddy reload step (ADR-012), which manages a host systemd service — a different problem class with no bind-mount equivalent.
## References
- ADR-011 — single-tenant runner trust model
- ADR-012 — nsenter via privileged container for host service management
- Issue #598 — original observability stack bind-mount failure
- `runner-config.yaml``workdir_parent`, `valid_volumes`, `options`

View File

@@ -0,0 +1,57 @@
# ADR-016: Observability stack co-location at `/opt/familienarchiv/` with CI-push config sync
## Status
Accepted
## Context
Issue #601 established that the observability stack must survive Gitea CI workspace wipes between nightly runs. When the nightly job completes, act_runner deletes the job workspace. Any Docker container that bind-mounts a config file from a workspace path (`/srv/gitea-workspace/…/infra/observability/prometheus/prometheus.yml`) then references a path that no longer exists on the host. On the next nightly run, Docker Compose either auto-creates an empty directory in its place (causing the container to fail to start because a file mount receives a directory) or finds a stale file from a previous run if the workspace happened to land at the same path.
ADR-015 solved the workspace bind-mount resolution problem: job workspaces are stored at `/srv/gitea-workspace` so `$(pwd)` inside the job container maps to a real host path. But it did not address persistence: the workspace is still wiped after the job, so bind mounts from workspace-relative paths remain fragile across runs.
### Decision drivers
1. Bind-mount sources must point to a host path that persists indefinitely, not to a path that disappears after each CI run.
2. Config files must reflect the committed state of the repo after every nightly run (no manual sync steps).
3. Secrets must not be written to the workspace or to any path managed by CI; they must survive independently of deployments.
4. The solution must not introduce new infrastructure dependencies (no SSH access from CI, no external registry, no additional server-side daemon).
### Alternatives considered
**A: Server-pull model** — a systemd timer or cron job on the server does `git pull` from the repo into `/opt/familienarchiv/` and then runs `docker compose up`. Rejected because: (1) requires git credentials on the server and a registered deploy key, (2) adds a second deployment mechanism that diverges from the CI-push model used for the main app stack, (3) timing coupling — the server pull must complete before CI's health checks run, requiring polling or a webhook.
**B: Separate directory (e.g. `/opt/obs/`)** — keeps obs configs isolated from the app stack. Rejected because: (1) the main app compose files are already in `/opt/familienarchiv/` (managed the same way), and (2) GlitchTip shares the `archive-db` PostgreSQL instance and `archiv-net` Docker network — it is architecturally part of the same deployment unit, not a separate one. Co-location reflects the actual coupling.
**C: Named Docker configs (Swarm)** — Docker Swarm supports first-class config objects that persist in the cluster. Rejected because the project does not use Swarm and introducing it solely for config persistence is a disproportionate dependency.
## Decision
The observability stack is co-located with the main application deployment at `/opt/familienarchiv/`:
- `docker-compose.observability.yml``/opt/familienarchiv/docker-compose.observability.yml`
- `infra/observability/``/opt/familienarchiv/infra/observability/`
Both the nightly CI job (`nightly.yml`) and the release job (`release.yml`) copy these files from the workspace checkout to `/opt/familienarchiv/` using `cp -r` on every run (CI-push model). Containers always read config from the permanent location; a workspace wipe has no effect on running containers.
Environment variables follow a two-source model:
- `infra/observability/obs.env` (git-tracked, non-secret): all non-sensitive config — host ports, public URLs (`GLITCHTIP_DOMAIN`, `GF_SERVER_ROOT_URL`), and the default `POSTGRES_HOST`. Changes go through PR review. No credentials.
- `/opt/familienarchiv/obs-secrets.env` (CI-written, per-deploy): passwords and secret keys only (`GRAFANA_ADMIN_PASSWORD`, `GLITCHTIP_SECRET_KEY`, `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_HOST`), injected fresh from Gitea secrets on every nightly and release deploy. Gitea is the single source of truth for secrets — rotating a secret takes effect on the next deploy without manual server action.
Both files are passed explicitly via `--env-file` to every obs compose command (config dry-run and `up`). There is no implicit auto-read `.env`. The required key inventory is documented in `docs/DEPLOYMENT.md §4`.
The CI runner mounts `/opt/familienarchiv` as a bind mount into job containers (see `runner-config.yaml`). This requires a one-time `mkdir -p /opt/familienarchiv/infra` on the server and a runner restart after updating `runner-config.yaml` (see ADR-015 and `docs/DEPLOYMENT.md §3.1`).
## Consequences
**Positive:**
- Bind-mount sources survive workspace wipes by definition — they are on a persistent host path.
- Config is always in sync with the repo after each nightly run.
- No new infrastructure dependencies; the CI-push model mirrors how the main app stack is deployed.
- Secret rotation requires no manual server action — Gitea secrets are the authoritative store; `obs-secrets.env` is rewritten from scratch on every deploy so a secret change takes effect on the next nightly or release run.
**Negative:**
- `cp -r` does not remove deleted files; a config file removed from the repo persists in `/opt/familienarchiv/infra/observability/` until manually deleted. Acceptable for this project's change frequency. A `rsync -a --delete` would give a clean mirror if this becomes a problem.
- Mounting `/opt/familienarchiv/` into CI job containers expands the blast radius of a compromised workflow step — a malicious step could overwrite app compose files and Caddy config. Acceptable because the runner is single-tenant (trusted code only). See `runner-config.yaml` security comment.
- Runner must be restarted (`systemctl restart gitea-runner`) after any change to `runner-config.yaml` for the new mount to take effect.

View File

@@ -0,0 +1,48 @@
# ADR-017: Spring Boot 4.0 management port shares the main security filter chain
## Status
Accepted
## Context
The Familienarchiv backend runs Spring Boot Actuator on a dedicated management port (8081) so that Caddy never proxies `/actuator/*` requests and Prometheus can reach the scrape endpoint directly inside `archiv-net`.
In earlier Spring Boot versions (< 4.0), the management server ran in an isolated child application context whose security was governed independently by `ManagementWebSecurityAutoConfiguration`. The main app's `SecurityConfig` filter chain (port 8080) never intercepted requests arriving on port 8081.
In Spring Boot 4.0 with Jetty, this isolation was removed. The management server now traverses the **same** Spring Security `FilterChainProxy` as the main application. Concretely:
- Any `SecurityFilterChain` bean in the application context is evaluated for requests arriving on the management port.
- There is no longer a separate "management security" child context.
This was discovered when Prometheus began receiving HTTP 401 responses from `/actuator/prometheus` despite the endpoint being exposed and the `micrometer-registry-prometheus` dependency being present. Prometheus rejected these responses with `received unsupported Content-Type "text/html"` because the main filter chain's form-login `DelegatingAuthenticationEntryPoint` was redirecting unauthenticated requests to `/login` (302 → HTML).
A secondary issue: Spring Boot 4.0 no longer auto-enables Prometheus metrics export — `management.prometheus.metrics.export.enabled` must be set explicitly, and the Prometheus scrape endpoint requires `spring-boot-starter-micrometer-metrics` (a new starter that was split out in Spring Boot 4.0).
## Decision
1. **Dedicated management `SecurityFilterChain`** scoped to `/actuator/**` at `@Order(1)` (highest precedence). This chain:
- `permitAll()` for `/actuator/health` and `/actuator/prometheus` — required for Docker health checks and unauthenticated Prometheus scraping.
- `authenticated()` for all other actuator endpoints — blocks `/actuator/metrics`, `/actuator/info`, etc. without credentials.
- Uses an explicit `401` entry point (not form-login redirect) so that API clients — including Prometheus — receive a machine-readable status code rather than an HTML redirect.
- No CSRF, no form login.
2. **Belt-and-suspenders `permitAll()` in the main `SecurityFilterChain`** for `/actuator/health` and `/actuator/prometheus`, in case a future configuration change causes these paths to escape the management chain's `securityMatcher`.
3. **Network isolation as the outer defense boundary.** Port 8081 is not published in `docker-compose.yml` and is not routed through Caddy. Only services inside `archiv-net` (primarily Prometheus and the Docker health checker) can reach the management port.
## Alternatives rejected
- **Exclude `ManagementWebSecurityAutoConfiguration`:** This auto-configuration no longer exists in Spring Boot 4.0. Exclusion is not applicable.
- **Keep `SecurityConfig` as the sole filter chain without `@Order(1)` management chain:** The main chain's form-login `DelegatingAuthenticationEntryPoint` redirects browser-like clients to `/login` (302). Prometheus and automated health check clients cannot follow this redirect, so the endpoint would be unreachable without a dedicated chain that returns plain 401 or 200.
- **Per-endpoint `@Order(1)` filter chain using `EndpointRequest.toAnyEndpoint()`:** The `spring-boot-security` artifact that provides `EndpointRequest` is not a transitive dependency of `spring-boot-starter-actuator` in Spring Boot 4.0. Using a path-based `securityMatcher("/actuator/**")` achieves the same scoping without an extra dependency.
## Consequences
- All actuator endpoints on port 8081 that are not explicitly `permitAll()`-ed require HTTP Basic credentials. Without valid credentials, the response is 401 (not a redirect).
- Adding a new actuator endpoint to `management.endpoints.web.exposure.include` implicitly protects it via `anyRequest().authenticated()` in the management chain — no additional `permitAll()` needed unless intentional.
- A regression test (`ActuatorPrometheusIT`) verifies:
- `/actuator/prometheus` returns 200 without credentials.
- `/actuator/metrics` returns 401 without credentials.
- Prometheus metric names are present in the response body.
- If port 8081 is ever accidentally published in `docker-compose.yml`, actuator endpoints other than health and prometheus are still protected by HTTP Basic. This reduces (but does not eliminate) the risk of inadvertent exposure.

View File

@@ -17,7 +17,7 @@ System_Boundary(archiv, "Familienarchiv (Docker Compose)") {
Container(mc, "Bucket / Service-Account Init", "MinIO Client (mc)", "One-shot container on startup. Idempotent: creates the archive bucket, the archiv-app service account, and attaches the readwrite policy.")
}
System_Boundary(observability, "Observability Stack (docker-compose.observability.yml)") {
System_Boundary(observability, "Observability Stack (/opt/familienarchiv/docker-compose.observability.yml)") {
Container(prometheus, "Prometheus", "prom/prometheus:v3.4.0", "Scrapes metrics from backend management port 8081 (/actuator/prometheus), node-exporter, and cAdvisor. Retention: 30 days.")
Container(node_exporter, "Node Exporter", "prom/node-exporter:v1.9.0", "Host-level CPU, memory, disk, and network metrics.")
Container(cadvisor, "cAdvisor", "gcr.io/cadvisor/cadvisor:v0.52.1", "Per-container resource metrics.")

View File

@@ -19,6 +19,39 @@ Both containers live in the `gitea_gitea` Docker network on the VPS. The runner
The `gitea-runner` container mounts the host Docker socket (`/var/run/docker.sock`). When a workflow job runs, act_runner spawns a **sibling container** for each job. That job container also gets the Docker socket mounted (via `valid_volumes` in `runner-config.yaml`), enabling `docker compose` calls in workflow steps.
### Workspace bind-mount setup (DooD path resolution)
When a workflow step calls `docker compose up` with relative bind-mount sources (e.g. `./infra/observability/prometheus/prometheus.yml`), Compose resolves them against `$(pwd)` inside the job container and passes the resulting **absolute path** to the host Docker daemon. The host daemon then tries to bind-mount that path from the **host filesystem**.
In the default DooD setup the job container's workspace lives in the act_runner overlay2 layer — the host has no directory at that path, auto-creates an empty one, and the container fails with:
```
error mounting "…/prometheus/prometheus.yml" to rootfs at "/etc/prometheus/prometheus.yml": not a directory
```
**Solution (ADR-015):** store job workspaces on a real host path and mount it at the **same absolute path** inside the runner and every job container. `runner-config.yaml` configures this via `workdir_parent`, `valid_volumes`, and `options`.
**One-time host setup** (required on any fresh VPS):
```bash
mkdir -p /srv/gitea-workspace
# Then add to the runner service in ~/docker/gitea/compose.yaml:
# volumes:
# - /srv/gitea-workspace:/srv/gitea-workspace
# Restart the runner container for the change to take effect.
```
The path `/srv/gitea-workspace` is the canonical workspace root. It must be identical on the host and inside job containers — if the paths differ, Compose still resolves to the container-internal path, which the host daemon cannot find (the original bug).
**Disk management:** act_runner cleans per-run subdirectories on completion. Orphaned directories from interrupted runs accumulate under `/srv/gitea-workspace` and should be pruned manually if disk space becomes a concern:
```bash
# List workspace directories older than 7 days
find /srv/gitea-workspace -mindepth 3 -maxdepth 3 -type d -mtime +7
```
---
### Running host-level commands from CI (nsenter pattern)
Job containers are unprivileged and do not share the host's PID/mount/network namespaces. Commands like `systemctl` that target the host daemon are therefore unavailable by default. When a workflow step needs to manage a host service (e.g. `systemctl reload caddy`), it uses the Docker socket to spin up a **privileged sibling container** in the host PID namespace:
@@ -108,6 +141,33 @@ nsenter: failed to execute /bin/systemctl: No such file or directory
The first error means the Docker socket is not mounted into the job container — check `valid_volumes` in `/root/docker/gitea/runner-config.yaml` on the VPS. The second means the Alpine image is running but cannot enter the host mount namespace; verify `--privileged` and `--pid=host` are both present in the workflow step.
**Failure mode 4 — workspace bind-mount not configured (observability stack or any compose-with-file-mounts job)**
Symptom in CI log:
```
Error response from daemon: error while creating mount source path "…/prometheus/prometheus.yml": mkdir …: not a directory
```
Or the service starts but immediately crashes because a config file was mounted as an empty directory.
Cause: `/srv/gitea-workspace` does not exist on the host, or the runner container's `compose.yaml` is missing the `- /srv/gitea-workspace:/srv/gitea-workspace` volume line.
Diagnosis:
```bash
ssh root@<vps>
ls -la /srv/gitea-workspace # must exist and be a directory
docker inspect gitea-runner | grep -A5 Mounts # must show /srv/gitea-workspace
```
Recovery:
```bash
mkdir -p /srv/gitea-workspace
# Add volume line to runner compose.yaml, then:
docker compose -f ~/docker/gitea/compose.yaml up -d gitea-runner
```
See `docs/DEPLOYMENT.md §3.1` and ADR-015 for the full setup rationale.
---
## Gitea vs GitHub Actions Differences

View File

@@ -45,7 +45,6 @@ export default defineConfig(
files: ['**/*.svelte', '**/*.svelte.ts', '**/*.svelte.js'],
languageOptions: {
parserOptions: {
projectService: true,
extraFileExtensions: ['.svelte'],
parser: ts.parser,
svelteConfig

View File

@@ -88,3 +88,13 @@ git.raddatz.cloud {
import security_headers
reverse_proxy 127.0.0.1:3005
}
grafana.archiv.raddatz.cloud {
import security_headers
reverse_proxy 127.0.0.1:3003
}
glitchtip.archiv.raddatz.cloud {
import security_headers
reverse_proxy 127.0.0.1:3002
}

View File

@@ -0,0 +1,24 @@
# Non-secret observability stack configuration — tracked in git.
# Secret values (passwords, keys) are injected by CI from Gitea secrets
# into /opt/familienarchiv/obs-secrets.env at deploy time.
#
# For local dev the main .env file supplies these values instead;
# this file is only used in the CI/production path.
# Host ports (all bound to 127.0.0.1 — Caddy is the external entry point)
PORT_GRAFANA=3003
PORT_GLITCHTIP=3002
PORT_PROMETHEUS=9090
# Public URLs — used for internal redirects, alert email links, OAuth callbacks
GF_SERVER_ROOT_URL=https://grafana.archiv.raddatz.cloud
GLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloud
POSTGRES_USER=archiv
# PostgreSQL hostname for GlitchTip db-init and workers.
# The actual value depends on the Compose project name — it is not a fixed string.
# CI sets POSTGRES_HOST in obs-secrets.env per environment:
# staging: archiv-staging-db-1 (project archiv-staging + service db)
# production: archiv-production-db-1 (project archiv-production + service db)
# For local dev, set POSTGRES_HOST in your .env file (defaults to archive-db there).

View File

@@ -15,8 +15,6 @@ scrape_configs:
metrics_path: /actuator/prometheus
static_configs:
# Uses the Docker service name (not container_name) for reliable DNS resolution.
# Target will show as DOWN until backend instrumentation issue adds
# micrometer-registry-prometheus and exposes the endpoint — this is expected.
- targets: ['backend:8081']
- job_name: ocr-service

View File

@@ -28,3 +28,5 @@ scrape_configs:
target_label: 'compose_project'
- source_labels: ['__meta_docker_container_log_stream']
target_label: 'logstream'
- source_labels: ['__meta_docker_container_label_com_docker_compose_service']
target_label: 'job'

View File

@@ -36,9 +36,6 @@ metrics_generator:
source: tempo
storage:
path: /var/tempo/generator/wal
processors:
- service-graphs
- span-metrics
# Tempo HTTP API (port 3200) is unauthenticated. Access is controlled entirely
# by network isolation: only Grafana (on obs-net) should reach this port.

View File

@@ -1,16 +1,26 @@
# runner-config.yaml — only the relevant section
container:
# passed as DOCKER_HOST inside the job container
# join the same network Gitea is on, so job containers can resolve 'gitea'
# for actions/checkout and other internal API calls.
network: gitea_gitea
# passed as DOCKER_HOST inside the job container; act_runner auto-mounts
# this socket path into the job, so no explicit -v option is needed.
docker_host: "unix:///var/run/docker.sock"
# whitelists the socket path so workflows can mount it
# Job workspaces are stored here and mounted at the same absolute path
# inside job containers. Identical host <-> container path is the requirement:
# Compose resolves relative bind mounts to $(pwd) inside the job container
# and passes that absolute path to the host daemon, which must find the file
# at that exact host path. Prerequisite: /srv/gitea-workspace exists on the
# host and is bind-mounted in the runner container (see compose.yaml).
workdir_parent: /srv/gitea-workspace
# whitelists volumes that workflow steps may bind-mount
valid_volumes:
- "/var/run/docker.sock"
# appended to `docker run` when the runner spawns a job container
# SECURITY: Mounting the Docker socket grants job containers root-equivalent
# access to the host Docker daemon. Acceptable here because only trusted code
# from this private repo runs on this runner. Do NOT use on a runner that
# accepts untrusted PRs from external contributors.
options: "-v /var/run/docker.sock:/var/run/docker.sock"
# keep network mode default (bridge) — Testcontainers handles its own networking
- "/srv/gitea-workspace"
- "/opt/familienarchiv"
# mount the workspace and the permanent obs/config directory into job containers.
# /opt/familienarchiv is the stable path CI copies configs to (ADR-016); it must
# be mounted here so deploy steps can write through to the host filesystem.
options: "-v /srv/gitea-workspace:/srv/gitea-workspace -v /opt/familienarchiv:/opt/familienarchiv"
# keep behavior default — Testcontainers handles its own networking
force_pull: false