Commit Graph

15 Commits

Author SHA1 Message Date
Marcel
3ced565aa2 docs(observability): document GlitchTip services in DEPLOYMENT.md and C4 diagram
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 5m53s
CI / OCR Service Tests (pull_request) Successful in 32s
CI / Backend Unit Tests (pull_request) Failing after 23m39s
CI / fail2ban Regex (pull_request) Successful in 2m13s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m55s
Adds GlitchTip env vars to the observability env var table, extends the
services table, and adds a first-run section with superuser creation and
project setup steps. Updates the C4 L2 container diagram with GlitchTip
and Redis containers and their relationships.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 04:38:47 +02:00
Marcel
c99321e5cf docs(observability): document Grafana in DEPLOYMENT.md and C4 diagram
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 4m7s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 5m41s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
Add Grafana row to the observability services table, Grafana access details
(URL, credentials, auto-provisioned datasources, pre-loaded dashboards), and
GRAFANA_ADMIN_PASSWORD to the env vars table in DEPLOYMENT.md.
Update C4 l2-containers.puml: replace placeholder Grafana entry with pinned
image version, expand observability boundary with node_exporter and cadvisor
containers, and add Rel() edges for Grafana → Prometheus, Loki, and Tempo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 04:04:09 +02:00
Marcel
6ce6122384 docs: add OTEL and tracing env vars to DEPLOYMENT.md
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 3m22s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Failing after 2m33s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 54s
2026-05-15 03:29:38 +02:00
Marcel
de08ffe989 devops(observability): add Tempo for distributed trace storage (OTLP receiver)
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m22s
CI / OCR Service Tests (pull_request) Successful in 17s
CI / Backend Unit Tests (pull_request) Successful in 4m32s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 56s
Closes #575

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 03:01:22 +02:00
Marcel
22e1b25398 devops(observability): add Loki + Promtail for centralised container log aggregation
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m21s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 4m31s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
- Add obs-loki (grafana/loki:3.4.2) to docker-compose.observability.yml
  with healthcheck (wget /ready), expose-only port 3100, named volume loki_data
- Add obs-promtail (grafana/promtail:3.4.2) bridging archiv-net + obs-net,
  depends_on loki service_healthy, docker.sock:ro, promtail_positions volume
  for restart-safe position tracking
- Create infra/observability/loki/loki-config.yml: single-node TSDB schema v13,
  30-day retention, auth disabled (obs-net only), telemetry off
- Create infra/observability/promtail/promtail-config.yml: Docker SD scrape,
  container_name / compose_service / compose_project / logstream labels
- Update docs/DEPLOYMENT.md §4 with service table and Loki quick-check commands

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 02:18:22 +02:00
Marcel
0c66f6298b devops(observability): fix Prometheus port binding, scrape port, and update DEPLOYMENT.md
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m21s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 4m35s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
- Fix spring-boot scrape target from backend:8080 to backend:8081 (actuator/management port)
- Restrict Prometheus host port binding to 127.0.0.1 to prevent unintended external exposure
- Add observability stack (Prometheus, Node Exporter, cAdvisor) to topology description
- Add PORT_PROMETHEUS env var to DEPLOYMENT.md reference table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 01:52:28 +02:00
Marcel
cf8d22d81b docs: update DEPLOYMENT.md and C4 diagram for observability scaffold
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m31s
CI / OCR Service Tests (pull_request) Successful in 16s
CI / Backend Unit Tests (pull_request) Successful in 4m31s
CI / fail2ban Regex (pull_request) Successful in 38s
CI / Compose Bucket Idempotency (pull_request) Successful in 57s
Replace the stale "no monitoring infrastructure in place yet" note in
§4 with a brief description of the observability compose file and a
pointer to issue #581 for full docs.

Add a placeholder System_Boundary block for Prometheus + Loki + Grafana
to l2-containers.puml, showing the stack joins archiv-net.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 01:28:14 +02:00
Marcel
8536b2ebbd docs(deploy): note Caddyfile symlink is a CI dependency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 07:42:28 +02:00
Marcel
a40267e490 docs(deployment): document IMPORT_HOST_DIR and mass-import workflow
DEPLOYMENT.md line 81 declares any compose env var missing from §2 a
blocking review comment. IMPORT_HOST_DIR (added on this branch) was
unmentioned. Adds the row and rewrites §6.4 so the staging/prod operator
workflow (rsync host → set env → trigger import) is in the runbook,
not just buried in compose comments.

Addresses review feedback from Markus and Tobias on #526.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 20:05:14 +02:00
Marcel
a7a80f8c16 docs(deployment): route SSE through Caddy in topology mermaid
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m48s
CI / OCR Service Tests (push) Successful in 16s
CI / Unit & Component Tests (pull_request) Failing after 2m48s
CI / Backend Unit Tests (pull_request) Successful in 4m8s
CI / fail2ban Regex (pull_request) Successful in 37s
CI / Compose Bucket Idempotency (pull_request) Failing after 49s
CI / Backend Unit Tests (push) Successful in 4m7s
CI / fail2ban Regex (push) Successful in 36s
CI / Compose Bucket Idempotency (push) Failing after 1m15s
CI / OCR Service Tests (pull_request) Successful in 16s
The top-level deployment diagram lagged the C4 L2 diagram, which
correctly notes that SSE notifications are fronted by Caddy. The
mermaid showed Browser → Backend direct, which would only be true
if the backend port were exposed publicly (it is not — all docker
ports bind to 127.0.0.1).

Fixes the inconsistency Markus flagged on PR #499: the public
surface is Caddy and Caddy only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 13:18:11 +02:00
Marcel
ba5bd9cb11 docs(deployment): document fail2ban symlink, OCR_MEM_LIMIT, smoke test
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m53s
CI / OCR Service Tests (push) Failing after 1m58s
CI / Backend Unit Tests (push) Failing after 1m23s
CI / Unit & Component Tests (pull_request) Failing after 3m59s
CI / Backend Unit Tests (pull_request) Successful in 5m39s
CI / OCR Service Tests (pull_request) Successful in 1m14s
Updates DEPLOYMENT.md to match the infra changes in this PR:

§1 OCR memory — point operators at the new OCR_MEM_LIMIT env var instead
                of telling them to edit "the prod overlay".
§2 OCR env vars — add OCR_MEM_LIMIT to the table.
§3.1 server setup — replace fail2ban prose with concrete `ln -sf`
                    commands referencing the committed jail/filter.
                    Document the single-tenant runner assumption near
                    the runner-registration step.
§3.4 first deploy — describe the new automated smoke test step.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 12:07:59 +02:00
Marcel
2eade2b78f docs(deployment): rewrite for Gitea Actions / Caddy / prod compose
Brings DEPLOYMENT.md in line with the production deployment landed
in #497:

- Topology diagram: frontend port 3000 (Node adapter), 127.0.0.1
  binding, project-name isolation between prod and staging
- Caddyfile now lives in-tree at infra/caddy/Caddyfile (symlinked
  onto the server)
- Dev vs prod table: documents the new deploy method (workflows +
  --wait) and the prod-compose specific differences
- Env vars: adds MINIO_APP_PASSWORD; notes that prod compose
  hardcodes the MinIO root user and the bucket name
- Bootstrap section: server hardening, fail2ban, Tailscale, the 16
  Gitea secrets, and the workflow_dispatch first-deploy step
- Admin password warning: first deploy locks the password, secret
  rotation after that point has no effect
- Rollback: TAG= override + docker compose up -d --wait

Refs #497.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 21:58:51 +02:00
Marcel
a3c17750cd fix(docs): correct DEPLOYMENT.md env var name and prod overlay note
Some checks failed
CI / Unit & Component Tests (push) Has been cancelled
CI / OCR Service Tests (push) Has been cancelled
CI / Backend Unit Tests (push) Has been cancelled
- Security checklist: OCR_TRAINING_TOKEN → APP_OCR_TRAINING_TOKEN (backend)
  plus TRAINING_TOKEN (OCR service); both must share the same value
- Bootstrap: clarify docker-compose.prod.yml is not committed — must be
  created from docs/infrastructure/production-compose.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 07:35:23 +02:00
Marcel
83db80b867 docs(legibility): fix two blockers in DEPLOYMENT.md
- Use correct container name archive-db (not familienarchiv-db-1) in
  §5 backup/restore commands — verified against docker-compose.yml
- Add KRAKEN_MODEL_PATH to OCR service env vars table (was missing;
  set at docker-compose.yml:92 as /app/models/german_kurrent.mlmodel)

Refs #399
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 07:35:23 +02:00
Marcel
a944563560 docs(legibility): write docs/DEPLOYMENT.md — Day-1 checklist and operational reference
Covers: topology diagram (Mermaid), OCR memory/VPS sizing table,
dev-vs-prod differences, complete env vars table (all vars verified
against docker-compose.yml and application.yaml, including APP_ADMIN_*
and ALLOWED_PDF_HOSTS gaps not in .env.example), security checklist
before first boot, bootstrap sequence, logs, backup current state vs
planned, common operational tasks, and known limitations with ADR links.

Closes #399
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 07:35:23 +02:00