devops: nightly backup pipeline — pg_dump + mc mirror over Tailscale to heim-nas #502

Open
opened 2026-05-11 13:20:00 +02:00 by marcel · 0 comments
Owner

Background

PR #499 lands the production deployment but explicitly defers the backup pipeline. The rollback procedure in docs/DEPLOYMENT.md §5 notes that manual backups are the only recovery option until this ships. Named volumes (postgres-data, minio-data) without a tested restore path are a single point of failure for the family archive.

This was flagged by Tobi (DevOps review on PR #499, comment #8352) and Elicit (Requirements, comment #8356, OQ-3).

The target topology is documented in ADR-010 (docs/adr/010-minio-self-hosted-not-hetzner-obs.md): backup is the long-term safety net that lets MinIO stay self-hosted without giving up durability.

Scope

Add a cron-driven backup pipeline running on the production VPS:

  1. Postgres dump: nightly pg_dump -Fc of the archiv database, written to a local backup directory with date-stamped name (archiv-YYYYMMDD.dump).
  2. MinIO bucket mirror: mc mirror --remove from local myminio/familienarchiv to a destination MinIO instance on heim-nas reached over Tailscale (private subnet, no public internet hop).
  3. Off-site sync: rsync of the Postgres dump directory to the same heim-nas destination over Tailscale.
  4. Rotation: keep last 14 daily dumps, plus weekly snapshots for the last 8 weeks.
  5. Healthcheck push: ping a deadmanssnitch.com (or self-hosted equivalent) endpoint on success so silent backup failure pages an operator.
  6. Restore drill runbook: documented in docs/DEPLOYMENT.md — operator can verify restore quarterly by spinning up a sandbox stack from the latest dump.

Acceptance criteria

  • Cron job runs nightly, completes < 30 minutes for current archive size (~13 GB)
  • Restore from a 30-day-old dump verified in a sandbox stack
  • Backup monitor pages on silent failure (timeout, partial sync, full disk)
  • DEPLOYMENT.md §5 rollback procedure updated to reference the backup as the primary recovery path (replaces "manual backup is the only option")
  • Tailscale ACL audited: backup destination on heim-nas is reachable only from the production VPS's Tailscale identity

Threat model considerations (Nora to review when work lands)

  • mc mirror --remove is destructive on the destination — if the source is compromised and emptied, the destination follows. Mitigation: keep --remove off and rotate destination snapshots independently, OR run a separate immutable snapshot job on heim-nas.
  • Tailscale identity on the production VPS is the only credential that gates access to the backup destination — if the VPS is compromised, backups are too. Mitigation: write-only-on-create destination policy (no overwrites of existing dated files).

References

  • PR #499 — DevOps review (Tobi, #8352)
  • PR #499 — Requirements review (Elicit, #8356, OQ-3 — NFR-AVAIL-001)
  • ADR-010 — MinIO self-hosted; backup is the trigger for revisiting Hetzner OBS
## Background PR #499 lands the production deployment but explicitly defers the backup pipeline. The rollback procedure in `docs/DEPLOYMENT.md` §5 notes that **manual backups are the only recovery option until this ships**. Named volumes (`postgres-data`, `minio-data`) without a tested restore path are a single point of failure for the family archive. This was flagged by Tobi (DevOps review on PR #499, [comment #8352](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8352)) and Elicit (Requirements, [comment #8356](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8356), OQ-3). The target topology is documented in ADR-010 ([docs/adr/010-minio-self-hosted-not-hetzner-obs.md](../docs/adr/010-minio-self-hosted-not-hetzner-obs.md)): backup is the long-term safety net that lets MinIO stay self-hosted without giving up durability. ## Scope Add a cron-driven backup pipeline running on the production VPS: 1. **Postgres dump**: nightly `pg_dump -Fc` of the `archiv` database, written to a local backup directory with date-stamped name (`archiv-YYYYMMDD.dump`). 2. **MinIO bucket mirror**: `mc mirror --remove` from local `myminio/familienarchiv` to a destination MinIO instance on `heim-nas` reached over Tailscale (private subnet, no public internet hop). 3. **Off-site sync**: `rsync` of the Postgres dump directory to the same `heim-nas` destination over Tailscale. 4. **Rotation**: keep last 14 daily dumps, plus weekly snapshots for the last 8 weeks. 5. **Healthcheck push**: ping a deadmanssnitch.com (or self-hosted equivalent) endpoint on success so silent backup failure pages an operator. 6. **Restore drill runbook**: documented in `docs/DEPLOYMENT.md` — operator can verify restore quarterly by spinning up a sandbox stack from the latest dump. ## Acceptance criteria - [ ] Cron job runs nightly, completes < 30 minutes for current archive size (~13 GB) - [ ] Restore from a 30-day-old dump verified in a sandbox stack - [ ] Backup monitor pages on silent failure (timeout, partial sync, full disk) - [ ] DEPLOYMENT.md §5 rollback procedure updated to reference the backup as the primary recovery path (replaces "manual backup is the only option") - [ ] Tailscale ACL audited: backup destination on `heim-nas` is reachable only from the production VPS's Tailscale identity ## Threat model considerations (Nora to review when work lands) - `mc mirror --remove` is destructive on the destination — if the source is compromised and emptied, the destination follows. Mitigation: keep `--remove` off and rotate destination snapshots independently, OR run a separate immutable snapshot job on `heim-nas`. - Tailscale identity on the production VPS is the only credential that gates access to the backup destination — if the VPS is compromised, backups are too. Mitigation: write-only-on-create destination policy (no overwrites of existing dated files). ## References - PR #499 — DevOps review (Tobi, [#8352](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8352)) - PR #499 — Requirements review (Elicit, [#8356](https://git.raddatz.cloud/marcel/familienarchiv/pulls/499#issuecomment-8356), OQ-3 — NFR-AVAIL-001) - ADR-010 — MinIO self-hosted; backup is the trigger for revisiting Hetzner OBS
marcel added the P1-highdevopsphase-5: backups labels 2026-05-11 13:20:04 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#502