devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure #584

marcel · 2026-05-15T01:24:21+02:00

marcel commented

2026-05-15 01:24:21 +02:00

Summary

Closes #572

Creates docker-compose.observability.yml at project root with archiv-net as external network, obs-net bridge network, and five named volumes (prometheus_data, loki_data, tempo_data, grafana_data, glitchtip_data)
Creates six .gitkeep placeholder directories under infra/observability/ for all upcoming service configs
Adds # --- Observability --- section to .env.example with all required env vars

Test plan

docker compose -f docker-compose.observability.yml config exits 0
All six infra/observability/ subdirectories exist with .gitkeep
.env.example contains the new Observability section

🤖 Generated with Claude Code

## Summary Closes #572 - Creates `docker-compose.observability.yml` at project root with `archiv-net` as external network, `obs-net` bridge network, and five named volumes (`prometheus_data`, `loki_data`, `tempo_data`, `grafana_data`, `glitchtip_data`) - Creates six `.gitkeep` placeholder directories under `infra/observability/` for all upcoming service configs - Adds `# --- Observability ---` section to `.env.example` with all required env vars ## Test plan - [ ] `docker compose -f docker-compose.observability.yml config` exits 0 - [ ] All six `infra/observability/` subdirectories exist with `.gitkeep` - [ ] `.env.example` contains the new Observability section 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel added 1 commit 2026-05-15 01:24:22 +02:00

devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure

CI / Unit & Component Tests (pull_request) Successful in 3m19s

Details

CI / OCR Service Tests (pull_request) Successful in 16s

Details

CI / Backend Unit Tests (pull_request) Successful in 4m28s

Details

CI / fail2ban Regex (pull_request) Successful in 39s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 55s

Details

1d42be9882

Creates the skeleton observability stack (no running services yet) that all
subsequent Grafana LGTM + GlitchTip issues depend on:

- docker-compose.observability.yml: external archiv-net join, obs-net bridge,
  named volumes for all five services, placeholder comments for each service
  group (Metrics/Logs/Traces/Dashboards/Error Tracking), startup-order note
- infra/observability/{prometheus,loki,promtail,tempo,grafana/provisioning/{datasources,dashboards}}/.gitkeep
- .env.example: new # --- Observability --- section with PORT_GRAFANA,
  PORT_GLITCHTIP, PORT_PROMETHEUS, GLITCHTIP_DOMAIN, GLITCHTIP_SECRET_KEY
  (with generation hint), SENTRY_DSN, VITE_SENTRY_DSN

Verified: docker compose -f docker-compose.observability.yml config exits 0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-15 01:25:33 +02:00

🏗️ Markus Keller — Application Architect

Verdict: 🚫 Changes requested

Blockers

1. docs/architecture/c4/l2-containers.puml is not updated.

The architecture persona rules are explicit: "New Docker service or infrastructure component → docs/architecture/c4/l2-containers.puml + docs/DEPLOYMENT.md" and "A doc omission is a blocker, not a concern — the PR does not merge until the diagram or text matches the code."

docker-compose.observability.yml introduces a second Docker Compose stack with a new external network (archiv-net) dependency and five new named volumes. The l2-containers.puml C4 diagram shows only the main stack — the observability stack boundary, and its relationship to archiv-net, must be added. Even a placeholder SystemBoundary(obs, "Observability Stack (phase 2)") block with the planned services annotated as ' -- added in subsequent issues' would satisfy the constraint. The diagram must reflect the infrastructure that exists in the repository.

2. docs/DEPLOYMENT.md section §4 is stale.

The current §4 "Logs + observability" says: "Phase 7 of the Production v1 milestone adds Prometheus + Loki + Grafana. No monitoring infrastructure is in place yet."

This is no longer accurate — the scaffold Compose file and infra/observability/ directories now exist. §4 must acknowledge the new file, describe how to start the optional stack (docker compose -f docker-compose.observability.yml up -d), and note that services are populated in subsequent issues. The ownership note at the top of DEPLOYMENT.md is unambiguous: "Update this file in any PR that changes the container topology."

Suggestions (not blockers)

3. ADR consideration for the two-stack topology.

Splitting the stack into docker-compose.yml + docker-compose.observability.yml is an architectural decision with lasting operational consequences: startup order dependency, separate down commands, and the archiv-net: external: true coupling. This is small enough that a note in the PR body or a comment block at the top of the compose file explaining why the split was chosen (vs. a profiles: approach in the main compose) would suffice — no full ADR required at the scaffold stage, but the reasoning should be recorded before the services are wired in.

4. obs-net bridge has no name: override.

Without an explicit name:, Docker Compose will prefix the network as <project>_obs-net, where <project> defaults to the directory name. If the directory is familienarchiv, the network becomes familienarchiv_obs-net. This is fine for now, but when services are added that need to reference the network by name from outside the Compose context (e.g. Prometheus scrape config), the auto-generated name is fragile. Worth adding name: obs-net alongside the driver: bridge declaration, matching the archiv-net external pattern.

What is done well

The file structure is clean: the comment block at the top explains the startup order, the # No services defined yet placeholder is honest, and the volume names match Tobias's semantic naming convention exactly. The .env.example additions are thorough and the security comments on GLITCHTIP_SECRET_KEY are precisely what the comment policy requires.

@mkeller

## 🏗️ Markus Keller — Application Architect **Verdict: 🚫 Changes requested** ### Blockers **1. `docs/architecture/c4/l2-containers.puml` is not updated.** The architecture persona rules are explicit: _"New Docker service or infrastructure component → `docs/architecture/c4/l2-containers.puml` + `docs/DEPLOYMENT.md`"_ and _"A doc omission is a blocker, not a concern — the PR does not merge until the diagram or text matches the code."_ `docker-compose.observability.yml` introduces a second Docker Compose stack with a new external network (`archiv-net`) dependency and five new named volumes. The `l2-containers.puml` C4 diagram shows only the main stack — the observability stack boundary, and its relationship to `archiv-net`, must be added. Even a placeholder `SystemBoundary(obs, "Observability Stack (phase 2)")` block with the planned services annotated as `' -- added in subsequent issues'` would satisfy the constraint. The diagram must reflect the infrastructure that exists in the repository. **2. `docs/DEPLOYMENT.md` section §4 is stale.** The current §4 "Logs + observability" says: _"Phase 7 of the Production v1 milestone adds Prometheus + Loki + Grafana. No monitoring infrastructure is in place yet."_ This is no longer accurate — the scaffold Compose file and `infra/observability/` directories now exist. §4 must acknowledge the new file, describe how to start the optional stack (`docker compose -f docker-compose.observability.yml up -d`), and note that services are populated in subsequent issues. The ownership note at the top of `DEPLOYMENT.md` is unambiguous: _"Update this file in any PR that changes the container topology."_ ### Suggestions (not blockers) **3. ADR consideration for the two-stack topology.** Splitting the stack into `docker-compose.yml` + `docker-compose.observability.yml` is an architectural decision with lasting operational consequences: startup order dependency, separate `down` commands, and the `archiv-net: external: true` coupling. This is small enough that a note in the PR body or a comment block at the top of the compose file explaining _why_ the split was chosen (vs. a `profiles:` approach in the main compose) would suffice — no full ADR required at the scaffold stage, but the reasoning should be recorded before the services are wired in. **4. `obs-net` bridge has no `name:` override.** Without an explicit `name:`, Docker Compose will prefix the network as `<project>_obs-net`, where `<project>` defaults to the directory name. If the directory is `familienarchiv`, the network becomes `familienarchiv_obs-net`. This is fine for now, but when services are added that need to reference the network by name from outside the Compose context (e.g. Prometheus scrape config), the auto-generated name is fragile. Worth adding `name: obs-net` alongside the `driver: bridge` declaration, matching the `archiv-net` external pattern. ### What is done well The file structure is clean: the comment block at the top explains the startup order, the `# No services defined yet` placeholder is honest, and the volume names match Tobias's semantic naming convention exactly. The `.env.example` additions are thorough and the security comments on `GLITCHTIP_SECRET_KEY` are precisely what the comment policy requires. --- _@mkeller_

marcel commented

2026-05-15 01:25:56 +02:00

⚙️ Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

Blockers

None. The compose file is structurally sound and would pass docker compose config.

Concerns (should fix before services are wired in)

1. Image tags will be needed in follow-up issues — flag the pattern now.

The file currently has services: {}, so there is nothing to pin yet. However, the comment block names five services (prometheus, loki, promtail, tempo, grafana, glitchtip). Every one of those follow-up PRs must pin to a specific version tag — no :latest. Noting this expectation in a comment alongside each placeholder would make it harder for a future PR to skip the discipline:

#   prometheus:      # image: prom/prometheus:v2.51.0 — pin when added (issue #573)
#   loki:            # image: grafana/loki:3.0.0        — pin when added (issue #574)

This is a suggestion for the scaffold, not a blocker here.

2. The five named volumes pre-exist the services that use them.

Docker Compose will create prometheus_data, loki_data, tempo_data, grafana_data, and glitchtip_data the moment docker compose -f docker-compose.observability.yml up -d is run — even with services: {}. On a clean host this is fine. On an existing production host, this creates five empty named volumes that accumulate alongside the main stack's volumes. Consider documenting this side-effect in docs/DEPLOYMENT.md so operators know to prune them if they abandon the observability stack before it is populated. Not a blocker at scaffold stage.

3. GLITCHTIP_DOMAIN default value is http://localhost:3002.

In .env.example the comment correctly warns about the production value. However, a developer who copies .env.example to .env and starts the stack will have a working GlitchTip at localhost:3002 — but any error reports from the backend will use this domain in envelope headers, which will cause GlitchTip to reject reports from other machines. The comment should explicitly say: "For any shared dev/staging environment, set this to the actual hostname before starting the stack." Minor — the inline comment is better than most projects manage.

4. PORT_PROMETHEUS=9090 — potential host collision.

Port 9090 is the canonical Prometheus port and is likely already used if the operator runs any other Prometheus instance on the host. The .env.example comment doesn't mention this. Suggest adding: "# Change if port 9090 is already in use on the host". Very minor.

What is done well

archiv-net: external: true is exactly the right pattern. It makes the startup dependency explicit and gives a clear error message if the main stack is not running.
Named volumes — five volumes, semantically named, no bind mounts. This follows the project convention precisely.
obs-net: driver: bridge — correct isolation pattern. Observability-internal traffic stays on this network; only Prometheus leaves it via archiv-net.
The services: {} scaffold approach — deliberately empty with issue references is far cleaner than a half-wired compose file that doesn't actually run.
.env.example additions — the GLITCHTIP_SECRET_KEY warning and SENTRY_DSN / VITE_SENTRY_DSN empty-default pattern (leave empty to disable) is well-thought-out. The "fail-closed" note on the secret key is excellent.
Six .gitkeep directories — right approach for preserving directory structure without committing config files that don't exist yet.

@tobiwendt

## ⚙️ Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** ### Blockers None. The compose file is structurally sound and would pass `docker compose config`. ### Concerns (should fix before services are wired in) **1. Image tags will be needed in follow-up issues — flag the pattern now.** The file currently has `services: {}`, so there is nothing to pin yet. However, the comment block names five services (prometheus, loki, promtail, tempo, grafana, glitchtip). Every one of those follow-up PRs must pin to a specific version tag — no `:latest`. Noting this expectation in a comment alongside each placeholder would make it harder for a future PR to skip the discipline: ```yaml # prometheus: # image: prom/prometheus:v2.51.0 — pin when added (issue #573) # loki: # image: grafana/loki:3.0.0 — pin when added (issue #574) ``` This is a suggestion for the scaffold, not a blocker here. **2. The five named volumes pre-exist the services that use them.** Docker Compose will create `prometheus_data`, `loki_data`, `tempo_data`, `grafana_data`, and `glitchtip_data` the moment `docker compose -f docker-compose.observability.yml up -d` is run — even with `services: {}`. On a clean host this is fine. On an existing production host, this creates five empty named volumes that accumulate alongside the main stack's volumes. Consider documenting this side-effect in `docs/DEPLOYMENT.md` so operators know to prune them if they abandon the observability stack before it is populated. Not a blocker at scaffold stage. **3. `GLITCHTIP_DOMAIN` default value is `http://localhost:3002`.** In `.env.example` the comment correctly warns about the production value. However, a developer who copies `.env.example` to `.env` and starts the stack will have a working GlitchTip at `localhost:3002` — but any error reports from the backend will use this domain in envelope headers, which will cause GlitchTip to reject reports from other machines. The comment should explicitly say: _"For any shared dev/staging environment, set this to the actual hostname before starting the stack."_ Minor — the inline comment is better than most projects manage. **4. `PORT_PROMETHEUS=9090` — potential host collision.** Port 9090 is the canonical Prometheus port and is likely already used if the operator runs any other Prometheus instance on the host. The `.env.example` comment doesn't mention this. Suggest adding: _"# Change if port 9090 is already in use on the host"_. Very minor. ### What is done well - **`archiv-net: external: true`** is exactly the right pattern. It makes the startup dependency explicit and gives a clear error message if the main stack is not running. - **Named volumes** — five volumes, semantically named, no bind mounts. This follows the project convention precisely. - **`obs-net: driver: bridge`** — correct isolation pattern. Observability-internal traffic stays on this network; only Prometheus leaves it via `archiv-net`. - **The `services: {}` scaffold approach** — deliberately empty with issue references is far cleaner than a half-wired compose file that doesn't actually run. - **`.env.example` additions** — the `GLITCHTIP_SECRET_KEY` warning and `SENTRY_DSN` / `VITE_SENTRY_DSN` empty-default pattern (leave empty to disable) is well-thought-out. The "fail-closed" note on the secret key is excellent. - **Six `.gitkeep` directories** — right approach for preserving directory structure without committing config files that don't exist yet. --- _@tobiwendt_

marcel commented

2026-05-15 01:26:16 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

Blockers

None. This is an infrastructure scaffold with no running services and no attack surface yet. The patterns established here are sound.

Concerns

1. GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret — placeholder value in .env.example.

The comment above it is exactly right: "REQUIRED in production — must not be empty or 'changeme'. Fail-closed: GlitchTip will refuse to start with an invalid key." This is the correct way to document a secret in .env.example. No issue here — documenting the intent is appropriate.

One gap to address before GlitchTip is wired in: the production secret must flow through Gitea Secrets → workflow env: → .env.production (matching the pattern already used for PROD_POSTGRES_PASSWORD, PROD_MINIO_PASSWORD, etc.). The .env.example comment should explicitly mention the Gitea secret name that should carry this value in CI/CD, so the operator knows where to put it:

# Generate with: python3 -c "import secrets; print(secrets.token_hex(50))"
# Production: set as Gitea secret PROD_GLITCHTIP_SECRET_KEY (see nightly.yml / release.yml)
GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret

This prevents the operator from setting it only in .env and forgetting to wire it through the workflow.

2. SENTRY_DSN and VITE_SENTRY_DSN are empty by default — correct, but document the principle explicitly.

The comment "leave empty to disable the SDK (safe default)" is good. When these are eventually populated, they must come from Gitea Secrets as well (they are sensitive — a DSN allows anyone who knows it to inject error reports). A brief note in .env.example to that effect would close the loop:

# Production: obtain DSN from GlitchTip UI. Set as Gitea secret PROD_SENTRY_DSN.
SENTRY_DSN=

3. PORT_GLITCHTIP=3002 — host-bound port.

When GlitchTip is added, its container port should be bound to 127.0.0.1:3002 only (matching the existing pattern for other services), not 0.0.0.0:3002. The .env.example does not configure this, but the follow-up PRs that add the actual service definition must use "127.0.0.1:${PORT_GLITCHTIP}:8000" not "${PORT_GLITCHTIP}:8000". The scaffold PR sets the precedent — a brief comment in .env.example to this effect would help:

# Ports are bound to 127.0.0.1 only — Caddy is the public entry point
PORT_GRAFANA=3001
PORT_GLITCHTIP=3002
PORT_PROMETHEUS=9090

What is done well

Fail-closed documentation on GLITCHTIP_SECRET_KEY — exactly the right pattern. Too many projects document secrets as optional when they are mandatory in production.
Empty DSN defaults — disabling the SDK when the env var is absent is the safest possible default.
No secrets committed — the compose file and .gitkeep files contain nothing sensitive.
archiv-net: external: true — the explicit external network dependency prevents the observability stack from starting without the main stack, which is a correct operational guard.

@NullX

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** ### Blockers None. This is an infrastructure scaffold with no running services and no attack surface yet. The patterns established here are sound. ### Concerns **1. `GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret` — placeholder value in `.env.example`.** The comment above it is exactly right: _"REQUIRED in production — must not be empty or 'changeme'. Fail-closed: GlitchTip will refuse to start with an invalid key."_ This is the correct way to document a secret in `.env.example`. No issue here — documenting the intent is appropriate. **One gap to address before GlitchTip is wired in:** the production secret must flow through Gitea Secrets → workflow `env:` → `.env.production` (matching the pattern already used for `PROD_POSTGRES_PASSWORD`, `PROD_MINIO_PASSWORD`, etc.). The `.env.example` comment should explicitly mention the Gitea secret name that should carry this value in CI/CD, so the operator knows where to put it: ```ini # Generate with: python3 -c "import secrets; print(secrets.token_hex(50))" # Production: set as Gitea secret PROD_GLITCHTIP_SECRET_KEY (see nightly.yml / release.yml) GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret ``` This prevents the operator from setting it only in `.env` and forgetting to wire it through the workflow. **2. `SENTRY_DSN` and `VITE_SENTRY_DSN` are empty by default — correct, but document the principle explicitly.** The comment _"leave empty to disable the SDK (safe default)"_ is good. When these are eventually populated, they must come from Gitea Secrets as well (they are sensitive — a DSN allows anyone who knows it to inject error reports). A brief note in `.env.example` to that effect would close the loop: ```ini # Production: obtain DSN from GlitchTip UI. Set as Gitea secret PROD_SENTRY_DSN. SENTRY_DSN= ``` **3. `PORT_GLITCHTIP=3002` — host-bound port.** When GlitchTip is added, its container port should be bound to `127.0.0.1:3002` only (matching the existing pattern for other services), not `0.0.0.0:3002`. The `.env.example` does not configure this, but the follow-up PRs that add the actual service definition must use `"127.0.0.1:${PORT_GLITCHTIP}:8000"` not `"${PORT_GLITCHTIP}:8000"`. The scaffold PR sets the precedent — a brief comment in `.env.example` to this effect would help: ```ini # Ports are bound to 127.0.0.1 only — Caddy is the public entry point PORT_GRAFANA=3001 PORT_GLITCHTIP=3002 PORT_PROMETHEUS=9090 ``` ### What is done well - **Fail-closed documentation on `GLITCHTIP_SECRET_KEY`** — exactly the right pattern. Too many projects document secrets as optional when they are mandatory in production. - **Empty DSN defaults** — disabling the SDK when the env var is absent is the safest possible default. - **No secrets committed** — the compose file and `.gitkeep` files contain nothing sensitive. - **`archiv-net: external: true`** — the explicit external network dependency prevents the observability stack from starting without the main stack, which is a correct operational guard. --- _@NullX_

marcel commented

2026-05-15 01:26:30 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ✅ Approved

This is a pure scaffold PR — no services are running, no application code is changed, and no new behaviour is introduced. There is nothing to test at the unit or integration layer.

Checked

PR test plan is complete and correct.

The three checklist items in the PR description are the right ones for this change:

docker compose -f docker-compose.observability.yml config exits 0 — this validates the YAML syntax and network/volume references without starting anything.
All six infra/observability/ subdirectories exist with .gitkeep — a simple git ls-files or ls check.
.env.example contains the new Observability section — a grep check.

These are exactly the level of testing appropriate for a scaffold. They are fast, deterministic, and can run in CI without a Docker daemon.

Suggestions (not blockers)

Add the docker compose config check to CI.

The main compose file already passes through CI linting (implicitly via the deployment workflow). The observability compose file does not. It would be worth adding a step to an appropriate CI job — perhaps the existing unit-tests or build workflow — to run:

- name: Validate observability compose
  run: docker compose -f docker-compose.observability.yml config > /dev/null

This catches syntax errors and undefined variable references before they reach production. Without this, a typo in a follow-up PR could go undetected until someone actually tries to start the stack.

The test plan items are all manual. That is acceptable for a scaffold, but when the first real service (Prometheus, per issue #573) is added, a CI smoke test should be introduced that starts the stack and checks the health endpoint. That is a concern for the #573 PR, not this one.

What is done well

The test plan in the PR description is explicit, specific, and verifiable — better than most infrastructure PRs.
The six .gitkeep files are committed correctly (empty, in the right directories), so the directory structure is verifiable via git.
The services: {} scaffold approach means there is no risk of an accidentally-running-but-misconfigured service in CI.

@saraholt

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ✅ Approved** This is a pure scaffold PR — no services are running, no application code is changed, and no new behaviour is introduced. There is nothing to test at the unit or integration layer. ### Checked **PR test plan is complete and correct.** The three checklist items in the PR description are the right ones for this change: - `docker compose -f docker-compose.observability.yml config` exits 0 — this validates the YAML syntax and network/volume references without starting anything. - All six `infra/observability/` subdirectories exist with `.gitkeep` — a simple `git ls-files` or `ls` check. - `.env.example` contains the new Observability section — a `grep` check. These are exactly the level of testing appropriate for a scaffold. They are fast, deterministic, and can run in CI without a Docker daemon. ### Suggestions (not blockers) **Add the `docker compose config` check to CI.** The main compose file already passes through CI linting (implicitly via the deployment workflow). The observability compose file does not. It would be worth adding a step to an appropriate CI job — perhaps the existing `unit-tests` or `build` workflow — to run: ```yaml - name: Validate observability compose run: docker compose -f docker-compose.observability.yml config > /dev/null ``` This catches syntax errors and undefined variable references before they reach production. Without this, a typo in a follow-up PR could go undetected until someone actually tries to start the stack. **The test plan items are all manual.** That is acceptable for a scaffold, but when the first real service (Prometheus, per issue #573) is added, a CI smoke test should be introduced that starts the stack and checks the health endpoint. That is a concern for the #573 PR, not this one. ### What is done well - The test plan in the PR description is explicit, specific, and verifiable — better than most infrastructure PRs. - The six `.gitkeep` files are committed correctly (empty, in the right directories), so the directory structure is verifiable via `git`. - The `services: {}` scaffold approach means there is no risk of an accidentally-running-but-misconfigured service in CI. --- _@saraholt_

marcel commented

2026-05-15 01:26:42 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

This PR contains no application code — no Java, no TypeScript, no Svelte, no Python. My review is scoped to what I can meaningfully evaluate: YAML structure, naming discipline, and documentation hygiene.

Checked

YAML structure and naming.

docker-compose.observability.yml follows the project's conventions cleanly:

Volume names are semantic: prometheus_data, loki_data, tempo_data, grafana_data, glitchtip_data — each tells you what it stores.
Comment blocks explain why, not what (the startup dependency comment at the top is a textbook example).
The services: {} placeholder with issue references is clear and honest — no dead config disguised as real config.

.env.example additions.

The new section follows the existing pattern: grouped under a header comment, each var explained, sensitive values documented with generation instructions. The GLITCHTIP_SECRET_KEY comment is particularly good — it says what happens if you get it wrong, not just what the value is.

.gitkeep files.

Six empty files in six correct directories. Nothing to flag.

Suggestions (not blockers)

glitchtip_data volume — note the Django dependency.

GlitchTip is a Django application and its volume will hold Django's database (SQLite by default, or a connected PostgreSQL). The follow-up PR that adds the GlitchTip service should configure it to use the project's existing PostgreSQL instance rather than embedding SQLite — this keeps the data model consistent and backup-able via the existing pg_dump procedure. This is a concern for the GlitchTip service PR, not this one, but worth flagging now so the volume design decision (separate PostgreSQL volume vs. shared database) is made deliberately.

What is done well

The file is minimal, honest, and well-commented. It establishes the right directory structure for the follow-up work without over-building. The PR does exactly what it says it does.

@felixbrandt

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** This PR contains no application code — no Java, no TypeScript, no Svelte, no Python. My review is scoped to what I can meaningfully evaluate: YAML structure, naming discipline, and documentation hygiene. ### Checked **YAML structure and naming.** `docker-compose.observability.yml` follows the project's conventions cleanly: - Volume names are semantic: `prometheus_data`, `loki_data`, `tempo_data`, `grafana_data`, `glitchtip_data` — each tells you what it stores. - Comment blocks explain _why_, not _what_ (the startup dependency comment at the top is a textbook example). - The `services: {}` placeholder with issue references is clear and honest — no dead config disguised as real config. **`.env.example` additions.** The new section follows the existing pattern: grouped under a header comment, each var explained, sensitive values documented with generation instructions. The `GLITCHTIP_SECRET_KEY` comment is particularly good — it says what happens if you get it wrong, not just what the value is. **.gitkeep files.** Six empty files in six correct directories. Nothing to flag. ### Suggestions (not blockers) **`glitchtip_data` volume — note the Django dependency.** GlitchTip is a Django application and its volume will hold Django's database (SQLite by default, or a connected PostgreSQL). The follow-up PR that adds the GlitchTip service should configure it to use the project's existing PostgreSQL instance rather than embedding SQLite — this keeps the data model consistent and backup-able via the existing `pg_dump` procedure. This is a concern for the GlitchTip service PR, not this one, but worth flagging now so the volume design decision (separate PostgreSQL volume vs. shared database) is made deliberately. ### What is done well The file is minimal, honest, and well-commented. It establishes the right directory structure for the follow-up work without over-building. The PR does exactly what it says it does. --- _@felixbrandt_

marcel commented

2026-05-15 01:26:58 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

Requirements engineering review is largely out-of-scope for a pure infrastructure scaffold PR. My check focuses on whether the PR faithfully delivers what issue #572 specified, and whether the work introduced creates any requirements gaps downstream.

Checked

Traceability: PR → Issue #572.

The PR description maps cleanly to a scaffold deliverable:

✅ docker-compose.observability.yml created with the correct network topology
✅ Six infra/observability/ subdirectories with .gitkeep placeholders
✅ .env.example updated with the observability section

The PR title, commit references, and "Closes #572" are all present. This is good issue-driven development discipline.

Test plan completeness.

The three acceptance criteria in the PR test plan are all verifiable and correspond to observable outcomes (not internal state). They match what a Requirements Engineer would write as Given-When-Then criteria for a scaffold story. No gaps.

Observations (not blockers)

glitchtip_data volume raises an unresolved data model question.

GlitchTip is an error tracking service. It captures exception data from the backend (Java SDK via SENTRY_DSN) and frontend (Vite build via VITE_SENTRY_DSN). This data includes stack traces, user context, and request metadata. Before GlitchTip is wired in, the following requirements questions should be answered in the relevant follow-up issue:

Data retention: How long are error reports retained? Is there a maximum?
User context: Does the GlitchTip SDK send user identifiers (email, username) with error reports? If so, this has GDPR implications for a family archive containing personal historical data.
Database: Does GlitchTip connect to the project's existing PostgreSQL, or run its own? The glitchtip_data volume name is ambiguous on this point.

These are questions for the GlitchTip service PR (a future issue), not this one. But they should be included in the acceptance criteria of that issue.

DSN wiring creates a build-time dependency for the frontend.

VITE_SENTRY_DSN is injected at build time (per the comment). This means the frontend Docker image must be rebuilt whenever the DSN changes. This constraint should be documented in the issue that adds the frontend SDK integration — if a developer assumes it can be changed at runtime (like most env vars), they will be surprised.

What is done well

The PR description is unusually thorough for an infrastructure scaffold. The summary, test plan, and issue reference are all present and correct. This is what good issue-driven delivery looks like.

@Elicit

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** Requirements engineering review is largely out-of-scope for a pure infrastructure scaffold PR. My check focuses on whether the PR faithfully delivers what issue #572 specified, and whether the work introduced creates any requirements gaps downstream. ### Checked **Traceability: PR → Issue #572.** The PR description maps cleanly to a scaffold deliverable: - ✅ `docker-compose.observability.yml` created with the correct network topology - ✅ Six `infra/observability/` subdirectories with `.gitkeep` placeholders - ✅ `.env.example` updated with the observability section The PR title, commit references, and "Closes #572" are all present. This is good issue-driven development discipline. **Test plan completeness.** The three acceptance criteria in the PR test plan are all verifiable and correspond to observable outcomes (not internal state). They match what a Requirements Engineer would write as Given-When-Then criteria for a scaffold story. No gaps. ### Observations (not blockers) **`glitchtip_data` volume raises an unresolved data model question.** GlitchTip is an error tracking service. It captures exception data from the backend (Java SDK via `SENTRY_DSN`) and frontend (Vite build via `VITE_SENTRY_DSN`). This data includes stack traces, user context, and request metadata. Before GlitchTip is wired in, the following requirements questions should be answered in the relevant follow-up issue: - **Data retention:** How long are error reports retained? Is there a maximum? - **User context:** Does the GlitchTip SDK send user identifiers (email, username) with error reports? If so, this has GDPR implications for a family archive containing personal historical data. - **Database:** Does GlitchTip connect to the project's existing PostgreSQL, or run its own? The `glitchtip_data` volume name is ambiguous on this point. These are questions for the GlitchTip service PR (a future issue), not this one. But they should be included in the acceptance criteria of that issue. **DSN wiring creates a build-time dependency for the frontend.** `VITE_SENTRY_DSN` is injected at build time (per the comment). This means the frontend Docker image must be rebuilt whenever the DSN changes. This constraint should be documented in the issue that adds the frontend SDK integration — if a developer assumes it can be changed at runtime (like most env vars), they will be surprised. ### What is done well The PR description is unusually thorough for an infrastructure scaffold. The summary, test plan, and issue reference are all present and correct. This is what good issue-driven delivery looks like. --- _@Elicit_

marcel commented

2026-05-15 01:27:01 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Verdict: ✅ Approved

This PR contains no UI changes — no Svelte components, no routes, no CSS, no i18n strings. There is nothing for me to evaluate from a UX, accessibility, or design system perspective.

I have checked that no frontend files were touched (confirmed: 8 changed files, all infrastructure — .env.example, docker-compose.observability.yml, and six .gitkeep files).

LGTM from a UI/UX standpoint. I will pick this up again when the Grafana and GlitchTip dashboards are wired in and exposed through the UI.

@leonievoss

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** This PR contains no UI changes — no Svelte components, no routes, no CSS, no i18n strings. There is nothing for me to evaluate from a UX, accessibility, or design system perspective. I have checked that no frontend files were touched (confirmed: 8 changed files, all infrastructure — `.env.example`, `docker-compose.observability.yml`, and six `.gitkeep` files). LGTM from a UI/UX standpoint. I will pick this up again when the Grafana and GlitchTip dashboards are wired in and exposed through the UI. --- _@leonievoss_

marcel added 1 commit 2026-05-15 01:28:50 +02:00

docs: update DEPLOYMENT.md and C4 diagram for observability scaffold

CI / Unit & Component Tests (pull_request) Successful in 3m31s

Details

CI / OCR Service Tests (pull_request) Successful in 16s

Details

CI / Backend Unit Tests (pull_request) Successful in 4m31s

Details

CI / fail2ban Regex (pull_request) Successful in 38s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 57s

Details

cf8d22d81b

Replace the stale "no monitoring infrastructure in place yet" note in
§4 with a brief description of the observability compose file and a
pointer to issue #581 for full docs.

Add a placeholder System_Boundary block for Prometheus + Loki + Grafana
to l2-containers.puml, showing the stack joins archiv-net.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-15 01:29:00 +02:00

Review blockers fixed in cf8d22d8:

Blocker 1 — docs/DEPLOYMENT.md §4 Logs + observability
Replaced the stale "No monitoring infrastructure is in place yet" paragraph with a brief note that docker-compose.observability.yml and infra/observability/ exist, the stack joins archiv-net, and full documentation is tracked in issue #581.

Blocker 2 — docs/architecture/c4/l2-containers.puml
File exists. Added a System_Boundary block for the observability stack with placeholder containers for Prometheus, Loki, and Grafana, each noting they connect via archiv-net and referencing issue #581 for wiring details.

Review blockers fixed in cf8d22d8: **Blocker 1 — `docs/DEPLOYMENT.md` §4 Logs + observability** Replaced the stale "No monitoring infrastructure is in place yet" paragraph with a brief note that `docker-compose.observability.yml` and `infra/observability/` exist, the stack joins `archiv-net`, and full documentation is tracked in issue #581. **Blocker 2 — `docs/architecture/c4/l2-containers.puml`** File exists. Added a `System_Boundary` block for the observability stack with placeholder containers for Prometheus, Loki, and Grafana, each noting they connect via `archiv-net` and referencing issue #581 for wiring details.

marcel merged commit 52508e9dea into main

2026-05-15 01:45:15 +02:00

marcel deleted branch feat/issue-572-observability-scaffold

2026-05-15 01:45:16 +02:00

marcel referenced this issue from a commit

2026-05-15 01:45:16 +02:00

Merge pull request 'devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure' (#584) from feat/issue-572-observability-scaffold into main

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#584