devops(observability): scaffold docker-compose.observability.yml and infra/observability/ directory structure #572

New Issue

marcel · 2026-05-14T15:03:37+02:00

marcel commented

2026-05-14 15:03:37 +02:00

Context

The application lives in docker-compose.yml. We need a separate docker-compose.observability.yml at the project root so the observability stack can be started and stopped without touching the application. The two stacks communicate via the existing archiv-net Docker bridge network, which the observability compose joins as an external network.

This issue creates the skeleton only — no running services yet, just the file/directory structure and network wiring that all subsequent observability issues depend on.

Acceptance Criteria

docker-compose.observability.yml exists at the project root
archiv-net is declared as external: true in the observability compose — so Prometheus can reach archive-backend by container name
Named Docker volumes declared: prometheus_data, loki_data, tempo_data, grafana_data, glitchtip_data
Config directories exist (empty, with a .gitkeep): infra/observability/prometheus/, infra/observability/loki/, infra/observability/promtail/, infra/observability/tempo/, infra/observability/grafana/provisioning/datasources/, infra/observability/grafana/provisioning/dashboards/

.env.example has a new # --- Observability --- section with these vars and sensible defaults:

PORT_GRAFANA=3001
PORT_GLITCHTIP=3002
PORT_PROMETHEUS=9090
GLITCHTIP_DOMAIN=http://localhost:3002
GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret
SENTRY_DSN=
VITE_SENTRY_DSN=

docker compose -f docker-compose.observability.yml config exits 0 (all referenced env vars have defaults or are in .env.example)

Implementation Notes

No actual service definitions yet — the file can contain placeholder comments for each upcoming service grouped by concern (Metrics, Logs, Traces, Dashboards, Error Tracking)
The archiv-net network is created by the main docker-compose.yml; the observability compose must not attempt to create it
All bind-mount config paths are relative to the project root, e.g. ./infra/observability/prometheus/prometheus.yml

Files to Create / Modify

Action	Path
Create	`docker-compose.observability.yml`
Create	`infra/observability/prometheus/.gitkeep`
Create	`infra/observability/loki/.gitkeep`
Create	`infra/observability/promtail/.gitkeep`
Create	`infra/observability/tempo/.gitkeep`
Create	`infra/observability/grafana/provisioning/datasources/.gitkeep`
Create	`infra/observability/grafana/provisioning/dashboards/.gitkeep`
Modify	`.env.example` — add `# --- Observability ---` block

Definition of Done

All acceptance criteria are checked
docker compose -f docker-compose.observability.yml config exits 0
Committed on a feature branch, PR opened against main

## Context The application lives in `docker-compose.yml`. We need a separate `docker-compose.observability.yml` at the project root so the observability stack can be started and stopped without touching the application. The two stacks communicate via the existing `archiv-net` Docker bridge network, which the observability compose joins as an external network. This issue creates the skeleton only — no running services yet, just the file/directory structure and network wiring that all subsequent observability issues depend on. ## Acceptance Criteria - [ ] `docker-compose.observability.yml` exists at the project root - [ ] `archiv-net` is declared as `external: true` in the observability compose — so Prometheus can reach `archive-backend` by container name - [ ] Named Docker volumes declared: `prometheus_data`, `loki_data`, `tempo_data`, `grafana_data`, `glitchtip_data` - [ ] Config directories exist (empty, with a `.gitkeep`): `infra/observability/prometheus/`, `infra/observability/loki/`, `infra/observability/promtail/`, `infra/observability/tempo/`, `infra/observability/grafana/provisioning/datasources/`, `infra/observability/grafana/provisioning/dashboards/` - [ ] `.env.example` has a new `# --- Observability ---` section with these vars and sensible defaults: ``` PORT_GRAFANA=3001 PORT_GLITCHTIP=3002 PORT_PROMETHEUS=9090 GLITCHTIP_DOMAIN=http://localhost:3002 GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret SENTRY_DSN= VITE_SENTRY_DSN= ``` - [ ] `docker compose -f docker-compose.observability.yml config` exits 0 (all referenced env vars have defaults or are in `.env.example`) ## Implementation Notes - No actual service definitions yet — the file can contain placeholder comments for each upcoming service grouped by concern (Metrics, Logs, Traces, Dashboards, Error Tracking) - The `archiv-net` network is created by the main `docker-compose.yml`; the observability compose must not attempt to create it - All bind-mount config paths are relative to the project root, e.g. `./infra/observability/prometheus/prometheus.yml` ## Files to Create / Modify | Action | Path | |--------|------| | Create | `docker-compose.observability.yml` | | Create | `infra/observability/prometheus/.gitkeep` | | Create | `infra/observability/loki/.gitkeep` | | Create | `infra/observability/promtail/.gitkeep` | | Create | `infra/observability/tempo/.gitkeep` | | Create | `infra/observability/grafana/provisioning/datasources/.gitkeep` | | Create | `infra/observability/grafana/provisioning/dashboards/.gitkeep` | | Modify | `.env.example` — add `# --- Observability ---` block | ## Definition of Done - All acceptance criteria are checked - `docker compose -f docker-compose.observability.yml config` exits 0 - Committed on a feature branch, PR opened against `main`

marcel added this to the Observability Stack — Grafana LGTM + GlitchTip milestone 2026-05-14 15:03:37 +02:00

marcel added the P2-medium devops phase-7: monitoring labels 2026-05-14 15:06:08 +02:00

marcel commented

2026-05-14 15:28:03 +02:00

🏗️ Markus Keller — Senior Application Architect

Observations

This issue is correctly scoped as a skeleton-only issue — no running services yet, just file structure and network wiring. That's the right incremental approach for a multi-issue observability rollout.
The archiv-net: external: true pattern is the correct way to join an existing network without attempting to own it. Good call — this is the main structural decision and it's made correctly.
Five named volumes (prometheus_data, loki_data, tempo_data, grafana_data, glitchtip_data) are declared upfront before any services exist. This creates orphaned volume declarations in docker compose config output — Docker Compose will warn about volumes with no consumer. That's cosmetic for now but worth noting.
The six new observability components (Prometheus, Loki, Promtail, Tempo, Grafana, GlitchTip) represent a significant topology expansion. Per the documentation update table, a new Docker service or infrastructure component requires updates to docs/architecture/c4/l2-containers.puml and docs/DEPLOYMENT.md. These are not listed in the "Files to Create / Modify" table in the issue.
The issue says "No actual service definitions yet — the file can contain placeholder comments." If there are no service definitions, Docker Compose will reject the file as invalid YAML unless a minimal services block is present. The acceptance criterion docker compose … config exits 0 forces the implementor to write at least a skeleton structure — this constraint is correct and will flush out the issue.

Recommendations

Add a minimal services: {} block (or a single placeholder comment under services:) so docker compose config validates without errors. Docker Compose requires at least a services key.
Add docs/architecture/c4/l2-containers.puml and docs/DEPLOYMENT.md to the "Files to Create / Modify" table. Even for a skeleton issue, the container diagram should reflect that an observability stack is incoming — a comment block in the PlantUML with ' TODO: observability services (issue #572+) is sufficient for now and prevents doc drift.
Consider whether an ADR is warranted for the Grafana LGTM + GlitchTip technology choice. This is a non-trivial infrastructure commitment (5 new services, ~1GB RAM at runtime). An ADR capturing "why LGTM + GlitchTip over alternatives" would be the memory of this decision. It doesn't block this issue but should be filed in the same milestone.

Open Decisions

Volume declaration timing — Should named volumes be declared in this skeleton issue (as specified), or deferred to the issue that actually introduces each service? The current spec declares all 5 upfront, which means docker compose config will warn about unused volumes. Both approaches work; the upfront approach is slightly cleaner for the milestone arc. No objection to the spec as written — noting it as a conscious choice.

## 🏗️ Markus Keller — Senior Application Architect ### Observations - This issue is correctly scoped as a **skeleton-only** issue — no running services yet, just file structure and network wiring. That's the right incremental approach for a multi-issue observability rollout. - The `archiv-net: external: true` pattern is the correct way to join an existing network without attempting to own it. Good call — this is the main structural decision and it's made correctly. - Five named volumes (`prometheus_data`, `loki_data`, `tempo_data`, `grafana_data`, `glitchtip_data`) are declared upfront before any services exist. This creates orphaned volume declarations in `docker compose config` output — Docker Compose will warn about volumes with no consumer. That's cosmetic for now but worth noting. - The six new observability components (Prometheus, Loki, Promtail, Tempo, Grafana, GlitchTip) represent a significant topology expansion. Per the documentation update table, a new Docker service or infrastructure component **requires** updates to `docs/architecture/c4/l2-containers.puml` and `docs/DEPLOYMENT.md`. These are not listed in the "Files to Create / Modify" table in the issue. - The issue says "No actual service definitions yet — the file can contain placeholder comments." If there are no service definitions, Docker Compose will reject the file as invalid YAML unless a minimal services block is present. The acceptance criterion `docker compose … config exits 0` forces the implementor to write at least a skeleton structure — this constraint is correct and will flush out the issue. ### Recommendations - Add a minimal `services: {}` block (or a single placeholder comment under `services:`) so `docker compose config` validates without errors. Docker Compose requires at least a `services` key. - Add `docs/architecture/c4/l2-containers.puml` and `docs/DEPLOYMENT.md` to the "Files to Create / Modify" table. Even for a skeleton issue, the container diagram should reflect that an observability stack is incoming — a comment block in the PlantUML with `' TODO: observability services (issue #572+)` is sufficient for now and prevents doc drift. - Consider whether an ADR is warranted for the Grafana LGTM + GlitchTip technology choice. This is a non-trivial infrastructure commitment (5 new services, ~1GB RAM at runtime). An ADR capturing "why LGTM + GlitchTip over alternatives" would be the memory of this decision. It doesn't block this issue but should be filed in the same milestone. ### Open Decisions - **Volume declaration timing** — Should named volumes be declared in this skeleton issue (as specified), or deferred to the issue that actually introduces each service? The current spec declares all 5 upfront, which means `docker compose config` will warn about unused volumes. Both approaches work; the upfront approach is slightly cleaner for the milestone arc. No objection to the spec as written — noting it as a conscious choice.

marcel commented

2026-05-14 15:28:17 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

This is a pure infrastructure/config issue with no frontend or backend code changes. My scope here is narrow: implementation quality of the YAML/shell deliverables and the .env.example additions.
The acceptance criterion docker compose -f docker-compose.observability.yml config exits 0 is a clean, executable definition of done. Good — this can be verified in CI without running any containers.
The .env.example additions include GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret. The word "changeme" in a secret key variable is the pattern Nora will flag on the security side, but from a DX perspective, the inline generation hint (python3 -c "import secrets; print(secrets.token_hex(32))") that already exists in .env.example for OCR_TRAINING_TOKEN should be applied here too — it's a much better developer hint than a freeform string.
SENTRY_DSN= and VITE_SENTRY_DSN= are listed as empty-by-default vars. An empty SENTRY_DSN is fine — GlitchTip/Sentry SDKs treat an empty DSN as "disabled." No issue there.
PORT_GLITCHTIP=3002 conflicts with PORT_GRAFANA=3001. The main docker-compose.yml already uses ports 5173 (frontend), 8080 (backend), 9000/9001 (MinIO), 5432 (DB), 8025/1025 (Mailpit). The observability ports 3001/3002 are new — no collision with the app stack. That's correct.

Recommendations

Mirror the OCR_TRAINING_TOKEN comment pattern for GLITCHTIP_SECRET_KEY:
```
# Generate with: python3 -c "import secrets; print(secrets.token_hex(32))"
GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret
```
This is already the established pattern in .env.example — stay consistent.
The .gitkeep files in infra/observability/*/ should be committed as truly empty files, not as files containing a comment. Verify this in the implementation — some editors or templates add content.
When the skeleton YAML file is created, use commented-out service blocks rather than leaving the services key empty. This makes the intended structure readable and acts as scaffolding for the next implementor:
```
services:
  # --- Metrics: Prometheus ---
  # prometheus: (see issue #573)

  # --- Logs: Loki + Promtail ---
  # loki: (see issue #574)
  # promtail: (see issue #575)
```

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - This is a pure infrastructure/config issue with no frontend or backend code changes. My scope here is narrow: implementation quality of the YAML/shell deliverables and the `.env.example` additions. - The acceptance criterion `docker compose -f docker-compose.observability.yml config exits 0` is a clean, executable definition of done. Good — this can be verified in CI without running any containers. - The `.env.example` additions include `GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret`. The word "changeme" in a secret key variable is the pattern Nora will flag on the security side, but from a DX perspective, the inline generation hint (`python3 -c "import secrets; print(secrets.token_hex(32))"`) that already exists in `.env.example` for `OCR_TRAINING_TOKEN` should be applied here too — it's a much better developer hint than a freeform string. - `SENTRY_DSN=` and `VITE_SENTRY_DSN=` are listed as empty-by-default vars. An empty `SENTRY_DSN` is fine — GlitchTip/Sentry SDKs treat an empty DSN as "disabled." No issue there. - `PORT_GLITCHTIP=3002` conflicts with `PORT_GRAFANA=3001`. The main `docker-compose.yml` already uses ports 5173 (frontend), 8080 (backend), 9000/9001 (MinIO), 5432 (DB), 8025/1025 (Mailpit). The observability ports 3001/3002 are new — no collision with the app stack. That's correct. ### Recommendations - Mirror the `OCR_TRAINING_TOKEN` comment pattern for `GLITCHTIP_SECRET_KEY`: ``` # Generate with: python3 -c "import secrets; print(secrets.token_hex(32))" GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret ``` This is already the established pattern in `.env.example` — stay consistent. - The `.gitkeep` files in `infra/observability/*/` should be committed as truly empty files, not as files containing a comment. Verify this in the implementation — some editors or templates add content. - When the skeleton YAML file is created, use commented-out service blocks rather than leaving the `services` key empty. This makes the intended structure readable and acts as scaffolding for the next implementor: ```yaml services: # --- Metrics: Prometheus --- # prometheus: (see issue #573) # --- Logs: Loki + Promtail --- # loki: (see issue #574) # promtail: (see issue #575) ```

marcel commented

2026-05-14 15:28:33 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Observations

The approach — separate compose file, external network join, named volumes upfront — is exactly right for this VPS setup. The app stack stays untouched; the observability stack joins the existing network as a peer. Clean separation of concerns.
The existing docker-compose.yml uses minio/minio:latest (flagged in my persona as an anti-pattern). The observability compose is a good opportunity to establish pinned tags as the pattern for new services. The issue doesn't mention image tags since no services are defined yet, but the follow-up issues must pin every image (e.g. prom/prometheus:v2.53.0, grafana/grafana:11.0.0, grafana/loki:3.0.0).
The PORT_GRAFANA=3001 and PORT_GLITCHTIP=3002 defaults make sense for local dev. In production, these services should not be publicly exposed — Grafana sits behind Caddy (internal-only or auth-gated), GlitchTip has its own auth. The issue is silent on this, which is fine for a skeleton issue, but the follow-up Grafana and GlitchTip issues should explicitly address Caddy routing and auth.
archiv-net is already declared in the main compose and is created by docker compose up. The external: true declaration in the observability compose is correct — it means the observability stack will fail with a clear error if the app stack isn't up, rather than silently creating a disconnected network.
The infra/observability/ directory structure mirrors the existing infra/caddy/ pattern. That's consistent and correct.
The existing infra/ directory already has caddy/, gitea/, minio/, fail2ban/ subdirectories — adding infra/observability/ is a natural extension.

Recommendations

Add a comment in docker-compose.observability.yml explaining the startup dependency:

# Requires the main stack to be running first:
# docker compose up -d   (creates archiv-net)
# docker compose -f docker-compose.observability.yml up -d

This saves the next operator from a confusing "network not found" error.

The docker compose -f docker-compose.observability.yml config validation in the acceptance criteria is excellent — consider adding it as a CI job step in the existing CI pipeline (a simple docker compose config dry-run job). This prevents the observability compose from silently drifting into an invalid state as env vars and services are added in subsequent issues.
Update docs/DEPLOYMENT.md §1 to mention the observability compose as a parallel optional stack. Even a single sentence ("An optional observability stack is available via docker-compose.observability.yml — see milestone 'Observability Stack'") prevents confusion when someone reads the deployment docs and wonders why there are two compose files.

Open Decisions

Grafana port exposure in production — Should Grafana be accessible via Caddy (auth-gated subdomain or path) or only accessible via SSH tunnel? Both are valid for a family project; the SSH-tunnel approach has zero exposure risk but is operationally inconvenient. This decision belongs in the Grafana-specific issue, not here — but flagging it now so it's on the radar.

## 🔧 Tobias Wendt — DevOps & Platform Engineer ### Observations - The approach — separate compose file, external network join, named volumes upfront — is exactly right for this VPS setup. The app stack stays untouched; the observability stack joins the existing network as a peer. Clean separation of concerns. - The existing `docker-compose.yml` uses `minio/minio:latest` (flagged in my persona as an anti-pattern). The observability compose is a good opportunity to establish pinned tags as the pattern for new services. The issue doesn't mention image tags since no services are defined yet, but the follow-up issues must pin every image (e.g. `prom/prometheus:v2.53.0`, `grafana/grafana:11.0.0`, `grafana/loki:3.0.0`). - The `PORT_GRAFANA=3001` and `PORT_GLITCHTIP=3002` defaults make sense for local dev. In production, these services should **not** be publicly exposed — Grafana sits behind Caddy (internal-only or auth-gated), GlitchTip has its own auth. The issue is silent on this, which is fine for a skeleton issue, but the follow-up Grafana and GlitchTip issues should explicitly address Caddy routing and auth. - `archiv-net` is already declared in the main compose and is created by `docker compose up`. The `external: true` declaration in the observability compose is correct — it means the observability stack will fail with a clear error if the app stack isn't up, rather than silently creating a disconnected network. - The `infra/observability/` directory structure mirrors the existing `infra/caddy/` pattern. That's consistent and correct. - The existing `infra/` directory already has `caddy/`, `gitea/`, `minio/`, `fail2ban/` subdirectories — adding `infra/observability/` is a natural extension. ### Recommendations - Add a comment in `docker-compose.observability.yml` explaining the startup dependency: ```yaml # Requires the main stack to be running first: # docker compose up -d (creates archiv-net) # docker compose -f docker-compose.observability.yml up -d ``` This saves the next operator from a confusing "network not found" error. - The `docker compose -f docker-compose.observability.yml config` validation in the acceptance criteria is excellent — consider adding it as a CI job step in the existing CI pipeline (a simple `docker compose config` dry-run job). This prevents the observability compose from silently drifting into an invalid state as env vars and services are added in subsequent issues. - Update `docs/DEPLOYMENT.md` §1 to mention the observability compose as a parallel optional stack. Even a single sentence ("An optional observability stack is available via `docker-compose.observability.yml` — see milestone 'Observability Stack'") prevents confusion when someone reads the deployment docs and wonders why there are two compose files. ### Open Decisions - **Grafana port exposure in production** — Should Grafana be accessible via Caddy (auth-gated subdomain or path) or only accessible via SSH tunnel? Both are valid for a family project; the SSH-tunnel approach has zero exposure risk but is operationally inconvenient. This decision belongs in the Grafana-specific issue, not here — but flagging it now so it's on the radar.

marcel commented

2026-05-14 15:28:50 +02:00

🔐 Nora Steiner (NullX) — Application Security Engineer

Observations

GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret — This is a Django SECRET_KEY equivalent. GlitchTip uses it to sign sessions and cryptographic tokens. Shipping a known-string default means anyone who deploys without changing it has predictable session signing. The existing OCR_TRAINING_TOKEN already sets a better precedent with a generation hint — apply the same pattern here.
SENTRY_DSN= and VITE_SENTRY_DSN= — Empty DSN is safe (SDK disables itself). No issue.
Port exposure — PORT_GRAFANA=3001 and PORT_GLITCHTIP=3002 as published host ports means these services will be reachable on the VPS host on those ports. In the current dev setup, all ports are bound to all interfaces (0.0.0.0). The production setup binds to 127.0.0.1 only (per docs/DEPLOYMENT.md), so this is fine for production but a mild exposure risk in dev if the dev machine is accessible on a LAN. This issue is pre-existing in the app stack too — noting for completeness.
Network isolation — archiv-net: external: true means observability containers join the same bridge network as the application containers. This gives Prometheus, Loki, etc. direct network access to archive-backend, archive-db, and archive-minio by container name. This is intentional and necessary for scraping, but it also means a compromised Grafana container has network-level access to the database port. Acceptable for a family project on a single VPS; worth documenting.
No secrets in the skeleton — The skeleton issue creates no credentials, connection strings, or tokens beyond the .env.example vars. Clean.
GLITCHTIP_DOMAIN=http://localhost:3002 — Using HTTP (not HTTPS) for the GlitchTip domain means Sentry SDK error reports will be sent unencrypted in production unless this is overridden. The .env.example should comment that production requires https://your-domain.example.com/glitchtip (or similar).

Recommendations

Change the GLITCHTIP_SECRET_KEY default to empty with a mandatory generation notice, following the OCR_TRAINING_TOKEN pattern:
```
# REQUIRED — generate with: python3 -c "import secrets; print(secrets.token_hex(50))"
# GlitchTip uses this as Django SECRET_KEY. Must not be empty or 'changeme' in production.
GLITCHTIP_SECRET_KEY=
```
An empty default will cause GlitchTip to fail at startup rather than silently running with a weak key — fail closed is the correct behavior.

Add a comment to GLITCHTIP_DOMAIN:

# Production: use https://your-domain.example.com (must match your Caddy vhost)
GLITCHTIP_DOMAIN=http://localhost:3002

In a follow-up issue (not this one), add a network policy note to the architecture docs explaining that archiv-net is a shared trust boundary and observability services have full internal network access. This is a documented accepted risk for a single-VPS family project, not a blocker.

No hard blockers for this skeleton issue — the security concerns above are either documentation gaps or configuration defaults to improve before services are actually wired up.

## 🔐 Nora Steiner (NullX) — Application Security Engineer ### Observations - **`GLITCHTIP_SECRET_KEY=changeme-generate-a-real-secret`** — This is a Django `SECRET_KEY` equivalent. GlitchTip uses it to sign sessions and cryptographic tokens. Shipping a known-string default means anyone who deploys without changing it has predictable session signing. The existing `OCR_TRAINING_TOKEN` already sets a better precedent with a generation hint — apply the same pattern here. - **`SENTRY_DSN=` and `VITE_SENTRY_DSN=`** — Empty DSN is safe (SDK disables itself). No issue. - **Port exposure** — `PORT_GRAFANA=3001` and `PORT_GLITCHTIP=3002` as published host ports means these services will be reachable on the VPS host on those ports. In the current dev setup, all ports are bound to all interfaces (`0.0.0.0`). The production setup binds to `127.0.0.1` only (per `docs/DEPLOYMENT.md`), so this is fine for production but a mild exposure risk in dev if the dev machine is accessible on a LAN. This issue is pre-existing in the app stack too — noting for completeness. - **Network isolation** — `archiv-net: external: true` means observability containers join the same bridge network as the application containers. This gives Prometheus, Loki, etc. direct network access to `archive-backend`, `archive-db`, and `archive-minio` by container name. This is intentional and necessary for scraping, but it also means a compromised Grafana container has network-level access to the database port. Acceptable for a family project on a single VPS; worth documenting. - **No secrets in the skeleton** — The skeleton issue creates no credentials, connection strings, or tokens beyond the `.env.example` vars. Clean. - **`GLITCHTIP_DOMAIN=http://localhost:3002`** — Using HTTP (not HTTPS) for the GlitchTip domain means Sentry SDK error reports will be sent unencrypted in production unless this is overridden. The `.env.example` should comment that production requires `https://your-domain.example.com/glitchtip` (or similar). ### Recommendations - Change the `GLITCHTIP_SECRET_KEY` default to empty with a mandatory generation notice, following the `OCR_TRAINING_TOKEN` pattern: ``` # REQUIRED — generate with: python3 -c "import secrets; print(secrets.token_hex(50))" # GlitchTip uses this as Django SECRET_KEY. Must not be empty or 'changeme' in production. GLITCHTIP_SECRET_KEY= ``` An empty default will cause GlitchTip to fail at startup rather than silently running with a weak key — fail closed is the correct behavior. - Add a comment to `GLITCHTIP_DOMAIN`: ``` # Production: use https://your-domain.example.com (must match your Caddy vhost) GLITCHTIP_DOMAIN=http://localhost:3002 ``` - In a follow-up issue (not this one), add a network policy note to the architecture docs explaining that `archiv-net` is a shared trust boundary and observability services have full internal network access. This is a documented accepted risk for a single-VPS family project, not a blocker. No hard blockers for this skeleton issue — the security concerns above are either documentation gaps or configuration defaults to improve before services are actually wired up.

marcel commented

2026-05-14 15:29:01 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Observations

This is a pure infrastructure scaffolding issue with no application logic, no API changes, and no frontend changes. The test pyramid implications are limited to the infrastructure validation layer.
The acceptance criteria is effectively one automated check: docker compose -f docker-compose.observability.yml config exits 0. This is the right gate for a skeleton issue — it's deterministic, fast, and verifiable in CI without running any containers.
There are no existing CI jobs that validate the observability compose. The main CI pipeline (based on recent commits about Surefire, timeouts, and artifact uploads) runs backend and frontend tests but has no docker compose validation step.
The .gitkeep files are the only deliverables that don't have an automated check. Their existence is implicitly verified by git — if they're not committed, the directories don't exist in the repo.

Recommendations

Add a CI validation job for docker compose config on the observability compose file. This is a one-liner that fits into the existing CI workflow:
```
- name: Validate observability compose
  run: docker compose -f docker-compose.observability.yml config --quiet
```
Without this, the compose file can drift into an invalid state as env vars and services are added in subsequent issues, and nobody will notice until someone tries to run it.
Verify .env.example completeness in CI — The acceptance criterion says "all referenced env vars have defaults or are in .env.example". This is currently a manual check. A simple script (e.g. docker compose config with a populated .env derived from .env.example) would make this automatic. This is a broader CI improvement, not specific to this issue — but this issue is a good trigger to request it.
Definition of Done gap — The DoD says "All acceptance criteria are checked" but doesn't specify who checks the docker compose config exit code. Make it explicit in the PR checklist: the implementor must run docker compose -f docker-compose.observability.yml config locally before opening the PR, and the CI job must run it on every push.

No blocking QA concerns for this skeleton issue. The risk profile is low — no production data, no running services, no network changes to the existing stack.

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Observations - This is a pure infrastructure scaffolding issue with no application logic, no API changes, and no frontend changes. The test pyramid implications are limited to the infrastructure validation layer. - The acceptance criteria is effectively one automated check: `docker compose -f docker-compose.observability.yml config exits 0`. This is the right gate for a skeleton issue — it's deterministic, fast, and verifiable in CI without running any containers. - There are no existing CI jobs that validate the observability compose. The main CI pipeline (based on recent commits about Surefire, timeouts, and artifact uploads) runs backend and frontend tests but has no docker compose validation step. - The `.gitkeep` files are the only deliverables that don't have an automated check. Their existence is implicitly verified by git — if they're not committed, the directories don't exist in the repo. ### Recommendations - **Add a CI validation job** for `docker compose config` on the observability compose file. This is a one-liner that fits into the existing CI workflow: ```yaml - name: Validate observability compose run: docker compose -f docker-compose.observability.yml config --quiet ``` Without this, the compose file can drift into an invalid state as env vars and services are added in subsequent issues, and nobody will notice until someone tries to run it. - **Verify `.env.example` completeness in CI** — The acceptance criterion says "all referenced env vars have defaults or are in `.env.example`". This is currently a manual check. A simple script (e.g. `docker compose config` with a populated `.env` derived from `.env.example`) would make this automatic. This is a broader CI improvement, not specific to this issue — but this issue is a good trigger to request it. - **Definition of Done gap** — The DoD says "All acceptance criteria are checked" but doesn't specify who checks the `docker compose config` exit code. Make it explicit in the PR checklist: the implementor must run `docker compose -f docker-compose.observability.yml config` locally before opening the PR, and the CI job must run it on every push. No blocking QA concerns for this skeleton issue. The risk profile is low — no production data, no running services, no network changes to the existing stack.

marcel commented

2026-05-14 15:29:15 +02:00

📋 Elicit — Requirements Engineer

Observations

The issue is well-structured for a scaffolding story: clear context, explicit acceptance criteria with checkboxes, an implementation notes section, a file table, and a definition of done. This is close to the project's established spec-dense format.
The acceptance criteria are testable and verifiable — each one maps to a concrete observable outcome (file exists, network declared, volume declared, vars present, command exits 0). This is good requirements hygiene.
One ambiguity in the docker compose config criterion: "all referenced env vars have defaults or are in .env.example" — but docker compose config only validates YAML structure and interpolates known vars. It does not fail if a var is referenced in the compose file but missing from .env.example. The real mechanism is: if a var has no default and is not in .env, the command fails. The acceptance criterion should clarify that the validator must copy .env.example to .env before running the check, otherwise the criterion is misleading.
Missing acceptance criterion for directory structure: The issue lists directories to create but doesn't have a checkbox verifying they exist in the committed state. infra/observability/prometheus/.gitkeep being committed is implied but not explicit.
Scope boundary is correctly enforced: "No actual service definitions yet" is explicit and prevents scope creep. The implementation notes correctly defer all service config to follow-up issues.
The .env.example section lists SENTRY_DSN= and VITE_SENTRY_DSN= without explaining the distinction. A comment clarifying which is used server-side (backend Spring Boot / GlitchTip SDK) vs. client-side (Vite / frontend Sentry SDK) would improve the .env.example for operators who aren't familiar with the DSN split.

Recommendations

Clarify the docker compose config acceptance criterion:

docker compose -f docker-compose.observability.yml config exits 0 when run with .env populated from .env.example (i.e., cp .env.example .env first)
Add an explicit checkbox for the directory structure:

All six infra/observability/*/.gitkeep files are committed and visible in the PR diff.

Add inline comments to the .env.example additions distinguishing SENTRY_DSN (backend) from VITE_SENTRY_DSN (frontend/Vite build), e.g.:

SENTRY_DSN=           # backend (Spring Boot): used by GlitchTip Java/Spring SDK
VITE_SENTRY_DSN=      # frontend (SvelteKit): injected at build time via Vite

These are minor clarifications — the issue is implementation-ready as written. The ambiguities are edge cases that a careful implementor will handle correctly, but making them explicit removes any guesswork.

## 📋 Elicit — Requirements Engineer ### Observations - The issue is well-structured for a scaffolding story: clear context, explicit acceptance criteria with checkboxes, an implementation notes section, a file table, and a definition of done. This is close to the project's established spec-dense format. - **The acceptance criteria are testable and verifiable** — each one maps to a concrete observable outcome (file exists, network declared, volume declared, vars present, command exits 0). This is good requirements hygiene. - **One ambiguity in the `docker compose config` criterion**: "all referenced env vars have defaults or are in `.env.example`" — but `docker compose config` only validates YAML structure and interpolates known vars. It does **not** fail if a var is referenced in the compose file but missing from `.env.example`. The real mechanism is: if a var has no default and is not in `.env`, the command fails. The acceptance criterion should clarify that the validator must copy `.env.example` to `.env` before running the check, otherwise the criterion is misleading. - **Missing acceptance criterion for directory structure**: The issue lists directories to create but doesn't have a checkbox verifying they exist in the committed state. `infra/observability/prometheus/.gitkeep` being committed is implied but not explicit. - **Scope boundary is correctly enforced**: "No actual service definitions yet" is explicit and prevents scope creep. The implementation notes correctly defer all service config to follow-up issues. - **The `.env.example` section** lists `SENTRY_DSN=` and `VITE_SENTRY_DSN=` without explaining the distinction. A comment clarifying which is used server-side (backend Spring Boot / GlitchTip SDK) vs. client-side (Vite / frontend Sentry SDK) would improve the `.env.example` for operators who aren't familiar with the DSN split. ### Recommendations - Clarify the `docker compose config` acceptance criterion: > `docker compose -f docker-compose.observability.yml config` exits 0 **when run with `.env` populated from `.env.example`** (i.e., `cp .env.example .env` first) - Add an explicit checkbox for the directory structure: > All six `infra/observability/*/.gitkeep` files are committed and visible in the PR diff. - Add inline comments to the `.env.example` additions distinguishing `SENTRY_DSN` (backend) from `VITE_SENTRY_DSN` (frontend/Vite build), e.g.: ``` SENTRY_DSN= # backend (Spring Boot): used by GlitchTip Java/Spring SDK VITE_SENTRY_DSN= # frontend (SvelteKit): injected at build time via Vite ``` These are minor clarifications — the issue is implementation-ready as written. The ambiguities are edge cases that a careful implementor will handle correctly, but making them explicit removes any guesswork.

marcel commented

2026-05-14 15:29:21 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

No UI or frontend changes in this issue — it is purely infrastructure scaffolding (Docker Compose file, config directories, .env.example additions).

From my angle, I checked:

No new routes or frontend components — nothing to review for accessibility, brand compliance, or responsive design.
No user-visible changes — the observability stack is operator-facing infrastructure, not end-user UI.
Future consideration: When Grafana and GlitchTip are wired up in subsequent issues, ensure that any links to their UIs (e.g., in an admin panel or internal operations page) follow the project's rel="noopener noreferrer" convention for external tool links and use the standard <BackButton> pattern for navigation.

No concerns from my angle on this skeleton issue. I'll re-engage when there are UI touchpoints in the observability milestone (e.g., if a status/health indicator is added to the admin panel).

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist No UI or frontend changes in this issue — it is purely infrastructure scaffolding (Docker Compose file, config directories, `.env.example` additions). From my angle, I checked: - **No new routes or frontend components** — nothing to review for accessibility, brand compliance, or responsive design. - **No user-visible changes** — the observability stack is operator-facing infrastructure, not end-user UI. - **Future consideration**: When Grafana and GlitchTip are wired up in subsequent issues, ensure that any links to their UIs (e.g., in an admin panel or internal operations page) follow the project's `rel="noopener noreferrer"` convention for external tool links and use the standard `<BackButton>` pattern for navigation. No concerns from my angle on this skeleton issue. I'll re-engage when there are UI touchpoints in the observability milestone (e.g., if a status/health indicator is added to the admin panel).

marcel commented

2026-05-14 15:29:36 +02:00

🗳️ Decision Queue — Action Required

3 decisions need your input before implementation starts.

Architecture

Volume declaration timing — Should named volumes (prometheus_data, loki_data, etc.) be declared in this skeleton issue (as the spec says), or deferred to the issue that introduces each service? The upfront approach means docker compose config will warn about volumes with no consumer; the deferred approach keeps each issue self-contained. No strong objection to either — but pick one and stick to it across the milestone. (Raised by: Markus)
ADR for LGTM + GlitchTip — The Grafana LGTM stack + GlitchTip represents ~5 new services and ~1 GB RAM at runtime. Should an ADR be written in this milestone capturing "why LGTM + GlitchTip over alternatives (Datadog, Sentry SaaS, single Prometheus only, etc.)"? The ADR doesn't block this skeleton issue but should be filed before the stack goes live. (Raised by: Markus)

Infrastructure

Grafana exposure in production — Should Grafana be accessible via Caddy (auth-gated path or subdomain, e.g. archiv.example.com/grafana) or only via SSH tunnel (ssh -L 3001:localhost:3001)? Caddy approach is more convenient but adds a public-facing auth surface; SSH tunnel has zero exposure risk but requires operator access for every use. This belongs in the Grafana-specific follow-up issue, but the decision affects whether PORT_GRAFANA needs a Caddy vhost entry. (Raised by: Tobias)

## 🗳️ Decision Queue — Action Required _3 decisions need your input before implementation starts._ ### Architecture - **Volume declaration timing** — Should named volumes (`prometheus_data`, `loki_data`, etc.) be declared in this skeleton issue (as the spec says), or deferred to the issue that introduces each service? The upfront approach means `docker compose config` will warn about volumes with no consumer; the deferred approach keeps each issue self-contained. No strong objection to either — but pick one and stick to it across the milestone. _(Raised by: Markus)_ - **ADR for LGTM + GlitchTip** — The Grafana LGTM stack + GlitchTip represents ~5 new services and ~1 GB RAM at runtime. Should an ADR be written in this milestone capturing "why LGTM + GlitchTip over alternatives (Datadog, Sentry SaaS, single Prometheus only, etc.)"? The ADR doesn't block this skeleton issue but should be filed before the stack goes live. _(Raised by: Markus)_ ### Infrastructure - **Grafana exposure in production** — Should Grafana be accessible via Caddy (auth-gated path or subdomain, e.g. `archiv.example.com/grafana`) or only via SSH tunnel (`ssh -L 3001:localhost:3001`)? Caddy approach is more convenient but adds a public-facing auth surface; SSH tunnel has zero exposure risk but requires operator access for every use. This belongs in the Grafana-specific follow-up issue, but the decision affects whether `PORT_GRAFANA` needs a Caddy vhost entry. _(Raised by: Tobias)_

marcel commented

2026-05-14 15:35:41 +02:00

Arch: Volume can stay here. Create an ADR
Grafana should be accesible via subdomain. grafana.raddatz,cloud

Arch: Volume can stay here. Create an ADR Grafana should be accesible via subdomain. grafana.raddatz,cloud

marcel commented

2026-05-15 01:24:00 +02:00

Implementation complete — Felix Brandt

All acceptance criteria met. Branch feat/issue-572-observability-scaffold is pushed and ready for PR.

Commit

1d42be98 — devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure

What was delivered

docker-compose.observability.yml (project root)

archiv-net declared as external: true — joins the app stack's network without attempting to create it
obs-net declared as driver: bridge — internal observability-to-observability traffic (Grafana → Prometheus/Loki/Tempo)
Named volumes declared upfront: prometheus_data, loki_data, tempo_data, grafana_data, glitchtip_data
services: {} — valid Docker Compose skeleton; placeholder comments document which issue adds each service group (Metrics #573, Logs #574/#575, Traces, Dashboards, Error Tracking)
Startup-order comment explains that docker compose up -d must run first to create archiv-net

infra/observability/ directory structure — 6 empty .gitkeep files committed:

prometheus/, loki/, promtail/, tempo/
grafana/provisioning/datasources/, grafana/provisioning/dashboards/

.env.example — new # --- Observability --- section with:

PORT_GRAFANA=3001, PORT_GLITCHTIP=3002, PORT_PROMETHEUS=9090
GLITCHTIP_DOMAIN=http://localhost:3002 (with production HTTPS note)
GLITCHTIP_SECRET_KEY with generation command (python3 -c "import secrets; print(secrets.token_hex(50))") — mirrors the OCR_TRAINING_TOKEN pattern per Felix + Nora's recommendations
SENTRY_DSN= and VITE_SENTRY_DSN= with separate comment lines clarifying backend vs. frontend usage

Reviewer recommendations applied

Felix: GLITCHTIP_SECRET_KEY generation hint added (mirrors OCR_TRAINING_TOKEN pattern)
Felix: .gitkeep files committed as truly empty (verified with touch)
Nora: GLITCHTIP_DOMAIN production HTTPS note added; GLITCHTIP_SECRET_KEY generation command uses token_hex(50) per her recommendation
Tobias: Startup-order comment in compose file (docker compose up -d must precede the observability stack)
Elicit: SENTRY_DSN / VITE_SENTRY_DSN distinction clarified via separate comment lines

Verification

docker compose -f docker-compose.observability.yml config --quiet
# Exit code: 0

Next step

Open PR against main and run /review-pr. Then proceed to issue #573 (Prometheus service definition).

## Implementation complete — Felix Brandt All acceptance criteria met. Branch `feat/issue-572-observability-scaffold` is pushed and ready for PR. ### Commit `1d42be98` — devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure ### What was delivered **`docker-compose.observability.yml`** (project root) - `archiv-net` declared as `external: true` — joins the app stack's network without attempting to create it - `obs-net` declared as `driver: bridge` — internal observability-to-observability traffic (Grafana → Prometheus/Loki/Tempo) - Named volumes declared upfront: `prometheus_data`, `loki_data`, `tempo_data`, `grafana_data`, `glitchtip_data` - `services: {}` — valid Docker Compose skeleton; placeholder comments document which issue adds each service group (Metrics #573, Logs #574/#575, Traces, Dashboards, Error Tracking) - Startup-order comment explains that `docker compose up -d` must run first to create `archiv-net` **`infra/observability/` directory structure** — 6 empty `.gitkeep` files committed: - `prometheus/`, `loki/`, `promtail/`, `tempo/` - `grafana/provisioning/datasources/`, `grafana/provisioning/dashboards/` **`.env.example`** — new `# --- Observability ---` section with: - `PORT_GRAFANA=3001`, `PORT_GLITCHTIP=3002`, `PORT_PROMETHEUS=9090` - `GLITCHTIP_DOMAIN=http://localhost:3002` (with production HTTPS note) - `GLITCHTIP_SECRET_KEY` with generation command (`python3 -c "import secrets; print(secrets.token_hex(50))"`) — mirrors the `OCR_TRAINING_TOKEN` pattern per Felix + Nora's recommendations - `SENTRY_DSN=` and `VITE_SENTRY_DSN=` with separate comment lines clarifying backend vs. frontend usage ### Reviewer recommendations applied - **Felix**: `GLITCHTIP_SECRET_KEY` generation hint added (mirrors `OCR_TRAINING_TOKEN` pattern) - **Felix**: `.gitkeep` files committed as truly empty (verified with `touch`) - **Nora**: `GLITCHTIP_DOMAIN` production HTTPS note added; `GLITCHTIP_SECRET_KEY` generation command uses `token_hex(50)` per her recommendation - **Tobias**: Startup-order comment in compose file (`docker compose up -d` must precede the observability stack) - **Elicit**: `SENTRY_DSN` / `VITE_SENTRY_DSN` distinction clarified via separate comment lines ### Verification ``` docker compose -f docker-compose.observability.yml config --quiet # Exit code: 0 ``` ### Next step Open PR against `main` and run `/review-pr`. Then proceed to issue #573 (Prometheus service definition).

marcel referenced a pull request that will close this issue

2026-05-15 01:24:21 +02:00

devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure #584

marcel referenced this issue

2026-05-15 01:26:58 +02:00

devops(observability): scaffold docker-compose.observability.yml and infra/observability/ structure #584

marcel closed this issue

2026-05-15 01:45:15 +02:00

marcel referenced this issue

2026-05-15 01:48:02 +02:00

devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #573

Sign in to join this conversation.