devops(observability): add Loki + Promtail for centralised container log aggregation #586

marcel · 2026-05-15T02:19:34+02:00

marcel commented

2026-05-15 02:19:34 +02:00

Summary

Closes #574

Adds obs-loki (grafana/loki:3.4.2) and obs-promtail (grafana/promtail:3.4.2) services to docker-compose.observability.yml
Creates infra/observability/loki/loki-config.yml — single-node TSDB v13, 30-day retention, analytics off
Creates infra/observability/promtail/promtail-config.yml — Docker SD with relabeling for container_name, compose_service, compose_project, logstream
Security: Docker socket mounted read-only, promtail_positions named volume for restart safety
Loki healthcheck added (wget /ready, 30s start_period)
Promtail depends_on loki: condition: service_healthy
docs/DEPLOYMENT.md updated with Loki + Promtail service entries

Test plan

docker compose -f docker-compose.observability.yml config exits 0
docker compose -f docker-compose.observability.yml up -d loki promtail starts both containers
curl -s http://localhost:3100/ready returns ready

🤖 Generated with Claude Code

## Summary Closes #574 - Adds `obs-loki` (grafana/loki:3.4.2) and `obs-promtail` (grafana/promtail:3.4.2) services to `docker-compose.observability.yml` - Creates `infra/observability/loki/loki-config.yml` — single-node TSDB v13, 30-day retention, analytics off - Creates `infra/observability/promtail/promtail-config.yml` — Docker SD with relabeling for container_name, compose_service, compose_project, logstream - Security: Docker socket mounted read-only, `promtail_positions` named volume for restart safety - Loki healthcheck added (wget /ready, 30s start_period) - Promtail depends_on loki: condition: service_healthy - `docs/DEPLOYMENT.md` updated with Loki + Promtail service entries ## Test plan - [ ] `docker compose -f docker-compose.observability.yml config` exits 0 - [ ] `docker compose -f docker-compose.observability.yml up -d loki promtail` starts both containers - [ ] `curl -s http://localhost:3100/ready` returns `ready` 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel added 1 commit 2026-05-15 02:19:35 +02:00

devops(observability): add Loki + Promtail for centralised container log aggregation

CI / Unit & Component Tests (pull_request) Successful in 3m21s

Details

CI / OCR Service Tests (pull_request) Successful in 16s

Details

CI / Backend Unit Tests (pull_request) Successful in 4m31s

Details

CI / fail2ban Regex (pull_request) Successful in 38s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 57s

Details

22e1b25398

- Add obs-loki (grafana/loki:3.4.2) to docker-compose.observability.yml
  with healthcheck (wget /ready), expose-only port 3100, named volume loki_data
- Add obs-promtail (grafana/promtail:3.4.2) bridging archiv-net + obs-net,
  depends_on loki service_healthy, docker.sock:ro, promtail_positions volume
  for restart-safe position tracking
- Create infra/observability/loki/loki-config.yml: single-node TSDB schema v13,
  30-day retention, auth disabled (obs-net only), telemetry off
- Create infra/observability/promtail/promtail-config.yml: Docker SD scrape,
  container_name / compose_service / compose_project / logstream labels
- Update docs/DEPLOYMENT.md §4 with service table and Loki quick-check commands

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-15 02:22:32 +02:00

🛠️ Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

This is solid infrastructure work. Image tags are pinned, the healthcheck pattern is correct, the named volume for positions is a good call, and the depends_on: condition: service_healthy chain is exactly right. The inline comments explaining trade-offs (Docker socket risk, compose_service vs container_name, retention rationale) are the kind of comments I actually want in infra files. Good job.

Blockers

1. l2-containers.puml is not updated with Promtail

docs/architecture/c4/l2-containers.puml (head branch) already has loki in the observability boundary, but promtail is absent. This PR wires Promtail as a real, running service — the diagram must reflect it. Markus requires C4 diagram updates when new Docker services are added. Add:

Container(promtail, "Promtail", "grafana/promtail:3.4.2", "Log shipping agent. Reads Docker container logs via Docker socket, relabels with compose_service / compose_project / container_name, and forwards to Loki on obs-net.")

...and a Rel(promtail, loki, "Ships log streams", "HTTP / Loki push API").

Suggestions (non-blocking)

2. loki service is missing archiv-net

Promtail joins both archiv-net (for label discovery) and obs-net (for log shipping to Loki). Loki only joins obs-net. That is correct — Loki does not need to reach application containers directly. Just confirming this is intentional and not an omission: Promtail's Docker SD discovery goes via the socket, not the network. All good.

3. Loki retention requires compactor to be enabled

limits_config.retention_period: 720h sets the retention policy, but Loki's TSDB backend also needs the compactor enabled and compaction_interval set, otherwise retention deletion does not run. The current config has no compactor: block. For a low-volume family archive this is unlikely to cause disk pressure soon, but worth knowing:

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

This is a suggestion, not a blocker, but retention will silently not work without it.

4. Promtail grpc_listen_port: 0 — document why

The comment says "gRPC disabled — used for Promtail clustering only; single-node deployment." Good. No action needed.

5. No Renovate coverage for the new images

grafana/loki:3.4.2 and grafana/promtail:3.4.2 are new pinned images. Verify Renovate's config matches docker-compose.observability.yml so these get bump PRs automatically. If Renovate is not watching this file, add it to the config.

What's done well

Read-only Docker socket mount with explicit trade-off comment — this is exactly the right balance of pragmatism and transparency
promtail_positions named volume prevents duplicate log ingestion on restart — many deployments miss this
start_period: 30s on Loki's healthcheck accounts for cold-start initialization time
compose_service label recommendation in both the config and the docs is the right operational guidance
expose (not ports) on Loki port 3100 — internal only, nothing bound to host

## 🛠️ Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** This is solid infrastructure work. Image tags are pinned, the healthcheck pattern is correct, the named volume for positions is a good call, and the `depends_on: condition: service_healthy` chain is exactly right. The inline comments explaining trade-offs (Docker socket risk, `compose_service` vs `container_name`, retention rationale) are the kind of comments I actually want in infra files. Good job. --- ### Blockers **1. `l2-containers.puml` is not updated with Promtail** `docs/architecture/c4/l2-containers.puml` (head branch) already has `loki` in the observability boundary, but `promtail` is absent. This PR wires Promtail as a real, running service — the diagram must reflect it. Markus requires C4 diagram updates when new Docker services are added. Add: ```plantuml Container(promtail, "Promtail", "grafana/promtail:3.4.2", "Log shipping agent. Reads Docker container logs via Docker socket, relabels with compose_service / compose_project / container_name, and forwards to Loki on obs-net.") ``` ...and a `Rel(promtail, loki, "Ships log streams", "HTTP / Loki push API")`. --- ### Suggestions (non-blocking) **2. `loki` service is missing `archiv-net`** Promtail joins both `archiv-net` (for label discovery) and `obs-net` (for log shipping to Loki). Loki only joins `obs-net`. That is correct — Loki does not need to reach application containers directly. Just confirming this is intentional and not an omission: Promtail's Docker SD discovery goes via the socket, not the network. All good. **3. Loki retention requires `compactor` to be enabled** `limits_config.retention_period: 720h` sets the retention policy, but Loki's TSDB backend also needs the compactor enabled and `compaction_interval` set, otherwise retention deletion does not run. The current config has no `compactor:` block. For a low-volume family archive this is unlikely to cause disk pressure soon, but worth knowing: ```yaml compactor: working_directory: /loki/compactor compaction_interval: 10m retention_enabled: true retention_delete_delay: 2h ``` This is a suggestion, not a blocker, but retention will silently not work without it. **4. Promtail `grpc_listen_port: 0` — document why** The comment says "gRPC disabled — used for Promtail clustering only; single-node deployment." Good. No action needed. **5. No Renovate coverage for the new images** `grafana/loki:3.4.2` and `grafana/promtail:3.4.2` are new pinned images. Verify Renovate's config matches `docker-compose.observability.yml` so these get bump PRs automatically. If Renovate is not watching this file, add it to the config. --- ### What's done well - Read-only Docker socket mount with explicit trade-off comment — this is exactly the right balance of pragmatism and transparency - `promtail_positions` named volume prevents duplicate log ingestion on restart — many deployments miss this - `start_period: 30s` on Loki's healthcheck accounts for cold-start initialization time - `compose_service` label recommendation in both the config and the docs is the right operational guidance - `expose` (not `ports`) on Loki port 3100 — internal only, nothing bound to host

marcel commented

2026-05-15 02:22:58 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

This is a well-considered infrastructure change with explicit, honest security commentary already embedded in the diff. The Docker socket trade-off is documented, the auth_enabled: false justification is present, and the unauthenticated Loki push API concern is called out. That transparency is the right approach. However, one concern warrants a blocker and one is worth flagging for future planning.

Blockers

1. Docker socket exposure grants container-escape-equivalent access

The PR author already flags this in the compose comment. I want to make the exact risk explicit so it's a conscious acceptance, not just a "revisit later":

Mounting /var/run/docker.sock read-only into Promtail allows any process running as root inside the Promtail container to issue arbitrary Docker API calls, including docker exec, docker run, and starting new privileged containers. read-only applies to the file descriptor mount, not to Docker API permissions — those are always full-daemon access. A compromised Promtail container is equivalent to host root.

This is acceptable for a single-operator family archive on a private server, but the acceptance should be explicit. I'm not asking for the PR to be blocked — I'm asking for the comment to be upgraded from "acceptable on a single-operator archive; review if multi-user access to the host is ever introduced" to something more precise:

# SECURITY: /var/run/docker.sock grants full Docker daemon access to the container,
# regardless of the :ro flag (which only prevents remounting, not API access).
# A compromised Promtail process can spawn privileged containers = host root.
# Accepted: single-operator VPS, Promtail is pinned to a signed Grafana image,
# no external network access. Re-evaluate if: multi-tenant access, Promtail
# exposed externally, or Docker API surface increases.

If you accept the current comment wording, I'll drop this to a suggestion. I raise it as a blocker because the current comment undersells the actual privilege level.

Suggestions (non-blocking)

2. Unauthenticated Loki push endpoint on obs-net

promtail-config.yml acknowledges: "Any container on obs-net can push logs." True. Worth noting: the practical blast radius on a single-VPS deployment with a private Docker network is low. If the observability stack ever gets more services or the network is widened, adding Loki's built-in auth_enabled: true with a static token is straightforward:

# loki-config.yml
auth_enabled: true

# promtail-config.yml
clients:
  - url: http://loki:3100/loki/api/v1/push
    basic_auth:
      username: promtail
      password: ${LOKI_PUSH_TOKEN}

Not required now, but easy to add before any network topology change.

3. No log filtering — application secrets may appear in Loki

Promtail is configured to ship ALL Docker container logs with no drop/filter rules. Spring Boot logs, if misconfigured (e.g., DEBUG level or log injection), could include session tokens, passwords, or PII. This is a systemic risk to flag for when Grafana dashboards are built (issue #581) — add a pipeline_stages drop rule for known-sensitive patterns then.

4. analytics.reporting_enabled: false — correct

Good. Disabling Grafana Labs telemetry is the right default for a private family archive.

What's done well

auth_enabled: false with explicit scope comment (not exposed beyond obs-net) — this is the right way to document a deliberate security trade-off
:ro on all config volume mounts — defense in depth even if not sufficient alone for the socket
Port 3100 expose-only, not host-bound — correct, reduces attack surface to Docker internal network only
Promtail on obs-net only for Loki push (log data stays internal), archiv-net only for Docker SD (label discovery stays internal)

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** This is a well-considered infrastructure change with explicit, honest security commentary already embedded in the diff. The Docker socket trade-off is documented, the `auth_enabled: false` justification is present, and the unauthenticated Loki push API concern is called out. That transparency is the right approach. However, one concern warrants a blocker and one is worth flagging for future planning. --- ### Blockers **1. Docker socket exposure grants container-escape-equivalent access** The PR author already flags this in the compose comment. I want to make the exact risk explicit so it's a conscious acceptance, not just a "revisit later": > Mounting `/var/run/docker.sock` read-only into Promtail allows _any process running as root inside the Promtail container_ to issue arbitrary Docker API calls, including `docker exec`, `docker run`, and starting new privileged containers. `read-only` applies to the _file descriptor mount_, not to Docker API permissions — those are always full-daemon access. A compromised Promtail container is equivalent to host root. **This is acceptable for a single-operator family archive on a private server, but the acceptance should be explicit.** I'm not asking for the PR to be blocked — I'm asking for the comment to be upgraded from "acceptable on a single-operator archive; review if multi-user access to the host is ever introduced" to something more precise: ```yaml # SECURITY: /var/run/docker.sock grants full Docker daemon access to the container, # regardless of the :ro flag (which only prevents remounting, not API access). # A compromised Promtail process can spawn privileged containers = host root. # Accepted: single-operator VPS, Promtail is pinned to a signed Grafana image, # no external network access. Re-evaluate if: multi-tenant access, Promtail # exposed externally, or Docker API surface increases. ``` **If you accept the current comment wording, I'll drop this to a suggestion.** I raise it as a blocker because the current comment undersells the actual privilege level. --- ### Suggestions (non-blocking) **2. Unauthenticated Loki push endpoint on `obs-net`** `promtail-config.yml` acknowledges: "Any container on obs-net can push logs." True. Worth noting: the practical blast radius on a single-VPS deployment with a private Docker network is low. If the observability stack ever gets more services or the network is widened, adding Loki's built-in `auth_enabled: true` with a static token is straightforward: ```yaml # loki-config.yml auth_enabled: true # promtail-config.yml clients: - url: http://loki:3100/loki/api/v1/push basic_auth: username: promtail password: ${LOKI_PUSH_TOKEN} ``` Not required now, but easy to add before any network topology change. **3. No log filtering — application secrets may appear in Loki** Promtail is configured to ship ALL Docker container logs with no drop/filter rules. Spring Boot logs, if misconfigured (e.g., DEBUG level or log injection), could include session tokens, passwords, or PII. This is a systemic risk to flag for when Grafana dashboards are built (issue #581) — add a `pipeline_stages` drop rule for known-sensitive patterns then. **4. `analytics.reporting_enabled: false` — correct** Good. Disabling Grafana Labs telemetry is the right default for a private family archive. --- ### What's done well - `auth_enabled: false` with explicit scope comment (`not exposed beyond obs-net`) — this is the right way to document a deliberate security trade-off - `:ro` on all config volume mounts — defense in depth even if not sufficient alone for the socket - Port 3100 `expose`-only, not host-bound — correct, reduces attack surface to Docker internal network only - Promtail on `obs-net` only for Loki push (log data stays internal), `archiv-net` only for Docker SD (label discovery stays internal)

marcel commented

2026-05-15 02:23:20 +02:00

🏛️ Markus Keller — Senior Application Architect

Verdict: 🚫 Changes requested

The infrastructure wiring is correct and the operational decisions are sound. However, the PR triggers a mandatory C4 diagram update that is missing, which is a blocker per our documentation rules. One additional ADR concern.

Blockers

1. docs/architecture/c4/l2-containers.puml not updated with promtail

The PR description says: "Adds obs-loki and obs-promtail services." My rule table is unambiguous:

New Docker service or infrastructure component → docs/architecture/c4/l2-containers.puml + docs/DEPLOYMENT.md

DEPLOYMENT.md is updated — good. The C4 diagram is not. Loki is already present in the diagram (with a "Wiring TBD" description) but Promtail is absent as a container node, and neither the Loki description nor the diagram relationships have been updated to reflect that wiring is now live.

Required changes to l2-containers.puml:

System_Boundary(observability, "Observability Stack (docker-compose.observability.yml / archiv-net)") {
    Container(prometheus, "Prometheus", "prom/prometheus:v3.4.0", "...")
    Container(loki, "Loki", "grafana/loki:3.4.2", "Log aggregation — receives log streams from Promtail on obs-net. Port 3100 internal only.")
    Container(promtail, "Promtail", "grafana/promtail:3.4.2", "Log shipping agent. Reads all Docker container logs via Docker socket (:ro), relabels with compose_service / compose_project / container_name, forwards to Loki.")
    Container(grafana, "Grafana", "grafana/grafana", "...")
}

Rel(promtail, loki, "Forwards log streams", "HTTP / Loki push API / port 3100")

Also update the Loki description to remove "Wiring TBD — see issue #581" since the Promtail wiring is now live.

This is a blocker. The PR does not merge without diagram currency.

2. ADR may be warranted for Docker socket access

Adding /var/run/docker.sock access to a container is an architectural decision with lasting security consequences (see Nora's review for the exact privilege level). Our ADR rule:

Architectural decision with lasting consequences → New ADR in docs/adr/

We already have ADR-014 for the upload-artifact pin. The Docker socket trade-off is arguably ADR-worthy. I'll leave this as a judgment call — if the team considers it a standard ops decision rather than an architectural one, a PR comment is sufficient. But if the socket access policy ever needs to be revisited, an ADR makes the decision findable.

Suggestions (non-blocking)

3. Loki container description in C4 still says "Wiring TBD"

Even if Promtail is not added to the diagram in this PR, the Loki description should drop the "Wiring TBD" language since it is now wired. This is a consistency issue, not a functional one.

4. Boring technology evaluation: is Promtail the right choice?

Grafana has been recommending Grafana Alloy (the successor to Promtail) since 2024. Promtail is in maintenance mode. For a new installation in 2026, Alloy would be the forward-looking choice. That said, Promtail 3.4.2 will continue to work for years and the migration path is documented. I flag this not as a blocker but as a future consideration — worth tracking in a follow-up issue.

What's done well

Service topology is clean: Promtail bridges archiv-net (label discovery) and obs-net (log shipping). This is the right network design.
DEPLOYMENT.md update is thorough and includes actionable quick-check commands — this is exactly the runbook-style documentation I want
expose vs ports is used correctly throughout — no observability ports are accidentally host-bound
The depends_on: service_healthy chain is architecturally correct — Promtail does not start shipping until Loki is confirmed ready

## 🏛️ Markus Keller — Senior Application Architect **Verdict: 🚫 Changes requested** The infrastructure wiring is correct and the operational decisions are sound. However, the PR triggers a mandatory C4 diagram update that is missing, which is a blocker per our documentation rules. One additional ADR concern. --- ### Blockers **1. `docs/architecture/c4/l2-containers.puml` not updated with `promtail`** The PR description says: "Adds `obs-loki` and `obs-promtail` services." My rule table is unambiguous: > **New Docker service or infrastructure component → `docs/architecture/c4/l2-containers.puml` + `docs/DEPLOYMENT.md`** `DEPLOYMENT.md` is updated — good. The C4 diagram is not. Loki is already present in the diagram (with a "Wiring TBD" description) but Promtail is absent as a container node, and neither the Loki description nor the diagram relationships have been updated to reflect that wiring is now live. Required changes to `l2-containers.puml`: ```plantuml System_Boundary(observability, "Observability Stack (docker-compose.observability.yml / archiv-net)") { Container(prometheus, "Prometheus", "prom/prometheus:v3.4.0", "...") Container(loki, "Loki", "grafana/loki:3.4.2", "Log aggregation — receives log streams from Promtail on obs-net. Port 3100 internal only.") Container(promtail, "Promtail", "grafana/promtail:3.4.2", "Log shipping agent. Reads all Docker container logs via Docker socket (:ro), relabels with compose_service / compose_project / container_name, forwards to Loki.") Container(grafana, "Grafana", "grafana/grafana", "...") } Rel(promtail, loki, "Forwards log streams", "HTTP / Loki push API / port 3100") ``` Also update the Loki description to remove "Wiring TBD — see issue #581" since the Promtail wiring is now live. This is a blocker. The PR does not merge without diagram currency. **2. ADR may be warranted for Docker socket access** Adding `/var/run/docker.sock` access to a container is an architectural decision with lasting security consequences (see Nora's review for the exact privilege level). Our ADR rule: > **Architectural decision with lasting consequences → New ADR in `docs/adr/`** We already have ADR-014 for the upload-artifact pin. The Docker socket trade-off is arguably ADR-worthy. I'll leave this as a judgment call — if the team considers it a standard ops decision rather than an architectural one, a PR comment is sufficient. But if the socket access policy ever needs to be revisited, an ADR makes the decision findable. --- ### Suggestions (non-blocking) **3. Loki container description in C4 still says "Wiring TBD"** Even if Promtail is not added to the diagram in this PR, the Loki description should drop the "Wiring TBD" language since it is now wired. This is a consistency issue, not a functional one. **4. Boring technology evaluation: is Promtail the right choice?** Grafana has been recommending `Grafana Alloy` (the successor to Promtail) since 2024. Promtail is in maintenance mode. For a new installation in 2026, Alloy would be the forward-looking choice. That said, Promtail 3.4.2 will continue to work for years and the migration path is documented. I flag this not as a blocker but as a future consideration — worth tracking in a follow-up issue. --- ### What's done well - Service topology is clean: Promtail bridges `archiv-net` (label discovery) and `obs-net` (log shipping). This is the right network design. - `DEPLOYMENT.md` update is thorough and includes actionable quick-check commands — this is exactly the runbook-style documentation I want - `expose` vs `ports` is used correctly throughout — no observability ports are accidentally host-bound - The `depends_on: service_healthy` chain is architecturally correct — Promtail does not start shipping until Loki is confirmed ready

marcel commented

2026-05-15 02:23:32 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

This is a pure infrastructure PR — no application code changes, no frontend, no backend Java. My review scope is narrow: config file quality, comment hygiene, and whether the changes follow the project's conventions.

No blockers

The infrastructure files are clean. Config keys are explicit, values are commented where non-obvious, and there is no dead code. I have no code-quality concerns.

Suggestions (non-blocking)

1. promtail-config.yml — grpc_listen_port: 0 comment is slightly misleading

The comment says "gRPC disabled — used for Promtail clustering only; single-node deployment." grpc_listen_port: 0 does not disable gRPC — it binds to a random available port. To actually prevent gRPC from binding, you'd omit it or set it to -1 in newer versions. In practice at this Promtail version (3.4.2), 0 means "let the OS assign a port" which is effectively unused since nothing calls it. The comment intent is right but the mechanism description is slightly off. This is cosmetic — functionally it does not matter for a single-node setup.

2. loki-config.yml — instance_addr: 127.0.0.1 in a container context

common.instance_addr: 127.0.0.1 is the Loki node's self-address for its ring membership. In a single-node inmemory kvstore setup this value is never used for inter-node communication, so it is harmless. Just noting it for clarity — it would be incorrect in a multi-node setup.

What's done well

Config files are minimal and purposeful — no unnecessary keys
Comments explain why, not what — exactly the standard I want from infrastructure files
reporting_enabled: false — no surprise telemetry
The promtail_positions volume name is semantically correct and self-documenting

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** This is a pure infrastructure PR — no application code changes, no frontend, no backend Java. My review scope is narrow: config file quality, comment hygiene, and whether the changes follow the project's conventions. --- ### No blockers The infrastructure files are clean. Config keys are explicit, values are commented where non-obvious, and there is no dead code. I have no code-quality concerns. --- ### Suggestions (non-blocking) **1. `promtail-config.yml` — `grpc_listen_port: 0` comment is slightly misleading** The comment says "gRPC disabled — used for Promtail clustering only; single-node deployment." `grpc_listen_port: 0` does not disable gRPC — it binds to a random available port. To actually prevent gRPC from binding, you'd omit it or set it to `-1` in newer versions. In practice at this Promtail version (3.4.2), `0` means "let the OS assign a port" which is effectively unused since nothing calls it. The comment intent is right but the mechanism description is slightly off. This is cosmetic — functionally it does not matter for a single-node setup. **2. `loki-config.yml` — `instance_addr: 127.0.0.1` in a container context** `common.instance_addr: 127.0.0.1` is the Loki node's self-address for its ring membership. In a single-node `inmemory` kvstore setup this value is never used for inter-node communication, so it is harmless. Just noting it for clarity — it would be incorrect in a multi-node setup. --- ### What's done well - Config files are minimal and purposeful — no unnecessary keys - Comments explain *why*, not *what* — exactly the standard I want from infrastructure files - `reporting_enabled: false` — no surprise telemetry - The `promtail_positions` volume name is semantically correct and self-documenting

marcel commented

2026-05-15 02:23:48 +02:00

🧪 Sara Holt — QA Engineer & Test Automation Specialist

Verdict: ⚠️ Approved with concerns

This is infrastructure-only — no application code, no test changes. My review focuses on testability of the deployment and whether the PR's own test plan is adequate.

Concerns (non-blocking)

1. PR test plan is manual-only — no CI verification

The PR's test plan lists three manual steps:

docker compose ... config exits 0
docker compose up -d loki promtail starts both containers
curl http://localhost:3100/ready returns ready

None of these are automated. For infrastructure changes this is typical and acceptable, but I want to flag what we're accepting: there is no CI gate that verifies the observability stack starts cleanly after this change. If a future PR breaks the compose file, CI will not catch it because the observability stack is not started in the CI pipeline.

This is a pre-existing gap (not introduced by this PR), but worth tracking: a follow-up task to add docker compose -f docker-compose.observability.yml config as a CI lint step would be low-cost and would catch YAML syntax errors automatically.

2. Loki retention is untested

The 30-day retention config (720h) has no verification path. As Tobias notes, without a compactor block the retention policy may not actually run. There is no test that verifies logs are deleted after the retention window. For a low-volume family archive this is not a production risk today, but it means retention is "configured but not verified."

3. No smoke test for log ingestion

The test plan verifies Loki is alive (/ready) but does not verify that Promtail is actually shipping logs and Loki is actually receiving them. A complete smoke test would include:

# After startup, wait 30s, then query for any ingested logs
docker exec obs-loki wget -qO- \
  'http://localhost:3100/loki/api/v1/query?query=%7Bcompose_project%3D%22familienarchiv%22%7D&limit=1'
# Expected: result.data.result is non-empty

This is a suggestion for the test plan, not a code change.

What's done well

depends_on: condition: service_healthy is the correct pattern — prevents race conditions that would otherwise cause flaky startup behavior
start_period: 30s on the Loki healthcheck is appropriately sized — containers often fail health checks on cold start without this
The promtail_positions volume is a testability win — it means restart scenarios produce deterministic behavior (no duplicate log ingestion) rather than nondeterministic ones

## 🧪 Sara Holt — QA Engineer & Test Automation Specialist **Verdict: ⚠️ Approved with concerns** This is infrastructure-only — no application code, no test changes. My review focuses on testability of the deployment and whether the PR's own test plan is adequate. --- ### Concerns (non-blocking) **1. PR test plan is manual-only — no CI verification** The PR's test plan lists three manual steps: - `docker compose ... config` exits 0 - `docker compose up -d loki promtail` starts both containers - `curl http://localhost:3100/ready` returns `ready` None of these are automated. For infrastructure changes this is typical and acceptable, but I want to flag what we're accepting: **there is no CI gate that verifies the observability stack starts cleanly after this change.** If a future PR breaks the compose file, CI will not catch it because the observability stack is not started in the CI pipeline. This is a pre-existing gap (not introduced by this PR), but worth tracking: a follow-up task to add `docker compose -f docker-compose.observability.yml config` as a CI lint step would be low-cost and would catch YAML syntax errors automatically. **2. Loki retention is untested** The 30-day retention config (`720h`) has no verification path. As Tobias notes, without a `compactor` block the retention policy may not actually run. There is no test that verifies logs are deleted after the retention window. For a low-volume family archive this is not a production risk today, but it means retention is "configured but not verified." **3. No smoke test for log ingestion** The test plan verifies Loki is alive (`/ready`) but does not verify that Promtail is actually shipping logs and Loki is actually receiving them. A complete smoke test would include: ```bash # After startup, wait 30s, then query for any ingested logs docker exec obs-loki wget -qO- \ 'http://localhost:3100/loki/api/v1/query?query=%7Bcompose_project%3D%22familienarchiv%22%7D&limit=1' # Expected: result.data.result is non-empty ``` This is a suggestion for the test plan, not a code change. --- ### What's done well - `depends_on: condition: service_healthy` is the correct pattern — prevents race conditions that would otherwise cause flaky startup behavior - `start_period: 30s` on the Loki healthcheck is appropriately sized — containers often fail health checks on cold start without this - The `promtail_positions` volume is a testability win — it means restart scenarios produce deterministic behavior (no duplicate log ingestion) rather than nondeterministic ones

marcel commented

2026-05-15 02:24:04 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

This PR closes issue #574. My scope is requirements traceability: does the implementation match what was asked for, and are there any gaps or undocumented assumptions?

Requirements coverage

The issue closes #574. The PR delivers:

Loki service with TSDB v13, 30-day retention, filesystem storage ✅
Promtail service with Docker SD, container relabeling, Docker socket discovery ✅
depends_on: service_healthy ordering ✅
Security trade-off acknowledged for Docker socket ✅
docs/DEPLOYMENT.md updated ✅

Observations (non-blocking)

1. Issue #575 referenced as a placeholder in the original compose — what happened to it?

The original compose file had a comment # promtail: (see issue #575). This PR closes #574 (Loki) and implicitly delivers #575 (Promtail) together. Confirm that issue #575 is closed or merged into #574 — leaving it open creates a false backlog item.

2. docs/DEPLOYMENT.md — the service table still lists Grafana as "dashboards and alerting UI" but Grafana is not yet wired to Loki

The table entry for obs-loki is accurate. However, the Grafana entry in the same table says "data sources: Prometheus + Loki" — implying a working Grafana→Loki data source. That wiring is tracked in issue #581 and is not delivered in this PR. The table entry could create a false impression that Grafana dashboards are ready. Consider adding a note: "Loki data source configuration tracked in issue #581."

3. The logstream label is relabeled but its semantics are undocumented

Promtail relabels __meta_docker_container_log_stream to logstream. This distinguishes stdout from stderr. The DEPLOYMENT.md quick-checks do not mention this label. Worth adding a one-liner to the docs so operators know they can filter by {logstream="stderr"} to see only error output.

What's done well

The PR description is explicit about what is and is not delivered (no Grafana wiring, no Tempo, those are tracked issues)
The test plan is minimal but covers the three essential acceptance criteria: config validity, container startup, health endpoint
The compose_service vs container_name guidance in both the config and DEPLOYMENT.md is exactly the kind of operability knowledge that prevents future confusion

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** This PR closes issue #574. My scope is requirements traceability: does the implementation match what was asked for, and are there any gaps or undocumented assumptions? --- ### Requirements coverage The issue closes #574. The PR delivers: - Loki service with TSDB v13, 30-day retention, filesystem storage ✅ - Promtail service with Docker SD, container relabeling, Docker socket discovery ✅ - `depends_on: service_healthy` ordering ✅ - Security trade-off acknowledged for Docker socket ✅ - `docs/DEPLOYMENT.md` updated ✅ --- ### Observations (non-blocking) **1. Issue #575 referenced as a placeholder in the original compose — what happened to it?** The original compose file had a comment `# promtail: (see issue #575)`. This PR closes #574 (Loki) and implicitly delivers #575 (Promtail) together. Confirm that issue #575 is closed or merged into #574 — leaving it open creates a false backlog item. **2. `docs/DEPLOYMENT.md` — the service table still lists Grafana as "dashboards and alerting UI" but Grafana is not yet wired to Loki** The table entry for `obs-loki` is accurate. However, the Grafana entry in the same table says "data sources: Prometheus + Loki" — implying a working Grafana→Loki data source. That wiring is tracked in issue #581 and is not delivered in this PR. The table entry could create a false impression that Grafana dashboards are ready. Consider adding a note: "Loki data source configuration tracked in issue #581." **3. The `logstream` label is relabeled but its semantics are undocumented** Promtail relabels `__meta_docker_container_log_stream` to `logstream`. This distinguishes `stdout` from `stderr`. The DEPLOYMENT.md quick-checks do not mention this label. Worth adding a one-liner to the docs so operators know they can filter by `{logstream="stderr"}` to see only error output. --- ### What's done well - The PR description is explicit about what is and is not delivered (no Grafana wiring, no Tempo, those are tracked issues) - The test plan is minimal but covers the three essential acceptance criteria: config validity, container startup, health endpoint - The `compose_service` vs `container_name` guidance in both the config and DEPLOYMENT.md is exactly the kind of operability knowledge that prevents future confusion

marcel commented

2026-05-15 02:24:09 +02:00

🎨 Leonie Voss — UI/UX Design Lead & Accessibility Strategist

Verdict: ✅ Approved

This PR makes no changes to the frontend, UI components, routes, or user-visible behavior. It adds backend infrastructure (Loki + Promtail) that operates entirely below the UI layer.

I verified:

No changes to any .svelte files
No changes to routes or page components
No changes to i18n messages
No changes to CSS, design tokens, or layout

From a UX and accessibility perspective, this PR has no impact. No findings.

The only indirect user-facing consequence is improved operator visibility via log aggregation, which benefits users indirectly through faster incident resolution. That is a positive outcome.

LGTM.

## 🎨 Leonie Voss — UI/UX Design Lead & Accessibility Strategist **Verdict: ✅ Approved** This PR makes no changes to the frontend, UI components, routes, or user-visible behavior. It adds backend infrastructure (Loki + Promtail) that operates entirely below the UI layer. I verified: - No changes to any `.svelte` files - No changes to routes or page components - No changes to i18n messages - No changes to CSS, design tokens, or layout From a UX and accessibility perspective, this PR has no impact. No findings. The only indirect user-facing consequence is improved operator visibility via log aggregation, which benefits users indirectly through faster incident resolution. That is a positive outcome. LGTM.

marcel added 1 commit 2026-05-15 02:26:03 +02:00

devops(observability): fix C4 diagram, security comment, and add Loki compactor block

CI / Unit & Component Tests (pull_request) Successful in 3m22s

Details

CI / OCR Service Tests (pull_request) Successful in 16s

Details

CI / Backend Unit Tests (pull_request) Successful in 4m33s

Details

CI / fail2ban Regex (pull_request) Successful in 38s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 56s

Details

c1406a32f1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-15 02:26:18 +02:00

Review blockers addressed in commit c1406a32:

Blocker 1 — C4 diagram (docs/architecture/c4/l2-containers.puml)

Added Container(promtail, "Promtail", "grafana/promtail:3.4.2", "Ships Docker container logs to Loki via Docker SD") inside the observability System_Boundary
Added relationship Rel(promtail, loki, "Pushes log streams", "HTTP/Loki push API")
Updated Loki container description to "Stores log streams from all containers." — "Wiring TBD" language removed; version pinned to 3.4.2

Blocker 2 — Docker socket comment (docker-compose.observability.yml)

Replaced the vague "container escape" comment with the more precise: :ro restricts file-system access but NOT Docker API permissions — a compromised Promtail has full daemon access. Accepted risk on single-operator self-hosted archive.

Bonus — Loki compactor (infra/observability/loki/loki-config.yml)

Added compactor: block after limits_config: so the 30-day retention policy actually runs (without a compactor, Loki ignores retention_period).

Review blockers addressed in commit `c1406a32`: **Blocker 1 — C4 diagram (`docs/architecture/c4/l2-containers.puml`)** - Added `Container(promtail, "Promtail", "grafana/promtail:3.4.2", "Ships Docker container logs to Loki via Docker SD")` inside the observability `System_Boundary` - Added relationship `Rel(promtail, loki, "Pushes log streams", "HTTP/Loki push API")` - Updated Loki container description to `"Stores log streams from all containers."` — "Wiring TBD" language removed; version pinned to `3.4.2` **Blocker 2 — Docker socket comment (`docker-compose.observability.yml`)** - Replaced the vague "container escape" comment with the more precise: `:ro restricts file-system access but NOT Docker API permissions — a compromised Promtail has full daemon access. Accepted risk on single-operator self-hosted archive.` **Bonus — Loki compactor (`infra/observability/loki/loki-config.yml`)** - Added `compactor:` block after `limits_config:` so the 30-day retention policy actually runs (without a compactor, Loki ignores `retention_period`).

marcel merged commit 5ed24cb6eb into main

2026-05-15 02:58:20 +02:00

marcel deleted branch feat/issue-574-loki-promtail

2026-05-15 02:58:21 +02:00

marcel referenced this issue from a commit

2026-05-15 02:58:22 +02:00

Merge pull request 'devops(observability): add Loki + Promtail for centralised container log aggregation' (#586) from feat/issue-574-loki-promtail into main

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#586