devops(observability): add Loki + Promtail for centralised container log aggregation #586
Reference in New Issue
Block a user
Delete Branch "feat/issue-574-loki-promtail"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes #574
obs-loki(grafana/loki:3.4.2) andobs-promtail(grafana/promtail:3.4.2) services todocker-compose.observability.ymlinfra/observability/loki/loki-config.yml— single-node TSDB v13, 30-day retention, analytics offinfra/observability/promtail/promtail-config.yml— Docker SD with relabeling for container_name, compose_service, compose_project, logstreampromtail_positionsnamed volume for restart safetydocs/DEPLOYMENT.mdupdated with Loki + Promtail service entriesTest plan
docker compose -f docker-compose.observability.yml configexits 0docker compose -f docker-compose.observability.yml up -d loki promtailstarts both containerscurl -s http://localhost:3100/readyreturnsready🤖 Generated with Claude Code
🛠️ Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
This is solid infrastructure work. Image tags are pinned, the healthcheck pattern is correct, the named volume for positions is a good call, and the
depends_on: condition: service_healthychain is exactly right. The inline comments explaining trade-offs (Docker socket risk,compose_servicevscontainer_name, retention rationale) are the kind of comments I actually want in infra files. Good job.Blockers
1.
l2-containers.pumlis not updated with Promtaildocs/architecture/c4/l2-containers.puml(head branch) already haslokiin the observability boundary, butpromtailis absent. This PR wires Promtail as a real, running service — the diagram must reflect it. Markus requires C4 diagram updates when new Docker services are added. Add:...and a
Rel(promtail, loki, "Ships log streams", "HTTP / Loki push API").Suggestions (non-blocking)
2.
lokiservice is missingarchiv-netPromtail joins both
archiv-net(for label discovery) andobs-net(for log shipping to Loki). Loki only joinsobs-net. That is correct — Loki does not need to reach application containers directly. Just confirming this is intentional and not an omission: Promtail's Docker SD discovery goes via the socket, not the network. All good.3. Loki retention requires
compactorto be enabledlimits_config.retention_period: 720hsets the retention policy, but Loki's TSDB backend also needs the compactor enabled andcompaction_intervalset, otherwise retention deletion does not run. The current config has nocompactor:block. For a low-volume family archive this is unlikely to cause disk pressure soon, but worth knowing:This is a suggestion, not a blocker, but retention will silently not work without it.
4. Promtail
grpc_listen_port: 0— document whyThe comment says "gRPC disabled — used for Promtail clustering only; single-node deployment." Good. No action needed.
5. No Renovate coverage for the new images
grafana/loki:3.4.2andgrafana/promtail:3.4.2are new pinned images. Verify Renovate's config matchesdocker-compose.observability.ymlso these get bump PRs automatically. If Renovate is not watching this file, add it to the config.What's done well
promtail_positionsnamed volume prevents duplicate log ingestion on restart — many deployments miss thisstart_period: 30son Loki's healthcheck accounts for cold-start initialization timecompose_servicelabel recommendation in both the config and the docs is the right operational guidanceexpose(notports) on Loki port 3100 — internal only, nothing bound to host🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
This is a well-considered infrastructure change with explicit, honest security commentary already embedded in the diff. The Docker socket trade-off is documented, the
auth_enabled: falsejustification is present, and the unauthenticated Loki push API concern is called out. That transparency is the right approach. However, one concern warrants a blocker and one is worth flagging for future planning.Blockers
1. Docker socket exposure grants container-escape-equivalent access
The PR author already flags this in the compose comment. I want to make the exact risk explicit so it's a conscious acceptance, not just a "revisit later":
This is acceptable for a single-operator family archive on a private server, but the acceptance should be explicit. I'm not asking for the PR to be blocked — I'm asking for the comment to be upgraded from "acceptable on a single-operator archive; review if multi-user access to the host is ever introduced" to something more precise:
If you accept the current comment wording, I'll drop this to a suggestion. I raise it as a blocker because the current comment undersells the actual privilege level.
Suggestions (non-blocking)
2. Unauthenticated Loki push endpoint on
obs-netpromtail-config.ymlacknowledges: "Any container on obs-net can push logs." True. Worth noting: the practical blast radius on a single-VPS deployment with a private Docker network is low. If the observability stack ever gets more services or the network is widened, adding Loki's built-inauth_enabled: truewith a static token is straightforward:Not required now, but easy to add before any network topology change.
3. No log filtering — application secrets may appear in Loki
Promtail is configured to ship ALL Docker container logs with no drop/filter rules. Spring Boot logs, if misconfigured (e.g., DEBUG level or log injection), could include session tokens, passwords, or PII. This is a systemic risk to flag for when Grafana dashboards are built (issue #581) — add a
pipeline_stagesdrop rule for known-sensitive patterns then.4.
analytics.reporting_enabled: false— correctGood. Disabling Grafana Labs telemetry is the right default for a private family archive.
What's done well
auth_enabled: falsewith explicit scope comment (not exposed beyond obs-net) — this is the right way to document a deliberate security trade-off:roon all config volume mounts — defense in depth even if not sufficient alone for the socketexpose-only, not host-bound — correct, reduces attack surface to Docker internal network onlyobs-netonly for Loki push (log data stays internal),archiv-netonly for Docker SD (label discovery stays internal)🏛️ Markus Keller — Senior Application Architect
Verdict: 🚫 Changes requested
The infrastructure wiring is correct and the operational decisions are sound. However, the PR triggers a mandatory C4 diagram update that is missing, which is a blocker per our documentation rules. One additional ADR concern.
Blockers
1.
docs/architecture/c4/l2-containers.pumlnot updated withpromtailThe PR description says: "Adds
obs-lokiandobs-promtailservices." My rule table is unambiguous:DEPLOYMENT.mdis updated — good. The C4 diagram is not. Loki is already present in the diagram (with a "Wiring TBD" description) but Promtail is absent as a container node, and neither the Loki description nor the diagram relationships have been updated to reflect that wiring is now live.Required changes to
l2-containers.puml:Also update the Loki description to remove "Wiring TBD — see issue #581" since the Promtail wiring is now live.
This is a blocker. The PR does not merge without diagram currency.
2. ADR may be warranted for Docker socket access
Adding
/var/run/docker.sockaccess to a container is an architectural decision with lasting security consequences (see Nora's review for the exact privilege level). Our ADR rule:We already have ADR-014 for the upload-artifact pin. The Docker socket trade-off is arguably ADR-worthy. I'll leave this as a judgment call — if the team considers it a standard ops decision rather than an architectural one, a PR comment is sufficient. But if the socket access policy ever needs to be revisited, an ADR makes the decision findable.
Suggestions (non-blocking)
3. Loki container description in C4 still says "Wiring TBD"
Even if Promtail is not added to the diagram in this PR, the Loki description should drop the "Wiring TBD" language since it is now wired. This is a consistency issue, not a functional one.
4. Boring technology evaluation: is Promtail the right choice?
Grafana has been recommending
Grafana Alloy(the successor to Promtail) since 2024. Promtail is in maintenance mode. For a new installation in 2026, Alloy would be the forward-looking choice. That said, Promtail 3.4.2 will continue to work for years and the migration path is documented. I flag this not as a blocker but as a future consideration — worth tracking in a follow-up issue.What's done well
archiv-net(label discovery) andobs-net(log shipping). This is the right network design.DEPLOYMENT.mdupdate is thorough and includes actionable quick-check commands — this is exactly the runbook-style documentation I wantexposevsportsis used correctly throughout — no observability ports are accidentally host-bounddepends_on: service_healthychain is architecturally correct — Promtail does not start shipping until Loki is confirmed ready👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
This is a pure infrastructure PR — no application code changes, no frontend, no backend Java. My review scope is narrow: config file quality, comment hygiene, and whether the changes follow the project's conventions.
No blockers
The infrastructure files are clean. Config keys are explicit, values are commented where non-obvious, and there is no dead code. I have no code-quality concerns.
Suggestions (non-blocking)
1.
promtail-config.yml—grpc_listen_port: 0comment is slightly misleadingThe comment says "gRPC disabled — used for Promtail clustering only; single-node deployment."
grpc_listen_port: 0does not disable gRPC — it binds to a random available port. To actually prevent gRPC from binding, you'd omit it or set it to-1in newer versions. In practice at this Promtail version (3.4.2),0means "let the OS assign a port" which is effectively unused since nothing calls it. The comment intent is right but the mechanism description is slightly off. This is cosmetic — functionally it does not matter for a single-node setup.2.
loki-config.yml—instance_addr: 127.0.0.1in a container contextcommon.instance_addr: 127.0.0.1is the Loki node's self-address for its ring membership. In a single-nodeinmemorykvstore setup this value is never used for inter-node communication, so it is harmless. Just noting it for clarity — it would be incorrect in a multi-node setup.What's done well
reporting_enabled: false— no surprise telemetrypromtail_positionsvolume name is semantically correct and self-documenting🧪 Sara Holt — QA Engineer & Test Automation Specialist
Verdict: ⚠️ Approved with concerns
This is infrastructure-only — no application code, no test changes. My review focuses on testability of the deployment and whether the PR's own test plan is adequate.
Concerns (non-blocking)
1. PR test plan is manual-only — no CI verification
The PR's test plan lists three manual steps:
docker compose ... configexits 0docker compose up -d loki promtailstarts both containerscurl http://localhost:3100/readyreturnsreadyNone of these are automated. For infrastructure changes this is typical and acceptable, but I want to flag what we're accepting: there is no CI gate that verifies the observability stack starts cleanly after this change. If a future PR breaks the compose file, CI will not catch it because the observability stack is not started in the CI pipeline.
This is a pre-existing gap (not introduced by this PR), but worth tracking: a follow-up task to add
docker compose -f docker-compose.observability.yml configas a CI lint step would be low-cost and would catch YAML syntax errors automatically.2. Loki retention is untested
The 30-day retention config (
720h) has no verification path. As Tobias notes, without acompactorblock the retention policy may not actually run. There is no test that verifies logs are deleted after the retention window. For a low-volume family archive this is not a production risk today, but it means retention is "configured but not verified."3. No smoke test for log ingestion
The test plan verifies Loki is alive (
/ready) but does not verify that Promtail is actually shipping logs and Loki is actually receiving them. A complete smoke test would include:This is a suggestion for the test plan, not a code change.
What's done well
depends_on: condition: service_healthyis the correct pattern — prevents race conditions that would otherwise cause flaky startup behaviorstart_period: 30son the Loki healthcheck is appropriately sized — containers often fail health checks on cold start without thispromtail_positionsvolume is a testability win — it means restart scenarios produce deterministic behavior (no duplicate log ingestion) rather than nondeterministic ones📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
This PR closes issue #574. My scope is requirements traceability: does the implementation match what was asked for, and are there any gaps or undocumented assumptions?
Requirements coverage
The issue closes #574. The PR delivers:
depends_on: service_healthyordering ✅docs/DEPLOYMENT.mdupdated ✅Observations (non-blocking)
1. Issue #575 referenced as a placeholder in the original compose — what happened to it?
The original compose file had a comment
# promtail: (see issue #575). This PR closes #574 (Loki) and implicitly delivers #575 (Promtail) together. Confirm that issue #575 is closed or merged into #574 — leaving it open creates a false backlog item.2.
docs/DEPLOYMENT.md— the service table still lists Grafana as "dashboards and alerting UI" but Grafana is not yet wired to LokiThe table entry for
obs-lokiis accurate. However, the Grafana entry in the same table says "data sources: Prometheus + Loki" — implying a working Grafana→Loki data source. That wiring is tracked in issue #581 and is not delivered in this PR. The table entry could create a false impression that Grafana dashboards are ready. Consider adding a note: "Loki data source configuration tracked in issue #581."3. The
logstreamlabel is relabeled but its semantics are undocumentedPromtail relabels
__meta_docker_container_log_streamtologstream. This distinguishesstdoutfromstderr. The DEPLOYMENT.md quick-checks do not mention this label. Worth adding a one-liner to the docs so operators know they can filter by{logstream="stderr"}to see only error output.What's done well
compose_servicevscontainer_nameguidance in both the config and DEPLOYMENT.md is exactly the kind of operability knowledge that prevents future confusion🎨 Leonie Voss — UI/UX Design Lead & Accessibility Strategist
Verdict: ✅ Approved
This PR makes no changes to the frontend, UI components, routes, or user-visible behavior. It adds backend infrastructure (Loki + Promtail) that operates entirely below the UI layer.
I verified:
.sveltefilesFrom a UX and accessibility perspective, this PR has no impact. No findings.
The only indirect user-facing consequence is improved operator visibility via log aggregation, which benefits users indirectly through faster incident resolution. That is a positive outcome.
LGTM.
Review blockers addressed in commit
c1406a32:Blocker 1 — C4 diagram (
docs/architecture/c4/l2-containers.puml)Container(promtail, "Promtail", "grafana/promtail:3.4.2", "Ships Docker container logs to Loki via Docker SD")inside the observabilitySystem_BoundaryRel(promtail, loki, "Pushes log streams", "HTTP/Loki push API")"Stores log streams from all containers."— "Wiring TBD" language removed; version pinned to3.4.2Blocker 2 — Docker socket comment (
docker-compose.observability.yml):ro restricts file-system access but NOT Docker API permissions — a compromised Promtail has full daemon access. Accepted risk on single-operator self-hosted archive.Bonus — Loki compactor (
infra/observability/loki/loki-config.yml)compactor:block afterlimits_config:so the 30-day retention policy actually runs (without a compactor, Loki ignoresretention_period).