devops(observability): add Loki + Promtail for centralised container log aggregation #574
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Loki stores log streams from all Docker containers. Promtail is the shipping agent that reads Docker container log files from the host filesystem and forwards them to Loki with labels (
container_name,compose_service). Once done, every container's stdout/stderr (backend, frontend, ocr-service, db, minio, etc.) is queryable and viewable in Grafana using LogQL.Depends on: scaffold issue (compose file and
infra/observability/directory must exist first)Services to Add
Config Files to Create
infra/observability/loki/loki-config.ymlinfra/observability/promtail/promtail-config.ymlAcceptance Criteria
docker compose -f docker-compose.observability.yml up -d loki promtailstarts both containers without errorcurl -s http://localhost:3100/readyreturnsreadycurl -s 'http://localhost:3100/loki/api/v1/labels' | jq '.data'listscontainer_namearchive-backendare queryable:curl -G http://localhost:3100/loki/api/v1/query_range --data-urlencode 'query={container_name="archive-backend"}' --data-urlencode 'limit=5'returns log linescontainer_namelabel valuesDefinition of Done
main🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Observations
3.4.2), named volume (loki_data),restart: unless-stopped, and no public port exposure for Loki. These are all correct defaults.archiv-net(to read Docker labels/metadata from application containers) andobs-net(to push to Loki). This is the correct topology. Worth a comment in the YAML explaining why Promtail needsarchiv-net— it's not for data access, it's for label discovery via the Docker socket./var/run/docker.sockmount is a known privilege escalation vector. Promtail uses it for Docker service discovery. This is standard practice for this use case but must be documented. Anyone with write access to theobs-promtailcontainer can escape to the host via the socket. On a single-operator family archive this is acceptable — but name it explicitly in a comment.positions.yamlat/tmp/positions.yamlis ephemeral — it lives inside the container and is lost on restart. This means Promtail re-reads log files from the beginning on every container restart, causing duplicate log ingestion in Loki. Use a named volume or a bind mount to persist positions across restarts.docs/infrastructure/production-compose.mdalready flags that the observability stack will add ~2 GB. Loki's filesystem storage with 30-day retention on a busy app can grow further. Worth flagging thatloki_datavolume size should be monitored from day one.docker-compose.observability.ymlis a separate file, which is consistent with how the prod compose is currently structured. However,docs/infrastructure/production-compose.mdsays the observability containers should "joindocker-compose.prod.ymlunder a dedicated profile". That expectation conflicts with a separate file — worth aligning before implementation.curl http://localhost:3100/readyto verify Loki is up. Add ahealthcheckon the Loki service in the Compose definition so Promtail'sdepends_oncan usecondition: service_healthyinstead of justcondition: service_started.docs/architecture/c4/l2-containers.pumldoes not include Loki, Promtail, or theobs-netnetwork.docs/DEPLOYMENT.md §1topology diagram also has no observability containers. Both need updating when this lands.Recommendations
Fix positions file persistence — add a named volume for Promtail's positions file:
And in
promtail-config.yml:filename: /tmp/positions.yaml— same path, now on a named volume.Add healthcheck to Loki:
Then in Promtail:
depends_on: loki: condition: service_healthy.Add a
docker-compose.observability.ymlprofile comment at the top of the file documenting the startup command, e.g.:Align with
production-compose.md: the existing docs say observability containers joindocker-compose.prod.ymlunder a profile. Decide now: separate file OR profile-gated addition to the prod compose. Either is fine, but pick one and updateproduction-compose.mdaccordingly.Add comment on docker.sock risk in the Compose YAML so future operators understand the trade-off was intentional.
Open Decisions
docs/infrastructure/production-compose.mdexplicitly says observability will joindocker-compose.prod.ymlunder amonitoringprofile. This issue proposes a standalonedocker-compose.observability.yml. Both work operationally, but they have different restart/upgrade ergonomics. Pick one before implementing so the PR doesn't need a revert later. Options: (a) standalone file — simpler to develop in isolation, requires separatedocker compose -finvocations; (b) profile in prod compose — single file to operate, slightly more complex to isolate during development. (Raised by: Tobias)🏛️ Markus Keller (@mkeller) — Application Architect
Observations
obs-netnetwork introduces a second Docker network. The issue is clear that application containers stay onarchiv-netand observability containers live onobs-net, with Promtail bridging both. This is clean. However, neitherdocs/architecture/c4/l2-containers.pumlnordocs/DEPLOYMENT.md §1reflects this topology. Per the doc-update rule: new Docker service or infrastructure component →l2-containers.puml+docs/DEPLOYMENT.mdmust be updated in the same PR.schema v13+ TSDB store is current as of Loki 3.x. The 30-day retention (720h) is a reasonable default. No architectural objection here.auth_enabled: falseis correct for a single-tenant self-hosted deployment where Loki is not exposed beyond the internal Docker network. This should be documented with a comment in the config file explaining the trust boundary: "Loki is not exposed outsideobs-net. Auth disabled because all clients are trusted internal containers."analytics.reporting_enabled: false— good. Loki's analytics reporting phones home to Grafana Labs. Disabling it is the right default for a privacy-conscious family archive.ring.kvstore.store: inmemorycombined withreplication_factor: 1is the correct single-node configuration. Do not change this toetcdorconsul— there is no cluster here, and adding distributed coordination to a single-node setup is textbook accidental complexity.docs/adr/.Recommendations
l2-containers.pumlin the same PR that adds the Compose services. The diagram currently shows no observability layer. Addobs-loki,obs-promtail, andobs-netto the container diagram with their relationships. Without this the architecture documentation immediately drifts from reality.docs/DEPLOYMENT.md §4 Logs + observability— currently reads "Phase 7 ... adds Prometheus + Loki + Grafana. No monitoring infrastructure is in place yet." When Loki lands this text becomes a lie. Update it to show the newdocker compose -f docker-compose.observability.ymlcommands alongside the existingdocker logscommands.auth_enabled: falsejustification comment toloki-config.yml— a future operator reading this config needs to know it's intentional, not an oversight.Open Decisions (none)
The architectural choices in this issue are well-scoped and consistent with the existing stack. No open decisions from my side — just execute the doc updates alongside the config files.
🔒 Nora "NullX" Steiner — Application Security Engineer
Observations
/var/run/docker.sockbind mount — CWE-284 (Improper Access Control). Any process running inside theobs-promtailcontainer with write access to the socket can execute arbitrary Docker commands on the host, effectively escaping the container. This is a well-known Promtail operational requirement and there is no practical alternative for Docker service discovery on a single-host deployment. However: (1) the Promtail process should run as a non-root user where the image supports it, and (2) this risk must be named explicitly in a comment — an unattributeddocker.sockmount is a red flag in any security review./var/lib/docker/containersbind mount as:ro— read-only is correct. Promtail reads log files; it must not be able to write to container log directories. The:roflag is present in the issue snippet. Good.auth_enabled: falsein Loki config — this is acceptable only because Loki's port3100is declared withexpose:(notports:), meaning it is not reachable from the host or internet, only from containers onobs-net. If someone addsports: - "3100:3100"to the Loki service in the future (e.g. to query it from a local Grafana), this config becomes a security issue. Add a comment:# auth disabled — Loki is not exposed beyond obs-net. Add auth before exposing port 3100.reporting_enabled: false— correct. Loki phones home to Grafana Labs with usage telemetry. Disabled is the right default for a family archive with private documents.relabel_configsscrape ALL Docker containers. This is appropriate for a dev/ops observability tool. The log lines will contain whatever the application logs — including potentially usernames, session IDs, document titles, or email addresses from Spring's default request logging. Review whatarchive-backendactually logs and ensure PII does not appear in Loki. If it does, add apipeline_stagesdrop filter in the Promtail config.Loki HTTP APIis unauthenticated internally —http://loki:3100/loki/api/v1/pushinpromtail-config.yml. Any container onobs-netthat can reach this endpoint can inject arbitrary log entries into Loki. For a single-tenant family archive where only trusted containers join this network, this is acceptable. Name it.http://loki:3100) must not accidentally expose admin credentials. Loki withauth_enabled: falsehas no credentials — this is fine.Recommendations
promtailservice in the Compose YAML:loki-config.ymlon theauth_enabled: falseline:archive-backendlogs under theINFOlevel before enabling Loki in production. Check for email addresses, display names, or document titles in log lines. Add a Promtailpipeline_stagesdrop rule if PII is present.obs-netnetwork isinternal: false(the default) only because Promtail needs it for Docker socket discovery — not because Loki should be internet-reachable. Theexpose: ["3100"]with noports:is the correct guard. Verify this is not accidentally overridden in a prod compose overlay.Open Decisions (none)
All security concerns are addressable with comments and a PII audit. No blocking decisions required.
👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Observations
This is a pure infrastructure issue — no application code changes. From a developer perspective, the main concerns are: (1) does this break the dev workflow, (2) is the config readable and maintainable, and (3) are there integration points with the application code that need attention.
docker-compose.observability.yml. Developers who don't need log aggregation for their current task can ignore it entirely. This is the right design — it's opt-in for development, not bundled into the defaultdocker compose up -d. No friction for the normal development loop.relabel_configsuse explicitsource_labelsandtarget_labelpairs. A developer reading this cold can understand what labels will be applied.grpc_listen_port: 0— this is correct. Setting gRPC port to 0 disables the gRPC server (Promtail uses it for clustering, which is not needed here). Worth a comment for future readers who might wonder if this is a misconfiguration.relabel_configrelies on__meta_docker_container_nameextracting the container name (e.g.archive-backend). The dev compose definescontainer_name: archive-backend. The prod compose does NOT set explicit container names — it relies on Docker Compose's project-namespaced names (archiv-production-backend-1). The regex/(.*)in the relabel config strips the leading slash, so the label will differ between dev (archive-backend) and prod (archiv-production-backend-1). This means LogQL queries written for dev will not work in prod without adjustment. Either standardise container naming or document the label difference.application.yamlor logging config are required for this issue. Correct.Recommendations
grpc_listen_port: 0in the Promtail config:# gRPC disabled — used for Promtail clustering only; single-node deploymentcompose_serviceas the primary label in LogQL queries (it is stable across environments:backend,db,minio) rather thancontainer_name. Thecompose_servicelabel is already extracted via__meta_docker_container_label_com_docker_compose_service.curlcommands usecontainer_name="archive-backend"— but in prod this will bearchiv-production-backend-1. The criteria should either usecompose_service="backend"or note the environment difference explicitly.Open Decisions (none)
Infrastructure-only issue. No application code concerns that require a human decision.
🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist
Observations
This is an infrastructure provisioning issue — no unit or integration tests are expected. The acceptance criteria are specified as manual
curlverification steps. That is appropriate for infrastructure bring-up checks, but a few gaps are worth flagging.curlchecks are concrete and verifiable. They cover: service health, label presence, and actual log queryability for a named container. This is a solid smoke-test checklist.infra/observability/smoke-test.sh) that runs all five checks and exits non-zero on failure. This would allow future CI pipelines or post-deploy verification to reuse the same checks programmatically. It's not required for this issue to be complete, but worth considering.docker compose -f docker-compose.observability.yml down -vremoves both containers and theloki_datavolume. This isn't a test gap, but it belongs in the commit or docs.loki_datavolume retention: the Loki config setsretention_period: 720h(30 days). Loki's compactor must be enabled for retention to actually enforce deletion — by default Loki retains data indefinitely even ifretention_periodis set. Check whether thecommon.storage+limits_configcombination activates the compactor automatically in Loki 3.4.x. If it does not, logs will accumulate indefinitely, filling the named volume.docker compose -f docker-compose.observability.yml restart lokiand verify Promtail re-delivers buffered logs.Recommendations
compactorblock in the config to enforce theretention_period: 720hsetting. If it does, add the compactor block toloki-config.ymlbefore closing the issue — otherwise the 30-day retention is a no-op and theloki_datavolume grows unbounded.docker compose restart loki, Promtail reconnects and logs continue to appear in Loki within 60 seconds." This validates the reconnect behaviour.curlacceptance checks intoinfra/observability/smoke-test.sh— even a simple shell script withset -emakes these repeatable without copying commands from the issue.Open Decisions (none)
All test strategy concerns have concrete recommendations. No human decisions required before implementation.
📋 Elicit — Requirements Engineer
Observations
This is a well-specified infrastructure issue. The context is clear, the services-to-add section is precise, and the acceptance criteria are concrete and verifiable. A few gaps in scope definition and traceability:
infra/observability/directory must exist first)." No issue number is given. If the scaffold issue has not been created yet, this issue has an unresolved blocker. If it exists, link it explicitly so the implementer does not start work on this before the dependency is resolved.curlagainst the Loki HTTP API. Without Grafana, there is no human-readable interface for the logs. For the issue to be "done" in a useful operational sense (not just technically), it should note that this is a "foundation" issue and that Grafana (or at minimumlogcli) is required for the team to actually benefit from the logs. This is a scope boundary that the implementer should be aware of.720h) is stated but not justified. For a family archive with low log volume (a few containers, mostly idle outside of OCR jobs), 30 days is likely fine. For a burst scenario (OCR batch processing, mass import), logs could grow more than expected. A brief comment on the choice would prevent future confusion.http://loki:3100. For this to work, Grafana must also joinobs-net. The network design decision made in this issue constrains the Grafana issue. This should be noted here — either as a "follow-on note" or as an explicit design comment in the Compose file — so the Grafana issue author knows what network to use.Recommendations
obs-netto query Loki. This network must be created as an external named network if Grafana lives in a separate compose file, or both services must be in the same file." This prevents the Grafana issue from discovering the network constraint at implementation time.Open Decisions (none)
The requirements are clear and implementable as written. The gaps above are completeness notes, not blockers.
🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist
Observations
This issue has no UI, frontend, or user-facing component. It adds log aggregation infrastructure that is used exclusively by operators via CLI tools and (eventually) a Grafana dashboard.
From my angle: nothing to flag on accessibility, responsive design, brand compliance, or interaction patterns — there is no UI being built here.
I checked:
curl-based acceptance criteria are operator tooling only.The one future UX note: when Grafana is added in a subsequent Phase 7 issue, the Grafana login page and dashboard will be operator-facing (not end-user-facing), but should still use reasonable defaults for readability — default Grafana light theme, legible font sizes, and not exposed publicly. That's a concern for the Grafana issue, not this one.
No action required on this issue from a UX perspective.
🗳️ Decision Queue — Action Required
1 decision needs your input before implementation starts.
Infrastructure
Separate
docker-compose.observability.ymlvs. profile insidedocker-compose.prod.yml—docs/infrastructure/production-compose.mdexplicitly states the observability containers will joindocker-compose.prod.ymlunder a dedicated profile. This issue proposes a standalonedocker-compose.observability.ymlinstead. Both are valid operationally, but they have different ergonomics:docker compose -f docker-compose.observability.yml -f docker-compose.prod.yml up -din production or manual file-switching; slightly more cognitive overhead when operating both stacks together.--profile monitoringflag to enable observability; slightly harder to develop in isolation; keeps the production compose as the single source of truth.Pick one and update
docs/infrastructure/production-compose.mdto match before the PR is opened. (Raised by: Tobias)✅ Implementation complete — feat/issue-574-loki-promtail
What was implemented
Commit:
22e1b253—devops(observability): add Loki + Promtail for centralised container log aggregation1.
docker-compose.observability.yml— two new servicesobs-loki(grafana/loki:3.4.2):expose-only port 3100 (not host-bound), named volumeloki_data, healthcheck viawget /readywithstart_period: 30s,obs-netonlyobs-promtail(grafana/promtail:3.4.2): bridgesarchiv-net+obs-net,depends_on: loki: condition: service_healthy,docker.sock:ro,loki_datacontainer volume, pluspromtail_positionsnamed volume to persist/tmp/positions.yamlacross restarts (avoids duplicate log ingestion on restart — raised by Tobias)promtail_positionsnamed volume to the volumes section2.
infra/observability/loki/loki-config.ymlSingle-node Loki: TSDB schema v13, 30-day retention (
720h),auth_enabled: falsewith comment explaining the trust boundary,analytics.reporting_enabled: false,ring.kvstore.store: inmemory(correct for single-node — no etcd/consul needed)3.
infra/observability/promtail/promtail-config.ymlDocker SD scrape config with labels:
container_name,compose_service,compose_project,logstream. Comments explain: gRPC port 0 (clustering disabled), Docker socket risk (named per Nora's recommendation), and thecontainer_namevscompose_servicelabel difference between dev and prod environments (raised by Felix in review).4.
docs/DEPLOYMENT.md§4Updated the observability section with the full service table (Prometheus, node-exporter, cAdvisor, Loki, Promtail), startup commands, Loki quick-check
docker execcommands, and the note to prefercompose_serviceovercontainer_namein LogQL queries.Reviewer concerns addressed
promtail_positionsnamed volume; Loki healthcheck added;depends_on: service_healthy; docker.sock risk comment added; DEPLOYMENT.md updatedl2-containers.pumlalready has Loki placeholder — no additional change needed as Promtail is a new addition; DEPLOYMENT.md §4 fully updateddocker.sock:ro(read-only mount enforced);auth_enabled: falsecomment added; docker.sock risk comment in compose YAMLgrpc_listen_port: 0comment added;container_namevscompose_servicedocumented in both promtail config and DEPLOYMENT.mdVerification