devops(observability): add Loki + Promtail for centralised container log aggregation #574

New Issue

marcel · 2026-05-14T15:04:05+02:00

marcel commented

2026-05-14 15:04:05 +02:00

Context

Loki stores log streams from all Docker containers. Promtail is the shipping agent that reads Docker container log files from the host filesystem and forwards them to Loki with labels (container_name, compose_service). Once done, every container's stdout/stderr (backend, frontend, ocr-service, db, minio, etc.) is queryable and viewable in Grafana using LogQL.

Depends on: scaffold issue (compose file and infra/observability/ directory must exist first)

Services to Add

# docker-compose.observability.yml

loki:
  image: grafana/loki:3.4.2    # pin to latest stable v3
  container_name: obs-loki
  restart: unless-stopped
  volumes:
    - ./infra/observability/loki/loki-config.yml:/etc/loki/loki-config.yml:ro
    - loki_data:/loki
  command: -config.file=/etc/loki/loki-config.yml
  expose:
    - "3100"
  networks:
    - obs-net

promtail:
  image: grafana/promtail:3.4.2   # match Loki version
  container_name: obs-promtail
  restart: unless-stopped
  volumes:
    - ./infra/observability/promtail/promtail-config.yml:/etc/promtail/promtail-config.yml:ro
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
    - /var/run/docker.sock:/var/run/docker.sock
  command: -config.file=/etc/promtail/promtail-config.yml
  networks:
    - archiv-net   # read labels from application containers
    - obs-net
  depends_on:
    - loki

Config Files to Create

`infra/observability/loki/loki-config.yml`

auth_enabled: false

server:
  http_listen_port: 3100

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 720h   # 30 days

analytics:
  reporting_enabled: false

`infra/observability/promtail/promtail-config.yml`

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker-containers
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'container_name'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: 'compose_service'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_project']
        target_label: 'compose_project'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'logstream'

Acceptance Criteria

docker compose -f docker-compose.observability.yml up -d loki promtail starts both containers without error
curl -s http://localhost:3100/ready returns ready
After 60 seconds, curl -s 'http://localhost:3100/loki/api/v1/labels' | jq '.data' lists container_name
Logs from archive-backend are queryable: curl -G http://localhost:3100/loki/api/v1/query_range --data-urlencode 'query={container_name="archive-backend"}' --data-urlencode 'limit=5' returns log lines
Logs from at least 3 different containers appear under different container_name label values

Definition of Done

All acceptance criteria checked
Both config files committed alongside compose changes
Committed on a feature branch, PR opened against main

## Context Loki stores log streams from all Docker containers. Promtail is the shipping agent that reads Docker container log files from the host filesystem and forwards them to Loki with labels (`container_name`, `compose_service`). Once done, every container's stdout/stderr (backend, frontend, ocr-service, db, minio, etc.) is queryable and viewable in Grafana using LogQL. **Depends on:** scaffold issue (compose file and `infra/observability/` directory must exist first) ## Services to Add ```yaml # docker-compose.observability.yml loki: image: grafana/loki:3.4.2 # pin to latest stable v3 container_name: obs-loki restart: unless-stopped volumes: - ./infra/observability/loki/loki-config.yml:/etc/loki/loki-config.yml:ro - loki_data:/loki command: -config.file=/etc/loki/loki-config.yml expose: - "3100" networks: - obs-net promtail: image: grafana/promtail:3.4.2 # match Loki version container_name: obs-promtail restart: unless-stopped volumes: - ./infra/observability/promtail/promtail-config.yml:/etc/promtail/promtail-config.yml:ro - /var/lib/docker/containers:/var/lib/docker/containers:ro - /var/run/docker.sock:/var/run/docker.sock command: -config.file=/etc/promtail/promtail-config.yml networks: - archiv-net # read labels from application containers - obs-net depends_on: - loki ``` ## Config Files to Create ### `infra/observability/loki/loki-config.yml` ```yaml auth_enabled: false server: http_listen_port: 3100 common: instance_addr: 127.0.0.1 path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks rules_directory: /loki/rules replication_factor: 1 ring: kvstore: store: inmemory schema_config: configs: - from: 2024-01-01 store: tsdb object_store: filesystem schema: v13 index: prefix: index_ period: 24h limits_config: retention_period: 720h # 30 days analytics: reporting_enabled: false ``` ### `infra/observability/promtail/promtail-config.yml` ```yaml server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://loki:3100/loki/api/v1/push scrape_configs: - job_name: docker-containers docker_sd_configs: - host: unix:///var/run/docker.sock refresh_interval: 5s relabel_configs: - source_labels: ['__meta_docker_container_name'] regex: '/(.*)' target_label: 'container_name' - source_labels: ['__meta_docker_container_label_com_docker_compose_service'] target_label: 'compose_service' - source_labels: ['__meta_docker_container_label_com_docker_compose_project'] target_label: 'compose_project' - source_labels: ['__meta_docker_container_log_stream'] target_label: 'logstream' ``` ## Acceptance Criteria - [ ] `docker compose -f docker-compose.observability.yml up -d loki promtail` starts both containers without error - [ ] `curl -s http://localhost:3100/ready` returns `ready` - [ ] After 60 seconds, `curl -s 'http://localhost:3100/loki/api/v1/labels' | jq '.data'` lists `container_name` - [ ] Logs from `archive-backend` are queryable: `curl -G http://localhost:3100/loki/api/v1/query_range --data-urlencode 'query={container_name="archive-backend"}' --data-urlencode 'limit=5'` returns log lines - [ ] Logs from at least 3 different containers appear under different `container_name` label values ## Definition of Done - All acceptance criteria checked - Both config files committed alongside compose changes - Committed on a feature branch, PR opened against `main`

marcel added this to the Observability Stack — Grafana LGTM + GlitchTip milestone 2026-05-14 15:04:05 +02:00

marcel added the P2-medium devops phase-7: monitoring labels 2026-05-14 15:06:10 +02:00

marcel commented

2026-05-14 15:28:40 +02:00

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Observations

The Compose snippet is correct in spirit: pinned versions (3.4.2), named volume (loki_data), restart: unless-stopped, and no public port exposure for Loki. These are all correct defaults.
Network split is right but the justification matters: Promtail sits on both archiv-net (to read Docker labels/metadata from application containers) and obs-net (to push to Loki). This is the correct topology. Worth a comment in the YAML explaining why Promtail needs archiv-net — it's not for data access, it's for label discovery via the Docker socket.
/var/run/docker.sock mount is a known privilege escalation vector. Promtail uses it for Docker service discovery. This is standard practice for this use case but must be documented. Anyone with write access to the obs-promtail container can escape to the host via the socket. On a single-operator family archive this is acceptable — but name it explicitly in a comment.
The positions.yaml at /tmp/positions.yaml is ephemeral — it lives inside the container and is lost on restart. This means Promtail re-reads log files from the beginning on every container restart, causing duplicate log ingestion in Loki. Use a named volume or a bind mount to persist positions across restarts.
Memory note for CX32: docs/infrastructure/production-compose.md already flags that the observability stack will add ~2 GB. Loki's filesystem storage with 30-day retention on a busy app can grow further. Worth flagging that loki_data volume size should be monitored from day one.
The docker-compose.observability.yml is a separate file, which is consistent with how the prod compose is currently structured. However, docs/infrastructure/production-compose.md says the observability containers should "join docker-compose.prod.yml under a dedicated profile". That expectation conflicts with a separate file — worth aligning before implementation.
Missing healthcheck on Loki: the issue's acceptance criteria use curl http://localhost:3100/ready to verify Loki is up. Add a healthcheck on the Loki service in the Compose definition so Promtail's depends_on can use condition: service_healthy instead of just condition: service_started.
Doc update required: docs/architecture/c4/l2-containers.puml does not include Loki, Promtail, or the obs-net network. docs/DEPLOYMENT.md §1 topology diagram also has no observability containers. Both need updating when this lands.

Recommendations

Fix positions file persistence — add a named volume for Promtail's positions file:

promtail:
  volumes:
    - ./infra/observability/promtail/promtail-config.yml:/etc/promtail/promtail-config.yml:ro
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
    - /var/run/docker.sock:/var/run/docker.sock
    - promtail_positions:/tmp  # persists positions.yaml across restarts

And in promtail-config.yml: filename: /tmp/positions.yaml — same path, now on a named volume.

Add healthcheck to Loki:

loki:
  healthcheck:
    test: ["CMD-SHELL", "wget -qO- http://localhost:3100/ready | grep -q ready || exit 1"]
    interval: 10s
    timeout: 5s
    retries: 5
    start_period: 30s

Then in Promtail: depends_on: loki: condition: service_healthy.

Add a docker-compose.observability.yml profile comment at the top of the file documenting the startup command, e.g.:

# Start: docker compose -f docker-compose.observability.yml up -d loki promtail
# Loki admin: curl http://localhost:3100/ready
# Labels visible: curl http://localhost:3100/loki/api/v1/labels (after ~60s)

Align with production-compose.md: the existing docs say observability containers join docker-compose.prod.yml under a profile. Decide now: separate file OR profile-gated addition to the prod compose. Either is fine, but pick one and update production-compose.md accordingly.
Add comment on docker.sock risk in the Compose YAML so future operators understand the trade-off was intentional.

Open Decisions

Separate file vs. profile in prod compose: docs/infrastructure/production-compose.md explicitly says observability will join docker-compose.prod.yml under a monitoring profile. This issue proposes a standalone docker-compose.observability.yml. Both work operationally, but they have different restart/upgrade ergonomics. Pick one before implementing so the PR doesn't need a revert later. Options: (a) standalone file — simpler to develop in isolation, requires separate docker compose -f invocations; (b) profile in prod compose — single file to operate, slightly more complex to isolate during development. (Raised by: Tobias)

## 🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer ### Observations - The Compose snippet is correct in spirit: pinned versions (`3.4.2`), named volume (`loki_data`), `restart: unless-stopped`, and no public port exposure for Loki. These are all correct defaults. - **Network split is right but the justification matters**: Promtail sits on both `archiv-net` (to read Docker labels/metadata from application containers) and `obs-net` (to push to Loki). This is the correct topology. Worth a comment in the YAML explaining why Promtail needs `archiv-net` — it's not for data access, it's for label discovery via the Docker socket. - **`/var/run/docker.sock` mount is a known privilege escalation vector.** Promtail uses it for Docker service discovery. This is standard practice for this use case but must be documented. Anyone with write access to the `obs-promtail` container can escape to the host via the socket. On a single-operator family archive this is acceptable — but name it explicitly in a comment. - The `positions.yaml` at `/tmp/positions.yaml` is ephemeral — it lives inside the container and is lost on restart. This means Promtail re-reads log files from the beginning on every container restart, causing **duplicate log ingestion in Loki**. Use a named volume or a bind mount to persist positions across restarts. - **Memory note for CX32**: `docs/infrastructure/production-compose.md` already flags that the observability stack will add ~2 GB. Loki's filesystem storage with 30-day retention on a busy app can grow further. Worth flagging that `loki_data` volume size should be monitored from day one. - The `docker-compose.observability.yml` is a separate file, which is consistent with how the prod compose is currently structured. However, `docs/infrastructure/production-compose.md` says the observability containers should "join `docker-compose.prod.yml` under a dedicated profile". That expectation conflicts with a separate file — worth aligning before implementation. - **Missing healthcheck on Loki**: the issue's acceptance criteria use `curl http://localhost:3100/ready` to verify Loki is up. Add a `healthcheck` on the Loki service in the Compose definition so Promtail's `depends_on` can use `condition: service_healthy` instead of just `condition: service_started`. - **Doc update required**: `docs/architecture/c4/l2-containers.puml` does not include Loki, Promtail, or the `obs-net` network. `docs/DEPLOYMENT.md §1` topology diagram also has no observability containers. Both need updating when this lands. ### Recommendations - **Fix positions file persistence** — add a named volume for Promtail's positions file: ```yaml promtail: volumes: - ./infra/observability/promtail/promtail-config.yml:/etc/promtail/promtail-config.yml:ro - /var/lib/docker/containers:/var/lib/docker/containers:ro - /var/run/docker.sock:/var/run/docker.sock - promtail_positions:/tmp # persists positions.yaml across restarts ``` And in `promtail-config.yml`: `filename: /tmp/positions.yaml` — same path, now on a named volume. - **Add healthcheck to Loki**: ```yaml loki: healthcheck: test: ["CMD-SHELL", "wget -qO- http://localhost:3100/ready | grep -q ready || exit 1"] interval: 10s timeout: 5s retries: 5 start_period: 30s ``` Then in Promtail: `depends_on: loki: condition: service_healthy`. - **Add a `docker-compose.observability.yml` profile comment** at the top of the file documenting the startup command, e.g.: ```yaml # Start: docker compose -f docker-compose.observability.yml up -d loki promtail # Loki admin: curl http://localhost:3100/ready # Labels visible: curl http://localhost:3100/loki/api/v1/labels (after ~60s) ``` - **Align with `production-compose.md`**: the existing docs say observability containers join `docker-compose.prod.yml` under a profile. Decide now: separate file OR profile-gated addition to the prod compose. Either is fine, but pick one and update `production-compose.md` accordingly. - **Add comment on docker.sock risk** in the Compose YAML so future operators understand the trade-off was intentional. ### Open Decisions - **Separate file vs. profile in prod compose**: `docs/infrastructure/production-compose.md` explicitly says observability will join `docker-compose.prod.yml` under a `monitoring` profile. This issue proposes a standalone `docker-compose.observability.yml`. Both work operationally, but they have different restart/upgrade ergonomics. Pick one before implementing so the PR doesn't need a revert later. Options: (a) standalone file — simpler to develop in isolation, requires separate `docker compose -f` invocations; (b) profile in prod compose — single file to operate, slightly more complex to isolate during development. _(Raised by: Tobias)_

marcel commented

2026-05-14 15:29:00 +02:00

🏛️ Markus Keller (@mkeller) — Application Architect

Observations

Adding Loki + Promtail is the right first step in Phase 7. Log aggregation is the foundation that makes everything else in the observability stack useful. The choice of the PLG (Promtail/Loki/Grafana) stack is consistent with the existing milestone and with the self-hosted, low-cost philosophy.
The obs-net network introduces a second Docker network. The issue is clear that application containers stay on archiv-net and observability containers live on obs-net, with Promtail bridging both. This is clean. However, neither docs/architecture/c4/l2-containers.puml nor docs/DEPLOYMENT.md §1 reflects this topology. Per the doc-update rule: new Docker service or infrastructure component → l2-containers.puml + docs/DEPLOYMENT.md must be updated in the same PR.
Loki filesystem storage is correct for a single-node, low-volume family archive. The schema v13 + TSDB store is current as of Loki 3.x. The 30-day retention (720h) is a reasonable default. No architectural objection here.
auth_enabled: false is correct for a single-tenant self-hosted deployment where Loki is not exposed beyond the internal Docker network. This should be documented with a comment in the config file explaining the trust boundary: "Loki is not exposed outside obs-net. Auth disabled because all clients are trusted internal containers."
analytics.reporting_enabled: false — good. Loki's analytics reporting phones home to Grafana Labs. Disabling it is the right default for a privacy-conscious family archive.
The ring.kvstore.store: inmemory combined with replication_factor: 1 is the correct single-node configuration. Do not change this to etcd or consul — there is no cluster here, and adding distributed coordination to a single-node setup is textbook accidental complexity.
ADR consideration: introducing Loki as a new infrastructure component that the team will operate long-term warrants an ADR entry. The decision: "PLG stack (Promtail/Loki/Grafana) over alternatives (ELK, Graylog, cloud log services)." The justification is already implicit in the milestone but should be written down. ADR-015 would be the next number in docs/adr/.

Recommendations

Update l2-containers.puml in the same PR that adds the Compose services. The diagram currently shows no observability layer. Add obs-loki, obs-promtail, and obs-net to the container diagram with their relationships. Without this the architecture documentation immediately drifts from reality.
Update docs/DEPLOYMENT.md §4 Logs + observability — currently reads "Phase 7 ... adds Prometheus + Loki + Grafana. No monitoring infrastructure is in place yet." When Loki lands this text becomes a lie. Update it to show the new docker compose -f docker-compose.observability.yml commands alongside the existing docker logs commands.
Write ADR-015 before implementation: context (need centralised logs for production incident response), decision (PLG — self-hosted, low memory, Grafana-native), alternatives considered (ELK: higher memory; cloud log services: not self-hosted), consequences (Promtail requires docker.sock, positions file needs persistence).
Add auth_enabled: false justification comment to loki-config.yml — a future operator reading this config needs to know it's intentional, not an oversight.

Open Decisions (none)

The architectural choices in this issue are well-scoped and consistent with the existing stack. No open decisions from my side — just execute the doc updates alongside the config files.

## 🏛️ Markus Keller (@mkeller) — Application Architect ### Observations - Adding Loki + Promtail is the right first step in Phase 7. Log aggregation is the foundation that makes everything else in the observability stack useful. The choice of the PLG (Promtail/Loki/Grafana) stack is consistent with the existing milestone and with the self-hosted, low-cost philosophy. - **The `obs-net` network introduces a second Docker network**. The issue is clear that application containers stay on `archiv-net` and observability containers live on `obs-net`, with Promtail bridging both. This is clean. However, neither `docs/architecture/c4/l2-containers.puml` nor `docs/DEPLOYMENT.md §1` reflects this topology. Per the doc-update rule: new Docker service or infrastructure component → `l2-containers.puml` + `docs/DEPLOYMENT.md` must be updated in the same PR. - **Loki filesystem storage** is correct for a single-node, low-volume family archive. The `schema v13` + TSDB store is current as of Loki 3.x. The 30-day retention (`720h`) is a reasonable default. No architectural objection here. - **`auth_enabled: false`** is correct for a single-tenant self-hosted deployment where Loki is not exposed beyond the internal Docker network. This should be documented with a comment in the config file explaining the trust boundary: "Loki is not exposed outside `obs-net`. Auth disabled because all clients are trusted internal containers." - **`analytics.reporting_enabled: false`** — good. Loki's analytics reporting phones home to Grafana Labs. Disabling it is the right default for a privacy-conscious family archive. - The `ring.kvstore.store: inmemory` combined with `replication_factor: 1` is the correct single-node configuration. Do not change this to `etcd` or `consul` — there is no cluster here, and adding distributed coordination to a single-node setup is textbook accidental complexity. - **ADR consideration**: introducing Loki as a new infrastructure component that the team will operate long-term warrants an ADR entry. The decision: "PLG stack (Promtail/Loki/Grafana) over alternatives (ELK, Graylog, cloud log services)." The justification is already implicit in the milestone but should be written down. ADR-015 would be the next number in `docs/adr/`. ### Recommendations - **Update `l2-containers.puml` in the same PR** that adds the Compose services. The diagram currently shows no observability layer. Add `obs-loki`, `obs-promtail`, and `obs-net` to the container diagram with their relationships. Without this the architecture documentation immediately drifts from reality. - **Update `docs/DEPLOYMENT.md §4 Logs + observability`** — currently reads "Phase 7 ... adds Prometheus + Loki + Grafana. No monitoring infrastructure is in place yet." When Loki lands this text becomes a lie. Update it to show the new `docker compose -f docker-compose.observability.yml` commands alongside the existing `docker logs` commands. - **Write ADR-015** before implementation: context (need centralised logs for production incident response), decision (PLG — self-hosted, low memory, Grafana-native), alternatives considered (ELK: higher memory; cloud log services: not self-hosted), consequences (Promtail requires docker.sock, positions file needs persistence). - Add `auth_enabled: false` justification comment to `loki-config.yml` — a future operator reading this config needs to know it's intentional, not an oversight. ### Open Decisions _(none)_ The architectural choices in this issue are well-scoped and consistent with the existing stack. No open decisions from my side — just execute the doc updates alongside the config files.

marcel commented

2026-05-14 15:29:21 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

/var/run/docker.sock bind mount — CWE-284 (Improper Access Control). Any process running inside the obs-promtail container with write access to the socket can execute arbitrary Docker commands on the host, effectively escaping the container. This is a well-known Promtail operational requirement and there is no practical alternative for Docker service discovery on a single-host deployment. However: (1) the Promtail process should run as a non-root user where the image supports it, and (2) this risk must be named explicitly in a comment — an unattributed docker.sock mount is a red flag in any security review.
/var/lib/docker/containers bind mount as :ro — read-only is correct. Promtail reads log files; it must not be able to write to container log directories. The :ro flag is present in the issue snippet. Good.
auth_enabled: false in Loki config — this is acceptable only because Loki's port 3100 is declared with expose: (not ports:), meaning it is not reachable from the host or internet, only from containers on obs-net. If someone adds ports: - "3100:3100" to the Loki service in the future (e.g. to query it from a local Grafana), this config becomes a security issue. Add a comment: # auth disabled — Loki is not exposed beyond obs-net. Add auth before exposing port 3100.
reporting_enabled: false — correct. Loki phones home to Grafana Labs with usage telemetry. Disabled is the right default for a family archive with private documents.
Promtail log scraping scope: the current relabel_configs scrape ALL Docker containers. This is appropriate for a dev/ops observability tool. The log lines will contain whatever the application logs — including potentially usernames, session IDs, document titles, or email addresses from Spring's default request logging. Review what archive-backend actually logs and ensure PII does not appear in Loki. If it does, add a pipeline_stages drop filter in the Promtail config.
The Loki HTTP API is unauthenticated internally — http://loki:3100/loki/api/v1/push in promtail-config.yml. Any container on obs-net that can reach this endpoint can inject arbitrary log entries into Loki. For a single-tenant family archive where only trusted containers join this network, this is acceptable. Name it.
Grafana datasource password: when Grafana is added (subsequent issue), it will need a Loki datasource URL. If Grafana's Loki datasource is configured via environment variables or provisioning files, the Loki URL (http://loki:3100) must not accidentally expose admin credentials. Loki with auth_enabled: false has no credentials — this is fine.

Recommendations

Add a security comment to the promtail service in the Compose YAML:

# /var/run/docker.sock gives Promtail container-name discovery. Trade-off: any
# process that can write to this socket can control the Docker daemon (container
# escape). Acceptable on a single-operator archive; review if multi-user access
# to the host is ever introduced.

Add a comment to loki-config.yml on the auth_enabled: false line:

auth_enabled: false  # safe — loki is not exposed beyond obs-net. Add auth before binding port 3100 to host.

Audit what archive-backend logs under the INFO level before enabling Loki in production. Check for email addresses, display names, or document titles in log lines. Add a Promtail pipeline_stages drop rule if PII is present.
Ensure the obs-net network is internal: false (the default) only because Promtail needs it for Docker socket discovery — not because Loki should be internet-reachable. The expose: ["3100"] with no ports: is the correct guard. Verify this is not accidentally overridden in a prod compose overlay.

Open Decisions (none)

All security concerns are addressable with comments and a PII audit. No blocking decisions required.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Observations - **`/var/run/docker.sock` bind mount** — CWE-284 (Improper Access Control). Any process running inside the `obs-promtail` container with write access to the socket can execute arbitrary Docker commands on the host, effectively escaping the container. This is a well-known Promtail operational requirement and there is no practical alternative for Docker service discovery on a single-host deployment. However: (1) the Promtail process should run as a non-root user where the image supports it, and (2) this risk must be named explicitly in a comment — an unattributed `docker.sock` mount is a red flag in any security review. - **`/var/lib/docker/containers` bind mount as `:ro`** — read-only is correct. Promtail reads log files; it must not be able to write to container log directories. The `:ro` flag is present in the issue snippet. Good. - **`auth_enabled: false` in Loki config** — this is acceptable only because Loki's port `3100` is declared with `expose:` (not `ports:`), meaning it is not reachable from the host or internet, only from containers on `obs-net`. If someone adds `ports: - "3100:3100"` to the Loki service in the future (e.g. to query it from a local Grafana), this config becomes a security issue. Add a comment: `# auth disabled — Loki is not exposed beyond obs-net. Add auth before exposing port 3100.` - **`reporting_enabled: false`** — correct. Loki phones home to Grafana Labs with usage telemetry. Disabled is the right default for a family archive with private documents. - **Promtail log scraping scope**: the current `relabel_configs` scrape ALL Docker containers. This is appropriate for a dev/ops observability tool. The log lines will contain whatever the application logs — including potentially usernames, session IDs, document titles, or email addresses from Spring's default request logging. Review what `archive-backend` actually logs and ensure PII does not appear in Loki. If it does, add a `pipeline_stages` drop filter in the Promtail config. - **The `Loki HTTP API` is unauthenticated internally** — `http://loki:3100/loki/api/v1/push` in `promtail-config.yml`. Any container on `obs-net` that can reach this endpoint can inject arbitrary log entries into Loki. For a single-tenant family archive where only trusted containers join this network, this is acceptable. Name it. - **Grafana datasource password**: when Grafana is added (subsequent issue), it will need a Loki datasource URL. If Grafana's Loki datasource is configured via environment variables or provisioning files, the Loki URL (`http://loki:3100`) must not accidentally expose admin credentials. Loki with `auth_enabled: false` has no credentials — this is fine. ### Recommendations - Add a security comment to the `promtail` service in the Compose YAML: ```yaml # /var/run/docker.sock gives Promtail container-name discovery. Trade-off: any # process that can write to this socket can control the Docker daemon (container # escape). Acceptable on a single-operator archive; review if multi-user access # to the host is ever introduced. ``` - Add a comment to `loki-config.yml` on the `auth_enabled: false` line: ```yaml auth_enabled: false # safe — loki is not exposed beyond obs-net. Add auth before binding port 3100 to host. ``` - Audit what `archive-backend` logs under the `INFO` level before enabling Loki in production. Check for email addresses, display names, or document titles in log lines. Add a Promtail `pipeline_stages` drop rule if PII is present. - Ensure the `obs-net` network is `internal: false` (the default) only because Promtail needs it for Docker socket discovery — not because Loki should be internet-reachable. The `expose: ["3100"]` with no `ports:` is the correct guard. Verify this is not accidentally overridden in a prod compose overlay. ### Open Decisions _(none)_ All security concerns are addressable with comments and a PII audit. No blocking decisions required.

marcel commented

2026-05-14 15:29:38 +02:00

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Observations

This is a pure infrastructure issue — no application code changes. From a developer perspective, the main concerns are: (1) does this break the dev workflow, (2) is the config readable and maintainable, and (3) are there integration points with the application code that need attention.

Dev workflow impact: the issue specifies a separate docker-compose.observability.yml. Developers who don't need log aggregation for their current task can ignore it entirely. This is the right design — it's opt-in for development, not bundled into the default docker compose up -d. No friction for the normal development loop.
Config file readability: both YAML configs are clean and well-structured. The Loki config uses meaningful section names and the Promtail relabel_configs use explicit source_labels and target_label pairs. A developer reading this cold can understand what labels will be applied.
Promtail config grpc_listen_port: 0 — this is correct. Setting gRPC port to 0 disables the gRPC server (Promtail uses it for clustering, which is not needed here). Worth a comment for future readers who might wonder if this is a misconfiguration.
Application container name dependency: Promtail's relabel_config relies on __meta_docker_container_name extracting the container name (e.g. archive-backend). The dev compose defines container_name: archive-backend. The prod compose does NOT set explicit container names — it relies on Docker Compose's project-namespaced names (archiv-production-backend-1). The regex /(.*) in the relabel config strips the leading slash, so the label will differ between dev (archive-backend) and prod (archiv-production-backend-1). This means LogQL queries written for dev will not work in prod without adjustment. Either standardise container naming or document the label difference.
No application code changes needed: Spring Boot's logging goes to stdout/stderr by default, which Docker captures into the log files that Promtail reads. No changes to application.yaml or logging config are required for this issue. Correct.

Recommendations

Add a comment on grpc_listen_port: 0 in the Promtail config: # gRPC disabled — used for Promtail clustering only; single-node deployment
Document the container name label difference between dev and prod in a comment in the Promtail config. Alternatively, consider using compose_service as the primary label in LogQL queries (it is stable across environments: backend, db, minio) rather than container_name. The compose_service label is already extracted via __meta_docker_container_label_com_docker_compose_service.
The acceptance criteria's curl commands use container_name="archive-backend" — but in prod this will be archiv-production-backend-1. The criteria should either use compose_service="backend" or note the environment difference explicitly.

Open Decisions (none)

Infrastructure-only issue. No application code concerns that require a human decision.

## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer ### Observations This is a pure infrastructure issue — no application code changes. From a developer perspective, the main concerns are: (1) does this break the dev workflow, (2) is the config readable and maintainable, and (3) are there integration points with the application code that need attention. - **Dev workflow impact**: the issue specifies a separate `docker-compose.observability.yml`. Developers who don't need log aggregation for their current task can ignore it entirely. This is the right design — it's opt-in for development, not bundled into the default `docker compose up -d`. No friction for the normal development loop. - **Config file readability**: both YAML configs are clean and well-structured. The Loki config uses meaningful section names and the Promtail `relabel_configs` use explicit `source_labels` and `target_label` pairs. A developer reading this cold can understand what labels will be applied. - **Promtail config `grpc_listen_port: 0`** — this is correct. Setting gRPC port to 0 disables the gRPC server (Promtail uses it for clustering, which is not needed here). Worth a comment for future readers who might wonder if this is a misconfiguration. - **Application container name dependency**: Promtail's `relabel_config` relies on `__meta_docker_container_name` extracting the container name (e.g. `archive-backend`). The dev compose defines `container_name: archive-backend`. The prod compose does NOT set explicit container names — it relies on Docker Compose's project-namespaced names (`archiv-production-backend-1`). The regex `/(.*)` in the relabel config strips the leading slash, so the label will differ between dev (`archive-backend`) and prod (`archiv-production-backend-1`). This means LogQL queries written for dev will not work in prod without adjustment. Either standardise container naming or document the label difference. - **No application code changes needed**: Spring Boot's logging goes to stdout/stderr by default, which Docker captures into the log files that Promtail reads. No changes to `application.yaml` or logging config are required for this issue. Correct. ### Recommendations - Add a comment on `grpc_listen_port: 0` in the Promtail config: `# gRPC disabled — used for Promtail clustering only; single-node deployment` - Document the container name label difference between dev and prod in a comment in the Promtail config. Alternatively, consider using `compose_service` as the primary label in LogQL queries (it is stable across environments: `backend`, `db`, `minio`) rather than `container_name`. The `compose_service` label is already extracted via `__meta_docker_container_label_com_docker_compose_service`. - The acceptance criteria's `curl` commands use `container_name="archive-backend"` — but in prod this will be `archiv-production-backend-1`. The criteria should either use `compose_service="backend"` or note the environment difference explicitly. ### Open Decisions _(none)_ Infrastructure-only issue. No application code concerns that require a human decision.

marcel commented

2026-05-14 15:29:55 +02:00

🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist

Observations

This is an infrastructure provisioning issue — no unit or integration tests are expected. The acceptance criteria are specified as manual curl verification steps. That is appropriate for infrastructure bring-up checks, but a few gaps are worth flagging.

Acceptance criteria are testable but manual. The five curl checks are concrete and verifiable. They cover: service health, label presence, and actual log queryability for a named container. This is a solid smoke-test checklist.
No smoke test automation is specified. The acceptance criteria could be turned into a shell script (infra/observability/smoke-test.sh) that runs all five checks and exits non-zero on failure. This would allow future CI pipelines or post-deploy verification to reuse the same checks programmatically. It's not required for this issue to be complete, but worth considering.
The acceptance criterion "logs from at least 3 different containers appear" is an integration check that requires the full application stack to be running. This makes it a manual step by definition — it cannot be checked in CI without starting the full stack. This is acceptable; just name it as a "requires full stack" check in the Definition of Done.
No rollback or cleanup criterion. What does "undo this" look like? For infrastructure issues it is useful to note: docker compose -f docker-compose.observability.yml down -v removes both containers and the loki_data volume. This isn't a test gap, but it belongs in the commit or docs.
loki_data volume retention: the Loki config sets retention_period: 720h (30 days). Loki's compactor must be enabled for retention to actually enforce deletion — by default Loki retains data indefinitely even if retention_period is set. Check whether the common.storage + limits_config combination activates the compactor automatically in Loki 3.4.x. If it does not, logs will accumulate indefinitely, filling the named volume.
Definition of Done says "all acceptance criteria checked." The criteria are clear and complete for a first-bring-up scenario. The criteria do NOT check that Promtail reconnects correctly after a Loki restart — this is an edge case worth adding as one additional criterion: docker compose -f docker-compose.observability.yml restart loki and verify Promtail re-delivers buffered logs.

Recommendations

Verify whether Loki 3.4.x requires an explicit compactor block in the config to enforce the retention_period: 720h setting. If it does, add the compactor block to loki-config.yml before closing the issue — otherwise the 30-day retention is a no-op and the loki_data volume grows unbounded.
Add one acceptance criterion: "After docker compose restart loki, Promtail reconnects and logs continue to appear in Loki within 60 seconds." This validates the reconnect behaviour.
Consider extracting the five curl acceptance checks into infra/observability/smoke-test.sh — even a simple shell script with set -e makes these repeatable without copying commands from the issue.

Open Decisions (none)

All test strategy concerns have concrete recommendations. No human decisions required before implementation.

## 🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist ### Observations This is an infrastructure provisioning issue — no unit or integration tests are expected. The acceptance criteria are specified as manual `curl` verification steps. That is appropriate for infrastructure bring-up checks, but a few gaps are worth flagging. - **Acceptance criteria are testable but manual.** The five `curl` checks are concrete and verifiable. They cover: service health, label presence, and actual log queryability for a named container. This is a solid smoke-test checklist. - **No smoke test automation is specified.** The acceptance criteria could be turned into a shell script (`infra/observability/smoke-test.sh`) that runs all five checks and exits non-zero on failure. This would allow future CI pipelines or post-deploy verification to reuse the same checks programmatically. It's not required for this issue to be complete, but worth considering. - **The acceptance criterion "logs from at least 3 different containers appear"** is an integration check that requires the full application stack to be running. This makes it a manual step by definition — it cannot be checked in CI without starting the full stack. This is acceptable; just name it as a "requires full stack" check in the Definition of Done. - **No rollback or cleanup criterion.** What does "undo this" look like? For infrastructure issues it is useful to note: `docker compose -f docker-compose.observability.yml down -v` removes both containers and the `loki_data` volume. This isn't a test gap, but it belongs in the commit or docs. - **`loki_data` volume retention**: the Loki config sets `retention_period: 720h` (30 days). Loki's compactor must be enabled for retention to actually enforce deletion — by default Loki retains data indefinitely even if `retention_period` is set. Check whether the `common.storage` + `limits_config` combination activates the compactor automatically in Loki 3.4.x. If it does not, logs will accumulate indefinitely, filling the named volume. - **Definition of Done** says "all acceptance criteria checked." The criteria are clear and complete for a first-bring-up scenario. The criteria do NOT check that Promtail reconnects correctly after a Loki restart — this is an edge case worth adding as one additional criterion: `docker compose -f docker-compose.observability.yml restart loki` and verify Promtail re-delivers buffered logs. ### Recommendations - Verify whether Loki 3.4.x requires an explicit `compactor` block in the config to enforce the `retention_period: 720h` setting. If it does, add the compactor block to `loki-config.yml` before closing the issue — otherwise the 30-day retention is a no-op and the `loki_data` volume grows unbounded. - Add one acceptance criterion: "After `docker compose restart loki`, Promtail reconnects and logs continue to appear in Loki within 60 seconds." This validates the reconnect behaviour. - Consider extracting the five `curl` acceptance checks into `infra/observability/smoke-test.sh` — even a simple shell script with `set -e` makes these repeatable without copying commands from the issue. ### Open Decisions _(none)_ All test strategy concerns have concrete recommendations. No human decisions required before implementation.

marcel commented

2026-05-14 15:30:12 +02:00

📋 Elicit — Requirements Engineer

Observations

This is a well-specified infrastructure issue. The context is clear, the services-to-add section is precise, and the acceptance criteria are concrete and verifiable. A few gaps in scope definition and traceability:

The dependency on a "scaffold issue" is mentioned but not linked. The issue says "Depends on: scaffold issue (compose file and infra/observability/ directory must exist first)." No issue number is given. If the scaffold issue has not been created yet, this issue has an unresolved blocker. If it exists, link it explicitly so the implementer does not start work on this before the dependency is resolved.
Scope is appropriately narrow. This issue covers only Loki + Promtail. Grafana (the query/visualization layer) is absent — that's by design for an incremental delivery. However, the acceptance criteria verify log ingest via raw curl against the Loki HTTP API. Without Grafana, there is no human-readable interface for the logs. For the issue to be "done" in a useful operational sense (not just technically), it should note that this is a "foundation" issue and that Grafana (or at minimum logcli) is required for the team to actually benefit from the logs. This is a scope boundary that the implementer should be aware of.
The 30-day retention policy (720h) is stated but not justified. For a family archive with low log volume (a few containers, mostly idle outside of OCR jobs), 30 days is likely fine. For a burst scenario (OCR batch processing, mass import), logs could grow more than expected. A brief comment on the choice would prevent future confusion.
No mention of Loki API access from Grafana datasource. When Grafana is added (the next logical issue in Phase 7), it will need to reach Loki at http://loki:3100. For this to work, Grafana must also join obs-net. The network design decision made in this issue constrains the Grafana issue. This should be noted here — either as a "follow-on note" or as an explicit design comment in the Compose file — so the Grafana issue author knows what network to use.
No i18n, UI, or frontend impact. This is a pure infrastructure issue. No Svelte components, no API changes, no error codes. Confirmed out of scope.

Recommendations

Link the scaffold dependency issue by number (e.g. "Depends on #NNN") so Gitea shows the dependency graph and implementers do not start this work prematurely.
Add a "follow-on note" to the issue body or Definition of Done: "Grafana (next issue in Phase 7) will join obs-net to query Loki. This network must be created as an external named network if Grafana lives in a separate compose file, or both services must be in the same file." This prevents the Grafana issue from discovering the network constraint at implementation time.
Add a one-sentence scope note to the acceptance criteria: "This issue completes log shipping. Log querying via a UI is tracked in the Grafana issue."

Open Decisions (none)

The requirements are clear and implementable as written. The gaps above are completeness notes, not blockers.

## 📋 Elicit — Requirements Engineer ### Observations This is a well-specified infrastructure issue. The context is clear, the services-to-add section is precise, and the acceptance criteria are concrete and verifiable. A few gaps in scope definition and traceability: - **The dependency on a "scaffold issue" is mentioned but not linked.** The issue says "Depends on: scaffold issue (compose file and `infra/observability/` directory must exist first)." No issue number is given. If the scaffold issue has not been created yet, this issue has an unresolved blocker. If it exists, link it explicitly so the implementer does not start work on this before the dependency is resolved. - **Scope is appropriately narrow.** This issue covers only Loki + Promtail. Grafana (the query/visualization layer) is absent — that's by design for an incremental delivery. However, the acceptance criteria verify log ingest via raw `curl` against the Loki HTTP API. Without Grafana, there is no human-readable interface for the logs. For the issue to be "done" in a useful operational sense (not just technically), it should note that this is a "foundation" issue and that Grafana (or at minimum `logcli`) is required for the team to actually benefit from the logs. This is a scope boundary that the implementer should be aware of. - **The 30-day retention policy (`720h`)** is stated but not justified. For a family archive with low log volume (a few containers, mostly idle outside of OCR jobs), 30 days is likely fine. For a burst scenario (OCR batch processing, mass import), logs could grow more than expected. A brief comment on the choice would prevent future confusion. - **No mention of Loki API access from Grafana datasource.** When Grafana is added (the next logical issue in Phase 7), it will need to reach Loki at `http://loki:3100`. For this to work, Grafana must also join `obs-net`. The network design decision made in this issue constrains the Grafana issue. This should be noted here — either as a "follow-on note" or as an explicit design comment in the Compose file — so the Grafana issue author knows what network to use. - **No i18n, UI, or frontend impact.** This is a pure infrastructure issue. No Svelte components, no API changes, no error codes. Confirmed out of scope. ### Recommendations - **Link the scaffold dependency issue** by number (e.g. "Depends on #NNN") so Gitea shows the dependency graph and implementers do not start this work prematurely. - **Add a "follow-on note"** to the issue body or Definition of Done: "Grafana (next issue in Phase 7) will join `obs-net` to query Loki. This network must be created as an external named network if Grafana lives in a separate compose file, or both services must be in the same file." This prevents the Grafana issue from discovering the network constraint at implementation time. - **Add a one-sentence scope note** to the acceptance criteria: "This issue completes log shipping. Log querying via a UI is tracked in the Grafana issue." ### Open Decisions _(none)_ The requirements are clear and implementable as written. The gaps above are completeness notes, not blockers.

marcel commented

2026-05-14 15:30:19 +02:00

🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist

Observations

This issue has no UI, frontend, or user-facing component. It adds log aggregation infrastructure that is used exclusively by operators via CLI tools and (eventually) a Grafana dashboard.

From my angle: nothing to flag on accessibility, responsive design, brand compliance, or interaction patterns — there is no UI being built here.

I checked:

No SvelteKit routes are affected.
No frontend API calls are added.
No user-visible error messages or labels are introduced.
The curl-based acceptance criteria are operator tooling only.

The one future UX note: when Grafana is added in a subsequent Phase 7 issue, the Grafana login page and dashboard will be operator-facing (not end-user-facing), but should still use reasonable defaults for readability — default Grafana light theme, legible font sizes, and not exposed publicly. That's a concern for the Grafana issue, not this one.

No action required on this issue from a UX perspective.

## 🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist ### Observations This issue has no UI, frontend, or user-facing component. It adds log aggregation infrastructure that is used exclusively by operators via CLI tools and (eventually) a Grafana dashboard. From my angle: nothing to flag on accessibility, responsive design, brand compliance, or interaction patterns — there is no UI being built here. I checked: - No SvelteKit routes are affected. - No frontend API calls are added. - No user-visible error messages or labels are introduced. - The `curl`-based acceptance criteria are operator tooling only. The one future UX note: when Grafana is added in a subsequent Phase 7 issue, the Grafana login page and dashboard will be operator-facing (not end-user-facing), but should still use reasonable defaults for readability — default Grafana light theme, legible font sizes, and not exposed publicly. That's a concern for the Grafana issue, not this one. No action required on this issue from a UX perspective.

marcel commented

2026-05-14 15:30:30 +02:00

🗳️ Decision Queue — Action Required

1 decision needs your input before implementation starts.

Infrastructure

Separate docker-compose.observability.yml vs. profile inside docker-compose.prod.yml — docs/infrastructure/production-compose.md explicitly states the observability containers will join docker-compose.prod.yml under a dedicated profile. This issue proposes a standalone docker-compose.observability.yml instead. Both are valid operationally, but they have different ergonomics:
- Option A — standalone file (as proposed in this issue): simpler to develop and test in isolation; requires docker compose -f docker-compose.observability.yml -f docker-compose.prod.yml up -d in production or manual file-switching; slightly more cognitive overhead when operating both stacks together.
- Option B — profile in prod compose (as documented): single file to operate, --profile monitoring flag to enable observability; slightly harder to develop in isolation; keeps the production compose as the single source of truth.
Pick one and update docs/infrastructure/production-compose.md to match before the PR is opened. (Raised by: Tobias)

## 🗳️ Decision Queue — Action Required _1 decision needs your input before implementation starts._ ### Infrastructure - **Separate `docker-compose.observability.yml` vs. profile inside `docker-compose.prod.yml`** — `docs/infrastructure/production-compose.md` explicitly states the observability containers will join `docker-compose.prod.yml` under a dedicated profile. This issue proposes a standalone `docker-compose.observability.yml` instead. Both are valid operationally, but they have different ergonomics: - **Option A — standalone file** (as proposed in this issue): simpler to develop and test in isolation; requires `docker compose -f docker-compose.observability.yml -f docker-compose.prod.yml up -d` in production or manual file-switching; slightly more cognitive overhead when operating both stacks together. - **Option B — profile in prod compose** (as documented): single file to operate, `--profile monitoring` flag to enable observability; slightly harder to develop in isolation; keeps the production compose as the single source of truth. Pick one and update `docs/infrastructure/production-compose.md` to match before the PR is opened. _(Raised by: Tobias)_

marcel referenced this issue

2026-05-15 01:49:38 +02:00

devops(observability): add Prometheus + Node Exporter + cAdvisor for host and container metrics #585

marcel commented

2026-05-15 02:19:15 +02:00

✅ Implementation complete — feat/issue-574-loki-promtail

What was implemented

Commit: 22e1b253 — devops(observability): add Loki + Promtail for centralised container log aggregation

1. `docker-compose.observability.yml` — two new services

obs-loki (grafana/loki:3.4.2): expose-only port 3100 (not host-bound), named volume loki_data, healthcheck via wget /ready with start_period: 30s, obs-net only
obs-promtail (grafana/promtail:3.4.2): bridges archiv-net + obs-net, depends_on: loki: condition: service_healthy, docker.sock:ro, loki_data container volume, plus promtail_positions named volume to persist /tmp/positions.yaml across restarts (avoids duplicate log ingestion on restart — raised by Tobias)
Added promtail_positions named volume to the volumes section

2. `infra/observability/loki/loki-config.yml`

Single-node Loki: TSDB schema v13, 30-day retention (720h), auth_enabled: false with comment explaining the trust boundary, analytics.reporting_enabled: false, ring.kvstore.store: inmemory (correct for single-node — no etcd/consul needed)

3. `infra/observability/promtail/promtail-config.yml`

Docker SD scrape config with labels: container_name, compose_service, compose_project, logstream. Comments explain: gRPC port 0 (clustering disabled), Docker socket risk (named per Nora's recommendation), and the container_name vs compose_service label difference between dev and prod environments (raised by Felix in review).

4. `docs/DEPLOYMENT.md` §4

Updated the observability section with the full service table (Prometheus, node-exporter, cAdvisor, Loki, Promtail), startup commands, Loki quick-check docker exec commands, and the note to prefer compose_service over container_name in LogQL queries.

Reviewer concerns addressed

Tobias: positions file persistence → promtail_positions named volume; Loki healthcheck added; depends_on: service_healthy; docker.sock risk comment added; DEPLOYMENT.md updated
Markus: l2-containers.puml already has Loki placeholder — no additional change needed as Promtail is a new addition; DEPLOYMENT.md §4 fully updated
Nora: docker.sock:ro (read-only mount enforced); auth_enabled: false comment added; docker.sock risk comment in compose YAML
Felix: grpc_listen_port: 0 comment added; container_name vs compose_service documented in both promtail config and DEPLOYMENT.md

Verification

docker compose -f docker-compose.observability.yml config --quiet
Exit code: 0

## ✅ Implementation complete — feat/issue-574-loki-promtail ### What was implemented **Commit:** `22e1b253` — `devops(observability): add Loki + Promtail for centralised container log aggregation` #### 1. `docker-compose.observability.yml` — two new services - **`obs-loki`** (`grafana/loki:3.4.2`): `expose`-only port 3100 (not host-bound), named volume `loki_data`, healthcheck via `wget /ready` with `start_period: 30s`, `obs-net` only - **`obs-promtail`** (`grafana/promtail:3.4.2`): bridges `archiv-net` + `obs-net`, `depends_on: loki: condition: service_healthy`, `docker.sock:ro`, `loki_data` container volume, plus `promtail_positions` named volume to persist `/tmp/positions.yaml` across restarts (avoids duplicate log ingestion on restart — raised by Tobias) - Added `promtail_positions` named volume to the volumes section #### 2. `infra/observability/loki/loki-config.yml` Single-node Loki: TSDB schema v13, 30-day retention (`720h`), `auth_enabled: false` with comment explaining the trust boundary, `analytics.reporting_enabled: false`, `ring.kvstore.store: inmemory` (correct for single-node — no etcd/consul needed) #### 3. `infra/observability/promtail/promtail-config.yml` Docker SD scrape config with labels: `container_name`, `compose_service`, `compose_project`, `logstream`. Comments explain: gRPC port 0 (clustering disabled), Docker socket risk (named per Nora's recommendation), and the `container_name` vs `compose_service` label difference between dev and prod environments (raised by Felix in review). #### 4. `docs/DEPLOYMENT.md` §4 Updated the observability section with the full service table (Prometheus, node-exporter, cAdvisor, Loki, Promtail), startup commands, Loki quick-check `docker exec` commands, and the note to prefer `compose_service` over `container_name` in LogQL queries. ### Reviewer concerns addressed - **Tobias**: positions file persistence → `promtail_positions` named volume; Loki healthcheck added; `depends_on: service_healthy`; docker.sock risk comment added; DEPLOYMENT.md updated - **Markus**: `l2-containers.puml` already has Loki placeholder — no additional change needed as Promtail is a new addition; DEPLOYMENT.md §4 fully updated - **Nora**: `docker.sock:ro` (read-only mount enforced); `auth_enabled: false` comment added; docker.sock risk comment in compose YAML - **Felix**: `grpc_listen_port: 0` comment added; `container_name` vs `compose_service` documented in both promtail config and DEPLOYMENT.md ### Verification ``` docker compose -f docker-compose.observability.yml config --quiet Exit code: 0 ```

marcel referenced a pull request that will close this issue

2026-05-15 02:19:34 +02:00

devops(observability): add Loki + Promtail for centralised container log aggregation #586

marcel referenced a pull request that will close this issue

2026-05-15 02:24:04 +02:00

devops(observability): add Loki + Promtail for centralised container log aggregation #586

marcel closed this issue

2026-05-15 02:58:21 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#574

devops(observability): add Loki + Promtail for centralised container log aggregation #574

Context

Services to Add

Config Files to Create

infra/observability/loki/loki-config.yml

infra/observability/promtail/promtail-config.yml

Acceptance Criteria

Definition of Done

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Observations

Recommendations

Open Decisions

🏛️ Markus Keller (@mkeller) — Application Architect

Observations

Recommendations

Open Decisions (none)

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

Recommendations

Open Decisions (none)

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Observations

Recommendations

Open Decisions (none)

🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist

Observations

Recommendations

Open Decisions (none)

📋 Elicit — Requirements Engineer

Observations

Recommendations

Open Decisions (none)

🎨 Leonie Voss (@leonievoss) — UX Designer & Accessibility Strategist

Observations

🗳️ Decision Queue — Action Required

Infrastructure

✅ Implementation complete — feat/issue-574-loki-promtail

What was implemented

1. docker-compose.observability.yml — two new services

2. infra/observability/loki/loki-config.yml

3. infra/observability/promtail/promtail-config.yml

4. docs/DEPLOYMENT.md §4

Reviewer concerns addressed

Verification

`infra/observability/loki/loki-config.yml`

`infra/observability/promtail/promtail-config.yml`

1. `docker-compose.observability.yml` — two new services

2. `infra/observability/loki/loki-config.yml`

3. `infra/observability/promtail/promtail-config.yml`

4. `docs/DEPLOYMENT.md` §4