fix(infra): deploy Ollama to prod/staging compose + fix broken model-init recipe #759

Merged
marcel merged 8 commits from fix/issue-758-ollama-prod-compose into main 2026-06-06 20:30:35 +02:00
Owner

Fixes #758.

Why

NL search returned 503 / "Intelligente Suche nicht verfügbar" on staging because Ollama was never reachable. Two defects, both downstream of #737:

  1. Ollama was only added to the dev docker-compose.yml. Staging/prod deploy from the self-contained docker-compose.prod.yml, which had no ollama service. The backend defaults to app.ollama.base-url: http://ollama:11434, so its client bean was active and hit a non-existent host → ResourceAccessException → 503.
  2. The model-init recipe merged in #737 never worked. The ollama/ollama image's ENTRYPOINT is ollama, so command: sh -c "..." ran as ollama sh -c "..." (unknown command "sh"), and the image ships no curl, so the readiness loop and the healthcheck could never pass.

Changes

  • docker-compose.prod.yml — add ollama-model-init + ollama services and the ollama-models volume, using the corrected recipe:
    • entrypoint: ["/bin/sh", "-c"] on the init container; readiness via until ollama list >/dev/null 2>&1 (no curl)
    • service healthcheck: ["CMD", "ollama", "list"] (no curl)
    • no container_name (prod namespaces by compose project); ADR-019 hardening preserved
  • docker-compose.yml (dev) — fix the same broken entrypoint/command and the curl healthcheck so the dev stack actually starts Ollama.

Verification

docker compose config valid for both files. Corrected recipe deployed to staging and verified end-to-end:

  • ollama-model-init exits 0; qwen2.5:7b-instruct-q4_K_M (4.7 GB) cached in ollama-models
  • ollama container healthy (via ollama list)
  • docker exec archiv-staging-backend-1 wget -qO- http://ollama:11434/api/tags → returns model list
  • one-shot inference via ollama run succeeds within the 8 GB limit

The staging host was hotfixed in place to restore service immediately; this PR is the durable fix so the next CI deploy keeps Ollama.

Out of scope (tracked in #758)

  • RestClientOllamaClient does not send an Authorization header, so OLLAMA_API_KEY is omitted from the prod service.
  • docs/DEPLOYMENT.md NL-search hardware tier / env-var rows, Prometheus ollama scrape job, Grafana latency dashboard (all from #737).

🤖 Generated with Claude Code

Fixes #758. ## Why NL search returned **503 / "Intelligente Suche nicht verfügbar"** on staging because Ollama was never reachable. Two defects, both downstream of #737: 1. **Ollama was only added to the dev `docker-compose.yml`.** Staging/prod deploy from the self-contained `docker-compose.prod.yml`, which had no `ollama` service. The backend defaults to `app.ollama.base-url: http://ollama:11434`, so its client bean was active and hit a non-existent host → `ResourceAccessException` → 503. 2. **The model-init recipe merged in #737 never worked.** The `ollama/ollama` image's `ENTRYPOINT` is `ollama`, so `command: sh -c "..."` ran as `ollama sh -c "..."` (`unknown command "sh"`), and the image ships **no curl**, so the readiness loop and the healthcheck could never pass. ## Changes - **`docker-compose.prod.yml`** — add `ollama-model-init` + `ollama` services and the `ollama-models` volume, using the corrected recipe: - `entrypoint: ["/bin/sh", "-c"]` on the init container; readiness via `until ollama list >/dev/null 2>&1` (no curl) - service `healthcheck: ["CMD", "ollama", "list"]` (no curl) - no `container_name` (prod namespaces by compose project); ADR-019 hardening preserved - **`docker-compose.yml`** (dev) — fix the same broken entrypoint/command and the curl healthcheck so the dev stack actually starts Ollama. ## Verification `docker compose config` valid for both files. Corrected recipe deployed to **staging** and verified end-to-end: - `ollama-model-init` exits **0**; `qwen2.5:7b-instruct-q4_K_M` (4.7 GB) cached in `ollama-models` - `ollama` container **healthy** (via `ollama list`) - `docker exec archiv-staging-backend-1 wget -qO- http://ollama:11434/api/tags` → returns model list - one-shot inference via `ollama run` succeeds within the 8 GB limit > The staging host was hotfixed in place to restore service immediately; this PR is the durable fix so the next CI deploy keeps Ollama. ## Out of scope (tracked in #758) - `RestClientOllamaClient` does not send an `Authorization` header, so `OLLAMA_API_KEY` is omitted from the prod service. - `docs/DEPLOYMENT.md` NL-search hardware tier / env-var rows, Prometheus `ollama` scrape job, Grafana latency dashboard (all from #737). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 1 commit 2026-06-06 19:20:42 +02:00
fix(infra): deploy Ollama to prod/staging compose + fix broken model-init recipe
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 4m0s
CI / OCR Service Tests (pull_request) Successful in 25s
CI / Backend Unit Tests (pull_request) Successful in 3m56s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 23s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m5s
b665e1132d
NL search returned 503 (SMART_SEARCH_UNAVAILABLE / "Intelligente Suche
nicht verfügbar") on staging because Ollama was never reachable.

Two defects, both downstream of #737:

1. Ollama was added only to the dev docker-compose.yml. Staging/prod
   deploy from the self-contained docker-compose.prod.yml, which had no
   ollama service — so the backend (defaulting to http://ollama:11434)
   hit a non-existent host (ResourceAccessException -> 503).

2. The merged model-init recipe never worked: the ollama/ollama image
   ENTRYPOINT is `ollama` (so `command: sh -c ...` ran as `ollama sh ...`
   -> "unknown command sh"), and the image ships no curl (so both the
   readiness loop and the healthcheck could never pass).

- docker-compose.prod.yml: add ollama-model-init + ollama services and
  the ollama-models volume, with the corrected recipe (entrypoint
  override to /bin/sh -c, `ollama list` for readiness and healthcheck).
- docker-compose.yml: fix the same broken entrypoint/command and the
  curl healthcheck so the dev stack actually starts Ollama.

Verified on staging end-to-end: model-init exits 0, ollama healthy,
backend reaches /api/tags, inference succeeds within the 8g limit.

Refs #758

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
marcel added 1 commit 2026-06-06 19:27:05 +02:00
fix(search): pin Ollama model in memory + raise read timeout
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m18s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m55s
CI / fail2ban Regex (pull_request) Successful in 51s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m8s
9e97687d0f
NL search recovered after deploy but went 503 again after a few minutes:
Ollama unloads the model after its default ~5 min keep-alive, so the next
query cold-loads the 4.7 GB model and exceeds the backend's 30s read
timeout (ResourceAccessException -> SMART_SEARCH_UNAVAILABLE). Warm
inference is ~18s; the cold load after idle is what timed out.

- docker-compose.{prod,yml}: set OLLAMA_KEEP_ALIVE=-1 on the ollama
  service so the model stays resident and never pays a cold-load penalty
  during normal operation (verified on staging: `ollama ps` -> UNTIL
  "Forever"; host has 47 GB free).
- application.yaml: raise app.ollama.timeout-seconds 30 -> 60 so the one
  unavoidable cold load (first query after an Ollama restart, before the
  model is pinned) completes instead of timing out.

Refs #758

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
Owner

🏛️ Markus Keller — Application Architect

Verdict: ⚠️ Approved with concerns

The topology decision is sound: Ollama stays a single internal container on archiv-net, the backend degrades gracefully to 503 when it is absent (no hard depends_on from backend — verified, correct), and the init/serve split with condition: service_completed_successfully is the right ordering primitive. This is boring, self-hosted, monolith-aligned infrastructure — exactly what I want. No new transport, no new broker, no premature complexity.

Blockers

  1. Doc currency: docs/DEPLOYMENT.md is now stale on the timeout you just changed. This PR bumps app.ollama.timeout-seconds from 30 → 60 in application.yaml, but DEPLOYMENT.md:616 still documents the default as 30. Per my doc-currency rule, a config change that contradicts the docs does not merge until they match. One-line fix.

    - | `app.ollama.timeout-seconds` | `30` | Read timeout for inference calls |
    + | `app.ollama.timeout-seconds` | `60` | Read timeout for inference calls |
    

Concerns

  1. l2-containers.puml declares ollama twice. The container diagram (already present from #737, so not introduced here — credit for that) contains two Container(ollama, …) entries with the same alias inside System_Boundary(archiv, …): one as "Ollama LLM Service" (ollama/ollama:0.30.6 / port 11434 (internal only)) and a second as "Ollama" (Ollama / port 11434). That renders a duplicate box / is a PlantUML alias collision. Pre-existing, but since this PR is the durable Ollama-in-prod change, folding the de-dupe in (or tracking it) keeps the diagram honest.

  2. No ADR for the prod-LLM decision. Adding a CPU LLM inference container to production with an 8 GB memory envelope and OLLAMA_KEEP_ALIVE=-1 (pinned, never unloaded) is an architectural decision with lasting operational consequences (resource sizing, restart behaviour, single-node coupling). A short ADR — context, the keep-alive/pinning trade-off, the "degrade to 503" consequence — would capture the why so a future maintainer doesn't "optimize" the keep-alive away and reintroduce the cold-load 503. Not a merge blocker, but cheap memory for the codebase.

Resolve #1 and I'm green.

## 🏛️ Markus Keller — Application Architect **Verdict: ⚠️ Approved with concerns** The topology decision is sound: Ollama stays a single internal container on `archiv-net`, the backend degrades gracefully to 503 when it is absent (no hard `depends_on` from `backend` — verified, correct), and the init/serve split with `condition: service_completed_successfully` is the right ordering primitive. This is boring, self-hosted, monolith-aligned infrastructure — exactly what I want. No new transport, no new broker, no premature complexity. ### Blockers 1. **Doc currency: `docs/DEPLOYMENT.md` is now stale on the timeout you just changed.** This PR bumps `app.ollama.timeout-seconds` from 30 → 60 in `application.yaml`, but `DEPLOYMENT.md:616` still documents the default as `30`. Per my doc-currency rule, a config change that contradicts the docs does not merge until they match. One-line fix. ```diff - | `app.ollama.timeout-seconds` | `30` | Read timeout for inference calls | + | `app.ollama.timeout-seconds` | `60` | Read timeout for inference calls | ``` ### Concerns 2. **`l2-containers.puml` declares `ollama` twice.** The container diagram (already present from #737, so *not* introduced here — credit for that) contains two `Container(ollama, …)` entries with the same alias inside `System_Boundary(archiv, …)`: one as `"Ollama LLM Service"` (`ollama/ollama:0.30.6 / port 11434 (internal only)`) and a second as `"Ollama"` (`Ollama / port 11434`). That renders a duplicate box / is a PlantUML alias collision. Pre-existing, but since this PR is the durable Ollama-in-prod change, folding the de-dupe in (or tracking it) keeps the diagram honest. 3. **No ADR for the prod-LLM decision.** Adding a CPU LLM inference container to production with an 8 GB memory envelope and `OLLAMA_KEEP_ALIVE=-1` (pinned, never unloaded) is an architectural decision with lasting operational consequences (resource sizing, restart behaviour, single-node coupling). A short ADR — context, the keep-alive/pinning trade-off, the "degrade to 503" consequence — would capture the *why* so a future maintainer doesn't "optimize" the keep-alive away and reintroduce the cold-load 503. Not a merge blocker, but cheap memory for the codebase. Resolve #1 and I'm green.
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

Almost entirely infra YAML, so there's little for me to TDD here — and the one code-adjacent change (application.yaml) is clean. The comments earn their place: they explain why (cold model load before OLLAMA_KEEP_ALIVE pins it; image ships no curl; ENTRYPOINT is ollama so it must be overridden) rather than what. That's exactly the comment discipline I'd want — these are non-obvious operational facts, not narration. The root-cause writeup in the PR body is excellent.

Concerns

  1. Doc drift on the value you changed. application.yaml now sets timeout-seconds: 60, but docs/DEPLOYMENT.md:616 still says 30. Same finding Markus raised — flagging because "update the doc when you change the code" is on my own pre-PR checklist. Quick fix.

  2. The defect class here was untested by construction. Two real bugs (ollama sh -c "…" from the un-overridden ENTRYPOINT, and a curl-based readiness/health probe in a curl-less image) shipped in #737 and only surfaced on staging. Neither is a Java/Svelte unit-test gap — it's that nothing ever executed the compose recipe in CI. I'll defer the concrete test design to Sara, but from a "red-before-green" standpoint: this fix has no failing-test-turned-green artifact, just manual staging verification. That's acceptable for an infra hotfix, but the durable guard belongs in CI.

Nits (non-blocking)

  1. The dev command: previously cleaned up its background ollama serve with kill $$SERVE_PID; the new one-shot drops that and relies on container exit to reap it. Fine for a one-shot init container — just noting the intentional behaviour change so it isn't mistaken for an omission.

Good, tight fix. Fix the doc row and I'm satisfied.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** Almost entirely infra YAML, so there's little for me to TDD here — and the one code-adjacent change (`application.yaml`) is clean. The comments earn their place: they explain *why* (cold model load before `OLLAMA_KEEP_ALIVE` pins it; image ships no curl; ENTRYPOINT is `ollama` so it must be overridden) rather than *what*. That's exactly the comment discipline I'd want — these are non-obvious operational facts, not narration. The root-cause writeup in the PR body is excellent. ### Concerns 1. **Doc drift on the value you changed.** `application.yaml` now sets `timeout-seconds: 60`, but `docs/DEPLOYMENT.md:616` still says `30`. Same finding Markus raised — flagging because "update the doc when you change the code" is on my own pre-PR checklist. Quick fix. 2. **The defect class here was untested by construction.** Two real bugs (`ollama sh -c "…"` from the un-overridden ENTRYPOINT, and a curl-based readiness/health probe in a curl-less image) shipped in #737 and only surfaced on staging. Neither is a Java/Svelte unit-test gap — it's that nothing ever *executed* the compose recipe in CI. I'll defer the concrete test design to Sara, but from a "red-before-green" standpoint: this fix has no failing-test-turned-green artifact, just manual staging verification. That's acceptable for an infra hotfix, but the durable guard belongs in CI. ### Nits (non-blocking) 3. The dev `command:` previously cleaned up its background `ollama serve` with `kill $$SERVE_PID`; the new one-shot drops that and relies on container exit to reap it. Fine for a one-shot init container — just noting the intentional behaviour change so it isn't mistaken for an omission. Good, tight fix. Fix the doc row and I'm satisfied.
Author
Owner

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

This is my wheelhouse and most of it is done right. Image tag is pinned (ollama/ollama:0.30.6, not :latest). Named volume ollama-models for the weights. Full hardening preserved on both new services (read_only: true, cap_drop: ALL, no-new-privileges:true, tmpfs /tmp). expose: ["11434"] rather than ports: keeps it off the host — correct. depends_on: condition: service_completed_successfully gates serving on the pull. Resource limits are env-overridable. The ollama list health/readiness probe is the right call for a curl-less image. Good work.

Concerns

  1. ollama pull runs on every up/restart and is not resilient to a registry outage. The init command is ollama serve & until ollama list …; do ollama pull qwen2.5:…. Even with the model already on the volume, ollama pull contacts the registry to verify the manifest digest. If the host reboots during an Ollama-registry or upstream-network blip, the pull errors → init exits non-zero → service_completed_successfully is never met → the ollama service won't start → NL search is down until the registry is reachable again. DEPLOYMENT.md:284 claims re-deploy idempotency, but that's about skipping the blob download, not skipping the registry round-trip. Make it offline-safe:

    ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done && \
      (ollama list | grep -q 'qwen2.5:7b-instruct-q4_K_M' || ollama pull qwen2.5:7b-instruct-q4_K_M)
    

    Now a cached model means a clean exit without ever touching the network.

  2. Volume naming is inconsistent across files and the docs are wrong for prod. Prod names the volume ollama-models (hyphen); dev names it ollama_models (underscore). Worse, DEPLOYMENT.md:626-628 documents the prod volume as archiv-production_ollama_models (underscore) — with this PR the actual volume is archiv-production_ollama-models (hyphen), so the documented docker volume rm command for a model swap won't match. Either rename the prod volume to ollama_models for consistency with dev (my preference — one mental model) or fix the doc. Pick one.

  3. memswap_limit == mem_limit disables swap → hard OOM-kill if the 8 GB envelope is exceeded. That's a deliberate, defensible choice (swap-thrashing an LLM is worse), but qwen2.5-7B-q4 + KV cache under load sits close enough to 8 GB that I'd want a Prometheus alert on the ollama container's memory before this bites in prod. The scrape job is already noted as out-of-scope in #758 — just make sure the alert lands with it.

  4. Init mem_limit: 2g is tight for ollama serve + a 4.7 GB pull. The pull streams to disk so RAM stays low, but ollama serve plus extraction has been known to spike. It verified on staging, so this is a "watch it," not a blocker — if init ever OOM-loops, 2g is the first knob.

Fix the volume-name/doc mismatch (#2) and harden the pull (#1) and I'm green.

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** This is my wheelhouse and most of it is done right. Image tag is pinned (`ollama/ollama:0.30.6`, not `:latest`). Named volume `ollama-models` for the weights. Full hardening preserved on both new services (`read_only: true`, `cap_drop: ALL`, `no-new-privileges:true`, tmpfs `/tmp`). `expose: ["11434"]` rather than `ports:` keeps it off the host — correct. `depends_on: condition: service_completed_successfully` gates serving on the pull. Resource limits are env-overridable. The `ollama list` health/readiness probe is the right call for a curl-less image. Good work. ### Concerns 1. **`ollama pull` runs on every `up`/restart and is not resilient to a registry outage.** The init command is `ollama serve & until ollama list …; do ollama pull qwen2.5:…`. Even with the model already on the volume, `ollama pull` contacts the registry to verify the manifest digest. If the host reboots during an Ollama-registry or upstream-network blip, the pull errors → init exits non-zero → `service_completed_successfully` is never met → the `ollama` service won't start → NL search is down until the registry is reachable again. `DEPLOYMENT.md:284` claims re-deploy idempotency, but that's about skipping the *blob download*, not skipping the registry round-trip. Make it offline-safe: ```sh ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done && \ (ollama list | grep -q 'qwen2.5:7b-instruct-q4_K_M' || ollama pull qwen2.5:7b-instruct-q4_K_M) ``` Now a cached model means a clean exit without ever touching the network. 2. **Volume naming is inconsistent across files *and* the docs are wrong for prod.** Prod names the volume `ollama-models` (hyphen); dev names it `ollama_models` (underscore). Worse, `DEPLOYMENT.md:626-628` documents the prod volume as `archiv-production_ollama_models` (underscore) — with this PR the actual volume is `archiv-production_ollama-models` (hyphen), so the documented `docker volume rm` command for a model swap won't match. Either rename the prod volume to `ollama_models` for consistency with dev (my preference — one mental model) or fix the doc. Pick one. 3. **`memswap_limit == mem_limit` disables swap → hard OOM-kill if the 8 GB envelope is exceeded.** That's a deliberate, defensible choice (swap-thrashing an LLM is worse), but qwen2.5-7B-q4 + KV cache under load sits close enough to 8 GB that I'd want a Prometheus alert on the `ollama` container's memory before this bites in prod. The scrape job is already noted as out-of-scope in #758 — just make sure the alert lands with it. 4. **Init `mem_limit: 2g` is tight for `ollama serve` + a 4.7 GB pull.** The pull streams to disk so RAM stays low, but `ollama serve` plus extraction has been known to spike. It verified on staging, so this is a "watch it," not a blocker — if init ever OOM-loops, 2g is the first knob. Fix the volume-name/doc mismatch (#2) and harden the pull (#1) and I'm green.
Author
Owner

🔒 Nora Steiner ("NullX") — Application Security Engineer

Verdict: Approved with concerns

No injection surface, no secrets in the diff, no new credentials. The hardening posture is strong and consistent: read_only, cap_drop: ALL, no-new-privileges, tmpfs, and — most importantly — expose: ["11434"] instead of a host ports: mapping. The Ollama API is unauthenticated, but it is reachable only on archiv-net, so the network boundary is the control. That's a defensible defense-in-depth design for an internal inference service.

What I checked and cleared

  • No public exposure. expose (not ports) → not bound to the host, not internet-reachable. Confirm Caddy never adds an ollama upstream (it doesn't today — /api/* → backend, everything else → frontend). Keep it that way; an LLM /api/generate reachable from the proxy is an SSRF/DoS amplifier.
  • Secrets. Prod intentionally omits OLLAMA_API_KEY because RestClientOllamaClient sends no Authorization header (documented in the PR, tracked in #758). No dangling secret, no hardcoded fallback. Fine — and good that it's explicit rather than shipping a dead/empty key.

Concerns (non-blocking)

  1. Unauthenticated inference relies entirely on network isolation. That's acceptable here, but it's a single-layer control. The moment anything else lands on archiv-net that processes untrusted input, it can hit ollama:11434/api/generate with no auth — prompt-injection / resource-exhaustion vector. The Authorization-header work in #758 is the right durable fix; I'd treat it as a real backlog item, not a someday. No action this PR.

  2. OLLAMA_KEEP_ALIVE=-1 pins the model in memory indefinitely. No security downside, but combined with memswap_limit == mem_limit it means a memory-pressure event is a hard kill, not graceful degradation — worth a one-line note in the eventual ADR so it reads as intentional, not as an availability oversight.

No secure-coding blockers. Ship it once the others' doc fixes land.

## 🔒 Nora Steiner ("NullX") — Application Security Engineer **Verdict: ✅ Approved with concerns** No injection surface, no secrets in the diff, no new credentials. The hardening posture is strong and consistent: `read_only`, `cap_drop: ALL`, `no-new-privileges`, tmpfs, and — most importantly — `expose: ["11434"]` instead of a host `ports:` mapping. The Ollama API is unauthenticated, but it is reachable **only** on `archiv-net`, so the network boundary is the control. That's a defensible defense-in-depth design for an internal inference service. ### What I checked and cleared - **No public exposure.** `expose` (not `ports`) → not bound to the host, not internet-reachable. Confirm Caddy never adds an `ollama` upstream (it doesn't today — `/api/*` → backend, everything else → frontend). Keep it that way; an LLM `/api/generate` reachable from the proxy is an SSRF/DoS amplifier. - **Secrets.** Prod intentionally omits `OLLAMA_API_KEY` because `RestClientOllamaClient` sends no `Authorization` header (documented in the PR, tracked in #758). No dangling secret, no hardcoded fallback. Fine — and good that it's explicit rather than shipping a dead/empty key. ### Concerns (non-blocking) 1. **Unauthenticated inference relies entirely on network isolation.** That's acceptable here, but it's a single-layer control. The moment anything else lands on `archiv-net` that processes untrusted input, it can hit `ollama:11434/api/generate` with no auth — prompt-injection / resource-exhaustion vector. The Authorization-header work in #758 is the right durable fix; I'd treat it as a real backlog item, not a someday. No action this PR. 2. **`OLLAMA_KEEP_ALIVE=-1` pins the model in memory indefinitely.** No security downside, but combined with `memswap_limit == mem_limit` it means a memory-pressure event is a hard kill, not graceful degradation — worth a one-line note in the eventual ADR so it reads as intentional, not as an availability oversight. No secure-coding blockers. Ship it once the others' doc fixes land.
Author
Owner

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

The fix itself is correct and the staging verification in the PR body is thorough and reproducible (init exits 0, model cached, container healthy, wget from backend returns the model list, one-shot inference within the 8 GB limit). My problem isn't this fix — it's that the bug it fixes was invisible to CI, and this PR doesn't change that.

Concern: the defect class is CI-catchable and currently uncaught

The two root causes — ollama sh -c "…" from an un-overridden ENTRYPOINT, and a curl probe in a curl-less image — are exactly the kind of failure a cheap pipeline step would have caught at #737 time, instead of surfacing live on staging. Manual verification is not a regression guard; the next refactor of this recipe has nothing stopping it from re-breaking.

Minimum gate I'd want (fast, no GPU, no model download):

  1. Config lintdocker compose -f docker-compose.yml config and docker compose -f docker-compose.yml -f docker-compose.prod.yml config in CI. Catches YAML/merge breakage on every PR. <5s.
  2. Init smoke — start only ollama-model-init against a tiny tag-pinned model (or assert it reaches the until ollama list ready state and the entrypoint resolves), with a hard timeout. This is the step that would have caught both the sh ENTRYPOINT bug and the curl-less probe. The full 4.7 GB qwen pull stays out of CI (correctly — DEPLOYMENT.md:282 already warns against --wait on first deploy).

E2E for NL search belongs at the integration layer with a stubbed Ollama, not a real model in CI — but that's a separate ticket.

What's good

  • Verification steps are concrete and re-runnable — that's the raw material for an automated smoke test; it just needs to move from the PR body into the pipeline.
  • The fix is appropriately not gating backend startup on ollama (degrade-to-503), so no new flaky cross-service health dependency enters the E2E stack. Good.

Not blocking the hotfix, but please open a follow-up for the compose-config + init-smoke CI step — this exact bug will recur otherwise.

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** The fix itself is correct and the staging verification in the PR body is thorough and reproducible (init exits 0, model cached, container healthy, `wget` from backend returns the model list, one-shot inference within the 8 GB limit). My problem isn't this fix — it's that **the bug it fixes was invisible to CI**, and this PR doesn't change that. ### Concern: the defect class is CI-catchable and currently uncaught The two root causes — `ollama sh -c "…"` from an un-overridden ENTRYPOINT, and a curl probe in a curl-less image — are exactly the kind of failure a cheap pipeline step would have caught at #737 time, instead of surfacing live on staging. Manual verification is not a regression guard; the next refactor of this recipe has nothing stopping it from re-breaking. Minimum gate I'd want (fast, no GPU, no model download): 1. **Config lint** — `docker compose -f docker-compose.yml config` and `docker compose -f docker-compose.yml -f docker-compose.prod.yml config` in CI. Catches YAML/merge breakage on every PR. <5s. 2. **Init smoke** — start *only* `ollama-model-init` against a tiny tag-pinned model (or assert it reaches the `until ollama list` ready state and the entrypoint resolves), with a hard timeout. This is the step that would have caught both the `sh` ENTRYPOINT bug and the curl-less probe. The full 4.7 GB qwen pull stays out of CI (correctly — `DEPLOYMENT.md:282` already warns against `--wait` on first deploy). E2E for NL search belongs at the integration layer with a stubbed Ollama, not a real model in CI — but that's a separate ticket. ### What's good - Verification steps are concrete and re-runnable — that's the raw material for an automated smoke test; it just needs to move from the PR body into the pipeline. - The fix is appropriately *not* gating `backend` startup on `ollama` (degrade-to-503), so no new flaky cross-service health dependency enters the E2E stack. Good. Not blocking the hotfix, but please open a follow-up for the compose-config + init-smoke CI step — this exact bug will recur otherwise.
Author
Owner

🎨 Leonie Voss — UX & Accessibility Lead

Verdict: Approved

No UI surface in this PR — it's compose topology and a backend timeout. Nothing in my lane (brand, contrast, touch targets, 320px layout, focus, ARIA) is touched.

One thing I'll note from the user's seat, since this fix is about NL search availability: the failure mode users hit was the 503 → "Intelligente Suche nicht verfügbar" message. That copy is the right pattern — a calm, localized, plain-language degradation rather than a raw error — and it's mapped via getErrorMessage(SMART_SEARCH_UNAVAILABLE), unchanged here. Good that the system degrades visibly and in German rather than failing silently; that matters most for the 60+ audience who won't dig into a console.

Nothing to fix. LGTM.

## 🎨 Leonie Voss — UX & Accessibility Lead **Verdict: ✅ Approved** No UI surface in this PR — it's compose topology and a backend timeout. Nothing in my lane (brand, contrast, touch targets, 320px layout, focus, ARIA) is touched. One thing I'll note from the user's seat, since this fix is about NL search availability: the failure mode users hit was the **503 → "Intelligente Suche nicht verfügbar"** message. That copy is the right pattern — a calm, localized, plain-language degradation rather than a raw error — and it's mapped via `getErrorMessage(SMART_SEARCH_UNAVAILABLE)`, unchanged here. Good that the system degrades visibly and in German rather than failing silently; that matters most for the 60+ audience who won't dig into a console. Nothing to fix. LGTM.
Author
Owner

📋 Elicit — Requirements Engineer & Business Analyst

Verdict: ⚠️ Approved with concerns

As a brownfield bug-fix this is well-formed: clear problem statement, named root causes, explicit Fixes #758 traceability, and a verification section that reads as acceptance evidence (model cached, container healthy, end-to-end reachability from backend, inference within the memory budget). The "Out of scope" section is exactly the scope-creep discipline I want to see — defects discovered mid-fix are deferred to a tracked issue instead of bloating this PR.

Concerns

  1. Requirements debt: the "out of scope" bullets are prose, not tracked work. Three real follow-ups live only as bullet points in this PR description — (a) RestClientOllamaClient sends no Authorization header / OLLAMA_API_KEY omitted, (b) DEPLOYMENT.md NL-search hardware-tier + env rows, (c) Prometheus ollama scrape job + Grafana latency dashboard. The PR says they're "tracked in #758," but #758 is the bug this PR closes. If #758 closes on merge, those items vanish. Confirm each has its own open issue (security-auth, docs, observability) before this merges, or they become invisible debt. Nora and Sara independently arrived at (a) and (c) — that's corroboration they're real backlog items, not nice-to-haves.

  2. Acceptance-criteria gap on the timeout change. Bumping timeout-seconds 30 → 60 is a non-functional change with a user-visible consequence (worst-case wait before a 503). The stated rationale — "~18s warm, higher ceiling absorbs cold load" — is good, but it's an implicit NFR. Make it explicit somewhere durable: "NL search shall return a result or a 503 within 60s; cold-start after an Ollama restart is the only path that approaches this ceiling." And it collides with DEPLOYMENT.md:616 still saying 30 (flagged by Markus/Felix) — a doc that contradicts the shipped value is a traceability defect.

Verdict rationale

The fix is sound and shippable. My concerns are about what happens to the deferred scope — make sure it's in the tracker, not just the PR prose — and closing the 30-vs-60 documentation contradiction. Neither blocks the hotfix value; both protect against silent requirements debt.

## 📋 Elicit — Requirements Engineer & Business Analyst **Verdict: ⚠️ Approved with concerns** As a brownfield bug-fix this is well-formed: clear problem statement, named root causes, explicit `Fixes #758` traceability, and a verification section that reads as acceptance evidence (model cached, container healthy, end-to-end reachability from backend, inference within the memory budget). The "Out of scope" section is exactly the scope-creep discipline I want to see — defects discovered mid-fix are deferred to a tracked issue instead of bloating this PR. ### Concerns 1. **Requirements debt: the "out of scope" bullets are prose, not tracked work.** Three real follow-ups live only as bullet points in this PR description — (a) `RestClientOllamaClient` sends no `Authorization` header / `OLLAMA_API_KEY` omitted, (b) `DEPLOYMENT.md` NL-search hardware-tier + env rows, (c) Prometheus `ollama` scrape job + Grafana latency dashboard. The PR says they're "tracked in #758," but #758 is the bug *this PR closes*. If #758 closes on merge, those items vanish. **Confirm each has its own open issue** (security-auth, docs, observability) before this merges, or they become invisible debt. Nora and Sara independently arrived at (a) and (c) — that's corroboration they're real backlog items, not nice-to-haves. 2. **Acceptance-criteria gap on the timeout change.** Bumping `timeout-seconds` 30 → 60 is a non-functional change with a user-visible consequence (worst-case wait before a 503). The stated rationale — "~18s warm, higher ceiling absorbs cold load" — is good, but it's an implicit NFR. Make it explicit somewhere durable: *"NL search shall return a result or a 503 within 60s; cold-start after an Ollama restart is the only path that approaches this ceiling."* And it collides with `DEPLOYMENT.md:616` still saying `30` (flagged by Markus/Felix) — a doc that contradicts the shipped value is a traceability defect. ### Verdict rationale The fix is sound and shippable. My concerns are about **what happens to the deferred scope** — make sure it's in the tracker, not just the PR prose — and closing the 30-vs-60 documentation contradiction. Neither blocks the hotfix value; both protect against silent requirements debt.
marcel added 6 commits 2026-06-06 20:17:41 +02:00
application.yaml sets app.ollama.timeout-seconds: 60 (raised from 30 to
absorb the cold model load on the first query after an Ollama restart),
but DEPLOYMENT.md still documented 30. A doc that contradicts the shipped
value is a traceability defect.

Addresses #759 review (Markus, Felix, Elicit).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docker-compose.prod.yml declares the volume as `ollama-models` (hyphen),
so the compose-project-prefixed name is `archiv-production_ollama-models`,
not the underscored `archiv-production_ollama_models` the model-upgrade
guide documented. The documented `docker volume rm` would not have matched
the real volume.

Addresses #759 review (Tobias #2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The init command unconditionally ran `ollama pull`, which contacts the
registry to verify the manifest digest even when the model is already on
the volume. A host reboot during a registry/upstream-network blip would
then fail init non-zero, the `service_completed_successfully` gate would
never be met, and the ollama service (hence NL search) would stay down
until the registry was reachable again.

Guard the pull with `ollama list | grep -q <model>` so a cached model
exits clean without any registry round-trip.

Addresses #759 review (Tobias #1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirror the prod hardening in the dev stack: guard the model pull with
`ollama list | grep -q <model>` so an already-cached model exits clean
without a registry round-trip. Keeps dev and prod on one recipe.

Addresses #759 review (Tobias #1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The diagram declared Container(ollama, ...) twice — an alias collision that
renders a duplicate box. It also declared the backend->ollama relationship
twice. Keep the richer 'Ollama LLM Service' declaration and the more
specific 'NL query parsing (POST /api/generate)' relationship; drop the
duplicates.

Addresses #759 review (Markus #2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs(adr): record prod Ollama deployment + keep-alive decision (ADR-034)
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m23s
CI / OCR Service Tests (pull_request) Successful in 24s
CI / Backend Unit Tests (pull_request) Successful in 3m52s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 25s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m7s
CI / Unit & Component Tests (push) Successful in 3m23s
CI / OCR Service Tests (push) Successful in 23s
CI / Backend Unit Tests (push) Successful in 3m52s
CI / fail2ban Regex (push) Successful in 46s
CI / Semgrep Security Scan (push) Successful in 23s
CI / Compose Bucket Idempotency (push) Successful in 1m4s
nightly / deploy-staging (push) Successful in 2m44s
ed98729f75
Capture the why behind deploying Ollama to prod/staging compose: the
corrected init recipe (supersedes ADR-028 §10's never-functional curl
loop), the OLLAMA_KEEP_ALIVE=-1 pin (so a future maintainer doesn't
optimize it away and reintroduce the post-idle cold-load 503), the
30->60s timeout NFR, and the memswap==mem hard-OOM trade-off.

Addresses #759 review (Markus #3, Nora #2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
Owner

👨‍💻 Felix Brandt — review concerns addressed

Pushed 6 atomic commits to fix/issue-758-ollama-prod-compose addressing the review. Honest note up front: this is infra YAML + Markdown + a PlantUML diagram — there is no executable behaviour to red/green test, so each compose change is verified with docker compose config rather than a unit test (as I flagged in my own review).

Resolved

# Concern (reviewer) Fix Commit
1 Doc drift on the timeout you changedDEPLOYMENT.md:616 said 30, application.yaml says 60 (Markus blocker, Felix, Elicit) Corrected the doc row to 60 with the cold-load rationale 2a0863c
2 Volume name mismatch — docs said archiv-production_ollama_models (underscore); prod volume is ollama-models (hyphen) (Tobias #2) Per your decision, kept the hyphen, fixed the doc to archiv-production_ollama-models f22a1a1
3 ollama pull not resilient to a registry blip — unconditional pull does a registry round-trip even when cached (Tobias #1) Guarded with ollama list | grep -q <model> || ollama pull … in both prod and dev → cached model exits clean, offline-safe re-up a2f37f8 (prod), d7d6d06 (dev)
4 l2-containers.puml declares ollama twice — alias collision (Markus #2) Removed the duplicate Container(ollama, …) and the duplicate Rel(backend, ollama, …); kept the richer declaration db87a64
5 No ADR for the prod-LLM decision (Markus #3, Nora #2) Added ADR-034 — prod deployment, corrected init recipe (supersedes ADR-028 §10–§11), OLLAMA_KEEP_ALIVE=-1 pin rationale, 30→60s NFR, and the memswap==mem hard-OOM trade-off ed98729

Deferred scope — now tracked (Elicit #1)

Since #758 closes when this merges, the "out of scope" prose is now filed as standalone issues:

  • #761 — Ollama observability: Prometheus scrape + container-memory alert + Grafana NL-search latency dashboard (Tobias #3, corroborated by Nora & Sara)
  • #762 — DEPLOYMENT.md NL-search hardware tier + Ollama env-var rows (#737 backlog)

Explicitly not tracked (per maintainer decision this round)

  • CI compose-lint + init-smoke (Sara, Felix #2) — agreed valuable, but the maintainer chose not to file it now.
  • Authorization header on RestClientOllamaClient (Nora #1) — remains noted in the PR body; not filed as a separate issue this round.

Verification

  • docker compose -f docker-compose.yml configOK (exit 0), hardened init command renders as expected.
  • Edited docker-compose.prod.yml is well-formed YAML (yaml.safe_load passes).
  • l2-containers.puml now has exactly one Container(ollama, …) and one Rel(backend, ollama, …).

⚠️ Pre-existing, out of scope: docker compose -f docker-compose.yml -f docker-compose.prod.yml config fails on services.ocr-service.security_opt items at 0 and 1 are equal — both base and prod files define the same single-item security_opt for ocr-service, which a newer compose version flags as a duplicate on merge. This is untouched by this PR (present on main) and unrelated to Ollama; it's exactly the class of thing the deferred CI compose-lint (#761-adjacent / Sara's concern) would surface. Flagging rather than silently fixing.

🤖 Generated with Claude Code

## 👨‍💻 Felix Brandt — review concerns addressed Pushed 6 atomic commits to `fix/issue-758-ollama-prod-compose` addressing the review. Honest note up front: this is infra YAML + Markdown + a PlantUML diagram — there is no executable behaviour to red/green test, so each compose change is verified with `docker compose config` rather than a unit test (as I flagged in my own review). ### Resolved | # | Concern (reviewer) | Fix | Commit | |---|---|---|---| | 1 | **Doc drift on the timeout you changed** — `DEPLOYMENT.md:616` said `30`, `application.yaml` says `60` (Markus **blocker**, Felix, Elicit) | Corrected the doc row to `60` with the cold-load rationale | `2a0863c` | | 2 | **Volume name mismatch** — docs said `archiv-production_ollama_models` (underscore); prod volume is `ollama-models` (hyphen) (Tobias #2) | Per your decision, **kept the hyphen, fixed the doc** to `archiv-production_ollama-models` | `f22a1a1` | | 3 | **`ollama pull` not resilient to a registry blip** — unconditional pull does a registry round-trip even when cached (Tobias #1) | Guarded with `ollama list \| grep -q <model> \|\| ollama pull …` in **both** prod and dev → cached model exits clean, offline-safe re-up | `a2f37f8` (prod), `d7d6d06` (dev) | | 4 | **`l2-containers.puml` declares `ollama` twice** — alias collision (Markus #2) | Removed the duplicate `Container(ollama, …)` and the duplicate `Rel(backend, ollama, …)`; kept the richer declaration | `db87a64` | | 5 | **No ADR for the prod-LLM decision** (Markus #3, Nora #2) | Added **ADR-034** — prod deployment, corrected init recipe (supersedes ADR-028 §10–§11), `OLLAMA_KEEP_ALIVE=-1` pin rationale, 30→60s NFR, and the `memswap==mem` hard-OOM trade-off | `ed98729` | ### Deferred scope — now tracked (Elicit #1) Since #758 closes when this merges, the "out of scope" prose is now filed as standalone issues: - **#761** — Ollama observability: Prometheus scrape + container-memory alert + Grafana NL-search latency dashboard (Tobias #3, corroborated by Nora & Sara) - **#762** — DEPLOYMENT.md NL-search hardware tier + Ollama env-var rows (#737 backlog) ### Explicitly *not* tracked (per maintainer decision this round) - **CI compose-lint + init-smoke** (Sara, Felix #2) — agreed valuable, but the maintainer chose not to file it now. - **`Authorization` header on `RestClientOllamaClient`** (Nora #1) — remains noted in the PR body; not filed as a separate issue this round. ### Verification - `docker compose -f docker-compose.yml config` → **OK** (exit 0), hardened init command renders as expected. - Edited `docker-compose.prod.yml` is well-formed YAML (`yaml.safe_load` passes). - `l2-containers.puml` now has exactly one `Container(ollama, …)` and one `Rel(backend, ollama, …)`. ⚠️ **Pre-existing, out of scope:** `docker compose -f docker-compose.yml -f docker-compose.prod.yml config` fails on `services.ocr-service.security_opt items at 0 and 1 are equal` — both base and prod files define the same single-item `security_opt` for `ocr-service`, which a newer compose version flags as a duplicate on merge. This is untouched by this PR (present on `main`) and unrelated to Ollama; it's exactly the class of thing the deferred CI compose-lint (#761-adjacent / Sara's concern) would surface. Flagging rather than silently fixing. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel merged commit ed98729f75 into main 2026-06-06 20:30:35 +02:00
marcel deleted branch fix/issue-758-ollama-prod-compose 2026-06-06 20:30:35 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#759