fix(infra): deploy Ollama to prod/staging compose + fix broken model-init recipe #759
Reference in New Issue
Block a user
Delete Branch "fix/issue-758-ollama-prod-compose"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #758.
Why
NL search returned 503 / "Intelligente Suche nicht verfügbar" on staging because Ollama was never reachable. Two defects, both downstream of #737:
docker-compose.yml. Staging/prod deploy from the self-containeddocker-compose.prod.yml, which had noollamaservice. The backend defaults toapp.ollama.base-url: http://ollama:11434, so its client bean was active and hit a non-existent host →ResourceAccessException→ 503.ollama/ollamaimage'sENTRYPOINTisollama, socommand: sh -c "..."ran asollama sh -c "..."(unknown command "sh"), and the image ships no curl, so the readiness loop and the healthcheck could never pass.Changes
docker-compose.prod.yml— addollama-model-init+ollamaservices and theollama-modelsvolume, using the corrected recipe:entrypoint: ["/bin/sh", "-c"]on the init container; readiness viauntil ollama list >/dev/null 2>&1(no curl)healthcheck: ["CMD", "ollama", "list"](no curl)container_name(prod namespaces by compose project); ADR-019 hardening preserveddocker-compose.yml(dev) — fix the same broken entrypoint/command and the curl healthcheck so the dev stack actually starts Ollama.Verification
docker compose configvalid for both files. Corrected recipe deployed to staging and verified end-to-end:ollama-model-initexits 0;qwen2.5:7b-instruct-q4_K_M(4.7 GB) cached inollama-modelsollamacontainer healthy (viaollama list)docker exec archiv-staging-backend-1 wget -qO- http://ollama:11434/api/tags→ returns model listollama runsucceeds within the 8 GB limitOut of scope (tracked in #758)
RestClientOllamaClientdoes not send anAuthorizationheader, soOLLAMA_API_KEYis omitted from the prod service.docs/DEPLOYMENT.mdNL-search hardware tier / env-var rows, Prometheusollamascrape job, Grafana latency dashboard (all from #737).🤖 Generated with Claude Code
NL search recovered after deploy but went 503 again after a few minutes: Ollama unloads the model after its default ~5 min keep-alive, so the next query cold-loads the 4.7 GB model and exceeds the backend's 30s read timeout (ResourceAccessException -> SMART_SEARCH_UNAVAILABLE). Warm inference is ~18s; the cold load after idle is what timed out. - docker-compose.{prod,yml}: set OLLAMA_KEEP_ALIVE=-1 on the ollama service so the model stays resident and never pays a cold-load penalty during normal operation (verified on staging: `ollama ps` -> UNTIL "Forever"; host has 47 GB free). - application.yaml: raise app.ollama.timeout-seconds 30 -> 60 so the one unavoidable cold load (first query after an Ollama restart, before the model is pinned) completes instead of timing out. Refs #758 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>🏛️ Markus Keller — Application Architect
Verdict: ⚠️ Approved with concerns
The topology decision is sound: Ollama stays a single internal container on
archiv-net, the backend degrades gracefully to 503 when it is absent (no harddepends_onfrombackend— verified, correct), and the init/serve split withcondition: service_completed_successfullyis the right ordering primitive. This is boring, self-hosted, monolith-aligned infrastructure — exactly what I want. No new transport, no new broker, no premature complexity.Blockers
Doc currency:
docs/DEPLOYMENT.mdis now stale on the timeout you just changed. This PR bumpsapp.ollama.timeout-secondsfrom 30 → 60 inapplication.yaml, butDEPLOYMENT.md:616still documents the default as30. Per my doc-currency rule, a config change that contradicts the docs does not merge until they match. One-line fix.Concerns
l2-containers.pumldeclaresollamatwice. The container diagram (already present from #737, so not introduced here — credit for that) contains twoContainer(ollama, …)entries with the same alias insideSystem_Boundary(archiv, …): one as"Ollama LLM Service"(ollama/ollama:0.30.6 / port 11434 (internal only)) and a second as"Ollama"(Ollama / port 11434). That renders a duplicate box / is a PlantUML alias collision. Pre-existing, but since this PR is the durable Ollama-in-prod change, folding the de-dupe in (or tracking it) keeps the diagram honest.No ADR for the prod-LLM decision. Adding a CPU LLM inference container to production with an 8 GB memory envelope and
OLLAMA_KEEP_ALIVE=-1(pinned, never unloaded) is an architectural decision with lasting operational consequences (resource sizing, restart behaviour, single-node coupling). A short ADR — context, the keep-alive/pinning trade-off, the "degrade to 503" consequence — would capture the why so a future maintainer doesn't "optimize" the keep-alive away and reintroduce the cold-load 503. Not a merge blocker, but cheap memory for the codebase.Resolve #1 and I'm green.
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
Almost entirely infra YAML, so there's little for me to TDD here — and the one code-adjacent change (
application.yaml) is clean. The comments earn their place: they explain why (cold model load beforeOLLAMA_KEEP_ALIVEpins it; image ships no curl; ENTRYPOINT isollamaso it must be overridden) rather than what. That's exactly the comment discipline I'd want — these are non-obvious operational facts, not narration. The root-cause writeup in the PR body is excellent.Concerns
Doc drift on the value you changed.
application.yamlnow setstimeout-seconds: 60, butdocs/DEPLOYMENT.md:616still says30. Same finding Markus raised — flagging because "update the doc when you change the code" is on my own pre-PR checklist. Quick fix.The defect class here was untested by construction. Two real bugs (
ollama sh -c "…"from the un-overridden ENTRYPOINT, and a curl-based readiness/health probe in a curl-less image) shipped in #737 and only surfaced on staging. Neither is a Java/Svelte unit-test gap — it's that nothing ever executed the compose recipe in CI. I'll defer the concrete test design to Sara, but from a "red-before-green" standpoint: this fix has no failing-test-turned-green artifact, just manual staging verification. That's acceptable for an infra hotfix, but the durable guard belongs in CI.Nits (non-blocking)
command:previously cleaned up its backgroundollama servewithkill $$SERVE_PID; the new one-shot drops that and relies on container exit to reap it. Fine for a one-shot init container — just noting the intentional behaviour change so it isn't mistaken for an omission.Good, tight fix. Fix the doc row and I'm satisfied.
🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
This is my wheelhouse and most of it is done right. Image tag is pinned (
ollama/ollama:0.30.6, not:latest). Named volumeollama-modelsfor the weights. Full hardening preserved on both new services (read_only: true,cap_drop: ALL,no-new-privileges:true, tmpfs/tmp).expose: ["11434"]rather thanports:keeps it off the host — correct.depends_on: condition: service_completed_successfullygates serving on the pull. Resource limits are env-overridable. Theollama listhealth/readiness probe is the right call for a curl-less image. Good work.Concerns
ollama pullruns on everyup/restart and is not resilient to a registry outage. The init command isollama serve & until ollama list …; do ollama pull qwen2.5:…. Even with the model already on the volume,ollama pullcontacts the registry to verify the manifest digest. If the host reboots during an Ollama-registry or upstream-network blip, the pull errors → init exits non-zero →service_completed_successfullyis never met → theollamaservice won't start → NL search is down until the registry is reachable again.DEPLOYMENT.md:284claims re-deploy idempotency, but that's about skipping the blob download, not skipping the registry round-trip. Make it offline-safe:Now a cached model means a clean exit without ever touching the network.
Volume naming is inconsistent across files and the docs are wrong for prod. Prod names the volume
ollama-models(hyphen); dev names itollama_models(underscore). Worse,DEPLOYMENT.md:626-628documents the prod volume asarchiv-production_ollama_models(underscore) — with this PR the actual volume isarchiv-production_ollama-models(hyphen), so the documenteddocker volume rmcommand for a model swap won't match. Either rename the prod volume toollama_modelsfor consistency with dev (my preference — one mental model) or fix the doc. Pick one.memswap_limit == mem_limitdisables swap → hard OOM-kill if the 8 GB envelope is exceeded. That's a deliberate, defensible choice (swap-thrashing an LLM is worse), but qwen2.5-7B-q4 + KV cache under load sits close enough to 8 GB that I'd want a Prometheus alert on theollamacontainer's memory before this bites in prod. The scrape job is already noted as out-of-scope in #758 — just make sure the alert lands with it.Init
mem_limit: 2gis tight forollama serve+ a 4.7 GB pull. The pull streams to disk so RAM stays low, butollama serveplus extraction has been known to spike. It verified on staging, so this is a "watch it," not a blocker — if init ever OOM-loops, 2g is the first knob.Fix the volume-name/doc mismatch (#2) and harden the pull (#1) and I'm green.
🔒 Nora Steiner ("NullX") — Application Security Engineer
Verdict: ✅ Approved with concerns
No injection surface, no secrets in the diff, no new credentials. The hardening posture is strong and consistent:
read_only,cap_drop: ALL,no-new-privileges, tmpfs, and — most importantly —expose: ["11434"]instead of a hostports:mapping. The Ollama API is unauthenticated, but it is reachable only onarchiv-net, so the network boundary is the control. That's a defensible defense-in-depth design for an internal inference service.What I checked and cleared
expose(notports) → not bound to the host, not internet-reachable. Confirm Caddy never adds anollamaupstream (it doesn't today —/api/*→ backend, everything else → frontend). Keep it that way; an LLM/api/generatereachable from the proxy is an SSRF/DoS amplifier.OLLAMA_API_KEYbecauseRestClientOllamaClientsends noAuthorizationheader (documented in the PR, tracked in #758). No dangling secret, no hardcoded fallback. Fine — and good that it's explicit rather than shipping a dead/empty key.Concerns (non-blocking)
Unauthenticated inference relies entirely on network isolation. That's acceptable here, but it's a single-layer control. The moment anything else lands on
archiv-netthat processes untrusted input, it can hitollama:11434/api/generatewith no auth — prompt-injection / resource-exhaustion vector. The Authorization-header work in #758 is the right durable fix; I'd treat it as a real backlog item, not a someday. No action this PR.OLLAMA_KEEP_ALIVE=-1pins the model in memory indefinitely. No security downside, but combined withmemswap_limit == mem_limitit means a memory-pressure event is a hard kill, not graceful degradation — worth a one-line note in the eventual ADR so it reads as intentional, not as an availability oversight.No secure-coding blockers. Ship it once the others' doc fixes land.
🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
The fix itself is correct and the staging verification in the PR body is thorough and reproducible (init exits 0, model cached, container healthy,
wgetfrom backend returns the model list, one-shot inference within the 8 GB limit). My problem isn't this fix — it's that the bug it fixes was invisible to CI, and this PR doesn't change that.Concern: the defect class is CI-catchable and currently uncaught
The two root causes —
ollama sh -c "…"from an un-overridden ENTRYPOINT, and a curl probe in a curl-less image — are exactly the kind of failure a cheap pipeline step would have caught at #737 time, instead of surfacing live on staging. Manual verification is not a regression guard; the next refactor of this recipe has nothing stopping it from re-breaking.Minimum gate I'd want (fast, no GPU, no model download):
docker compose -f docker-compose.yml configanddocker compose -f docker-compose.yml -f docker-compose.prod.yml configin CI. Catches YAML/merge breakage on every PR. <5s.ollama-model-initagainst a tiny tag-pinned model (or assert it reaches theuntil ollama listready state and the entrypoint resolves), with a hard timeout. This is the step that would have caught both theshENTRYPOINT bug and the curl-less probe. The full 4.7 GB qwen pull stays out of CI (correctly —DEPLOYMENT.md:282already warns against--waiton first deploy).E2E for NL search belongs at the integration layer with a stubbed Ollama, not a real model in CI — but that's a separate ticket.
What's good
backendstartup onollama(degrade-to-503), so no new flaky cross-service health dependency enters the E2E stack. Good.Not blocking the hotfix, but please open a follow-up for the compose-config + init-smoke CI step — this exact bug will recur otherwise.
🎨 Leonie Voss — UX & Accessibility Lead
Verdict: ✅ Approved
No UI surface in this PR — it's compose topology and a backend timeout. Nothing in my lane (brand, contrast, touch targets, 320px layout, focus, ARIA) is touched.
One thing I'll note from the user's seat, since this fix is about NL search availability: the failure mode users hit was the 503 → "Intelligente Suche nicht verfügbar" message. That copy is the right pattern — a calm, localized, plain-language degradation rather than a raw error — and it's mapped via
getErrorMessage(SMART_SEARCH_UNAVAILABLE), unchanged here. Good that the system degrades visibly and in German rather than failing silently; that matters most for the 60+ audience who won't dig into a console.Nothing to fix. LGTM.
📋 Elicit — Requirements Engineer & Business Analyst
Verdict: ⚠️ Approved with concerns
As a brownfield bug-fix this is well-formed: clear problem statement, named root causes, explicit
Fixes #758traceability, and a verification section that reads as acceptance evidence (model cached, container healthy, end-to-end reachability from backend, inference within the memory budget). The "Out of scope" section is exactly the scope-creep discipline I want to see — defects discovered mid-fix are deferred to a tracked issue instead of bloating this PR.Concerns
Requirements debt: the "out of scope" bullets are prose, not tracked work. Three real follow-ups live only as bullet points in this PR description — (a)
RestClientOllamaClientsends noAuthorizationheader /OLLAMA_API_KEYomitted, (b)DEPLOYMENT.mdNL-search hardware-tier + env rows, (c) Prometheusollamascrape job + Grafana latency dashboard. The PR says they're "tracked in #758," but #758 is the bug this PR closes. If #758 closes on merge, those items vanish. Confirm each has its own open issue (security-auth, docs, observability) before this merges, or they become invisible debt. Nora and Sara independently arrived at (a) and (c) — that's corroboration they're real backlog items, not nice-to-haves.Acceptance-criteria gap on the timeout change. Bumping
timeout-seconds30 → 60 is a non-functional change with a user-visible consequence (worst-case wait before a 503). The stated rationale — "~18s warm, higher ceiling absorbs cold load" — is good, but it's an implicit NFR. Make it explicit somewhere durable: "NL search shall return a result or a 503 within 60s; cold-start after an Ollama restart is the only path that approaches this ceiling." And it collides withDEPLOYMENT.md:616still saying30(flagged by Markus/Felix) — a doc that contradicts the shipped value is a traceability defect.Verdict rationale
The fix is sound and shippable. My concerns are about what happens to the deferred scope — make sure it's in the tracker, not just the PR prose — and closing the 30-vs-60 documentation contradiction. Neither blocks the hotfix value; both protect against silent requirements debt.
👨💻 Felix Brandt — review concerns addressed
Pushed 6 atomic commits to
fix/issue-758-ollama-prod-composeaddressing the review. Honest note up front: this is infra YAML + Markdown + a PlantUML diagram — there is no executable behaviour to red/green test, so each compose change is verified withdocker compose configrather than a unit test (as I flagged in my own review).Resolved
DEPLOYMENT.md:616said30,application.yamlsays60(Markus blocker, Felix, Elicit)60with the cold-load rationale2a0863carchiv-production_ollama_models(underscore); prod volume isollama-models(hyphen) (Tobias #2)archiv-production_ollama-modelsf22a1a1ollama pullnot resilient to a registry blip — unconditional pull does a registry round-trip even when cached (Tobias #1)ollama list | grep -q <model> || ollama pull …in both prod and dev → cached model exits clean, offline-safe re-upa2f37f8(prod),d7d6d06(dev)l2-containers.pumldeclaresollamatwice — alias collision (Markus #2)Container(ollama, …)and the duplicateRel(backend, ollama, …); kept the richer declarationdb87a64OLLAMA_KEEP_ALIVE=-1pin rationale, 30→60s NFR, and thememswap==memhard-OOM trade-offed98729Deferred scope — now tracked (Elicit #1)
Since #758 closes when this merges, the "out of scope" prose is now filed as standalone issues:
Explicitly not tracked (per maintainer decision this round)
Authorizationheader onRestClientOllamaClient(Nora #1) — remains noted in the PR body; not filed as a separate issue this round.Verification
docker compose -f docker-compose.yml config→ OK (exit 0), hardened init command renders as expected.docker-compose.prod.ymlis well-formed YAML (yaml.safe_loadpasses).l2-containers.pumlnow has exactly oneContainer(ollama, …)and oneRel(backend, ollama, …).⚠️ Pre-existing, out of scope:
docker compose -f docker-compose.yml -f docker-compose.prod.yml configfails onservices.ocr-service.security_opt items at 0 and 1 are equal— both base and prod files define the same single-itemsecurity_optforocr-service, which a newer compose version flags as a duplicate on merge. This is untouched by this PR (present onmain) and unrelated to Ollama; it's exactly the class of thing the deferred CI compose-lint (#761-adjacent / Sara's concern) would surface. Flagging rather than silently fixing.🤖 Generated with Claude Code