Capture the why behind deploying Ollama to prod/staging compose: the corrected init recipe (supersedes ADR-028 §10's never-functional curl loop), the OLLAMA_KEEP_ALIVE=-1 pin (so a future maintainer doesn't optimize it away and reintroduce the post-idle cold-load 503), the 30->60s timeout NFR, and the memswap==mem hard-OOM trade-off. Addresses #759 review (Markus #3, Nora #2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.5 KiB
ADR-034: Ollama in production — deployment, keep-alive pinning, and corrected init recipe
Date: 2026-06-06 Status: Accepted Deciders: Marcel Raddatz Relates to: #758 (bug), #759 (fix), #737 (NL search infrastructure) Corrects: ADR-028 §10–§11 (init recipe and readiness probe)
Context
ADR-028 introduced Ollama as a Docker Compose service for NL search and documented its topology, graceful-degradation contract, and memory budget. Two defects survived that work and only surfaced once NL search reached staging (#758):
- Ollama was added only to the dev
docker-compose.yml. Staging and production deploy from the self-containeddocker-compose.prod.yml, which had noollamaservice. The backend defaults toapp.ollama.base-url: http://ollama:11434, so its client bean was active and resolved to a non-existent host →ResourceAccessException→ HTTP 503 on every NL search. - The init recipe documented in ADR-028 §10 never worked. The
ollama/ollamaimageENTRYPOINTisollama, so a barecommand: sh -c "…"ran asollama sh -c "…"(unknown command "sh"), and the image ships no curl, so the curl-based readiness loop and the curl healthcheck could never pass.
This ADR records the production deployment decision and the corrected operational
contract. It is also the durable record of why OLLAMA_KEEP_ALIVE=-1 is set, so a
future maintainer does not "optimize" it away and reintroduce the cold-load 503.
Decisions
1. Ollama is a first-class production service
docker-compose.prod.yml now defines ollama + ollama-model-init + the
ollama-models volume, mirroring the dev stack. The graceful-degradation contract from
ADR-028 §3 is preserved: backend has no hard depends_on on ollama, so an absent
or unhealthy Ollama still yields a clean 503 rather than blocking backend startup.
2. Corrected init recipe (supersedes ADR-028 §10)
The init container overrides the image entrypoint to a shell and probes readiness with
ollama list (not curl, which the image lacks):
ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done && \
(ollama list | grep -q 'qwen2.5:7b-instruct-q4_K_M' || ollama pull qwen2.5:7b-instruct-q4_K_M)
entrypoint: ["/bin/sh", "-c"]
The pull is guarded by a grep on the cached model list. A model already on the volume
exits clean without any registry round-trip. This makes re-up offline-safe: a host reboot
during a registry/network blip can no longer fail init (which, via
condition: service_completed_successfully, would otherwise block the ollama service
and take NL search down until the registry was reachable again). The same recipe is used
in dev and prod — one mental model.
3. Healthcheck uses ollama list (supersedes ADR-028 §11 probe)
healthcheck:
test: ["CMD", "ollama", "list"]
ollama list hits the local API and exits non-zero when the server is down — the correct
probe for a curl-less image. The start_period: 60s rationale from ADR-028 §11 still holds.
4. OLLAMA_KEEP_ALIVE=-1 — pin the model in memory
environment:
OLLAMA_KEEP_ALIVE: "-1"
By default Ollama evicts an idle model after ~5 minutes. The next query then pays a
cold-load penalty that exceeds the backend read timeout, producing an NL search 503 after
any idle period. Pinning the model (-1 = never unload) keeps warm-path latency
predictable (~18 s on CPU). Do not remove this without re-introducing the post-idle
cold-load 503.
5. Read timeout raised 30 → 60 s
app.ollama.timeout-seconds is raised from 30 to 60 (application.yaml, mirrored in
DEPLOYMENT.md). Warm CPU inference is ~18 s; the higher ceiling absorbs the one cold
model load on the first query after an Ollama (re)start, before §4's pin takes hold.
Implicit NFR made explicit: NL search shall return a result or a 503 within 60 s; the cold-start path immediately after an Ollama restart is the only path that approaches this ceiling.
6. Hard-OOM trade-off (refines ADR-028 §2)
memswap_limit == mem_limit (both ${OLLAMA_MEM_LIMIT:-8g}) disables swap for the
container. Combined with §4's pinned model, a memory-pressure event is a hard OOM-kill,
not graceful latency degradation. This is deliberate — swap-thrashing an LLM is worse
than a clean restart — but it means the 8 GB envelope is a real ceiling. qwen2.5-7B-q4
plus its KV cache under load sits close enough to 8 GB that this needs a Prometheus
memory alert on the ollama container before it bites in production (tracked as
observability follow-up, not in this PR).
Consequences
Positive
- NL search works on staging/production, not just dev — the actual deploy artifact now matches the documented architecture.
- Re-up is offline-safe: a cached model never depends on registry reachability.
- The keep-alive pin and timeout ceiling make NL search latency predictable on CPU.
Risks and operational implications
- Hard OOM under memory pressure (§6): a Prometheus alert on
ollamacontainer memory is required before this is load-bearing in prod. Tracked as an observability follow-up. - Unauthenticated inference relies entirely on
archiv-netisolation (ADR-028 §7/§12, unchanged). Sending anAuthorizationheader fromRestClientOllamaClientis a separate durable hardening item, tracked outside this PR. - ADR-028 §10–§11 describe a recipe that never functioned; this ADR is the authoritative init/healthcheck contract going forward.