fix(infra): deploy Ollama to prod/staging compose + fix broken model-init recipe #759
125
docs/adr/034-ollama-production-deployment-and-keep-alive.md
Normal file
125
docs/adr/034-ollama-production-deployment-and-keep-alive.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# ADR-034: Ollama in production — deployment, keep-alive pinning, and corrected init recipe
|
||||
|
||||
**Date:** 2026-06-06
|
||||
**Status:** Accepted
|
||||
**Deciders:** Marcel Raddatz
|
||||
**Relates to:** #758 (bug), #759 (fix), #737 (NL search infrastructure)
|
||||
**Corrects:** ADR-028 §10–§11 (init recipe and readiness probe)
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
ADR-028 introduced Ollama as a Docker Compose service for NL search and documented
|
||||
its topology, graceful-degradation contract, and memory budget. Two defects survived
|
||||
that work and only surfaced once NL search reached staging (#758):
|
||||
|
||||
1. **Ollama was added only to the dev `docker-compose.yml`.** Staging and production
|
||||
deploy from the self-contained `docker-compose.prod.yml`, which had no `ollama`
|
||||
service. The backend defaults to `app.ollama.base-url: http://ollama:11434`, so its
|
||||
client bean was active and resolved to a non-existent host → `ResourceAccessException`
|
||||
→ HTTP 503 on every NL search.
|
||||
2. **The init recipe documented in ADR-028 §10 never worked.** The `ollama/ollama` image
|
||||
`ENTRYPOINT` is `ollama`, so a bare `command: sh -c "…"` ran as `ollama sh -c "…"`
|
||||
(`unknown command "sh"`), and the image ships **no curl**, so the curl-based readiness
|
||||
loop and the curl healthcheck could never pass.
|
||||
|
||||
This ADR records the production deployment decision and the corrected operational
|
||||
contract. It is also the durable record of *why* `OLLAMA_KEEP_ALIVE=-1` is set, so a
|
||||
future maintainer does not "optimize" it away and reintroduce the cold-load 503.
|
||||
|
||||
---
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Ollama is a first-class production service
|
||||
|
||||
`docker-compose.prod.yml` now defines `ollama` + `ollama-model-init` + the
|
||||
`ollama-models` volume, mirroring the dev stack. The graceful-degradation contract from
|
||||
ADR-028 §3 is preserved: `backend` has **no** hard `depends_on` on `ollama`, so an absent
|
||||
or unhealthy Ollama still yields a clean 503 rather than blocking backend startup.
|
||||
|
||||
### 2. Corrected init recipe (supersedes ADR-028 §10)
|
||||
|
||||
The init container overrides the image entrypoint to a shell and probes readiness with
|
||||
`ollama list` (not curl, which the image lacks):
|
||||
|
||||
```sh
|
||||
ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done && \
|
||||
(ollama list | grep -q 'qwen2.5:7b-instruct-q4_K_M' || ollama pull qwen2.5:7b-instruct-q4_K_M)
|
||||
```
|
||||
|
||||
```yaml
|
||||
entrypoint: ["/bin/sh", "-c"]
|
||||
```
|
||||
|
||||
The pull is **guarded by a grep on the cached model list**. A model already on the volume
|
||||
exits clean without any registry round-trip. This makes re-up offline-safe: a host reboot
|
||||
during a registry/network blip can no longer fail init (which, via
|
||||
`condition: service_completed_successfully`, would otherwise block the `ollama` service
|
||||
and take NL search down until the registry was reachable again). The same recipe is used
|
||||
in dev and prod — one mental model.
|
||||
|
||||
### 3. Healthcheck uses `ollama list` (supersedes ADR-028 §11 probe)
|
||||
|
||||
```yaml
|
||||
healthcheck:
|
||||
test: ["CMD", "ollama", "list"]
|
||||
```
|
||||
|
||||
`ollama list` hits the local API and exits non-zero when the server is down — the correct
|
||||
probe for a curl-less image. The `start_period: 60s` rationale from ADR-028 §11 still holds.
|
||||
|
||||
### 4. `OLLAMA_KEEP_ALIVE=-1` — pin the model in memory
|
||||
|
||||
```yaml
|
||||
environment:
|
||||
OLLAMA_KEEP_ALIVE: "-1"
|
||||
```
|
||||
|
||||
By default Ollama evicts an idle model after ~5 minutes. The next query then pays a
|
||||
cold-load penalty that exceeds the backend read timeout, producing an NL search 503 after
|
||||
any idle period. Pinning the model (`-1` = never unload) keeps warm-path latency
|
||||
predictable (~18 s on CPU). **Do not remove this** without re-introducing the post-idle
|
||||
cold-load 503.
|
||||
|
||||
### 5. Read timeout raised 30 → 60 s
|
||||
|
||||
`app.ollama.timeout-seconds` is raised from 30 to 60 (`application.yaml`, mirrored in
|
||||
`DEPLOYMENT.md`). Warm CPU inference is ~18 s; the higher ceiling absorbs the one cold
|
||||
model load on the first query after an Ollama (re)start, before §4's pin takes hold.
|
||||
|
||||
**Implicit NFR made explicit:** NL search shall return a result or a 503 within 60 s; the
|
||||
cold-start path immediately after an Ollama restart is the only path that approaches this
|
||||
ceiling.
|
||||
|
||||
### 6. Hard-OOM trade-off (refines ADR-028 §2)
|
||||
|
||||
`memswap_limit == mem_limit` (both `${OLLAMA_MEM_LIMIT:-8g}`) disables swap for the
|
||||
container. Combined with §4's pinned model, a memory-pressure event is a **hard OOM-kill,
|
||||
not graceful latency degradation**. This is deliberate — swap-thrashing an LLM is worse
|
||||
than a clean restart — but it means the 8 GB envelope is a real ceiling. `qwen2.5-7B-q4`
|
||||
plus its KV cache under load sits close enough to 8 GB that this needs a Prometheus
|
||||
memory alert on the `ollama` container before it bites in production (tracked as
|
||||
observability follow-up, not in this PR).
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- NL search works on staging/production, not just dev — the actual deploy artifact now
|
||||
matches the documented architecture.
|
||||
- Re-up is offline-safe: a cached model never depends on registry reachability.
|
||||
- The keep-alive pin and timeout ceiling make NL search latency predictable on CPU.
|
||||
|
||||
### Risks and operational implications
|
||||
|
||||
- **Hard OOM under memory pressure** (§6): a Prometheus alert on `ollama` container memory
|
||||
is required before this is load-bearing in prod. Tracked as an observability follow-up.
|
||||
- **Unauthenticated inference** relies entirely on `archiv-net` isolation (ADR-028 §7/§12,
|
||||
unchanged). Sending an `Authorization` header from `RestClientOllamaClient` is a separate
|
||||
durable hardening item, tracked outside this PR.
|
||||
- ADR-028 §10–§11 describe a recipe that never functioned; this ADR is the authoritative
|
||||
init/healthcheck contract going forward.
|
||||
Reference in New Issue
Block a user