Files
familienarchiv/docs/adr/034-ollama-production-deployment-and-keep-alive.md
Marcel ed98729f75
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m23s
CI / OCR Service Tests (pull_request) Successful in 24s
CI / Backend Unit Tests (pull_request) Successful in 3m52s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 25s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m7s
CI / Unit & Component Tests (push) Successful in 3m23s
CI / OCR Service Tests (push) Successful in 23s
CI / Backend Unit Tests (push) Successful in 3m52s
CI / fail2ban Regex (push) Successful in 46s
CI / Semgrep Security Scan (push) Successful in 23s
CI / Compose Bucket Idempotency (push) Successful in 1m4s
nightly / deploy-staging (push) Successful in 2m44s
docs(adr): record prod Ollama deployment + keep-alive decision (ADR-034)
Capture the why behind deploying Ollama to prod/staging compose: the
corrected init recipe (supersedes ADR-028 §10's never-functional curl
loop), the OLLAMA_KEEP_ALIVE=-1 pin (so a future maintainer doesn't
optimize it away and reintroduce the post-idle cold-load 503), the
30->60s timeout NFR, and the memswap==mem hard-OOM trade-off.

Addresses #759 review (Markus #3, Nora #2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 20:16:03 +02:00

126 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-034: Ollama in production — deployment, keep-alive pinning, and corrected init recipe
**Date:** 2026-06-06
**Status:** Accepted
**Deciders:** Marcel Raddatz
**Relates to:** #758 (bug), #759 (fix), #737 (NL search infrastructure)
**Corrects:** ADR-028 §10§11 (init recipe and readiness probe)
---
## Context
ADR-028 introduced Ollama as a Docker Compose service for NL search and documented
its topology, graceful-degradation contract, and memory budget. Two defects survived
that work and only surfaced once NL search reached staging (#758):
1. **Ollama was added only to the dev `docker-compose.yml`.** Staging and production
deploy from the self-contained `docker-compose.prod.yml`, which had no `ollama`
service. The backend defaults to `app.ollama.base-url: http://ollama:11434`, so its
client bean was active and resolved to a non-existent host → `ResourceAccessException`
→ HTTP 503 on every NL search.
2. **The init recipe documented in ADR-028 §10 never worked.** The `ollama/ollama` image
`ENTRYPOINT` is `ollama`, so a bare `command: sh -c "…"` ran as `ollama sh -c "…"`
(`unknown command "sh"`), and the image ships **no curl**, so the curl-based readiness
loop and the curl healthcheck could never pass.
This ADR records the production deployment decision and the corrected operational
contract. It is also the durable record of *why* `OLLAMA_KEEP_ALIVE=-1` is set, so a
future maintainer does not "optimize" it away and reintroduce the cold-load 503.
---
## Decisions
### 1. Ollama is a first-class production service
`docker-compose.prod.yml` now defines `ollama` + `ollama-model-init` + the
`ollama-models` volume, mirroring the dev stack. The graceful-degradation contract from
ADR-028 §3 is preserved: `backend` has **no** hard `depends_on` on `ollama`, so an absent
or unhealthy Ollama still yields a clean 503 rather than blocking backend startup.
### 2. Corrected init recipe (supersedes ADR-028 §10)
The init container overrides the image entrypoint to a shell and probes readiness with
`ollama list` (not curl, which the image lacks):
```sh
ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done && \
(ollama list | grep -q 'qwen2.5:7b-instruct-q4_K_M' || ollama pull qwen2.5:7b-instruct-q4_K_M)
```
```yaml
entrypoint: ["/bin/sh", "-c"]
```
The pull is **guarded by a grep on the cached model list**. A model already on the volume
exits clean without any registry round-trip. This makes re-up offline-safe: a host reboot
during a registry/network blip can no longer fail init (which, via
`condition: service_completed_successfully`, would otherwise block the `ollama` service
and take NL search down until the registry was reachable again). The same recipe is used
in dev and prod — one mental model.
### 3. Healthcheck uses `ollama list` (supersedes ADR-028 §11 probe)
```yaml
healthcheck:
test: ["CMD", "ollama", "list"]
```
`ollama list` hits the local API and exits non-zero when the server is down — the correct
probe for a curl-less image. The `start_period: 60s` rationale from ADR-028 §11 still holds.
### 4. `OLLAMA_KEEP_ALIVE=-1` — pin the model in memory
```yaml
environment:
OLLAMA_KEEP_ALIVE: "-1"
```
By default Ollama evicts an idle model after ~5 minutes. The next query then pays a
cold-load penalty that exceeds the backend read timeout, producing an NL search 503 after
any idle period. Pinning the model (`-1` = never unload) keeps warm-path latency
predictable (~18 s on CPU). **Do not remove this** without re-introducing the post-idle
cold-load 503.
### 5. Read timeout raised 30 → 60 s
`app.ollama.timeout-seconds` is raised from 30 to 60 (`application.yaml`, mirrored in
`DEPLOYMENT.md`). Warm CPU inference is ~18 s; the higher ceiling absorbs the one cold
model load on the first query after an Ollama (re)start, before §4's pin takes hold.
**Implicit NFR made explicit:** NL search shall return a result or a 503 within 60 s; the
cold-start path immediately after an Ollama restart is the only path that approaches this
ceiling.
### 6. Hard-OOM trade-off (refines ADR-028 §2)
`memswap_limit == mem_limit` (both `${OLLAMA_MEM_LIMIT:-8g}`) disables swap for the
container. Combined with §4's pinned model, a memory-pressure event is a **hard OOM-kill,
not graceful latency degradation**. This is deliberate — swap-thrashing an LLM is worse
than a clean restart — but it means the 8 GB envelope is a real ceiling. `qwen2.5-7B-q4`
plus its KV cache under load sits close enough to 8 GB that this needs a Prometheus
memory alert on the `ollama` container before it bites in production (tracked as
observability follow-up, not in this PR).
---
## Consequences
### Positive
- NL search works on staging/production, not just dev — the actual deploy artifact now
matches the documented architecture.
- Re-up is offline-safe: a cached model never depends on registry reachability.
- The keep-alive pin and timeout ceiling make NL search latency predictable on CPU.
### Risks and operational implications
- **Hard OOM under memory pressure** (§6): a Prometheus alert on `ollama` container memory
is required before this is load-bearing in prod. Tracked as an observability follow-up.
- **Unauthenticated inference** relies entirely on `archiv-net` isolation (ADR-028 §7/§12,
unchanged). Sending an `Authorization` header from `RestClientOllamaClient` is a separate
durable hardening item, tracked outside this PR.
- ADR-028 §10§11 describe a recipe that never functioned; this ADR is the authoritative
init/healthcheck contract going forward.