Files
familienarchiv/docs/adr/034-ollama-production-deployment-and-keep-alive.md
Marcel ed98729f75
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m23s
CI / OCR Service Tests (pull_request) Successful in 24s
CI / Backend Unit Tests (pull_request) Successful in 3m52s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 25s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m7s
CI / Unit & Component Tests (push) Successful in 3m23s
CI / OCR Service Tests (push) Successful in 23s
CI / Backend Unit Tests (push) Successful in 3m52s
CI / fail2ban Regex (push) Successful in 46s
CI / Semgrep Security Scan (push) Successful in 23s
CI / Compose Bucket Idempotency (push) Successful in 1m4s
nightly / deploy-staging (push) Successful in 2m44s
docs(adr): record prod Ollama deployment + keep-alive decision (ADR-034)
Capture the why behind deploying Ollama to prod/staging compose: the
corrected init recipe (supersedes ADR-028 §10's never-functional curl
loop), the OLLAMA_KEEP_ALIVE=-1 pin (so a future maintainer doesn't
optimize it away and reintroduce the post-idle cold-load 503), the
30->60s timeout NFR, and the memswap==mem hard-OOM trade-off.

Addresses #759 review (Markus #3, Nora #2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 20:16:03 +02:00

5.5 KiB
Raw Blame History

ADR-034: Ollama in production — deployment, keep-alive pinning, and corrected init recipe

Date: 2026-06-06 Status: Accepted Deciders: Marcel Raddatz Relates to: #758 (bug), #759 (fix), #737 (NL search infrastructure) Corrects: ADR-028 §10§11 (init recipe and readiness probe)


Context

ADR-028 introduced Ollama as a Docker Compose service for NL search and documented its topology, graceful-degradation contract, and memory budget. Two defects survived that work and only surfaced once NL search reached staging (#758):

  1. Ollama was added only to the dev docker-compose.yml. Staging and production deploy from the self-contained docker-compose.prod.yml, which had no ollama service. The backend defaults to app.ollama.base-url: http://ollama:11434, so its client bean was active and resolved to a non-existent host → ResourceAccessException → HTTP 503 on every NL search.
  2. The init recipe documented in ADR-028 §10 never worked. The ollama/ollama image ENTRYPOINT is ollama, so a bare command: sh -c "…" ran as ollama sh -c "…" (unknown command "sh"), and the image ships no curl, so the curl-based readiness loop and the curl healthcheck could never pass.

This ADR records the production deployment decision and the corrected operational contract. It is also the durable record of why OLLAMA_KEEP_ALIVE=-1 is set, so a future maintainer does not "optimize" it away and reintroduce the cold-load 503.


Decisions

1. Ollama is a first-class production service

docker-compose.prod.yml now defines ollama + ollama-model-init + the ollama-models volume, mirroring the dev stack. The graceful-degradation contract from ADR-028 §3 is preserved: backend has no hard depends_on on ollama, so an absent or unhealthy Ollama still yields a clean 503 rather than blocking backend startup.

2. Corrected init recipe (supersedes ADR-028 §10)

The init container overrides the image entrypoint to a shell and probes readiness with ollama list (not curl, which the image lacks):

ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done && \
  (ollama list | grep -q 'qwen2.5:7b-instruct-q4_K_M' || ollama pull qwen2.5:7b-instruct-q4_K_M)
entrypoint: ["/bin/sh", "-c"]

The pull is guarded by a grep on the cached model list. A model already on the volume exits clean without any registry round-trip. This makes re-up offline-safe: a host reboot during a registry/network blip can no longer fail init (which, via condition: service_completed_successfully, would otherwise block the ollama service and take NL search down until the registry was reachable again). The same recipe is used in dev and prod — one mental model.

3. Healthcheck uses ollama list (supersedes ADR-028 §11 probe)

healthcheck:
  test: ["CMD", "ollama", "list"]

ollama list hits the local API and exits non-zero when the server is down — the correct probe for a curl-less image. The start_period: 60s rationale from ADR-028 §11 still holds.

4. OLLAMA_KEEP_ALIVE=-1 — pin the model in memory

environment:
  OLLAMA_KEEP_ALIVE: "-1"

By default Ollama evicts an idle model after ~5 minutes. The next query then pays a cold-load penalty that exceeds the backend read timeout, producing an NL search 503 after any idle period. Pinning the model (-1 = never unload) keeps warm-path latency predictable (~18 s on CPU). Do not remove this without re-introducing the post-idle cold-load 503.

5. Read timeout raised 30 → 60 s

app.ollama.timeout-seconds is raised from 30 to 60 (application.yaml, mirrored in DEPLOYMENT.md). Warm CPU inference is ~18 s; the higher ceiling absorbs the one cold model load on the first query after an Ollama (re)start, before §4's pin takes hold.

Implicit NFR made explicit: NL search shall return a result or a 503 within 60 s; the cold-start path immediately after an Ollama restart is the only path that approaches this ceiling.

6. Hard-OOM trade-off (refines ADR-028 §2)

memswap_limit == mem_limit (both ${OLLAMA_MEM_LIMIT:-8g}) disables swap for the container. Combined with §4's pinned model, a memory-pressure event is a hard OOM-kill, not graceful latency degradation. This is deliberate — swap-thrashing an LLM is worse than a clean restart — but it means the 8 GB envelope is a real ceiling. qwen2.5-7B-q4 plus its KV cache under load sits close enough to 8 GB that this needs a Prometheus memory alert on the ollama container before it bites in production (tracked as observability follow-up, not in this PR).


Consequences

Positive

  • NL search works on staging/production, not just dev — the actual deploy artifact now matches the documented architecture.
  • Re-up is offline-safe: a cached model never depends on registry reachability.
  • The keep-alive pin and timeout ceiling make NL search latency predictable on CPU.

Risks and operational implications

  • Hard OOM under memory pressure (§6): a Prometheus alert on ollama container memory is required before this is load-bearing in prod. Tracked as an observability follow-up.
  • Unauthenticated inference relies entirely on archiv-net isolation (ADR-028 §7/§12, unchanged). Sending an Authorization header from RestClientOllamaClient is a separate durable hardening item, tracked outside this PR.
  • ADR-028 §10§11 describe a recipe that never functioned; this ADR is the authoritative init/healthcheck contract going forward.