Files

Marcel c0d034c85d docs(adr): add ADR-028 — Ollama Docker Compose service for NL search (#737 )

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-06 14:58:49 +02:00

10 KiB

Raw Blame History

ADR-028: Ollama Docker Compose service for NL search

Date: 2026-06-06 Status: Accepted Deciders: Marcel Raddatz Relates to: #737 (infrastructure), #735 (NL search epic)

Context

Issue #735 introduces natural-language document search, requiring a local LLM to generate embeddings and/or run inference at query time. The family archive stores personal family history — data privacy is non-negotiable, so cloud-based inference APIs are excluded. The production target is a Hetzner CX42 (16 GB RAM, 8 vCPUs, CPU-only, ~32 EUR/month).

Alternatives considered:

Option	Reason rejected
llama.cpp	No HTTP API out of the box; requires custom wrapper; higher ops burden
vLLM	GPU-first; significant overhead on CPU-only hardware; overkill for this scale
Cloud APIs (OpenAI, Gemini, etc.)	Vendor lock-in; per-token cost at scale; data leaves the server — unacceptable for a private family archive
Ollama	Self-contained Docker image; built-in HTTP REST API; actively maintained; CPU-compatible; zero egress

Decision: run Ollama as a Docker Compose service alongside the existing stack.

Decisions

1. Hardware minimums and CPU-only constraint

All inference runs on CPU. The target is the Hetzner CX42 (16 GB RAM, 8 vCPUs).

Tier	RAM	NL search
CX42	16 GB	Supported — full stack including Ollama
CX32	8 GB	Disabled — set `APP_OLLAMA_BASE_URL=` (empty) to skip Ollama entirely
CX22	4 GB	Unsupported for NL search

2. Memory budget on CX42

Component	`mem_limit`	Typical active RSS
OCR service	12g (hard ceiling)	~6 GB
Ollama	8g	~8 GB
Total		~14 GB active

memswap_limit on the Ollama service is set to 8g (matching mem_limit) to prevent Linux from swapping model weights into swap under OCR memory pressure. Swapping model weights does not crash the container but silently degrades inference latency. This mirrors the pattern already applied to the OCR service.

Operational constraint: do NOT run docker-compose.observability.yml continuously alongside both OCR and Ollama on a CX42. The observability stack adds ~2 GB, which leaves no headroom.

3. Graceful-degradation contract

app.ollama.base-url absent OR blank → Ollama bean NOT registered → NL search returns HTTP 503 with ErrorCode: NL_SEARCH_UNAVAILABLE.

This single code path covers all unavailability scenarios: base-url unset, service unreachable, health check failed, and request timeout.

Why not `@ConditionalOnProperty`

@ConditionalOnProperty registers the bean when the property is present but blank (APP_OLLAMA_BASE_URL=). This produces a RestClient with an empty base URL that fails at runtime with an opaque error rather than a clean 503.

Correct condition expression

@ConditionalOnExpression("!'${app.ollama.base-url:}'.isBlank()")

When the property is absent, the placeholder resolves to ''; .isBlank() returns true; negation makes the condition false; the bean is not registered. Same result for an explicit empty string (APP_OLLAMA_BASE_URL=).

4. Backend configuration pattern

Use a @ConfigurationProperties record, not separate @Value injections:

@ConfigurationProperties("app.ollama")
record OllamaProperties(String baseUrl, String apiKey) {}

OllamaProperties is registered unconditionally — it is a plain value holder with no side effects.

@ConditionalOnExpression belongs only on RestClientOllamaClient (the bean that creates a live network client).

Deliberate divergence from the OCR pattern: the OCR service uses @Value-with-default because OCR is always-on and http://ocr-service:8000 is a safe default. Ollama is truly optional — a missing URL means "feature disabled", not "use this default server". There is no safe default Ollama URL.

5. Optional injection

The NL search service uses constructor injection with Optional<OllamaClient>:

private final Optional<OllamaClient> ollamaClient;

When empty (bean not registered), the service method returns 503 immediately:

var client = ollamaClient.orElseThrow(
    () -> DomainException.internal(ErrorCode.NL_SEARCH_UNAVAILABLE, "Ollama not configured"));

Prefer this over @Autowired(required = false) with a null check — the null-check pattern is noisy when the service already uses @RequiredArgsConstructor.

6. Empty API key guard

RestClientOllamaClient omits the Authorization header entirely when apiKey is blank:

if (!apiKey.isBlank()) {
    request.header("Authorization", "Bearer " + apiKey);
}

Sending Authorization: Bearer (empty token) has undefined or potentially broken behavior depending on the Ollama version. This mirrors the trainingToken guard in RestClientOcrClient.java:107.

7. OLLAMA_API_KEY empty-string behavior

TBD: Empirical verification pending on Ollama 0.6.5.

Unknown: whether OLLAMA_API_KEY= (explicit empty string) is treated as "no auth" (unauthenticated requests accepted) or "invalid key" (all requests rejected). Both the empty-string and fully-unset cases must be tested.

If empty-string rejects requests, the .env.example comment "Leave empty to run unauthenticated" must be corrected and this ADR updated.

Action item: run empirical test (OLLAMA_API_KEY= vs # OLLAMA_API_KEY in env) and record result before merging PR.

8. read_only: true feasibility

TBD: Investigation pending on Ollama 0.6.5.

Test command:

docker run --rm --read-only \
  -v ollama_models:/root/.ollama \
  --tmpfs /tmp \
  ollama/ollama:0.6.5 \
  sh -c "ollama serve & sleep 3 && ollama pull qwen2.5:7b-instruct-q4_K_M && ollama list"

All three operations (serve, pull, list) must pass to confirm no hidden write paths. Ollama may write to /root/.config/ollama, /var/run, or /tmp/ollama*.

If test succeeds: add read_only: true to the ollama service; document the tmpfs size needed.
If test fails: document which paths require writes and why read_only cannot be applied.

Action item: run investigation before merging PR.

9. Peak RSS of init container during pull

TBD: Investigation pending.

The ollama-model-init container currently has mem_limit: 2g. If peak RSS during qwen2.5:7b-instruct-q4_K_M pull exceeds 2 GB, bump to 4 GB.

Action item: capture docker stats output during pull and record peak RSS here before merging PR.

10. Init container pull mechanism

The ollama-model-init container uses a curl-based readiness loop with captured PID:

ollama serve & SERVE_PID=$!
until curl -sf http://localhost:11434/api/tags; do sleep 1; done
ollama pull qwen2.5:7b-instruct-q4_K_M
kill $SERVE_PID

kill %1 (job-control syntax) is unreliable in non-interactive sh -c contexts. Capturing the PID via SERVE_PID=$! is reliable.

The same endpoint (/api/tags) is used for both the init container readiness loop and the main service healthcheck.

11. start_period: 60s rationale

The model is pre-pulled by ollama-model-init before the main service starts (via condition: service_completed_successfully). At main service startup, Ollama only loads model weights from the named volume and binds port 11434.

60 seconds is appropriate for this cold-start profile. 300 seconds was considered — that would be appropriate if the service pulled the model itself — but overstates actual startup time when the model is already present on the volume.

12. Security threat model

Primary control: archiv-net network isolation. Ollama has no externally exposed port (expose: only, not ports:). The Caddyfile must not route any path to the Ollama service.

Defense-in-depth: OLLAMA_API_KEY guards against lateral movement from a compromised backend container.

Both ollama and ollama-model-init receive the ADR-019 hardening baseline:

cap_drop: [ALL]
security_opt: [no-new-privileges:true]

13. CI exclusion strategy

Docker Compose profiles are not used — they would add developer friction (requiring --profile ... for all local dev commands).

CI uses explicit service selection in docker-compose.ci.yml:

docker compose -f docker-compose.ci.yml up -d db minio create-buckets

Ollama is simply not listed and is never started in CI. A YAML comment on the ollama service block documents this:

# Not started in CI — CI uses explicit service selection
# (docker-compose.ci.yml: db minio create-buckets)

14. ollama_models volume operational note

The ollama_models named volume holds model weights only — fully reproducible by re-pull. No backup is needed.

If the volume fills after a model upgrade:

docker volume rm ollama_models && docker compose up -d

The init container re-pulls the model on next startup.

Consequences

Positive

NL search runs entirely on-premises; no data leaves the server and no per-token cloud cost.
Graceful degradation is a first-class concern: smaller or budget-constrained instances can run the app without Ollama with a single env var change.
The init container pattern keeps model pull out of the critical startup path for the main service, giving accurate healthcheck timings.
@ConditionalOnExpression with a blank-check is more correct than @ConditionalOnProperty for optional features with no safe default URL.

Risks and operational implications

Memory pressure: OCR + Ollama together consume ~14 GB on a 16 GB host. Running the observability stack simultaneously risks OOM kills. Monitor with docker stats.
CPU inference latency: qwen2.5:7b-instruct-q4_K_M is chosen for CPU viability, but inference on 8 vCPUs will be noticeably slower than GPU-accelerated alternatives. This is acceptable for the family archive use case (low concurrency, not real-time).
Three TBD items (OLLAMA_API_KEY empty-string behavior, read_only feasibility, init container peak RSS) must be resolved before the PR is merged. See Decisions §7, §8, §9.
Model upgrades require a docker volume rm to free old weights before pulling the replacement. Document this in runbook/DEPLOYMENT.md.

10 KiB Raw Blame History