merge(search): resolve DEPLOYMENT.md conflict — keep setup + upgrade sections

Both the first-time model pull runbook (from this branch) and the model upgrade procedure (from main) belong in DEPLOYMENT.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 16:47:49 +02:00
parent 4c620619d4 7679596c70
commit 9a9e1c4c40
7 changed files with 599 additions and 8 deletions
--- a/docs/adr/028-ollama-docker-compose-service.md
+++ b/docs/adr/028-ollama-docker-compose-service.md
@@ -0,0 +1,239 @@
+# ADR-028: Ollama Docker Compose service for NL search
+
+**Date:** 2026-06-06
+**Status:** Accepted
+**Deciders:** Marcel Raddatz
+**Relates to:** #737 (infrastructure), #735 (NL search epic)
+
+---
+
+## Context
+
+Issue #735 introduces natural-language document search, requiring a local LLM to generate embeddings and/or run inference at query time. The family archive stores personal family history — data privacy is non-negotiable, so cloud-based inference APIs are excluded. The production target is a Hetzner CX42 (16 GB RAM, 8 vCPUs, CPU-only, ~32 EUR/month).
+
+Alternatives considered:
+
+| Option | Reason rejected |
+|---|---|
+| **llama.cpp** | No HTTP API out of the box; requires custom wrapper; higher ops burden |
+| **vLLM** | GPU-first; significant overhead on CPU-only hardware; overkill for this scale |
+| **Cloud APIs** (OpenAI, Gemini, etc.) | Vendor lock-in; per-token cost at scale; data leaves the server — unacceptable for a private family archive |
+| **Ollama** | Self-contained Docker image; built-in HTTP REST API; actively maintained; CPU-compatible; zero egress |
+
+**Decision:** run Ollama as a Docker Compose service alongside the existing stack.
+
+---
+
+## Decisions
+
+### 1. Hardware minimums and CPU-only constraint
+
+All inference runs on CPU. The target is the Hetzner CX42 (16 GB RAM, 8 vCPUs).
+
+| Tier | RAM | NL search |
+|---|---|---|
+| CX42 | 16 GB | Supported — full stack including Ollama |
+| CX32 | 8 GB | Disabled — set `APP_OLLAMA_BASE_URL=` (empty) to skip Ollama entirely |
+| CX22 | 4 GB | Unsupported for NL search |
+
+### 2. Memory budget on CX42
+
+| Component | `mem_limit` | Typical active RSS |
+|---|---|---|
+| OCR service | 12g (hard ceiling) | ~6 GB |
+| Ollama | 8g | ~8 GB |
+| **Total** | | **~14 GB active** |
+
+`memswap_limit` on the Ollama service is set to `8g` (matching `mem_limit`) to prevent Linux from swapping model weights into swap under OCR memory pressure. Swapping model weights does not crash the container but silently degrades inference latency. This mirrors the pattern already applied to the OCR service.
+
+**Operational constraint:** do NOT run `docker-compose.observability.yml` continuously alongside both OCR and Ollama on a CX42. The observability stack adds ~2 GB, which leaves no headroom.
+
+### 3. Graceful-degradation contract
+
+`app.ollama.base-url` absent OR blank → Ollama bean NOT registered → NL search returns HTTP 503 with `ErrorCode: NL_SEARCH_UNAVAILABLE`.
+
+This single code path covers all unavailability scenarios: base-url unset, service unreachable, health check failed, and request timeout.
+
+#### Why not `@ConditionalOnProperty`
+
+`@ConditionalOnProperty` registers the bean when the property is present but blank (`APP_OLLAMA_BASE_URL=`). This produces a `RestClient` with an empty base URL that fails at runtime with an opaque error rather than a clean 503.
+
+#### Correct condition expression
+
+```java
+@ConditionalOnExpression("!'${app.ollama.base-url:}'.isBlank()")
+```
+
+When the property is absent, the placeholder resolves to `''`; `.isBlank()` returns `true`; negation makes the condition `false`; the bean is not registered. Same result for an explicit empty string (`APP_OLLAMA_BASE_URL=`).
+
+### 4. Backend configuration pattern
+
+Use a `@ConfigurationProperties` record, not separate `@Value` injections:
+
+```java
+@ConfigurationProperties("app.ollama")
+record OllamaProperties(String baseUrl, String apiKey) {}
+```
+
+`OllamaProperties` is registered unconditionally — it is a plain value holder with no side effects.
+
+`@ConditionalOnExpression` belongs **only** on `RestClientOllamaClient` (the bean that creates a live network client).
+
+**Deliberate divergence from the OCR pattern:** the OCR service uses `@Value`-with-default because OCR is always-on and `http://ocr-service:8000` is a safe default. Ollama is truly optional — a missing URL means "feature disabled", not "use this default server". There is no safe default Ollama URL.
+
+### 5. Optional<OllamaClient> injection
+
+The NL search service uses constructor injection with `Optional<OllamaClient>`:
+
+```java
+private final Optional<OllamaClient> ollamaClient;
+```
+
+When empty (bean not registered), the service method returns 503 immediately:
+
+```java
+var client = ollamaClient.orElseThrow(
+    () -> DomainException.internal(ErrorCode.NL_SEARCH_UNAVAILABLE, "Ollama not configured"));
+```
+
+Prefer this over `@Autowired(required = false)` with a null check — the null-check pattern is noisy when the service already uses `@RequiredArgsConstructor`.
+
+### 6. Empty API key guard
+
+`RestClientOllamaClient` omits the `Authorization` header entirely when `apiKey` is blank:
+
+```java
+if (!apiKey.isBlank()) {
+    request.header("Authorization", "Bearer " + apiKey);
+}
+```
+
+Sending `Authorization: Bearer ` (empty token) has undefined or potentially broken behavior depending on the Ollama version. This mirrors the `trainingToken` guard in `RestClientOcrClient.java:107`.
+
+### 7. OLLAMA_API_KEY behavior in Ollama 0.6.5 and 0.30.6
+
+**Empirically verified (2026-06-06) on both `0.6.5` and `0.30.6`:** `OLLAMA_API_KEY` does **not** enforce request authentication in either version.
+
+Test matrix run against `/api/tags`:
+
+| Configuration | No auth header | `Authorization: Bearer ` (empty) | `Authorization: Bearer wrongkey` | `Authorization: Bearer correctkey` |
+|---|---|---|---|---|
+| `OLLAMA_API_KEY=` (empty) | 200 | 200 | — | — |
+| `OLLAMA_API_KEY` unset | 200 | — | — | — |
+| `OLLAMA_API_KEY=testkey99` | 200 | 200 | 200 | 200 |
+
+**Finding:** The `OLLAMA_API_KEY` environment variable is not listed in Ollama's startup config dump and does not gate any HTTP request in either tested version. All configurations — empty string, fully unset, and a real key — accept all requests without authentication.
+
+**Practical implication:** `OLLAMA_API_KEY` provides no defense-in-depth in the tested versions. `archiv-net` network isolation is the only effective security control. The env var is retained in the Compose definition and `.env.example` for forward compatibility if Ollama enables enforcement in a future version, but operators must not rely on it for access control.
+
+**Backend guard still valid:** the `RestClientOllamaClient` code-level guard (omit `Authorization` header when `apiKey.isBlank()`) remains correct behavior regardless — it prevents a malformed `Authorization: Bearer ` header from being sent.
+
+### 8. read_only: true feasibility
+
+**Empirically verified (2026-06-06) on both `0.6.5` and `0.30.6`:** `read_only: true` works with Ollama. All three operations — `ollama serve`, `ollama pull qwen2.5:7b-instruct-q4_K_M`, and `ollama list` — succeeded with exit code 0 in both versions.
+
+Test run:
+```bash
+docker run --rm --read-only \
+  -v ollama_models:/root/.ollama \
+  --tmpfs /tmp \
+  --entrypoint sh ollama/ollama:0.30.6 \
+  -c "ollama serve & sleep 5 && ollama pull qwen2.5:7b-instruct-q4_K_M && ollama list"
+```
+
+**Note:** the entrypoint must be overridden to `sh` for the test command — the container's default entrypoint is `/bin/ollama` and does not accept `sh` as a subcommand. This is a Docker invocation detail; the Compose service definition uses the image's default entrypoint and `command:` override for the init container, which works correctly.
+
+**Result:** `read_only: true` and `tmpfs: - /tmp:size=512m` are applied to both `ollama` and `ollama-model-init`. The `ollama_models` volume handles all persistent writes; no other paths require write access during normal operation.
+
+### 9. Peak RSS of init container during pull
+
+**Empirically verified (2026-06-06):** Peak RSS during `qwen2.5:7b-instruct-q4_K_M` pull was **~108 MiB**.
+
+`docker stats` samples during the pull (15-second intervals):
+
+| Sample | MEM |
+|---|---|
+| 1 | 54.89 MiB |
+| 2 | 66.3 MiB |
+| 5 | 97.25 MiB |
+| 9 | **107.8 MiB** (peak) |
+
+`mem_limit: 2g` is adequate — the model weights stream directly to the named volume; RSS is dominated by the Ollama server process alone (~100 MB), not the model data. No bump to 4 GB needed.
+
+### 10. Init container pull mechanism
+
+The `ollama-model-init` container uses a curl-based readiness loop with captured PID:
+
+```sh
+ollama serve & SERVE_PID=$!
+until curl -sf http://localhost:11434/api/tags; do sleep 1; done
+ollama pull qwen2.5:7b-instruct-q4_K_M
+kill $SERVE_PID
+```
+
+`kill %1` (job-control syntax) is unreliable in non-interactive `sh -c` contexts. Capturing the PID via `SERVE_PID=$!` is reliable.
+
+The same endpoint (`/api/tags`) is used for both the init container readiness loop and the main service `healthcheck`.
+
+### 11. start_period: 60s rationale
+
+The model is pre-pulled by `ollama-model-init` before the main service starts (via `condition: service_completed_successfully`). At main service startup, Ollama only loads model weights from the named volume and binds port 11434.
+
+60 seconds is appropriate for this cold-start profile. 300 seconds was considered — that would be appropriate if the service pulled the model itself — but overstates actual startup time when the model is already present on the volume.
+
+### 12. Security threat model
+
+**Primary control:** `archiv-net` network isolation. Ollama has no externally exposed port (`expose:` only, not `ports:`). The Caddyfile must not route any path to the Ollama service.
+
+**Note on `OLLAMA_API_KEY`:** Per §7, `OLLAMA_API_KEY` is not enforced in Ollama 0.6.5 or 0.30.6 and provides no authentication barrier against a compromised backend container. `archiv-net` network isolation is the sole effective security control. The env var is retained for forward compatibility only — do not rely on it for access control.
+
+Both `ollama` and `ollama-model-init` receive the ADR-019 hardening baseline:
+
+```yaml
+cap_drop: [ALL]
+security_opt: [no-new-privileges:true]
+```
+
+### 13. CI exclusion strategy
+
+Docker Compose profiles are not used — they would add developer friction (requiring `--profile ...` for all local dev commands).
+
+CI uses explicit service selection in `docker-compose.ci.yml`:
+```bash
+docker compose -f docker-compose.ci.yml up -d db minio create-buckets
+```
+
+Ollama is simply not listed and is never started in CI. A YAML comment on the `ollama` service block documents this:
+
+```yaml
+# Not started in CI — CI uses explicit service selection
+# (docker-compose.ci.yml: db minio create-buckets)
+```
+
+### 14. ollama_models volume operational note
+
+The `ollama_models` named volume holds model weights only — fully reproducible by re-pull. No backup is needed.
+
+If the volume fills after a model upgrade:
+```bash
+docker volume rm ollama_models && docker compose up -d
+```
+The init container re-pulls the model on next startup.
+
+---
+
+## Consequences
+
+### Positive
+
+- NL search runs entirely on-premises; no data leaves the server and no per-token cloud cost.
+- Graceful degradation is a first-class concern: smaller or budget-constrained instances can run the app without Ollama with a single env var change.
+- The init container pattern keeps model pull out of the critical startup path for the main service, giving accurate healthcheck timings.
+- `@ConditionalOnExpression` with a blank-check is more correct than `@ConditionalOnProperty` for optional features with no safe default URL.
+
+### Risks and operational implications
+
+- **Memory pressure:** OCR + Ollama together consume ~14 GB on a 16 GB host. Running the observability stack simultaneously risks OOM kills. Monitor with `docker stats`.
+- **CPU inference latency:** `qwen2.5:7b-instruct-q4_K_M` is chosen for CPU viability, but inference on 8 vCPUs will be noticeably slower than GPU-accelerated alternatives. This is acceptable for the family archive use case (low concurrency, not real-time).
+- All three empirical TBD items from the original issue spec were resolved — see §7 (OLLAMA_API_KEY not enforced), §8 (`read_only: true` works), §9 (peak RSS ~108 MiB).
+- Model upgrades require a `docker volume rm` to free old weights before pulling the replacement. Document this in runbook/DEPLOYMENT.md.