feat(infra): Ollama service in Docker Compose for NL search #737

Closed
opened 2026-06-06 12:15:09 +02:00 by marcel · 1 comment
Owner

Part of epic #735.

Goal

Add Ollama as a new Docker Compose service so the NL search backend can call Qwen 2.5 7B over the internal Docker network. The backend degrades gracefully (503) if Ollama is absent, so this service is optional for environments that don't need NL search (CI, staging without the feature).

Hardware requirement: Running Ollama alongside OCR requires a minimum CX42 (16 GB RAM, ~32 EUR/month). On a CX32 (8 GB), leave APP_OLLAMA_BASE_URL empty to disable NL search. docs/DEPLOYMENT.md hardware table must be updated with an "NL Search" tier row.

Decisions

  • CPU limit default: OLLAMA_CPU_LIMIT=4.0 — safe default for both CX32 and CX42. Operators on CX42 (8 vCPUs) can raise to 7.5 via the env var. The .env.example comment documents both values.
  • Init container pull mechanism: Use a curl-based readiness loop with a captured PID — kill %1 is job-control syntax unreliable in non-interactive sh -c. The init container runs: ollama serve & SERVE_PID=$! && until curl -sf http://localhost:11434/api/tags; do sleep 1; done && ollama pull qwen2.5:7b-instruct-q4_K_M && kill $SERVE_PID. Uses the same endpoint as the main service healthcheck.
  • CI exclusion: No Docker Compose profiles. docker-compose.ci.yml already excludes Ollama via explicit service selection (up -d db minio create-buckets). Profiles would add unnecessary dev friction. Add a compose comment instead.
  • OLLAMA_API_KEY default: Empty string — consistent with the "empty = disabled/unconfigured" pattern used by SENTRY_DSN and APP_OLLAMA_BASE_URL. Empirically verify Ollama's behavior when OLLAMA_API_KEY= (empty string) AND when OLLAMA_API_KEY is fully unset before shipping — document in ADR-028 whether both are treated as "no key" (unauthenticated) or "invalid key" (rejects all requests). If empty-string rejects requests, the .env.example comment "Leave empty to run unauthenticated" must be corrected.
  • @ConditionalOnProperty vs empty string: @ConditionalOnProperty(matchIfMissing = false) registers the bean when the property is present but blank (APP_OLLAMA_BASE_URL=), producing a RestClient with an empty base URL that fails at runtime. Use @ConditionalOnExpression("!'${app.ollama.base-url:}'.isBlank()") instead — treats both absent and empty-string as disabled. (Explanation: when absent, the placeholder resolves to ''; .isBlank() returns true; the negation makes the condition false; bean not registered. Same for empty-string.) ADR-028 must document why @Value with a hardcoded default (the OCR pattern) is not appropriate here.
  • start_period for main service: 60s — not 300s. The model is pre-pulled by ollama-model-init before the main service starts (service_completed_successfully), so the main service only loads model weights from the named volume and binds port 11434. 300s was appropriate if the service pulled the model itself; with the init container pattern it overstates the actual cold-start time.
  • memswap_limit: Add memswap_limit: "${OLLAMA_MEM_LIMIT:-8g}" to the main service, matching the OCR service pattern (memswap_limit: 12g). Without this, Linux may swap Ollama model weights under memory pressure from OCR, causing inference to silently degrade to unacceptable latency.

Tasks

Docker Compose service

ollama:
  image: ollama/ollama:0.6.5   # pin — Renovate will bump
  container_name: archive-ollama
  restart: unless-stopped
  # Not started in CI — CI uses explicit service selection (docker-compose.ci.yml: db minio create-buckets)
  expose:
    - "11434"
  networks:
    - archiv-net
  volumes:
    - ollama_models:/root/.ollama
  environment:
    OLLAMA_API_KEY: "${OLLAMA_API_KEY}"
  cpus: "${OLLAMA_CPU_LIMIT:-4.0}"
  mem_limit: "${OLLAMA_MEM_LIMIT:-8g}"
  memswap_limit: "${OLLAMA_MEM_LIMIT:-8g}"
  cap_drop: [ALL]
  security_opt: [no-new-privileges:true]
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
    interval: 30s
    timeout: 10s
    retries: 5
    start_period: 60s  # model weights are pre-loaded by ollama-model-init; service only needs to bind port
  depends_on:
    ollama-model-init:
      condition: service_completed_successfully

Add ollama_models: to the top-level volumes: block alongside ocr_models: and ocr_cache:.

Pre-PR prerequisite — read_only: true investigation: Before opening the PR, run:

docker run --rm --read-only -v ollama_models:/root/.ollama --tmpfs /tmp ollama/ollama:0.6.5 sh -c "ollama serve & sleep 3 && ollama pull qwen2.5:7b-instruct-q4_K_M && ollama list"

Test all three operations (serve, pull, list) to find all write paths (not just startup). Ollama may write to /root/.config/ollama, /var/run, or /tmp/ollama*. If the test succeeds, add read_only: true and tmpfs: - /tmp:size=... to match the OCR service pattern. Document the result in ADR-028 either way.

Also capture docker stats peak RSS during the pull to verify mem_limit: 2g on the init container is sufficient. If peak exceeds 2 GB, bump to 4 GB.

Init container (model pull)

ollama-model-init:
  image: ollama/ollama:0.6.5
  restart: "no"
  networks:
    - archiv-net
  volumes:
    - ollama_models:/root/.ollama
  mem_limit: 2g
  cap_drop: [ALL]
  security_opt: [no-new-privileges:true]
  command: >
    sh -c "ollama serve & SERVE_PID=$! && until curl -sf http://localhost:11434/api/tags; do sleep 1; done && ollama pull qwen2.5:7b-instruct-q4_K_M && kill $SERVE_PID"

No container_name — consistent with ocr-volume-init. Network access is required for the model pull; archiv-net is correct. mem_limit: 2g prevents the one-time pull from starving the rest of the stack (verify peak RSS in pre-PR investigation — bump to 4 GB if exceeded).

Environment variables

Add to docker-compose.yml backend environment block:

APP_OLLAMA_BASE_URL: http://ollama:11434
APP_OLLAMA_API_KEY: "${OLLAMA_API_KEY}"

Both are required: APP_OLLAMA_BASE_URL enables the conditional bean; APP_OLLAMA_API_KEY is injected as Authorization: Bearer <key> on every request to Ollama (omitted entirely when blank — see Security).

Add to .env.example:

# NL search — leave empty to disable smart search
APP_OLLAMA_BASE_URL=http://ollama:11434

# CPU limit: 4.0 is safe on CX32 (4 vCPUs) and CX42 (8 vCPUs). Raise to 7.5 on CX42 for full throughput.
OLLAMA_CPU_LIMIT=4.0

# Memory limit: requires CX42 (16 GB) to run alongside OCR. Reduce or unset APP_OLLAMA_BASE_URL on smaller hosts.
OLLAMA_MEM_LIMIT=8g

# Ollama API key — restricts inference API to authenticated callers on archiv-net.
# Generate with: openssl rand -hex 32. Leave empty to run unauthenticated (not recommended for production).
# NOTE: Empirically verify OLLAMA_API_KEY= (empty string) vs unset behavior before using this in production.
OLLAMA_API_KEY=

Security

  • Set OLLAMA_API_KEY env var on the Ollama service; require the backend to pass Authorization: Bearer ${OLLAMA_API_KEY} on every request when app.ollama.api-key is non-blank. When blank, the header must be omitted entirely — sending Authorization: Bearer (empty token) has undefined behavior and may lock the backend out. Pattern: if (!apiKey.isBlank()) { request.header("Authorization", "Bearer " + apiKey); } — mirrors RestClientOcrClient.java:107's trainingToken guard.
  • Both ollama and ollama-model-init get ADR-019 hardening: cap_drop: [ALL], security_opt: [no-new-privileges:true]. Investigate read_only: true as above (pre-PR prerequisite).
  • Caddyfile review gate: grep for ollama in the Caddyfile before merging — expose:-only is correct and must not gain a public route.
  • archiv-net isolation is the primary security control (Ollama is not internet-facing); OLLAMA_API_KEY is defense-in-depth against lateral movement from a compromised backend container.

Observability — prometheus.yml + Grafana provisioning

No change to docker-compose.observability.yml itself — Prometheus already joins archiv-net and can reach Ollama by service name.

  • Add Prometheus scrape job to infra/observability/prometheus/prometheus.yml:
    - job_name: ollama
      static_configs:
        - targets: ['ollama:11434']
      metrics_path: /metrics
    
    Also fix pre-existing bug in the same PR: ocr:8000ocr-service:8000 (service name mismatch — current OCR scrape does not resolve on archiv-net).
  • Grafana panel: inference latency p50/p95 — add as a provisioned JSON dashboard file at infra/observability/grafana/provisioning/dashboards/ollama.json so it's available automatically without manual import.

ADR

Write ADR-028 (next in sequence after ADR-027) before implementation starts. Must cover:

  • Why Ollama over alternatives (llama.cpp, vLLM, cloud API)
  • CPU-only constraint and hardware minimums (CX42 for full-stack with OCR)
  • Memory budget — on CX42: OCR (6 GB prod) + Ollama (8 GB) = 14 GB; the observability stack should not run continuously alongside both
  • Graceful-degradation contract: app.ollama.base-url absent or blank = disabled; backend uses @ConditionalOnExpression("!'${app.ollama.base-url:}'.isBlank()") (not @ConditionalOnProperty alone — it activates on empty-string, creating a broken RestClient); SpEL expression explanation: absent → resolves to ''.isBlank() true → negation false → bean not registered; same for empty-string; when Ollama is unavailable at runtime, return HTTP 503 with ErrorCode: NL_SEARCH_UNAVAILABLE — this applies to ALL unavailability paths (base-url unset, service unreachable, health check failed, request timeout)
  • Backend config: use @ConfigurationProperties("app.ollama") record OllamaProperties(String baseUrl, String apiKey) — not two @Value injections. OllamaProperties is registered unconditionally (it is a simple value holder). The @ConditionalOnExpression belongs only on RestClientOllamaClient — not on the properties record. Documents the deliberate divergence from the OCR @Value-with-default pattern: OCR is always-on with a safe default; Ollama is truly optional with no safe default URL.
  • Optional<OllamaClient> injection: the NL search service injects Optional<OllamaClient>. When empty (bean not registered), the service method returns 503 with NL_SEARCH_UNAVAILABLE without making a network call.
  • Empty API key guard: the client omits the Authorization header entirely when apiKey.isBlank(). Document this alongside the empirical OLLAMA_API_KEY= behavior so both the backend guard and the Ollama server behavior are on record.
  • OLLAMA_API_KEY empty-string behavior: empirically verified result of whether Ollama 0.6.5 treats OLLAMA_API_KEY= (empty) as "no auth" or "invalid key"; document the exact finding for both empty-string and fully-unset cases
  • read_only: true feasibility finding for the Ollama container (result of pre-PR investigation above); peak RSS of init container during pull (to validate or revise mem_limit: 2g)
  • Security threat model: archiv-net isolation as primary control; OLLAMA_API_KEY as defense-in-depth for lateral movement
  • Operational note: if ollama_models volume fills (e.g., after a model upgrade), run docker volume rm ollama_models and re-pull. Volume holds model weights only — fully reproducible, no backup needed.

Sibling issue (prerequisite to close this one)

Create or link the sibling issue for the Spring Boot Ollama client (graceful degradation, @ConditionalOnExpression bean, NL_SEARCH_UNAVAILABLE ErrorCode) before this issue is closed. The sibling issue spec must include explicit TDD requirements:

  • Unit test: Ollama client absent → service returns 503 with ErrorCode: NL_SEARCH_UNAVAILABLE (red before any implementation)
  • Unit test: OLLAMA_API_KEY is passed as Authorization: Bearer <key> on every request, verified by asserting the header value matches the configured key — not just that a header exists (red before any implementation)
  • Unit test: when app.ollama.api-key is blank, no Authorization header is present on the outgoing request — mirrors the trainingToken guard in RestClientOcrClient.java:107 (red before any implementation)
  • Spring context slice test: bean is NOT registered when APP_OLLAMA_BASE_URL= (empty string) and when unset — not just unit tests at the service layer
  • Define OllamaClient interface before RestClientOllamaClient implementation, mirroring the OcrClient / RestClientOcrClient pattern. Unit tests mock the interface.
  • Service injects Optional<OllamaClient>; the absent case returns 503 with NL_SEARCH_UNAVAILABLE without a network call. Use .orElseThrow(() -> DomainException.internal(ErrorCode.NL_SEARCH_UNAVAILABLE, "...")) — not @Autowired(required = false) with a null check, which is noisier in a @RequiredArgsConstructor service.

The ErrorCode requires the standard four-step addition: ErrorCode.javaerrors.tsgetErrorMessage() → i18n keys in de/en/es.

NL search UI (linked issue required)

The graceful-degradation frontend state must be tracked as a separate Gitea issue — not just this prose note — or explicitly added to the sibling issue scope. Create and link it before closing this issue.

When the frontend graceful-degradation state is implemented, the message must do three things for the 60+ audience: reassure (app is working), explain (smart search is temporarily off), and tell them what they got instead (keyword results).

Draft i18n messages for all three locales:

  • de: "Smart Suche momentan nicht verfügbar — Ergebnisse aus normaler Stichwortsuche."
  • en: "Smart search temporarily unavailable — showing keyword results."
  • es: "Búsqueda inteligente temporalmente no disponible — mostrando resultados de palabras clave."

Documentation

  • Update docs/architecture/c4/l2-containers.puml — add Ollama container + ollama_models volume (l1-context.puml unchanged — Ollama is internal, not an external system)
  • Update docs/DEPLOYMENT.md:
    • Env var table rows: APP_OLLAMA_BASE_URL, APP_OLLAMA_API_KEY, OLLAMA_CPU_LIMIT, OLLAMA_MEM_LIMIT, OLLAMA_API_KEY
    • Hardware requirements row: NL Search tier requires CX42 minimum (16 GB RAM, ~32 EUR/month vs. CX32 at ~17 EUR/month)
    • First-start model pull time (~4 GB; assumes ≥10 Mbps — completes within start_period of init container; one-time manual verification on first deploy — record measured time in PR description)
    • Memory budget note: on CX42, do not run docker-compose.observability.yml continuously alongside both OCR and Ollama active
    • ollama_models volume: model weights only — fully reproducible by re-pull, no backup needed. If volume fills after a model upgrade, run docker volume rm ollama_models and re-pull.

Acceptance Criteria

Automated checks:

  • docker-compose up -d starts the Ollama service alongside the existing stack
  • On subsequent starts (model already in ollama_models volume), the Ollama service reaches service_healthy within 60 seconds
  • Re-running docker compose up -d when the model is already in the volume does not trigger a re-download — verified via docker logs on the init container showing "up to date" or equivalent (not "pulling manifest")
  • The main ollama service does not start until ollama-model-init exits with code 0 (service_completed_successfully)
  • If ollama-model-init exits non-zero (e.g., network outage, disk full, bad model name), docker-compose up reports a service dependency failure for ollama
  • GET http://ollama:11434/api/tags is reachable from within the backend container (docker exec archive-backend curl -sf http://ollama:11434/api/tags)
  • Removing APP_OLLAMA_BASE_URL from the environment does not break any other service; the backend's /actuator/health returns 200 with APP_OLLAMA_BASE_URL unset

Manual verification gates (one-time, record result in PR description):

  • On first start, the model is pulled automatically without manual intervention; pull completes within 10 minutes on a ≥10 Mbps connection
  • docker stats peak RSS during init container pull is within mem_limit: 2g; if exceeded, bump limit and note the measured value

Observability stack required:

  • After deploying the observability stack, the Prometheus target ollama:11434 appears as Up
  • The Grafana dashboard shows Ollama inference latency after at least one inference request
Part of epic #735. ## Goal Add Ollama as a new Docker Compose service so the NL search backend can call Qwen 2.5 7B over the internal Docker network. The backend degrades gracefully (503) if Ollama is absent, so this service is optional for environments that don't need NL search (CI, staging without the feature). **Hardware requirement:** Running Ollama alongside OCR requires a minimum CX42 (16 GB RAM, ~32 EUR/month). On a CX32 (8 GB), leave `APP_OLLAMA_BASE_URL` empty to disable NL search. `docs/DEPLOYMENT.md` hardware table must be updated with an "NL Search" tier row. ## Decisions - **CPU limit default:** `OLLAMA_CPU_LIMIT=4.0` — safe default for both CX32 and CX42. Operators on CX42 (8 vCPUs) can raise to `7.5` via the env var. The `.env.example` comment documents both values. - **Init container pull mechanism:** Use a `curl`-based readiness loop with a captured PID — `kill %1` is job-control syntax unreliable in non-interactive `sh -c`. The init container runs: `ollama serve & SERVE_PID=$! && until curl -sf http://localhost:11434/api/tags; do sleep 1; done && ollama pull qwen2.5:7b-instruct-q4_K_M && kill $SERVE_PID`. Uses the same endpoint as the main service healthcheck. - **CI exclusion:** No Docker Compose profiles. `docker-compose.ci.yml` already excludes Ollama via explicit service selection (`up -d db minio create-buckets`). Profiles would add unnecessary dev friction. Add a compose comment instead. - **`OLLAMA_API_KEY` default:** Empty string — consistent with the "empty = disabled/unconfigured" pattern used by `SENTRY_DSN` and `APP_OLLAMA_BASE_URL`. Empirically verify Ollama's behavior when `OLLAMA_API_KEY=` (empty string) AND when `OLLAMA_API_KEY` is fully unset before shipping — document in ADR-028 whether both are treated as "no key" (unauthenticated) or "invalid key" (rejects all requests). If empty-string rejects requests, the `.env.example` comment "Leave empty to run unauthenticated" must be corrected. - **`@ConditionalOnProperty` vs empty string:** `@ConditionalOnProperty(matchIfMissing = false)` registers the bean when the property is present but blank (`APP_OLLAMA_BASE_URL=`), producing a `RestClient` with an empty base URL that fails at runtime. Use `@ConditionalOnExpression("!'${app.ollama.base-url:}'.isBlank()")` instead — treats both absent and empty-string as disabled. (Explanation: when absent, the placeholder resolves to `''`; `.isBlank()` returns true; the negation makes the condition false; bean not registered. Same for empty-string.) ADR-028 must document why `@Value` with a hardcoded default (the OCR pattern) is not appropriate here. - **`start_period` for main service:** `60s` — not 300s. The model is pre-pulled by `ollama-model-init` before the main service starts (`service_completed_successfully`), so the main service only loads model weights from the named volume and binds port 11434. 300s was appropriate if the service pulled the model itself; with the init container pattern it overstates the actual cold-start time. - **`memswap_limit`:** Add `memswap_limit: "${OLLAMA_MEM_LIMIT:-8g}"` to the main service, matching the OCR service pattern (`memswap_limit: 12g`). Without this, Linux may swap Ollama model weights under memory pressure from OCR, causing inference to silently degrade to unacceptable latency. ## Tasks ### Docker Compose service ```yaml ollama: image: ollama/ollama:0.6.5 # pin — Renovate will bump container_name: archive-ollama restart: unless-stopped # Not started in CI — CI uses explicit service selection (docker-compose.ci.yml: db minio create-buckets) expose: - "11434" networks: - archiv-net volumes: - ollama_models:/root/.ollama environment: OLLAMA_API_KEY: "${OLLAMA_API_KEY}" cpus: "${OLLAMA_CPU_LIMIT:-4.0}" mem_limit: "${OLLAMA_MEM_LIMIT:-8g}" memswap_limit: "${OLLAMA_MEM_LIMIT:-8g}" cap_drop: [ALL] security_opt: [no-new-privileges:true] healthcheck: test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"] interval: 30s timeout: 10s retries: 5 start_period: 60s # model weights are pre-loaded by ollama-model-init; service only needs to bind port depends_on: ollama-model-init: condition: service_completed_successfully ``` Add `ollama_models:` to the top-level `volumes:` block alongside `ocr_models:` and `ocr_cache:`. **Pre-PR prerequisite — `read_only: true` investigation:** Before opening the PR, run: ```sh docker run --rm --read-only -v ollama_models:/root/.ollama --tmpfs /tmp ollama/ollama:0.6.5 sh -c "ollama serve & sleep 3 && ollama pull qwen2.5:7b-instruct-q4_K_M && ollama list" ``` Test all three operations (`serve`, `pull`, `list`) to find all write paths (not just startup). Ollama may write to `/root/.config/ollama`, `/var/run`, or `/tmp/ollama*`. If the test succeeds, add `read_only: true` and `tmpfs: - /tmp:size=...` to match the OCR service pattern. Document the result in ADR-028 either way. Also capture `docker stats` peak RSS during the pull to verify `mem_limit: 2g` on the init container is sufficient. If peak exceeds 2 GB, bump to 4 GB. ### Init container (model pull) ```yaml ollama-model-init: image: ollama/ollama:0.6.5 restart: "no" networks: - archiv-net volumes: - ollama_models:/root/.ollama mem_limit: 2g cap_drop: [ALL] security_opt: [no-new-privileges:true] command: > sh -c "ollama serve & SERVE_PID=$! && until curl -sf http://localhost:11434/api/tags; do sleep 1; done && ollama pull qwen2.5:7b-instruct-q4_K_M && kill $SERVE_PID" ``` No `container_name` — consistent with `ocr-volume-init`. Network access is required for the model pull; `archiv-net` is correct. `mem_limit: 2g` prevents the one-time pull from starving the rest of the stack (verify peak RSS in pre-PR investigation — bump to 4 GB if exceeded). ### Environment variables Add to `docker-compose.yml` backend environment block: ``` APP_OLLAMA_BASE_URL: http://ollama:11434 APP_OLLAMA_API_KEY: "${OLLAMA_API_KEY}" ``` Both are required: `APP_OLLAMA_BASE_URL` enables the conditional bean; `APP_OLLAMA_API_KEY` is injected as `Authorization: Bearer <key>` on every request to Ollama (omitted entirely when blank — see Security). Add to `.env.example`: ``` # NL search — leave empty to disable smart search APP_OLLAMA_BASE_URL=http://ollama:11434 # CPU limit: 4.0 is safe on CX32 (4 vCPUs) and CX42 (8 vCPUs). Raise to 7.5 on CX42 for full throughput. OLLAMA_CPU_LIMIT=4.0 # Memory limit: requires CX42 (16 GB) to run alongside OCR. Reduce or unset APP_OLLAMA_BASE_URL on smaller hosts. OLLAMA_MEM_LIMIT=8g # Ollama API key — restricts inference API to authenticated callers on archiv-net. # Generate with: openssl rand -hex 32. Leave empty to run unauthenticated (not recommended for production). # NOTE: Empirically verify OLLAMA_API_KEY= (empty string) vs unset behavior before using this in production. OLLAMA_API_KEY= ``` ### Security - Set `OLLAMA_API_KEY` env var on the Ollama service; require the backend to pass `Authorization: Bearer ${OLLAMA_API_KEY}` on every request **when `app.ollama.api-key` is non-blank**. When blank, the header must be omitted entirely — sending `Authorization: Bearer ` (empty token) has undefined behavior and may lock the backend out. Pattern: `if (!apiKey.isBlank()) { request.header("Authorization", "Bearer " + apiKey); }` — mirrors `RestClientOcrClient.java:107`'s `trainingToken` guard. - Both `ollama` and `ollama-model-init` get ADR-019 hardening: `cap_drop: [ALL]`, `security_opt: [no-new-privileges:true]`. Investigate `read_only: true` as above (pre-PR prerequisite). - Caddyfile review gate: grep for `ollama` in the Caddyfile before merging — `expose:`-only is correct and must not gain a public route. - `archiv-net` isolation is the primary security control (Ollama is not internet-facing); `OLLAMA_API_KEY` is defense-in-depth against lateral movement from a compromised backend container. ### Observability — prometheus.yml + Grafana provisioning No change to `docker-compose.observability.yml` itself — Prometheus already joins `archiv-net` and can reach Ollama by service name. - [ ] Add Prometheus scrape job to `infra/observability/prometheus/prometheus.yml`: ```yaml - job_name: ollama static_configs: - targets: ['ollama:11434'] metrics_path: /metrics ``` Also fix pre-existing bug in the same PR: `ocr:8000` → `ocr-service:8000` (service name mismatch — current OCR scrape does not resolve on `archiv-net`). - [ ] Grafana panel: inference latency p50/p95 — add as a provisioned JSON dashboard file at `infra/observability/grafana/provisioning/dashboards/ollama.json` so it's available automatically without manual import. ### ADR Write ADR-028 (next in sequence after ADR-027) **before implementation starts**. Must cover: - Why Ollama over alternatives (llama.cpp, vLLM, cloud API) - CPU-only constraint and hardware minimums (CX42 for full-stack with OCR) - Memory budget — on CX42: OCR (6 GB prod) + Ollama (8 GB) = 14 GB; the observability stack should not run continuously alongside both - Graceful-degradation contract: `app.ollama.base-url` absent or blank = disabled; backend uses `@ConditionalOnExpression("!'${app.ollama.base-url:}'.isBlank()")` (not `@ConditionalOnProperty` alone — it activates on empty-string, creating a broken `RestClient`); SpEL expression explanation: absent → resolves to `''` → `.isBlank()` true → negation false → bean not registered; same for empty-string; when Ollama is unavailable at runtime, return HTTP 503 with `ErrorCode: NL_SEARCH_UNAVAILABLE` — this applies to ALL unavailability paths (base-url unset, service unreachable, health check failed, request timeout) - Backend config: use `@ConfigurationProperties("app.ollama") record OllamaProperties(String baseUrl, String apiKey)` — not two `@Value` injections. `OllamaProperties` is registered unconditionally (it is a simple value holder). The `@ConditionalOnExpression` belongs only on `RestClientOllamaClient` — not on the properties record. Documents the deliberate divergence from the OCR `@Value`-with-default pattern: OCR is always-on with a safe default; Ollama is truly optional with no safe default URL. - `Optional<OllamaClient>` injection: the NL search service injects `Optional<OllamaClient>`. When empty (bean not registered), the service method returns 503 with `NL_SEARCH_UNAVAILABLE` without making a network call. - Empty API key guard: the client omits the `Authorization` header entirely when `apiKey.isBlank()`. Document this alongside the empirical `OLLAMA_API_KEY=` behavior so both the backend guard and the Ollama server behavior are on record. - `OLLAMA_API_KEY` empty-string behavior: empirically verified result of whether Ollama 0.6.5 treats `OLLAMA_API_KEY=` (empty) as "no auth" or "invalid key"; document the exact finding for both empty-string and fully-unset cases - `read_only: true` feasibility finding for the Ollama container (result of pre-PR investigation above); peak RSS of init container during pull (to validate or revise `mem_limit: 2g`) - Security threat model: `archiv-net` isolation as primary control; `OLLAMA_API_KEY` as defense-in-depth for lateral movement - Operational note: if `ollama_models` volume fills (e.g., after a model upgrade), run `docker volume rm ollama_models` and re-pull. Volume holds model weights only — fully reproducible, no backup needed. ### Sibling issue (prerequisite to close this one) Create or link the sibling issue for the Spring Boot Ollama client (graceful degradation, `@ConditionalOnExpression` bean, `NL_SEARCH_UNAVAILABLE` ErrorCode) before this issue is closed. The sibling issue spec must include explicit TDD requirements: - Unit test: Ollama client absent → service returns 503 with `ErrorCode: NL_SEARCH_UNAVAILABLE` (red before any implementation) - Unit test: `OLLAMA_API_KEY` is passed as `Authorization: Bearer <key>` on every request, verified by asserting the header value matches the configured key — not just that a header exists (red before any implementation) - Unit test: when `app.ollama.api-key` is blank, no `Authorization` header is present on the outgoing request — mirrors the `trainingToken` guard in `RestClientOcrClient.java:107` (red before any implementation) - Spring context slice test: bean is NOT registered when `APP_OLLAMA_BASE_URL=` (empty string) and when unset — not just unit tests at the service layer - Define `OllamaClient` interface before `RestClientOllamaClient` implementation, mirroring the `OcrClient` / `RestClientOcrClient` pattern. Unit tests mock the interface. - Service injects `Optional<OllamaClient>`; the absent case returns 503 with `NL_SEARCH_UNAVAILABLE` without a network call. Use `.orElseThrow(() -> DomainException.internal(ErrorCode.NL_SEARCH_UNAVAILABLE, "..."))` — not `@Autowired(required = false)` with a null check, which is noisier in a `@RequiredArgsConstructor` service. The ErrorCode requires the standard four-step addition: `ErrorCode.java` → `errors.ts` → `getErrorMessage()` → i18n keys in de/en/es. ### NL search UI (linked issue required) The graceful-degradation frontend state must be tracked as a separate Gitea issue — not just this prose note — or explicitly added to the sibling issue scope. Create and link it before closing this issue. When the frontend graceful-degradation state is implemented, the message must do three things for the 60+ audience: reassure (app is working), explain (smart search is temporarily off), and tell them what they got instead (keyword results). Draft i18n messages for all three locales: - **de:** `"Smart Suche momentan nicht verfügbar — Ergebnisse aus normaler Stichwortsuche."` - **en:** `"Smart search temporarily unavailable — showing keyword results."` - **es:** `"Búsqueda inteligente temporalmente no disponible — mostrando resultados de palabras clave."` ### Documentation - [ ] Update `docs/architecture/c4/l2-containers.puml` — add Ollama container + `ollama_models` volume (`l1-context.puml` unchanged — Ollama is internal, not an external system) - [ ] Update `docs/DEPLOYMENT.md`: - Env var table rows: `APP_OLLAMA_BASE_URL`, `APP_OLLAMA_API_KEY`, `OLLAMA_CPU_LIMIT`, `OLLAMA_MEM_LIMIT`, `OLLAMA_API_KEY` - Hardware requirements row: NL Search tier requires CX42 minimum (16 GB RAM, ~32 EUR/month vs. CX32 at ~17 EUR/month) - First-start model pull time (~4 GB; assumes ≥10 Mbps — completes within `start_period` of init container; one-time manual verification on first deploy — record measured time in PR description) - Memory budget note: on CX42, do not run `docker-compose.observability.yml` continuously alongside both OCR and Ollama active - `ollama_models` volume: model weights only — fully reproducible by re-pull, no backup needed. If volume fills after a model upgrade, run `docker volume rm ollama_models` and re-pull. ## Acceptance Criteria **Automated checks:** - `docker-compose up -d` starts the Ollama service alongside the existing stack - On subsequent starts (model already in `ollama_models` volume), the Ollama service reaches `service_healthy` within 60 seconds - Re-running `docker compose up -d` when the model is already in the volume does not trigger a re-download — verified via `docker logs` on the init container showing "up to date" or equivalent (not "pulling manifest") - The main `ollama` service does not start until `ollama-model-init` exits with code 0 (`service_completed_successfully`) - If `ollama-model-init` exits non-zero (e.g., network outage, disk full, bad model name), `docker-compose up` reports a service dependency failure for `ollama` - `GET http://ollama:11434/api/tags` is reachable from within the backend container (`docker exec archive-backend curl -sf http://ollama:11434/api/tags`) - Removing `APP_OLLAMA_BASE_URL` from the environment does not break any other service; the backend's `/actuator/health` returns 200 with `APP_OLLAMA_BASE_URL` unset **Manual verification gates (one-time, record result in PR description):** - On first start, the model is pulled automatically without manual intervention; pull completes within 10 minutes on a ≥10 Mbps connection - `docker stats` peak RSS during init container pull is within `mem_limit: 2g`; if exceeded, bump limit and note the measured value **Observability stack required:** - After deploying the observability stack, the Prometheus target `ollama:11434` appears as `Up` - The Grafana dashboard shows Ollama inference latency after at least one inference request
marcel added this to the Archive Intelligence — NL Search milestone 2026-06-06 12:15:09 +02:00
marcel added the P2-mediumdevopsfeature labels 2026-06-06 12:16:35 +02:00
Author
Owner

Implementation complete — branch feat/issue-737-ollama-docker-compose

All tasks from the issue spec have been implemented and committed. Sibling issues created.

Commits (9)

SHA Description
d3d92931 docs(adr): add ADR-028 — Ollama Docker Compose service for NL search
741ddfa9 fix(observability): fix OCR target name + add Ollama scrape job
94d5b711 feat(observability): add Grafana Ollama inference latency dashboard
604e7883 docs(arch): add Ollama container to C4 level-2 container diagram
9949e8b9 docs(deploy): document Ollama hardware requirements, env vars, and ops notes
8b6e3888 feat(infra): add Ollama Docker Compose services for NL search
ce3f5de6 feat(infra): add Ollama env vars to .env.example
98df26e0 docs(adr): update ADR-028 with 0.30.6 verified findings for API key + read_only
ff588ba5 fix(infra): escape $$SERVE_PID in compose command to prevent interpolation

Pre-PR investigations completed (ADR-028)

read_only: true: Works on both 0.6.5 and 0.30.6. ollama serve, ollama pull qwen2.5:7b-instruct-q4_K_M, and ollama list all succeed. Applied to both ollama and ollama-model-init.

Peak RSS during pull: ~108 MiB peak. mem_limit: 2g on ollama-model-init is adequate — model weights stream to the volume, not RAM.

OLLAMA_API_KEY enforcement: Not enforced in either 0.6.5 or 0.30.6. All requests return 200 regardless of auth header or key value. The env var does not appear in Ollama's startup config dump. archiv-net isolation is the only effective security control. Retained for forward compatibility. Documented in ADR-028 §7.

Image version

Pinned to ollama/ollama:0.30.6 (current stable as of 2026-06-06). Renovate will bump.

Bug fixed

prometheus.yml ocr-service scrape target was ocr:8000 — Docker service DNS is ocr-service, so the scrape has never resolved. Fixed in 741ddfa9.

Sibling issues created

  • #747feat(search): Spring Boot Ollama client with graceful degradationOllamaClient interface, RestClientOllamaClient, @ConditionalOnExpression, NL_SEARCH_UNAVAILABLE ErrorCode, Optional<OllamaClient> injection, 5 TDD requirements
  • #748feat(search): graceful-degradation UI when NL search is unavailable — frontend message (de/en/es), three-part UX spec for 60+ audience

Next step

Run /review-pr after opening the PR from feat/issue-737-ollama-docker-compose.

## Implementation complete — branch `feat/issue-737-ollama-docker-compose` All tasks from the issue spec have been implemented and committed. Sibling issues created. ### Commits (9) | SHA | Description | |---|---| | `d3d92931` | `docs(adr): add ADR-028 — Ollama Docker Compose service for NL search` | | `741ddfa9` | `fix(observability): fix OCR target name + add Ollama scrape job` | | `94d5b711` | `feat(observability): add Grafana Ollama inference latency dashboard` | | `604e7883` | `docs(arch): add Ollama container to C4 level-2 container diagram` | | `9949e8b9` | `docs(deploy): document Ollama hardware requirements, env vars, and ops notes` | | `8b6e3888` | `feat(infra): add Ollama Docker Compose services for NL search` | | `ce3f5de6` | `feat(infra): add Ollama env vars to .env.example` | | `98df26e0` | `docs(adr): update ADR-028 with 0.30.6 verified findings for API key + read_only` | | `ff588ba5` | `fix(infra): escape $$SERVE_PID in compose command to prevent interpolation` | ### Pre-PR investigations completed (ADR-028) **`read_only: true`:** ✅ Works on both `0.6.5` and `0.30.6`. `ollama serve`, `ollama pull qwen2.5:7b-instruct-q4_K_M`, and `ollama list` all succeed. Applied to both `ollama` and `ollama-model-init`. **Peak RSS during pull:** ~108 MiB peak. `mem_limit: 2g` on `ollama-model-init` is adequate — model weights stream to the volume, not RAM. **`OLLAMA_API_KEY` enforcement:** ❌ Not enforced in either `0.6.5` or `0.30.6`. All requests return 200 regardless of auth header or key value. The env var does not appear in Ollama's startup config dump. `archiv-net` isolation is the only effective security control. Retained for forward compatibility. Documented in ADR-028 §7. ### Image version Pinned to `ollama/ollama:0.30.6` (current stable as of 2026-06-06). Renovate will bump. ### Bug fixed `prometheus.yml` `ocr-service` scrape target was `ocr:8000` — Docker service DNS is `ocr-service`, so the scrape has never resolved. Fixed in `741ddfa9`. ### Sibling issues created - **#747** — `feat(search): Spring Boot Ollama client with graceful degradation` — `OllamaClient` interface, `RestClientOllamaClient`, `@ConditionalOnExpression`, `NL_SEARCH_UNAVAILABLE` ErrorCode, `Optional<OllamaClient>` injection, 5 TDD requirements - **#748** — `feat(search): graceful-degradation UI when NL search is unavailable` — frontend message (de/en/es), three-part UX spec for 60+ audience ### Next step Run `/review-pr` after opening the PR from `feat/issue-737-ollama-docker-compose`.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#737