merge(search): resolve DEPLOYMENT.md conflict — keep setup + upgrade sections

Both the first-time model pull runbook (from this branch) and the model upgrade procedure (from main) belong in DEPLOYMENT.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 16:47:49 +02:00
parent 4c620619d4 7679596c70
commit 9a9e1c4c40
7 changed files with 599 additions and 8 deletions
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -50,15 +50,17 @@ graph TD

 The OCR service requires significant RAM for model loading. The dev compose sets `mem_limit: 12g`.

-| Production target | RAM | Recommended OCR limit | Notes |
-|---|---|---|---|
-| Current server (Hetzner Serverbörse, i7-6700) | 64 GB | 12 GB | Default `mem_limit: 12g` works comfortably |
-| ≥ 16 GB RAM | 16+ GB | 12 GB | Default works |
-| 8 GB RAM | 8 GB | 6 GB | Set `OCR_MEM_LIMIT=6g`; accept reduced batch sizes |
-| 4 GB RAM | 4 GB | — | Disable OCR service (`profiles: [ocr]`); run OCR on demand only |
+| Production target | RAM | Recommended OCR limit | NL Search | Notes |
+|---|---|---|---|---|
+| Current server (Hetzner Serverbörse, i7-6700) | 64 GB | 12 GB | Supported | Default `mem_limit: 12g` works comfortably; plenty of headroom for Ollama |
+| ≥ 16 GB RAM | 16+ GB | 12 GB | Supported | Default works |
+| 8 GB RAM | 8 GB | 6 GB | Disabled — set `APP_OLLAMA_BASE_URL=` (empty) | Set `OCR_MEM_LIMIT=6g`; accept reduced batch sizes |
+| 4 GB RAM | 4 GB | — | Unsupported | Disable OCR service (`profiles: [ocr]`); run OCR on demand only |

 On servers with less than 16 GB RAM the default `mem_limit: 12g` cannot be honoured — set the `OCR_MEM_LIMIT` env var (in `.env.production` / `.env.staging`, or as a Gitea secret consumed by the workflow). The prod compose interpolates this var with a 12g default.

+> **Memory budget:** OCR (~6 GB active) + Ollama (~8 GB) = ~14 GB. On servers with less than 16 GB RAM, do not run `docker-compose.observability.yml` continuously alongside both OCR and Ollama.
+
 ### Dev vs production differences

 | Concern | Dev (`docker-compose.yml`) | Prod (`docker-compose.prod.yml`) |
@@ -145,6 +147,16 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
 | `XDG_CACHE_HOME` | XDG cache base dir — redirects Matplotlib and other XDG-aware libraries away from the read-only `HOME` (`/home/ocr`) to the writable cache volume | `/app/cache` | — | — |
 | `TORCH_HOME` | PyTorch model cache — redirects `~/.cache/torch` to the writable models volume | `/app/models/torch` | — | — |

+### Ollama (NL search) service
+
+| Variable | Purpose | Default | Required? | Sensitive? |
+|---|---|---|---|---|
+| `APP_OLLAMA_BASE_URL` | Base URL for the Ollama service. Leave empty to disable NL search. | `http://ollama:11434` | — | — |
+| `APP_OLLAMA_API_KEY` | API key passed as `Authorization: Bearer` to Ollama. Leave empty for unauthenticated access. Note: `OLLAMA_API_KEY` is not enforced in Ollama 0.6.5 or 0.30.6 (see ADR-028). | — | — | YES |
+| `OLLAMA_CPU_LIMIT` | Docker CPU quota for the Ollama container. On CX42 (8 vCPUs) can be raised to `7.5`. | `4.0` | — | — |
+| `OLLAMA_MEM_LIMIT` | Memory limit for the Ollama container. Requires CX42 (16 GB RAM). | `8g` | — | — |
+| `OLLAMA_API_KEY` | API key set on the Ollama service itself. Same value as `APP_OLLAMA_API_KEY`. Leave empty for unauthenticated. | — | — | YES |
+
 ### Observability stack (`docker-compose.observability.yml`)

 | Variable | Purpose | Default | Required? | Sensitive? |
@@ -265,6 +277,19 @@ git.raddatz.cloud      A   <server IP>

 ### 3.4 First deploy

+> **First start — Ollama model pull:** On first `docker compose up -d`, the `ollama-model-init` container pulls `qwen2.5:7b-instruct-q4_K_M` (~4.7 GB). At 10 Mbps this takes approximately 60–90 minutes; at 100 Mbps approximately 6–10 minutes. The pull is a one-time operation — subsequent restarts skip it (model already on the `ollama_models` volume). Monitor progress with `docker logs -f $(docker ps -q --filter name=ollama-model-init)`.
+>
+> **Do not use `--wait` on first deploy** — `docker compose up -d --wait` waits for all services to reach their health/completion target, including `ollama-model-init`. On first pull this blocks for 60–90 minutes and will time out any CI/deploy script that uses `--wait`.
+>
+> **Re-deploy idempotency:** on subsequent `docker compose up -d` runs (including `--force-recreate`), `ollama-model-init` re-executes but exits in seconds — Ollama's CLI skips the download when the model digest already matches what is on the volume.
+>
+> **Verify NL search is active** after enabling Ollama (`APP_OLLAMA_BASE_URL=http://ollama:11434`):
+> ```bash
+> curl -s http://localhost:8080/api/nl-search?q=brief+von+grossmutter
+> # Returns 200 with results → NL search is active
+> # Returns 503 NL_SEARCH_UNAVAILABLE → Ollama is not reachable or APP_OLLAMA_BASE_URL is unset
+> ```
+
 ```bash
 # 1. Trigger nightly.yml manually (Repo → Actions → nightly → "Run workflow")
 #    Expected: docker compose up -d --wait succeeds for archiv-staging, then
@@ -591,6 +616,24 @@ Expected output includes `qwen2.5:7b-instruct-q4_K_M`.
 | `app.ollama.timeout-seconds` | `30` | Read timeout for inference calls |
 | `app.nl-search.rate-limit.max-requests-per-minute` | `5` | Per-user rate limit |

+### Upgrade the Ollama model
+
+To switch to a newer model version (e.g. a future release of `qwen2.5`):
+
+1. Update the model name in the `ollama-model-init` `command:` in `docker-compose.yml`.
+2. Remove the existing model volume to free the old weights:
+   ```bash
+   docker volume rm familienarchiv_ollama_models
+   ```
+   (In production the volume name is prefixed with the compose project: `archiv-production_ollama_models`.)
+3. Restart the stack:
+   ```bash
+   docker compose up -d
+   ```
+   The `ollama-model-init` container pulls the new model weights on first start (~4–8 GB download depending on the model). The `ollama` inference server will not start until the pull completes (`condition: service_completed_successfully`).
+
+> **`ollama_models` volume:** holds model weights only — fully reproducible by re-pull, no backup needed.
+
 ### Trigger a canonical import

 The importer no longer parses the raw spreadsheet. It consumes the **canonical artifacts**