From 884c1156bdc32e05e5425939ce70cc9b90fdf7be Mon Sep 17 00:00:00 2001 From: Marcel Date: Sun, 7 Jun 2026 16:41:46 +0200 Subject: [PATCH] docs(deployment): replace Ollama with nlp-service in DEPLOYMENT.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - §1: update memory table (nlp-service ~256 MB vs Ollama ~8 GB); update memory budget note; add nlp-service to topology diagram - §2: replace 'Ollama (NL search) service' env var table with 'NLP service' table (APP_NLP_BASE_URL, NLP_FUZZY_THRESHOLD); add credential-rotation restart note - §3.4: replace Ollama model-pull first-deploy warning with nlp-service startup note (no download, --wait safe) - §6: replace Ollama operational section (model pull, ollama list, upgrade guide) with nlp-service health check and tuning guide Co-Authored-By: Claude Sonnet 4.6 --- docs/DEPLOYMENT.md | 90 ++++++++++++++++++++-------------------------- 1 file changed, 38 insertions(+), 52 deletions(-) diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md index f8523515..e8bc6b67 100644 --- a/docs/DEPLOYMENT.md +++ b/docs/DEPLOYMENT.md @@ -33,6 +33,7 @@ graph TD Backend -->|JDBC :5432| DB[(PostgreSQL 16)] Backend -->|S3 API :9000| MinIO[(MinIO)] Backend -->|HTTP :8000 internal| OCR["OCR Service\nPython FastAPI"] + Backend -->|HTTP :8001 internal| NLP["NLP Service\nPython FastAPI"] OCR -->|presigned URL| MinIO Caddy -->|SSE proxy_pass| Backend ``` @@ -40,7 +41,7 @@ graph TD **Key facts:** - Caddy terminates TLS and reverse-proxies to frontend (`:3000`) and backend (`:8080`). The Caddyfile is committed at [`infra/caddy/Caddyfile`](../infra/caddy/Caddyfile) and is installed on the host as `/etc/caddy/Caddyfile` (symlink). - The host binds all docker-published ports to `127.0.0.1` only; Caddy is the sole external entry point. -- The OCR service has **no published port** — reachable only on the internal Docker network from the backend. +- The OCR service and NLP service have **no published ports** — reachable only on the internal Docker network from the backend. - SSE notifications transit Caddy (browser → Caddy → backend); the backend is never reachable directly from the public internet. The SvelteKit SSR layer is bypassed for SSE, but Caddy is not. - The Caddyfile responds `404` on `/actuator/*` (defense in depth). Internal monitoring scrapes the backend on the docker network, not through Caddy. - Production and staging cohabit on the same host via docker compose project names: `archiv-production` (ports 8080/3000) and `archiv-staging` (ports 8081/3001). @@ -52,14 +53,14 @@ The OCR service requires significant RAM for model loading. The dev compose sets | Production target | RAM | Recommended OCR limit | NL Search | Notes | |---|---|---|---|---| -| Current server (Hetzner Serverbörse, i7-6700) | 64 GB | 12 GB | Supported | Default `mem_limit: 12g` works comfortably; plenty of headroom for Ollama | +| Current server (Hetzner Serverbörse, i7-6700) | 64 GB | 12 GB | Supported | Default `mem_limit: 12g` works comfortably; nlp-service adds only ~256 MB | | ≥ 16 GB RAM | 16+ GB | 12 GB | Supported | Default works | -| 8 GB RAM | 8 GB | 6 GB | Disabled — set `APP_OLLAMA_BASE_URL=` (empty) | Set `OCR_MEM_LIMIT=6g`; accept reduced batch sizes | -| 4 GB RAM | 4 GB | — | Unsupported | Disable OCR service (`profiles: [ocr]`); run OCR on demand only | +| 8 GB RAM | 8 GB | 6 GB | Supported | Set `OCR_MEM_LIMIT=6g`; accept reduced batch sizes; nlp-service is lightweight | +| 4 GB RAM | 4 GB | — | Supported | Disable OCR service (`profiles: [ocr]`); run OCR on demand only; nlp-service still runs | On servers with less than 16 GB RAM the default `mem_limit: 12g` cannot be honoured — set the `OCR_MEM_LIMIT` env var (in `.env.production` / `.env.staging`, or as a Gitea secret consumed by the workflow). The prod compose interpolates this var with a 12g default. -> **Memory budget:** OCR (~6 GB active) + Ollama (~8 GB) = ~14 GB. On servers with less than 16 GB RAM, do not run `docker-compose.observability.yml` continuously alongside both OCR and Ollama. +> **Memory budget:** OCR (~6 GB active) + nlp-service (~256 MB) = ~6.25 GB. The previous Ollama LLM (~8 GB) has been replaced by the rule-based nlp-service — significant memory headroom freed on all server tiers. ### Dev vs production differences @@ -147,15 +148,18 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back | `XDG_CACHE_HOME` | XDG cache base dir — redirects Matplotlib and other XDG-aware libraries away from the read-only `HOME` (`/home/ocr`) to the writable cache volume | `/app/cache` | — | — | | `TORCH_HOME` | PyTorch model cache — redirects `~/.cache/torch` to the writable models volume | `/app/models/torch` | — | — | -### Ollama (NL search) service +### NLP service (NL search) | Variable | Purpose | Default | Required? | Sensitive? | |---|---|---|---|---| -| `APP_OLLAMA_BASE_URL` | Base URL for the Ollama service. Leave empty to disable NL search. | `http://ollama:11434` | — | — | -| `APP_OLLAMA_API_KEY` | API key passed as `Authorization: Bearer` to Ollama. Leave empty for unauthenticated access. Note: `OLLAMA_API_KEY` is not enforced in Ollama 0.6.5 or 0.30.6 (see ADR-028). | — | — | YES | -| `OLLAMA_CPU_LIMIT` | Docker CPU quota for the Ollama container. On CX42 (8 vCPUs) can be raised to `7.5`. | `4.0` | — | — | -| `OLLAMA_MEM_LIMIT` | Memory limit for the Ollama container. Requires CX42 (16 GB RAM). | `8g` | — | — | -| `OLLAMA_API_KEY` | API key set on the Ollama service itself. Same value as `APP_OLLAMA_API_KEY`. Leave empty for unauthenticated. | — | — | YES | +| `APP_NLP_BASE_URL` | Internal URL of the nlp-service container. Wired automatically in compose via `http://nlp-service:8001`. | `http://nlp-service:8001` | YES | — | +| `NLP_FUZZY_THRESHOLD` | Rapidfuzz similarity floor for person-name matching (0–100). Lower values match more aggressively; raise if false positives appear. | `80` | — | — | + +The nlp-service reads `DATABASE_URL` at startup (composed from `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`). Any credential rotation that touches those three vars must be followed by a restart of **both** `backend` and `nlp-service`: + +```bash +docker compose restart nlp-service backend +``` ### Observability stack (`docker-compose.observability.yml`) @@ -277,17 +281,13 @@ git.raddatz.cloud A ### 3.4 First deploy -> **First start — Ollama model pull:** On first `docker compose up -d`, the `ollama-model-init` container pulls `qwen2.5:7b-instruct-q4_K_M` (~4.7 GB). At 10 Mbps this takes approximately 60–90 minutes; at 100 Mbps approximately 6–10 minutes. The pull is a one-time operation — subsequent restarts skip it (model already on the `ollama_models` volume). Monitor progress with `docker logs -f $(docker ps -q --filter name=ollama-model-init)`. +> **NL search startup:** `nlp-service` loads person names from the database at startup (single query, ~1–2 s). No model weights to download. The backend waits for `nlp-service` to pass its healthcheck (`/health` returns `{"status":"ok","persons_loaded":N}`) before starting, so `docker compose up -d --wait` is safe to use on first deploy. > -> **Do not use `--wait` on first deploy** — `docker compose up -d --wait` waits for all services to reach their health/completion target, including `ollama-model-init`. On first pull this blocks for 60–90 minutes and will time out any CI/deploy script that uses `--wait`. -> -> **Re-deploy idempotency:** on subsequent `docker compose up -d` runs (including `--force-recreate`), `ollama-model-init` re-executes but exits in seconds — Ollama's CLI skips the download when the model digest already matches what is on the volume. -> -> **Verify NL search is active** after enabling Ollama (`APP_OLLAMA_BASE_URL=http://ollama:11434`): +> **Verify NL search is active:** > ```bash -> curl -s http://localhost:8080/api/nl-search?q=brief+von+grossmutter -> # Returns 200 with results → NL search is active -> # Returns 503 NL_SEARCH_UNAVAILABLE → Ollama is not reachable or APP_OLLAMA_BASE_URL is unset +> curl -s http://localhost:8001/health +> # Returns {"status":"ok","persons_loaded":N} with N > 0 → person matching enabled +> # Returns {"status":"ok","persons_loaded":0} → DB not reachable or persons table empty > ``` ```bash @@ -328,7 +328,7 @@ docker compose logs --follow # Single snapshot docker compose logs --tail=200 -# services: frontend, backend, db, minio, ocr-service +# services: frontend, backend, db, minio, ocr-service, nlp-service ``` ### Log locations @@ -585,54 +585,40 @@ bash scripts/download-kraken-models.sh > Downloads the Kurrent/Sütterlin HTR models. Run once after a fresh clone or when models are updated. -### Ollama — natural-language search (NL Search) +### NLP service — natural-language search (NL Search) -NL search uses a local Ollama instance for query parsing. The `ollama` service is defined in `docker-compose.yml` alongside the main stack. +NL search uses the rule-based `nlp-service` FastAPI container for query parsing. It has no model weights — it loads person names from the database at startup and applies regex + fuzzy matching. See ADR-035. -**First-time model pull** (required before the feature works): +**Health check:** ```bash -docker compose exec ollama ollama pull qwen2.5:7b-instruct-q4_K_M +curl -s http://localhost:8001/health +# {"status":"ok","persons_loaded":1247} ``` -This downloads ~4.4 GB. The model is stored in the `ollama_data` Docker volume and persists across container restarts. +`persons_loaded: 0` means the service started but could not reach the database (check `DATABASE_URL` and that `db` is healthy). -**Verify the model is available:** +If `POST /api/search/nl` returns HTTP 503 `SMART_SEARCH_UNAVAILABLE`, the backend cannot reach `nlp-service`. Check with: ```bash -docker compose exec ollama ollama list +docker compose logs nlp-service --tail=50 +docker compose ps nlp-service ``` -Expected output includes `qwen2.5:7b-instruct-q4_K_M`. - -**Health check** — the backend polls `GET /api/tags` on Ollama at startup and before inference. If Ollama is absent, `POST /api/search/nl` returns HTTP 503 with `SMART_SEARCH_UNAVAILABLE`. - -**Configuration** (see `application.yaml` under `app.ollama`): +**Configuration** (see `application.yaml` under `app.nlp`): | Property | Default | Description | |---|---|---| -| `app.ollama.base-url` | `http://ollama:11434` | Ollama service URL (dev: `http://localhost:11434`) | -| `app.ollama.model` | `qwen2.5:7b-instruct-q4_K_M` | Model to use for inference | -| `app.ollama.timeout-seconds` | `60` | Read timeout for inference calls (absorbs cold model load on the first query after an Ollama restart) | -| `app.nl-search.rate-limit.max-requests-per-minute` | `5` | Per-user rate limit | +| `app.nlp.base-url` | `http://nlp-service:8001` | nlp-service URL; set via `APP_NLP_BASE_URL` env var | +| `app.nl-search.rate-limit.max-requests-per-minute` | `20` | Per-user rate limit | -### Upgrade the Ollama model +**Tuning person matching:** -To switch to a newer model version (e.g. a future release of `qwen2.5`): +Set `NLP_FUZZY_THRESHOLD` in `.env` (default: `80`, range: `0–100`). Lower values match more aggressively at the cost of false positives. Restart nlp-service after changing: -1. Update the model name in the `ollama-model-init` `command:` in `docker-compose.yml`. -2. Remove the existing model volume to free the old weights: - ```bash - docker volume rm familienarchiv_ollama_models - ``` - (In production the volume name is prefixed with the compose project: `archiv-production_ollama-models`.) -3. Restart the stack: - ```bash - docker compose up -d - ``` - The `ollama-model-init` container pulls the new model weights on first start (~4–8 GB download depending on the model). The `ollama` inference server will not start until the pull completes (`condition: service_completed_successfully`). - -> **`ollama_models` volume:** holds model weights only — fully reproducible by re-pull, no backup needed. +```bash +docker compose restart nlp-service +``` ### Trigger a canonical import