docs(deployment): replace Ollama with nlp-service in DEPLOYMENT.md

- §1: update memory table (nlp-service ~256 MB vs Ollama ~8 GB); update memory budget note; add nlp-service to topology diagram - §2: replace 'Ollama (NL search) service' env var table with 'NLP service' table (APP_NLP_BASE_URL, NLP_FUZZY_THRESHOLD); add credential-rotation restart note - §3.4: replace Ollama model-pull first-deploy warning with nlp-service startup note (no download, --wait safe) - §6: replace Ollama operational section (model pull, ollama list, upgrade guide) with nlp-service health check and tuning guide Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 16:41:46 +02:00
parent 960f1c171a
commit 884c1156bd
1 changed files with 38 additions and 52 deletions
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -33,6 +33,7 @@ graph TD
    Backend -->|JDBC :5432| DB[(PostgreSQL 16)]
    Backend -->|S3 API :9000| MinIO[(MinIO)]
    Backend -->|HTTP :8000 internal| OCR["OCR Service\nPython FastAPI"]
+    Backend -->|HTTP :8001 internal| NLP["NLP Service\nPython FastAPI"]
    OCR -->|presigned URL| MinIO
    Caddy -->|SSE proxy_pass| Backend
 ```
@@ -40,7 +41,7 @@ graph TD
 **Key facts:**
 - Caddy terminates TLS and reverse-proxies to frontend (`:3000`) and backend (`:8080`). The Caddyfile is committed at [`infra/caddy/Caddyfile`](../infra/caddy/Caddyfile) and is installed on the host as `/etc/caddy/Caddyfile` (symlink).
 - The host binds all docker-published ports to `127.0.0.1` only; Caddy is the sole external entry point.
- The OCR service has **no published port** — reachable only on the internal Docker network from the backend.
+- The OCR service and NLP service have **no published ports** — reachable only on the internal Docker network from the backend.
 - SSE notifications transit Caddy (browser → Caddy → backend); the backend is never reachable directly from the public internet. The SvelteKit SSR layer is bypassed for SSE, but Caddy is not.
 - The Caddyfile responds `404` on `/actuator/*` (defense in depth). Internal monitoring scrapes the backend on the docker network, not through Caddy.
 - Production and staging cohabit on the same host via docker compose project names: `archiv-production` (ports 8080/3000) and `archiv-staging` (ports 8081/3001).
@@ -52,14 +53,14 @@ The OCR service requires significant RAM for model loading. The dev compose sets

 | Production target | RAM | Recommended OCR limit | NL Search | Notes |
 |---|---|---|---|---|
-| Current server (Hetzner Serverbörse, i7-6700) | 64 GB | 12 GB | Supported | Default `mem_limit: 12g` works comfortably; plenty of headroom for Ollama |
+| Current server (Hetzner Serverbörse, i7-6700) | 64 GB | 12 GB | Supported | Default `mem_limit: 12g` works comfortably; nlp-service adds only ~256 MB |
 | ≥ 16 GB RAM | 16+ GB | 12 GB | Supported | Default works |
-| 8 GB RAM | 8 GB | 6 GB | Disabled — set `APP_OLLAMA_BASE_URL=` (empty) | Set `OCR_MEM_LIMIT=6g`; accept reduced batch sizes |
-| 4 GB RAM | 4 GB | — | Unsupported | Disable OCR service (`profiles: [ocr]`); run OCR on demand only |
+| 8 GB RAM | 8 GB | 6 GB | Supported | Set `OCR_MEM_LIMIT=6g`; accept reduced batch sizes; nlp-service is lightweight |
+| 4 GB RAM | 4 GB | — | Supported | Disable OCR service (`profiles: [ocr]`); run OCR on demand only; nlp-service still runs |

 On servers with less than 16 GB RAM the default `mem_limit: 12g` cannot be honoured — set the `OCR_MEM_LIMIT` env var (in `.env.production` / `.env.staging`, or as a Gitea secret consumed by the workflow). The prod compose interpolates this var with a 12g default.

-> **Memory budget:** OCR (~6 GB active) + Ollama (~8 GB) = ~14 GB. On servers with less than 16 GB RAM, do not run `docker-compose.observability.yml` continuously alongside both OCR and Ollama.
+> **Memory budget:** OCR (~6 GB active) + nlp-service (~256 MB) = ~6.25 GB. The previous Ollama LLM (~8 GB) has been replaced by the rule-based nlp-service — significant memory headroom freed on all server tiers.

 ### Dev vs production differences

@@ -147,15 +148,18 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
 | `XDG_CACHE_HOME` | XDG cache base dir — redirects Matplotlib and other XDG-aware libraries away from the read-only `HOME` (`/home/ocr`) to the writable cache volume | `/app/cache` | — | — |
 | `TORCH_HOME` | PyTorch model cache — redirects `~/.cache/torch` to the writable models volume | `/app/models/torch` | — | — |

-### Ollama (NL search) service
+### NLP service (NL search)

 | Variable | Purpose | Default | Required? | Sensitive? |
 |---|---|---|---|---|
-| `APP_OLLAMA_BASE_URL` | Base URL for the Ollama service. Leave empty to disable NL search. | `http://ollama:11434` | — | — |
-| `APP_OLLAMA_API_KEY` | API key passed as `Authorization: Bearer` to Ollama. Leave empty for unauthenticated access. Note: `OLLAMA_API_KEY` is not enforced in Ollama 0.6.5 or 0.30.6 (see ADR-028). | — | — | YES |
-| `OLLAMA_CPU_LIMIT` | Docker CPU quota for the Ollama container. On CX42 (8 vCPUs) can be raised to `7.5`. | `4.0` | — | — |
-| `OLLAMA_MEM_LIMIT` | Memory limit for the Ollama container. Requires CX42 (16 GB RAM). | `8g` | — | — |
-| `OLLAMA_API_KEY` | API key set on the Ollama service itself. Same value as `APP_OLLAMA_API_KEY`. Leave empty for unauthenticated. | — | — | YES |
+| `APP_NLP_BASE_URL` | Internal URL of the nlp-service container. Wired automatically in compose via `http://nlp-service:8001`. | `http://nlp-service:8001` | YES | — |
+| `NLP_FUZZY_THRESHOLD` | Rapidfuzz similarity floor for person-name matching (0–100). Lower values match more aggressively; raise if false positives appear. | `80` | — | — |
+
+The nlp-service reads `DATABASE_URL` at startup (composed from `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`). Any credential rotation that touches those three vars must be followed by a restart of **both** `backend` and `nlp-service`:
+
+```bash
+docker compose restart nlp-service backend
+```

 ### Observability stack (`docker-compose.observability.yml`)

@@ -277,17 +281,13 @@ git.raddatz.cloud      A   <server IP>

 ### 3.4 First deploy

-> **First start — Ollama model pull:** On first `docker compose up -d`, the `ollama-model-init` container pulls `qwen2.5:7b-instruct-q4_K_M` (~4.7 GB). At 10 Mbps this takes approximately 60–90 minutes; at 100 Mbps approximately 6–10 minutes. The pull is a one-time operation — subsequent restarts skip it (model already on the `ollama_models` volume). Monitor progress with `docker logs -f $(docker ps -q --filter name=ollama-model-init)`.
+> **NL search startup:** `nlp-service` loads person names from the database at startup (single query, ~1–2 s). No model weights to download. The backend waits for `nlp-service` to pass its healthcheck (`/health` returns `{"status":"ok","persons_loaded":N}`) before starting, so `docker compose up -d --wait` is safe to use on first deploy.
 >
-> **Do not use `--wait` on first deploy** — `docker compose up -d --wait` waits for all services to reach their health/completion target, including `ollama-model-init`. On first pull this blocks for 60–90 minutes and will time out any CI/deploy script that uses `--wait`.
->
-> **Re-deploy idempotency:** on subsequent `docker compose up -d` runs (including `--force-recreate`), `ollama-model-init` re-executes but exits in seconds — Ollama's CLI skips the download when the model digest already matches what is on the volume.
->
-> **Verify NL search is active** after enabling Ollama (`APP_OLLAMA_BASE_URL=http://ollama:11434`):
+> **Verify NL search is active:**
 > ```bash
-> curl -s http://localhost:8080/api/nl-search?q=brief+von+grossmutter
-> # Returns 200 with results → NL search is active
-> # Returns 503 NL_SEARCH_UNAVAILABLE → Ollama is not reachable or APP_OLLAMA_BASE_URL is unset
+> curl -s http://localhost:8001/health
+> # Returns {"status":"ok","persons_loaded":N} with N > 0 → person matching enabled
+> # Returns {"status":"ok","persons_loaded":0} → DB not reachable or persons table empty
 > ```

 ```bash
@@ -328,7 +328,7 @@ docker compose logs --follow

 # Single snapshot
 docker compose logs --tail=200 <service>
-# services: frontend, backend, db, minio, ocr-service
+# services: frontend, backend, db, minio, ocr-service, nlp-service
 ```

 ### Log locations
@@ -585,54 +585,40 @@ bash scripts/download-kraken-models.sh

 > Downloads the Kurrent/Sütterlin HTR models. Run once after a fresh clone or when models are updated.

-### Ollama — natural-language search (NL Search)
+### NLP service — natural-language search (NL Search)

-NL search uses a local Ollama instance for query parsing. The `ollama` service is defined in `docker-compose.yml` alongside the main stack.
+NL search uses the rule-based `nlp-service` FastAPI container for query parsing. It has no model weights — it loads person names from the database at startup and applies regex + fuzzy matching. See ADR-035.

-**First-time model pull** (required before the feature works):
+**Health check:**

 ```bash
-docker compose exec ollama ollama pull qwen2.5:7b-instruct-q4_K_M
+curl -s http://localhost:8001/health
+# {"status":"ok","persons_loaded":1247}
 ```

-This downloads ~4.4 GB. The model is stored in the `ollama_data` Docker volume and persists across container restarts.
+`persons_loaded: 0` means the service started but could not reach the database (check `DATABASE_URL` and that `db` is healthy).

-**Verify the model is available:**
+If `POST /api/search/nl` returns HTTP 503 `SMART_SEARCH_UNAVAILABLE`, the backend cannot reach `nlp-service`. Check with:

 ```bash
-docker compose exec ollama ollama list
+docker compose logs nlp-service --tail=50
+docker compose ps nlp-service
 ```

-Expected output includes `qwen2.5:7b-instruct-q4_K_M`.
-
-**Health check** — the backend polls `GET /api/tags` on Ollama at startup and before inference. If Ollama is absent, `POST /api/search/nl` returns HTTP 503 with `SMART_SEARCH_UNAVAILABLE`.
-
-**Configuration** (see `application.yaml` under `app.ollama`):
+**Configuration** (see `application.yaml` under `app.nlp`):

 | Property | Default | Description |
 |---|---|---|
-| `app.ollama.base-url` | `http://ollama:11434` | Ollama service URL (dev: `http://localhost:11434`) |
-| `app.ollama.model` | `qwen2.5:7b-instruct-q4_K_M` | Model to use for inference |
-| `app.ollama.timeout-seconds` | `60` | Read timeout for inference calls (absorbs cold model load on the first query after an Ollama restart) |
-| `app.nl-search.rate-limit.max-requests-per-minute` | `5` | Per-user rate limit |
+| `app.nlp.base-url` | `http://nlp-service:8001` | nlp-service URL; set via `APP_NLP_BASE_URL` env var |
+| `app.nl-search.rate-limit.max-requests-per-minute` | `20` | Per-user rate limit |

-### Upgrade the Ollama model
+**Tuning person matching:**

-To switch to a newer model version (e.g. a future release of `qwen2.5`):
+Set `NLP_FUZZY_THRESHOLD` in `.env` (default: `80`, range: `0–100`). Lower values match more aggressively at the cost of false positives. Restart nlp-service after changing:

-1. Update the model name in the `ollama-model-init` `command:` in `docker-compose.yml`.
-2. Remove the existing model volume to free the old weights:
-   ```bash
-   docker volume rm familienarchiv_ollama_models
-   ```
-   (In production the volume name is prefixed with the compose project: `archiv-production_ollama-models`.)
-3. Restart the stack:
-   ```bash
-   docker compose up -d
-   ```
-   The `ollama-model-init` container pulls the new model weights on first start (~4–8 GB download depending on the model). The `ollama` inference server will not start until the pull completes (`condition: service_completed_successfully`).
-
-> **`ollama_models` volume:** holds model weights only — fully reproducible by re-pull, no backup needed.
+```bash
+docker compose restart nlp-service
+```

 ### Trigger a canonical import