From 6f5497c7bf12b6fcfd22ecd5bf3d4db4b9edd879 Mon Sep 17 00:00:00 2001 From: Marcel Date: Sat, 6 Jun 2026 15:31:53 +0200 Subject: [PATCH] =?UTF-8?q?docs(adr):=20ADR-028=20=E2=80=94=20NL=20search?= =?UTF-8?q?=20via=20Ollama?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- docs/adr/028-nl-search-ollama.md | 65 ++++++++++++++++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 docs/adr/028-nl-search-ollama.md diff --git a/docs/adr/028-nl-search-ollama.md b/docs/adr/028-nl-search-ollama.md new file mode 100644 index 00000000..41459992 --- /dev/null +++ b/docs/adr/028-nl-search-ollama.md @@ -0,0 +1,65 @@ +# ADR-028 — Natural language search is powered by Ollama (Qwen 2.5 7B), not a cloud API + +**Date:** 2026-06-06 +**Status:** Accepted +**Issue:** #738 (NL search backend); part of epic #735 +**Milestone:** Archive Intelligence — NL Search + +--- + +## Context + +Family members write their search intent in plain German ("Was hat Walter im Krieg an Emma geschrieben?"), not in structured filter forms. Issue #735 defines NL search as a core product goal. Three delivery options were evaluated: + +**Option A — extend the OCR service.** The OCR Python microservice already runs on the same host. Adding LLM inference there avoids a new container. Rejected: the OCR service is a single-purpose, CPU-bound pipeline optimised for Kraken; bundling a 4.5 GB LLM weight into the same image would bloat it, complicate model lifecycle management, and create an unrelated failure domain (OOM on large OCR batches vs. LLM load time). ADR-001 was explicit about keeping OCR single-purpose. + +**Option B — call an external API (OpenAI, Anthropic, etc.).** Cloud inference is instant and requires no local hardware. Rejected: the archive contains real person names and private family correspondence from 1899–1950 — sending query content to a third party violates the project's data-residency principle (family data stays on the family server). Additionally, API cost and availability are outside the operator's control; the system must work air-gapped. + +**Option C — local Ollama service (chosen).** Ollama is a purpose-built LLM runtime with a simple REST API, model lifecycle management (`ollama pull`), and support for grammar-constrained JSON output. It runs entirely on the existing server (i7-6700, 64 GB RAM) with no cloud dependency. + +**Model selection:** Qwen 2.5 7B Q4_K_M (`qwen2.5:7b-instruct-q4_K_M`) was chosen over larger models because: +- Quantised weight is ~4.5 GB — fits comfortably in 64 GB RAM alongside PostgreSQL and the JVM. +- Instruction-tuned variant follows the structured JSON schema reliably without fine-tuning. +- CPU-only inference at Q4_K_M takes 2–15 seconds per query, acceptable for a search that replaces a multi-step filter form. + +**Prompt injection mitigation:** The backend sends the raw user query to Ollama. To prevent the model from being prompted to return schema-breaking output, the API call uses Ollama's `format` parameter with a grammar-constrained JSON schema. Output length is further bounded by `maxLength` constraints in the schema (names ≤ 200 chars, keywords ≤ 100 chars). `NlQueryParserService` enforces these limits in code before any LLM-extracted fragment is passed to `PersonRepository.searchByName()` — defence in depth. + +**DB-blind name resolution:** The Ollama prompt stays small (the raw query only); person database records are never sent to the model. Name resolution happens as a cheap SQL query after the model returns. This keeps the prompt short, avoids data leakage, and means adding 1,000 new persons requires no prompt change. + +**Graceful degradation:** `RestClientOllamaClient.isHealthy()` is called inline before each inference request (calls `GET /api/tags` on a 2-second connect-timeout client). If Ollama is absent or times out, `NlQueryParserService` throws `DomainException` with `SMART_SEARCH_UNAVAILABLE` (HTTP 503). The regular structured search (`GET /api/documents/search`) is unaffected — it never calls Ollama. + +**Expected inference latency:** 2–15 seconds on the current CPU-only hardware. The frontend issue must show a persistent "Suche läuft…" indicator for the full duration (see `aria-live="polite"` requirement in issue #738 frontend notes). The backend timeout is 30 seconds (`app.ollama.timeout-seconds=30`) — chosen as a safe upper bound for Q4_K_M on the i7-6700 with a realistic 500-character query under modest concurrent load. + +**NL query logging policy:** Only metadata is logged — query length, resolved person count, latency in milliseconds. The raw query is never written to the log file. Rationale: queries contain real family names (PII); log files persist to disk and may be shipped to Loki. Structured metadata is sufficient for debugging latency regressions. + +**Prompt-amplification abuse:** A malicious user could submit a long or crafted query to cause slow Ollama inference, consuming CPU. Mitigated by `NlSearchRateLimiter` (5 requests per user per minute, Bucket4j + Caffeine) and by `@Size(max=500)` on the request body. The rate limiter is node-local; in multi-replica deployments the effective limit multiplies by replica count — acceptable at the current single-node deployment scale. + +**Ollama model pre-pull requirement:** The Docker image contains only the Ollama binary, not the model weights. The operator must run `ollama pull qwen2.5:7b-instruct-q4_K_M` (≈4.5 GB download, 10–30 minutes) before the backend starts inference. If skipped, every NL search request returns 503 until the pull completes. The deployment runbook in `docs/DEPLOYMENT.md` covers this explicitly. + +**Startup dependency:** The `backend` Compose service declares `depends_on: ollama: condition: service_healthy`. The Ollama healthcheck polls `GET http://localhost:11434/api/tags`; `start_period: 120s` provides margin for weight loading (20–60 s on SSD). Note: `service_healthy` confirms the API is responding, not that the model is downloaded — if the pull was skipped, inference still returns 404. + +**Multi-name resolution heuristic:** For 2-name queries (e.g. "Was hat Walter an Emma geschrieben?"), the first extracted name is treated as sender and the second as receiver. Per-name role annotation (e.g. `{name: "Walter", role: "sender"}`) was rejected because it would require a combinatorially complex Ollama schema and the most natural German phrasing strongly implies sender→receiver order. For single-name queries, a `personRole` field (`sender`/`receiver`/`any`) is returned. + +**`personRole: "any"` keyword limitation:** When `personRole` is `"any"` and the name resolves to exactly one person, `DocumentService.searchDocumentsByPersonId()` is called (OR semantics: person as sender or receiver). Keyword filtering is not applied on this path — only person identity and date range. `keywordsApplied = false` is returned in the response. Rationale: the JPQL for OR-semantics person queries has no text predicate; adding FTS would require a native query or a separate pass, adding complexity for a case that is already well-narrowed by person identity. + +**`search/` → `person/` + `document/` dependency direction:** `NlQueryParserService` calls `PersonService.findByDisplayNameContaining()` and `DocumentService.searchDocuments()` — both are legitimate cross-domain service calls, not repository leaks. The `search/` package has no JPA entities of its own and never accesses `PersonRepository` or `DocumentRepository` directly. + +## Decision + +**Introduce a new `search/` domain package** with a local Ollama integration via `RestClientOllamaClient`. The Ollama service runs as a separate Docker container, reachable only on the internal Docker network (`expose: ["11434"]`, not `ports:`). The backend calls Ollama's `/api/generate` endpoint with grammar-constrained JSON output. Name resolution and document search are performed by existing services after the model returns. + +Key component structure: +- `OllamaClient` / `OllamaHealthClient` interfaces — mockable for tests, modelled on `OcrClient`/`OcrHealthClient` +- `RestClientOllamaClient` — two `RestClient` instances (30 s inference, 2 s health-check) +- `NlQueryParserService` — orchestrates Ollama → name resolution → document search +- `NlSearchRateLimiter` — Bucket4j + Caffeine, 5 req/min per user +- `NlSearchController` — `POST /api/search/nl`, `@RequirePermission(READ_ALL)` + +## Consequences + +- Family members can query in natural German without learning filter UI. Expected search satisfaction improvement for the 60+ age cohort (primary transcription audience) is significant. +- NL search is unavailable when Ollama is down or the model pull is not complete. The regular search is unaffected. The 503 response includes a CTA directing users to the regular search. +- Operator responsibility: run `ollama pull` on first deploy and after model updates. The backup runbook must exclude `ollama_models` volume (model weights are re-downloadable, not user data). +- Inference takes 2–15 seconds. The frontend loading indicator is a hard requirement (see issue #738 frontend notes). +- The rate limiter is node-local. At the current single-node deployment scale this is correct. If the service is ever scaled horizontally, the rate limiter must be moved to Redis (same caveat as `LoginRateLimiter`). +- The `search/` package introduces a new cross-domain dependency direction (`search` → `person`, `search` → `document`). This is intentional and documented in `docs/architecture/c4/l3-backend-search.puml`.