9.6 KiB
ADR-028 — Natural language search is powered by Ollama (Qwen 2.5 7B), not a cloud API
Date: 2026-06-06 Status: Accepted Issue: #738 (NL search backend); part of epic #735 Milestone: Archive Intelligence — NL Search
Context
Family members write their search intent in plain German ("Was hat Walter im Krieg an Emma geschrieben?"), not in structured filter forms. Issue #735 defines NL search as a core product goal. Three delivery options were evaluated:
Option A — extend the OCR service. The OCR Python microservice already runs on the same host. Adding LLM inference there avoids a new container. Rejected: the OCR service is a single-purpose, CPU-bound pipeline optimised for Kraken; bundling a 4.5 GB LLM weight into the same image would bloat it, complicate model lifecycle management, and create an unrelated failure domain (OOM on large OCR batches vs. LLM load time). ADR-001 was explicit about keeping OCR single-purpose.
Option B — call an external API (OpenAI, Anthropic, etc.). Cloud inference is instant and requires no local hardware. Rejected: the archive contains real person names and private family correspondence from 1899–1950 — sending query content to a third party violates the project's data-residency principle (family data stays on the family server). Additionally, API cost and availability are outside the operator's control; the system must work air-gapped.
Option C — local Ollama service (chosen). Ollama is a purpose-built LLM runtime with a simple REST API, model lifecycle management (ollama pull), and support for grammar-constrained JSON output. It runs entirely on the existing server (i7-6700, 64 GB RAM) with no cloud dependency.
Model selection: Qwen 2.5 7B Q4_K_M (qwen2.5:7b-instruct-q4_K_M) was chosen over larger models because:
- Quantised weight is ~4.5 GB — fits comfortably in 64 GB RAM alongside PostgreSQL and the JVM.
- Instruction-tuned variant follows the structured JSON schema reliably without fine-tuning.
- CPU-only inference at Q4_K_M takes 2–15 seconds per query, acceptable for a search that replaces a multi-step filter form.
Prompt injection mitigation: The backend sends the raw user query to Ollama. To prevent the model from being prompted to return schema-breaking output, the API call uses Ollama's format parameter with a grammar-constrained JSON schema. Output length is further bounded by maxLength constraints in the schema (names ≤ 200 chars, keywords ≤ 100 chars). NlQueryParserService enforces these limits in code before any LLM-extracted fragment is passed to PersonRepository.searchByName() — defence in depth.
DB-blind name resolution: The Ollama prompt stays small (the raw query only); person database records are never sent to the model. Name resolution happens as a cheap SQL query after the model returns. This keeps the prompt short, avoids data leakage, and means adding 1,000 new persons requires no prompt change.
Graceful degradation: In-path Ollama failures surface via OllamaClient.parse() — any IOException, read timeout, or non-2xx response is caught by RestClientOllamaClient and re-thrown as DomainException(SMART_SEARCH_UNAVAILABLE, HTTP 503). isHealthy() has no callers inside search/; it is reserved for the ops/health-endpoint polling path only (e.g. a future /api/health/ollama endpoint). The regular structured search (GET /api/documents/search) is unaffected — it never calls Ollama.
Expected inference latency: 2–15 seconds on the current CPU-only hardware. The frontend issue must show a persistent "Suche läuft…" indicator for the full duration (see aria-live="polite" requirement in issue #738 frontend notes). The backend timeout is 30 seconds (app.ollama.timeout-seconds=30) — chosen as a safe upper bound for Q4_K_M on the i7-6700 with a realistic 500-character query under modest concurrent load.
NL query logging policy: Only metadata is logged — query length, resolved person count, latency in milliseconds. The raw query is never written to the log file. Rationale: queries contain real family names (PII); log files persist to disk and may be shipped to Loki. Structured metadata is sufficient for debugging latency regressions.
Prompt-amplification abuse: A malicious user could submit a long or crafted query to cause slow Ollama inference, consuming CPU. Mitigated by NlSearchRateLimiter (5 requests per user per minute, Bucket4j + Caffeine) and by @Size(max=500) on the request body. The rate limiter is node-local; in multi-replica deployments the effective limit multiplies by replica count — acceptable at the current single-node deployment scale.
Ollama model pre-pull requirement: The Docker image contains only the Ollama binary, not the model weights. The operator must run ollama pull qwen2.5:7b-instruct-q4_K_M (≈4.5 GB download, 10–30 minutes) before the backend starts inference. If skipped, every NL search request returns 503 until the pull completes. The deployment runbook in docs/DEPLOYMENT.md covers this explicitly.
Startup dependency: The backend Compose service declares depends_on: ollama: condition: service_healthy. The Ollama healthcheck polls GET http://localhost:11434/api/tags; start_period: 120s provides margin for weight loading (20–60 s on SSD). Note: service_healthy confirms the API is responding, not that the model is downloaded — if the pull was skipped, inference still returns 404.
Multi-name resolution heuristic: For 2-name queries (e.g. "Was hat Walter an Emma geschrieben?"), the first extracted name is treated as sender and the second as receiver. Per-name role annotation (e.g. {name: "Walter", role: "sender"}) was rejected because it would require a combinatorially complex Ollama schema and the most natural German phrasing strongly implies sender→receiver order. For single-name queries, a personRole field (sender/receiver/any) is returned.
personRole: "any" keyword limitation: When personRole is "any" and the name resolves to exactly one person, DocumentService.searchDocumentsByPersonId() is called (OR semantics: person as sender or receiver). Keyword filtering is not applied on this path — only person identity and date range. keywordsApplied = false is returned in the response. Rationale: the JPQL for OR-semantics person queries has no text predicate; adding FTS would require a native query or a separate pass, adding complexity for a case that is already well-narrowed by person identity.
search/ → person/ + document/ dependency direction: NlQueryParserService calls PersonService.findByDisplayNameContaining() and DocumentService.searchDocuments() — both are legitimate cross-domain service calls, not repository leaks. The search/ package has no JPA entities of its own and never accesses PersonRepository or DocumentRepository directly.
Keyword→tag resolution (issue #743): After Ollama extracts the keywords list, NlQueryParserService calls TagService.findByNameContaining() for each keyword. Keywords that match one or more tags are removed from the FTS text list and added as OR-union tag filters; keywords with no tag match remain as FTS text. Resolved tags are returned to the frontend as TagHint objects in NlQueryInterpretation.resolvedTags and rendered as removable "Thema: X" chips. The tagsApplied flag signals whether the OR-union filter was actually passed to DocumentService.searchDocuments() — it is false when the personRole:any single-person path is taken, because that path has no tag filter slot. See ADR-033 for the tag name resolution and case-collision rules that TagService.findByNameContaining() relies on.
Decision
Introduce a new search/ domain package with a local Ollama integration via RestClientOllamaClient. The Ollama service runs as a separate Docker container, reachable only on the internal Docker network (expose: ["11434"], not ports:). The backend calls Ollama's /api/generate endpoint with grammar-constrained JSON output. Name resolution and document search are performed by existing services after the model returns.
Key component structure:
OllamaClient/OllamaHealthClientinterfaces — mockable for tests, modelled onOcrClient/OcrHealthClientRestClientOllamaClient— twoRestClientinstances (30 s inference, 2 s health-check)NlQueryParserService— orchestrates Ollama → name resolution → document searchNlSearchRateLimiter— Bucket4j + Caffeine, 5 req/min per userNlSearchController—POST /api/search/nl,@RequirePermission(READ_ALL)
Consequences
- Family members can query in natural German without learning filter UI. Expected search satisfaction improvement for the 60+ age cohort (primary transcription audience) is significant.
- NL search is unavailable when Ollama is down or the model pull is not complete. The regular search is unaffected. The 503 response includes a CTA directing users to the regular search.
- Operator responsibility: run
ollama pullon first deploy and after model updates. The backup runbook must excludeollama_modelsvolume (model weights are re-downloadable, not user data). - Inference takes 2–15 seconds. The frontend loading indicator is a hard requirement (see issue #738 frontend notes).
- The rate limiter is node-local. At the current single-node deployment scale this is correct. If the service is ever scaled horizontally, the rate limiter must be moved to Redis (same caveat as
LoginRateLimiter). - The
search/package introduces a new cross-domain dependency direction (search→person,search→document). This is intentional and documented indocs/architecture/c4/l3-backend-search.puml.