Files

Marcel 0fe0ae5235 docs(search): ADR-028 fix + glossary + C4 diagram for tag resolution (#743 )

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-07 08:47:47 +02:00

9.6 KiB

Raw Blame History

ADR-028 — Natural language search is powered by Ollama (Qwen 2.5 7B), not a cloud API

Date: 2026-06-06 Status: Accepted Issue: #738 (NL search backend); part of epic #735 Milestone: Archive Intelligence — NL Search

Context

Family members write their search intent in plain German ("Was hat Walter im Krieg an Emma geschrieben?"), not in structured filter forms. Issue #735 defines NL search as a core product goal. Three delivery options were evaluated:

Option A — extend the OCR service. The OCR Python microservice already runs on the same host. Adding LLM inference there avoids a new container. Rejected: the OCR service is a single-purpose, CPU-bound pipeline optimised for Kraken; bundling a 4.5 GB LLM weight into the same image would bloat it, complicate model lifecycle management, and create an unrelated failure domain (OOM on large OCR batches vs. LLM load time). ADR-001 was explicit about keeping OCR single-purpose.

Option B — call an external API (OpenAI, Anthropic, etc.). Cloud inference is instant and requires no local hardware. Rejected: the archive contains real person names and private family correspondence from 1899–1950 — sending query content to a third party violates the project's data-residency principle (family data stays on the family server). Additionally, API cost and availability are outside the operator's control; the system must work air-gapped.

Option C — local Ollama service (chosen). Ollama is a purpose-built LLM runtime with a simple REST API, model lifecycle management (ollama pull), and support for grammar-constrained JSON output. It runs entirely on the existing server (i7-6700, 64 GB RAM) with no cloud dependency.

Model selection: Qwen 2.5 7B Q4_K_M (qwen2.5:7b-instruct-q4_K_M) was chosen over larger models because:

Quantised weight is ~4.5 GB — fits comfortably in 64 GB RAM alongside PostgreSQL and the JVM.
Instruction-tuned variant follows the structured JSON schema reliably without fine-tuning.
CPU-only inference at Q4_K_M takes 2–15 seconds per query, acceptable for a search that replaces a multi-step filter form.

Prompt injection mitigation: The backend sends the raw user query to Ollama. To prevent the model from being prompted to return schema-breaking output, the API call uses Ollama's format parameter with a grammar-constrained JSON schema. Output length is further bounded by maxLength constraints in the schema (names ≤ 200 chars, keywords ≤ 100 chars). NlQueryParserService enforces these limits in code before any LLM-extracted fragment is passed to PersonRepository.searchByName() — defence in depth.

DB-blind name resolution: The Ollama prompt stays small (the raw query only); person database records are never sent to the model. Name resolution happens as a cheap SQL query after the model returns. This keeps the prompt short, avoids data leakage, and means adding 1,000 new persons requires no prompt change.

Graceful degradation: In-path Ollama failures surface via OllamaClient.parse() — any IOException, read timeout, or non-2xx response is caught by RestClientOllamaClient and re-thrown as DomainException(SMART_SEARCH_UNAVAILABLE, HTTP 503). isHealthy() has no callers inside search/; it is reserved for the ops/health-endpoint polling path only (e.g. a future /api/health/ollama endpoint). The regular structured search (GET /api/documents/search) is unaffected — it never calls Ollama.

Expected inference latency: 2–15 seconds on the current CPU-only hardware. The frontend issue must show a persistent "Suche läuft…" indicator for the full duration (see aria-live="polite" requirement in issue #738 frontend notes). The backend timeout is 30 seconds (app.ollama.timeout-seconds=30) — chosen as a safe upper bound for Q4_K_M on the i7-6700 with a realistic 500-character query under modest concurrent load.

NL query logging policy: Only metadata is logged — query length, resolved person count, latency in milliseconds. The raw query is never written to the log file. Rationale: queries contain real family names (PII); log files persist to disk and may be shipped to Loki. Structured metadata is sufficient for debugging latency regressions.

Prompt-amplification abuse: A malicious user could submit a long or crafted query to cause slow Ollama inference, consuming CPU. Mitigated by NlSearchRateLimiter (5 requests per user per minute, Bucket4j + Caffeine) and by @Size(max=500) on the request body. The rate limiter is node-local; in multi-replica deployments the effective limit multiplies by replica count — acceptable at the current single-node deployment scale.

Ollama model pre-pull requirement: The Docker image contains only the Ollama binary, not the model weights. The operator must run ollama pull qwen2.5:7b-instruct-q4_K_M (≈4.5 GB download, 10–30 minutes) before the backend starts inference. If skipped, every NL search request returns 503 until the pull completes. The deployment runbook in docs/DEPLOYMENT.md covers this explicitly.

Startup dependency: The backend Compose service declares depends_on: ollama: condition: service_healthy. The Ollama healthcheck polls GET http://localhost:11434/api/tags; start_period: 120s provides margin for weight loading (20–60 s on SSD). Note: service_healthy confirms the API is responding, not that the model is downloaded — if the pull was skipped, inference still returns 404.

Multi-name resolution heuristic: For 2-name queries (e.g. "Was hat Walter an Emma geschrieben?"), the first extracted name is treated as sender and the second as receiver. Per-name role annotation (e.g. {name: "Walter", role: "sender"}) was rejected because it would require a combinatorially complex Ollama schema and the most natural German phrasing strongly implies sender→receiver order. For single-name queries, a personRole field (sender/receiver/any) is returned.

personRole: "any" keyword limitation: When personRole is "any" and the name resolves to exactly one person, DocumentService.searchDocumentsByPersonId() is called (OR semantics: person as sender or receiver). Keyword filtering is not applied on this path — only person identity and date range. keywordsApplied = false is returned in the response. Rationale: the JPQL for OR-semantics person queries has no text predicate; adding FTS would require a native query or a separate pass, adding complexity for a case that is already well-narrowed by person identity.

search/ → person/ + document/ dependency direction: NlQueryParserService calls PersonService.findByDisplayNameContaining() and DocumentService.searchDocuments() — both are legitimate cross-domain service calls, not repository leaks. The search/ package has no JPA entities of its own and never accesses PersonRepository or DocumentRepository directly.

Keyword→tag resolution (issue #743): After Ollama extracts the keywords list, NlQueryParserService calls TagService.findByNameContaining() for each keyword. Keywords that match one or more tags are removed from the FTS text list and added as OR-union tag filters; keywords with no tag match remain as FTS text. Resolved tags are returned to the frontend as TagHint objects in NlQueryInterpretation.resolvedTags and rendered as removable "Thema: X" chips. The tagsApplied flag signals whether the OR-union filter was actually passed to DocumentService.searchDocuments() — it is false when the personRole:any single-person path is taken, because that path has no tag filter slot. See ADR-033 for the tag name resolution and case-collision rules that TagService.findByNameContaining() relies on.

Decision

Introduce a new search/ domain package with a local Ollama integration via RestClientOllamaClient. The Ollama service runs as a separate Docker container, reachable only on the internal Docker network (expose: ["11434"], not ports:). The backend calls Ollama's /api/generate endpoint with grammar-constrained JSON output. Name resolution and document search are performed by existing services after the model returns.

Key component structure:

OllamaClient / OllamaHealthClient interfaces — mockable for tests, modelled on OcrClient/OcrHealthClient
RestClientOllamaClient — two RestClient instances (30 s inference, 2 s health-check)
NlQueryParserService — orchestrates Ollama → name resolution → document search
NlSearchRateLimiter — Bucket4j + Caffeine, 5 req/min per user
NlSearchController — POST /api/search/nl, @RequirePermission(READ_ALL)

Consequences

Family members can query in natural German without learning filter UI. Expected search satisfaction improvement for the 60+ age cohort (primary transcription audience) is significant.
NL search is unavailable when Ollama is down or the model pull is not complete. The regular search is unaffected. The 503 response includes a CTA directing users to the regular search.
Operator responsibility: run ollama pull on first deploy and after model updates. The backup runbook must exclude ollama_models volume (model weights are re-downloadable, not user data).
Inference takes 2–15 seconds. The frontend loading indicator is a hard requirement (see issue #738 frontend notes).
The rate limiter is node-local. At the current single-node deployment scale this is correct. If the service is ever scaled horizontally, the rate limiter must be moved to Redis (same caveat as LoginRateLimiter).
The search/ package introduces a new cross-domain dependency direction (search → person, search → document). This is intentional and documented in docs/architecture/c4/l3-backend-search.puml.

9.6 KiB Raw Blame History Unescape Escape

ADR-028 — Natural language search is powered by Ollama (Qwen 2.5 7B), not a cloud API

Context

Decision

Consequences

9.6 KiB

Raw Blame History