From 3d2b907fb4328b6b135ca0c30a47c7fe867c8ff7 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sun, 7 Jun 2026 16:13:58 +0200
Subject: [PATCH] =?UTF-8?q?docs(adr):=20ADR-035=20=E2=80=94=20replace=20Ol?=
 =?UTF-8?q?lama=20with=20rule-based=20nlp-service?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/adr/035-rule-based-nlp-service.md | 105 +++++++++++++++++++++++++
 1 file changed, 105 insertions(+)
 create mode 100644 docs/adr/035-rule-based-nlp-service.md

diff --git a/docs/adr/035-rule-based-nlp-service.md b/docs/adr/035-rule-based-nlp-service.md
new file mode 100644
index 00000000..3955e423
--- /dev/null
+++ b/docs/adr/035-rule-based-nlp-service.md
@@ -0,0 +1,105 @@
+# ADR-035: Replace Ollama with a rule-based NLP service for smart search
+
+**Date:** 2026-06-07
+**Status:** Accepted
+**Deciders:** Marcel Raddatz
+**Supersedes:** ADR-028 (Ollama for NL search), ADR-034 (Ollama production deployment)
+**Relates to:** #771 (implementation)
+
+---
+
+## Context
+
+ADR-028 introduced Ollama + qwen2.5-7B to parse free-text search queries into structured
+extractions (person names, date ranges, person role, keywords). After deploying to
+staging (ADR-034) the approach showed three problems:
+
+1. **Cold-start latency:** even with `OLLAMA_KEEP_ALIVE=-1` a Qwen inference on CPU takes
+   ~18 s. This blows the UX budget for a search feature and requires a 60 s timeout.
+2. **Resource cost:** 8 GB resident RAM + 4 vCPU cap for an LLM whose only job is regex-
+   level entity extraction from short (< 500 char) German family-history queries.
+3. **Fragility:** model-weight downloads, version pinning, and init-container orchestration
+   add operational surface area with no quality benefit over a deterministic parser.
+
+The query set is narrow and well-understood: person names are all in the PostgreSQL
+`persons` table; date patterns are a fixed repertoire of German/English/Spanish formats;
+person role (sender vs. receiver) is reliably signalled by a handful of prepositions
+("von", "an", "von … an"); keywords are nouns/proper nouns not consumed by the other
+extractors.
+
+---
+
+## Decision
+
+Replace Ollama with a lightweight, rule-based Python FastAPI service (`nlp-service`).
+
+### Architecture
+
+```
+POST /api/search/nl (NlSearchController)
+  → NlQueryParserService
+    → RestClientNlpClient.parse(query, lang)
+      → POST http://nlp-service:8001/parse
+        ← { personNames, personRole, dateFrom, dateTo, keywords, rawQuery }
+```
+
+The response contract is identical to the old `OllamaExtraction`; only the transport
+and implementation change. Java callers see `NlpExtraction` (renamed, same shape).
+
+### Implementation
+
+- **`nlp-service/`** — standalone FastAPI app (Python 3.11.12-slim image, ~256 MB RAM)
+  - `extractor.py` — pipeline: person extraction → role detection → date parsing → keywords
+  - `person_matcher.py` — two-pass fuzzy lookup (rapidfuzz 3.x) against the `persons` DB table;
+    loaded at startup, no live DB queries during extraction
+  - `models.py` — Pydantic `ParseRequest` (max 500 chars), `ParseResponse`
+  - `main.py` — lifespan loads persons from `DATABASE_URL`; `/health` reports `persons_loaded`
+
+- **`backend/search/`** — `OllamaClient` / `OllamaExtraction` renamed to `NlpClient` /
+  `NlpExtraction`; `NlpProperties` (`@ConfigurationProperties("app.nlp")`) replaces
+  `OllamaProperties`; `lang` parameter added to `/parse` and threaded through the stack.
+
+### Tunable parameters
+
+| Env var | Default | Effect |
+|---|---|---|
+| `DATABASE_URL` | — | PostgreSQL DSN; unset → person matching disabled |
+| `NLP_FUZZY_THRESHOLD` | `80` | rapidfuzz similarity floor (0–100) |
+
+### Graceful degradation
+
+The backend's `RestClientNlpClient` wraps all HTTP errors and timeouts in
+`DomainException.serviceUnavailable(SMART_SEARCH_UNAVAILABLE)`, returning HTTP 503 to
+the client — identical behaviour to the Ollama path. The rate limiter is relaxed from
+5 to 20 requests/min (rule-based extraction completes in < 50 ms vs. ~18 s for LLM).
+
+---
+
+## Consequences
+
+### Positive
+
+- **Latency:** < 50 ms per extraction vs. ~18 s — smart search is now interactive.
+- **Memory:** ~256 MB vs. 8 GB — frees 7.75 GB on the production host.
+- **No model downloads:** the image ships no weights; startup is a single DB query.
+- **Deterministic:** same query always produces the same result; no temperature/sampling.
+- **Testable without infrastructure:** pytest with a seeded `PersonMatcher` fixture; no
+  WireMock stubs needed for most unit tests.
+
+### Trade-offs
+
+- **No semantic generalisation.** The LLM could handle novel phrasing; the rule-based
+  parser only handles the preposition patterns it was written for. Edge cases that fall
+  outside the pattern produce an empty extraction rather than a best-effort result.
+- **Person matching depends on DB content.** A person not yet in the archive will never
+  match, even if the user types their exact name. The LLM could surface the name as a
+  raw string; this service surfaces nothing. This is acceptable for the current archive
+  size and query patterns.
+- **Language support is fixed at de/en/es** (Paraglide locales). Adding a fourth locale
+  requires adding its stopword list and preposition table to `extractor.py`.
+
+### Superseded ADRs
+
+ADR-028 and ADR-034 documented the Ollama topology, init recipe, keep-alive pin, and
+memory budget. All of that is now moot. The `ollama`, `ollama-model-init`, and
+`ollama_models` volume are removed from `docker-compose.yml`.