From 3d2b907fb4328b6b135ca0c30a47c7fe867c8ff7 Mon Sep 17 00:00:00 2001 From: Marcel Date: Sun, 7 Jun 2026 16:13:58 +0200 Subject: [PATCH] =?UTF-8?q?docs(adr):=20ADR-035=20=E2=80=94=20replace=20Ol?= =?UTF-8?q?lama=20with=20rule-based=20nlp-service?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- docs/adr/035-rule-based-nlp-service.md | 105 +++++++++++++++++++++++++ 1 file changed, 105 insertions(+) create mode 100644 docs/adr/035-rule-based-nlp-service.md diff --git a/docs/adr/035-rule-based-nlp-service.md b/docs/adr/035-rule-based-nlp-service.md new file mode 100644 index 00000000..3955e423 --- /dev/null +++ b/docs/adr/035-rule-based-nlp-service.md @@ -0,0 +1,105 @@ +# ADR-035: Replace Ollama with a rule-based NLP service for smart search + +**Date:** 2026-06-07 +**Status:** Accepted +**Deciders:** Marcel Raddatz +**Supersedes:** ADR-028 (Ollama for NL search), ADR-034 (Ollama production deployment) +**Relates to:** #771 (implementation) + +--- + +## Context + +ADR-028 introduced Ollama + qwen2.5-7B to parse free-text search queries into structured +extractions (person names, date ranges, person role, keywords). After deploying to +staging (ADR-034) the approach showed three problems: + +1. **Cold-start latency:** even with `OLLAMA_KEEP_ALIVE=-1` a Qwen inference on CPU takes + ~18 s. This blows the UX budget for a search feature and requires a 60 s timeout. +2. **Resource cost:** 8 GB resident RAM + 4 vCPU cap for an LLM whose only job is regex- + level entity extraction from short (< 500 char) German family-history queries. +3. **Fragility:** model-weight downloads, version pinning, and init-container orchestration + add operational surface area with no quality benefit over a deterministic parser. + +The query set is narrow and well-understood: person names are all in the PostgreSQL +`persons` table; date patterns are a fixed repertoire of German/English/Spanish formats; +person role (sender vs. receiver) is reliably signalled by a handful of prepositions +("von", "an", "von … an"); keywords are nouns/proper nouns not consumed by the other +extractors. + +--- + +## Decision + +Replace Ollama with a lightweight, rule-based Python FastAPI service (`nlp-service`). + +### Architecture + +``` +POST /api/search/nl (NlSearchController) + → NlQueryParserService + → RestClientNlpClient.parse(query, lang) + → POST http://nlp-service:8001/parse + ← { personNames, personRole, dateFrom, dateTo, keywords, rawQuery } +``` + +The response contract is identical to the old `OllamaExtraction`; only the transport +and implementation change. Java callers see `NlpExtraction` (renamed, same shape). + +### Implementation + +- **`nlp-service/`** — standalone FastAPI app (Python 3.11.12-slim image, ~256 MB RAM) + - `extractor.py` — pipeline: person extraction → role detection → date parsing → keywords + - `person_matcher.py` — two-pass fuzzy lookup (rapidfuzz 3.x) against the `persons` DB table; + loaded at startup, no live DB queries during extraction + - `models.py` — Pydantic `ParseRequest` (max 500 chars), `ParseResponse` + - `main.py` — lifespan loads persons from `DATABASE_URL`; `/health` reports `persons_loaded` + +- **`backend/search/`** — `OllamaClient` / `OllamaExtraction` renamed to `NlpClient` / + `NlpExtraction`; `NlpProperties` (`@ConfigurationProperties("app.nlp")`) replaces + `OllamaProperties`; `lang` parameter added to `/parse` and threaded through the stack. + +### Tunable parameters + +| Env var | Default | Effect | +|---|---|---| +| `DATABASE_URL` | — | PostgreSQL DSN; unset → person matching disabled | +| `NLP_FUZZY_THRESHOLD` | `80` | rapidfuzz similarity floor (0–100) | + +### Graceful degradation + +The backend's `RestClientNlpClient` wraps all HTTP errors and timeouts in +`DomainException.serviceUnavailable(SMART_SEARCH_UNAVAILABLE)`, returning HTTP 503 to +the client — identical behaviour to the Ollama path. The rate limiter is relaxed from +5 to 20 requests/min (rule-based extraction completes in < 50 ms vs. ~18 s for LLM). + +--- + +## Consequences + +### Positive + +- **Latency:** < 50 ms per extraction vs. ~18 s — smart search is now interactive. +- **Memory:** ~256 MB vs. 8 GB — frees 7.75 GB on the production host. +- **No model downloads:** the image ships no weights; startup is a single DB query. +- **Deterministic:** same query always produces the same result; no temperature/sampling. +- **Testable without infrastructure:** pytest with a seeded `PersonMatcher` fixture; no + WireMock stubs needed for most unit tests. + +### Trade-offs + +- **No semantic generalisation.** The LLM could handle novel phrasing; the rule-based + parser only handles the preposition patterns it was written for. Edge cases that fall + outside the pattern produce an empty extraction rather than a best-effort result. +- **Person matching depends on DB content.** A person not yet in the archive will never + match, even if the user types their exact name. The LLM could surface the name as a + raw string; this service surfaces nothing. This is acceptable for the current archive + size and query patterns. +- **Language support is fixed at de/en/es** (Paraglide locales). Adding a fourth locale + requires adding its stopword list and preposition table to `extractor.py`. + +### Superseded ADRs + +ADR-028 and ADR-034 documented the Ollama topology, init recipe, keep-alive pin, and +memory budget. All of that is now moot. The `ollama`, `ollama-model-init`, and +`ollama_models` volume are removed from `docker-compose.yml`.