4.7 KiB
ADR-035: Replace Ollama with a rule-based NLP service for smart search
Date: 2026-06-07 Status: Accepted Deciders: Marcel Raddatz Supersedes: ADR-028 (Ollama for NL search), ADR-034 (Ollama production deployment) Relates to: #771 (implementation)
Context
ADR-028 introduced Ollama + qwen2.5-7B to parse free-text search queries into structured extractions (person names, date ranges, person role, keywords). After deploying to staging (ADR-034) the approach showed three problems:
- Cold-start latency: even with
OLLAMA_KEEP_ALIVE=-1a Qwen inference on CPU takes ~18 s. This blows the UX budget for a search feature and requires a 60 s timeout. - Resource cost: 8 GB resident RAM + 4 vCPU cap for an LLM whose only job is regex- level entity extraction from short (< 500 char) German family-history queries.
- Fragility: model-weight downloads, version pinning, and init-container orchestration add operational surface area with no quality benefit over a deterministic parser.
The query set is narrow and well-understood: person names are all in the PostgreSQL
persons table; date patterns are a fixed repertoire of German/English/Spanish formats;
person role (sender vs. receiver) is reliably signalled by a handful of prepositions
("von", "an", "von … an"); keywords are nouns/proper nouns not consumed by the other
extractors.
Decision
Replace Ollama with a lightweight, rule-based Python FastAPI service (nlp-service).
Architecture
POST /api/search/nl (NlSearchController)
→ NlQueryParserService
→ RestClientNlpClient.parse(query, lang)
→ POST http://nlp-service:8001/parse
← { personNames, personRole, dateFrom, dateTo, keywords, rawQuery }
The response contract is identical to the old OllamaExtraction; only the transport
and implementation change. Java callers see NlpExtraction (renamed, same shape).
Implementation
-
nlp-service/— standalone FastAPI app (Python 3.11.12-slim image, ~256 MB RAM)extractor.py— pipeline: person extraction → role detection → date parsing → keywordsperson_matcher.py— two-pass fuzzy lookup (rapidfuzz 3.x) against thepersonsDB table; loaded at startup, no live DB queries during extractionmodels.py— PydanticParseRequest(max 500 chars),ParseResponsemain.py— lifespan loads persons fromDATABASE_URL;/healthreportspersons_loaded
-
backend/search/—OllamaClient/OllamaExtractionrenamed toNlpClient/NlpExtraction;NlpProperties(@ConfigurationProperties("app.nlp")) replacesOllamaProperties;langparameter added to/parseand threaded through the stack.
Tunable parameters
| Env var | Default | Effect |
|---|---|---|
DATABASE_URL |
— | PostgreSQL DSN; unset → person matching disabled |
NLP_FUZZY_THRESHOLD |
80 |
rapidfuzz similarity floor (0–100) |
Graceful degradation
The backend's RestClientNlpClient wraps all HTTP errors and timeouts in
DomainException.serviceUnavailable(SMART_SEARCH_UNAVAILABLE), returning HTTP 503 to
the client — identical behaviour to the Ollama path. The rate limiter is relaxed from
5 to 20 requests/min (rule-based extraction completes in < 50 ms vs. ~18 s for LLM).
Consequences
Positive
- Latency: < 50 ms per extraction vs. ~18 s — smart search is now interactive.
- Memory: ~256 MB vs. 8 GB — frees 7.75 GB on the production host.
- No model downloads: the image ships no weights; startup is a single DB query.
- Deterministic: same query always produces the same result; no temperature/sampling.
- Testable without infrastructure: pytest with a seeded
PersonMatcherfixture; no WireMock stubs needed for most unit tests.
Trade-offs
- No semantic generalisation. The LLM could handle novel phrasing; the rule-based parser only handles the preposition patterns it was written for. Edge cases that fall outside the pattern produce an empty extraction rather than a best-effort result.
- Person matching depends on DB content. A person not yet in the archive will never match, even if the user types their exact name. The LLM could surface the name as a raw string; this service surfaces nothing. This is acceptable for the current archive size and query patterns.
- Language support is fixed at de/en/es (Paraglide locales). Adding a fourth locale
requires adding its stopword list and preposition table to
extractor.py.
Superseded ADRs
ADR-028 and ADR-034 documented the Ollama topology, init recipe, keep-alive pin, and
memory budget. All of that is now moot. The ollama, ollama-model-init, and
ollama_models volume are removed from docker-compose.yml.