Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 2m56s
CI / OCR Service Tests (pull_request) Successful in 25s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m8s
CI / Backend Unit Tests (pull_request) Failing after 39s
CI / fail2ban Regex (pull_request) Successful in 53s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
106 lines
4.7 KiB
Markdown
106 lines
4.7 KiB
Markdown
# ADR-035: Replace Ollama with a rule-based NLP service for smart search
|
||
|
||
**Date:** 2026-06-07
|
||
**Status:** Accepted
|
||
**Deciders:** Marcel Raddatz
|
||
**Supersedes:** ADR-028 (Ollama for NL search), ADR-034 (Ollama production deployment)
|
||
**Relates to:** #771 (implementation)
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
ADR-028 introduced Ollama + qwen2.5-7B to parse free-text search queries into structured
|
||
extractions (person names, date ranges, person role, keywords). After deploying to
|
||
staging (ADR-034) the approach showed three problems:
|
||
|
||
1. **Cold-start latency:** even with `OLLAMA_KEEP_ALIVE=-1` a Qwen inference on CPU takes
|
||
~18 s. This blows the UX budget for a search feature and requires a 60 s timeout.
|
||
2. **Resource cost:** 8 GB resident RAM + 4 vCPU cap for an LLM whose only job is regex-
|
||
level entity extraction from short (< 500 char) German family-history queries.
|
||
3. **Fragility:** model-weight downloads, version pinning, and init-container orchestration
|
||
add operational surface area with no quality benefit over a deterministic parser.
|
||
|
||
The query set is narrow and well-understood: person names are all in the PostgreSQL
|
||
`persons` table; date patterns are a fixed repertoire of German/English/Spanish formats;
|
||
person role (sender vs. receiver) is reliably signalled by a handful of prepositions
|
||
("von", "an", "von … an"); keywords are nouns/proper nouns not consumed by the other
|
||
extractors.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
Replace Ollama with a lightweight, rule-based Python FastAPI service (`nlp-service`).
|
||
|
||
### Architecture
|
||
|
||
```
|
||
POST /api/search/nl (NlSearchController)
|
||
→ NlQueryParserService
|
||
→ RestClientNlpClient.parse(query, lang)
|
||
→ POST http://nlp-service:8001/parse
|
||
← { personNames, personRole, dateFrom, dateTo, keywords, rawQuery }
|
||
```
|
||
|
||
The response contract is identical to the old `OllamaExtraction`; only the transport
|
||
and implementation change. Java callers see `NlpExtraction` (renamed, same shape).
|
||
|
||
### Implementation
|
||
|
||
- **`nlp-service/`** — standalone FastAPI app (Python 3.11.12-slim image, ~256 MB RAM)
|
||
- `extractor.py` — pipeline: person extraction → role detection → date parsing → keywords
|
||
- `person_matcher.py` — two-pass fuzzy lookup (rapidfuzz 3.x) against the `persons` DB table;
|
||
loaded at startup, no live DB queries during extraction
|
||
- `models.py` — Pydantic `ParseRequest` (max 500 chars), `ParseResponse`
|
||
- `main.py` — lifespan loads persons from `DATABASE_URL`; `/health` reports `persons_loaded`
|
||
|
||
- **`backend/search/`** — `OllamaClient` / `OllamaExtraction` renamed to `NlpClient` /
|
||
`NlpExtraction`; `NlpProperties` (`@ConfigurationProperties("app.nlp")`) replaces
|
||
`OllamaProperties`; `lang` parameter added to `/parse` and threaded through the stack.
|
||
|
||
### Tunable parameters
|
||
|
||
| Env var | Default | Effect |
|
||
|---|---|---|
|
||
| `DATABASE_URL` | — | PostgreSQL DSN; unset → person matching disabled |
|
||
| `NLP_FUZZY_THRESHOLD` | `80` | rapidfuzz similarity floor (0–100) |
|
||
|
||
### Graceful degradation
|
||
|
||
The backend's `RestClientNlpClient` wraps all HTTP errors and timeouts in
|
||
`DomainException.serviceUnavailable(SMART_SEARCH_UNAVAILABLE)`, returning HTTP 503 to
|
||
the client — identical behaviour to the Ollama path. The rate limiter is relaxed from
|
||
5 to 20 requests/min (rule-based extraction completes in < 50 ms vs. ~18 s for LLM).
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
|
||
- **Latency:** < 50 ms per extraction vs. ~18 s — smart search is now interactive.
|
||
- **Memory:** ~256 MB vs. 8 GB — frees 7.75 GB on the production host.
|
||
- **No model downloads:** the image ships no weights; startup is a single DB query.
|
||
- **Deterministic:** same query always produces the same result; no temperature/sampling.
|
||
- **Testable without infrastructure:** pytest with a seeded `PersonMatcher` fixture; no
|
||
WireMock stubs needed for most unit tests.
|
||
|
||
### Trade-offs
|
||
|
||
- **No semantic generalisation.** The LLM could handle novel phrasing; the rule-based
|
||
parser only handles the preposition patterns it was written for. Edge cases that fall
|
||
outside the pattern produce an empty extraction rather than a best-effort result.
|
||
- **Person matching depends on DB content.** A person not yet in the archive will never
|
||
match, even if the user types their exact name. The LLM could surface the name as a
|
||
raw string; this service surfaces nothing. This is acceptable for the current archive
|
||
size and query patterns.
|
||
- **Language support is fixed at de/en/es** (Paraglide locales). Adding a fourth locale
|
||
requires adding its stopword list and preposition table to `extractor.py`.
|
||
|
||
### Superseded ADRs
|
||
|
||
ADR-028 and ADR-034 documented the Ollama topology, init recipe, keep-alive pin, and
|
||
memory budget. All of that is now moot. The `ollama`, `ollama-model-init`, and
|
||
`ollama_models` volume are removed from `docker-compose.yml`.
|