Files
familienarchiv/docs/adr/035-rule-based-nlp-service.md
2026-06-08 10:56:32 +02:00

4.7 KiB
Raw Blame History

ADR-035: Replace Ollama with a rule-based NLP service for smart search

Date: 2026-06-07 Status: Accepted Deciders: Marcel Raddatz Supersedes: ADR-028 (Ollama for NL search), ADR-034 (Ollama production deployment) Relates to: #771 (implementation)


Context

ADR-028 introduced Ollama + qwen2.5-7B to parse free-text search queries into structured extractions (person names, date ranges, person role, keywords). After deploying to staging (ADR-034) the approach showed three problems:

  1. Cold-start latency: even with OLLAMA_KEEP_ALIVE=-1 a Qwen inference on CPU takes ~18 s. This blows the UX budget for a search feature and requires a 60 s timeout.
  2. Resource cost: 8 GB resident RAM + 4 vCPU cap for an LLM whose only job is regex- level entity extraction from short (< 500 char) German family-history queries.
  3. Fragility: model-weight downloads, version pinning, and init-container orchestration add operational surface area with no quality benefit over a deterministic parser.

The query set is narrow and well-understood: person names are all in the PostgreSQL persons table; date patterns are a fixed repertoire of German/English/Spanish formats; person role (sender vs. receiver) is reliably signalled by a handful of prepositions ("von", "an", "von … an"); keywords are nouns/proper nouns not consumed by the other extractors.


Decision

Replace Ollama with a lightweight, rule-based Python FastAPI service (nlp-service).

Architecture

POST /api/search/nl (NlSearchController)
  → NlQueryParserService
    → RestClientNlpClient.parse(query, lang)
      → POST http://nlp-service:8001/parse
        ← { personNames, personRole, dateFrom, dateTo, keywords, rawQuery }

The response contract is identical to the old OllamaExtraction; only the transport and implementation change. Java callers see NlpExtraction (renamed, same shape).

Implementation

  • nlp-service/ — standalone FastAPI app (Python 3.11.12-slim image, ~256 MB RAM)

    • extractor.py — pipeline: person extraction → role detection → date parsing → keywords
    • person_matcher.py — two-pass fuzzy lookup (rapidfuzz 3.x) against the persons DB table; loaded at startup, no live DB queries during extraction
    • models.py — Pydantic ParseRequest (max 500 chars), ParseResponse
    • main.py — lifespan loads persons from DATABASE_URL; /health reports persons_loaded
  • backend/search/OllamaClient / OllamaExtraction renamed to NlpClient / NlpExtraction; NlpProperties (@ConfigurationProperties("app.nlp")) replaces OllamaProperties; lang parameter added to /parse and threaded through the stack.

Tunable parameters

Env var Default Effect
DATABASE_URL PostgreSQL DSN; unset → person matching disabled
NLP_FUZZY_THRESHOLD 80 rapidfuzz similarity floor (0100)

Graceful degradation

The backend's RestClientNlpClient wraps all HTTP errors and timeouts in DomainException.serviceUnavailable(SMART_SEARCH_UNAVAILABLE), returning HTTP 503 to the client — identical behaviour to the Ollama path. The rate limiter is relaxed from 5 to 20 requests/min (rule-based extraction completes in < 50 ms vs. ~18 s for LLM).


Consequences

Positive

  • Latency: < 50 ms per extraction vs. ~18 s — smart search is now interactive.
  • Memory: ~256 MB vs. 8 GB — frees 7.75 GB on the production host.
  • No model downloads: the image ships no weights; startup is a single DB query.
  • Deterministic: same query always produces the same result; no temperature/sampling.
  • Testable without infrastructure: pytest with a seeded PersonMatcher fixture; no WireMock stubs needed for most unit tests.

Trade-offs

  • No semantic generalisation. The LLM could handle novel phrasing; the rule-based parser only handles the preposition patterns it was written for. Edge cases that fall outside the pattern produce an empty extraction rather than a best-effort result.
  • Person matching depends on DB content. A person not yet in the archive will never match, even if the user types their exact name. The LLM could surface the name as a raw string; this service surfaces nothing. This is acceptable for the current archive size and query patterns.
  • Language support is fixed at de/en/es (Paraglide locales). Adding a fourth locale requires adding its stopword list and preposition table to extractor.py.

Superseded ADRs

ADR-028 and ADR-034 documented the Ollama topology, init recipe, keep-alive pin, and memory budget. All of that is now moot. The ollama, ollama-model-init, and ollama_models volume are removed from docker-compose.yml.