refactor(search): delete nlp-service microservice and Ollama ADRs

This commit is contained in:
Marcel
2026-06-07 19:04:00 +02:00
parent 2231744e6a
commit f20521b6fb
15 changed files with 0 additions and 1799 deletions

View File

@@ -1,67 +0,0 @@
# ADR-028 — Natural language search is powered by Ollama (Qwen 2.5 7B), not a cloud API
**Date:** 2026-06-06
**Status:** Accepted
**Issue:** #738 (NL search backend); part of epic #735
**Milestone:** Archive Intelligence — NL Search
---
## Context
Family members write their search intent in plain German ("Was hat Walter im Krieg an Emma geschrieben?"), not in structured filter forms. Issue #735 defines NL search as a core product goal. Three delivery options were evaluated:
**Option A — extend the OCR service.** The OCR Python microservice already runs on the same host. Adding LLM inference there avoids a new container. Rejected: the OCR service is a single-purpose, CPU-bound pipeline optimised for Kraken; bundling a 4.5 GB LLM weight into the same image would bloat it, complicate model lifecycle management, and create an unrelated failure domain (OOM on large OCR batches vs. LLM load time). ADR-001 was explicit about keeping OCR single-purpose.
**Option B — call an external API (OpenAI, Anthropic, etc.).** Cloud inference is instant and requires no local hardware. Rejected: the archive contains real person names and private family correspondence from 18991950 — sending query content to a third party violates the project's data-residency principle (family data stays on the family server). Additionally, API cost and availability are outside the operator's control; the system must work air-gapped.
**Option C — local Ollama service (chosen).** Ollama is a purpose-built LLM runtime with a simple REST API, model lifecycle management (`ollama pull`), and support for grammar-constrained JSON output. It runs entirely on the existing server (i7-6700, 64 GB RAM) with no cloud dependency.
**Model selection:** Qwen 2.5 7B Q4_K_M (`qwen2.5:7b-instruct-q4_K_M`) was chosen over larger models because:
- Quantised weight is ~4.5 GB — fits comfortably in 64 GB RAM alongside PostgreSQL and the JVM.
- Instruction-tuned variant follows the structured JSON schema reliably without fine-tuning.
- CPU-only inference at Q4_K_M takes 215 seconds per query, acceptable for a search that replaces a multi-step filter form.
**Prompt injection mitigation:** The backend sends the raw user query to Ollama. To prevent the model from being prompted to return schema-breaking output, the API call uses Ollama's `format` parameter with a grammar-constrained JSON schema. Output length is further bounded by `maxLength` constraints in the schema (names ≤ 200 chars, keywords ≤ 100 chars). `NlQueryParserService` enforces these limits in code before any LLM-extracted fragment is passed to `PersonRepository.searchByName()` — defence in depth.
**DB-blind name resolution:** The Ollama prompt stays small (the raw query only); person database records are never sent to the model. Name resolution happens as a cheap SQL query after the model returns. This keeps the prompt short, avoids data leakage, and means adding 1,000 new persons requires no prompt change.
**Graceful degradation:** In-path Ollama failures surface via `OllamaClient.parse()` — any `IOException`, read timeout, or non-2xx response is caught by `RestClientOllamaClient` and re-thrown as `DomainException(SMART_SEARCH_UNAVAILABLE, HTTP 503)`. `isHealthy()` has no callers inside `search/`; it is reserved for the ops/health-endpoint polling path only (e.g. a future `/api/health/ollama` endpoint). The regular structured search (`GET /api/documents/search`) is unaffected — it never calls Ollama.
**Expected inference latency:** 215 seconds on the current CPU-only hardware. The frontend issue must show a persistent "Suche läuft…" indicator for the full duration (see `aria-live="polite"` requirement in issue #738 frontend notes). The backend timeout is 30 seconds (`app.ollama.timeout-seconds=30`) — chosen as a safe upper bound for Q4_K_M on the i7-6700 with a realistic 500-character query under modest concurrent load.
**NL query logging policy:** Only metadata is logged — query length, resolved person count, latency in milliseconds. The raw query is never written to the log file. Rationale: queries contain real family names (PII); log files persist to disk and may be shipped to Loki. Structured metadata is sufficient for debugging latency regressions.
**Prompt-amplification abuse:** A malicious user could submit a long or crafted query to cause slow Ollama inference, consuming CPU. Mitigated by `NlSearchRateLimiter` (5 requests per user per minute, Bucket4j + Caffeine) and by `@Size(max=500)` on the request body. The rate limiter is node-local; in multi-replica deployments the effective limit multiplies by replica count — acceptable at the current single-node deployment scale.
**Ollama model pre-pull requirement:** The Docker image contains only the Ollama binary, not the model weights. The operator must run `ollama pull qwen2.5:7b-instruct-q4_K_M` (≈4.5 GB download, 1030 minutes) before the backend starts inference. If skipped, every NL search request returns 503 until the pull completes. The deployment runbook in `docs/DEPLOYMENT.md` covers this explicitly.
**Startup dependency:** The `backend` Compose service declares `depends_on: ollama: condition: service_healthy`. The Ollama healthcheck polls `GET http://localhost:11434/api/tags`; `start_period: 120s` provides margin for weight loading (2060 s on SSD). Note: `service_healthy` confirms the API is responding, not that the model is downloaded — if the pull was skipped, inference still returns 404.
**Multi-name resolution heuristic:** For 2-name queries (e.g. "Was hat Walter an Emma geschrieben?"), the first extracted name is treated as sender and the second as receiver. Per-name role annotation (e.g. `{name: "Walter", role: "sender"}`) was rejected because it would require a combinatorially complex Ollama schema and the most natural German phrasing strongly implies sender→receiver order. For single-name queries, a `personRole` field (`sender`/`receiver`/`any`) is returned.
**`personRole: "any"` keyword limitation:** When `personRole` is `"any"` and the name resolves to exactly one person, `DocumentService.searchDocumentsByPersonId()` is called (OR semantics: person as sender or receiver). Keyword filtering is not applied on this path — only person identity and date range. `keywordsApplied = false` is returned in the response. Rationale: the JPQL for OR-semantics person queries has no text predicate; adding FTS would require a native query or a separate pass, adding complexity for a case that is already well-narrowed by person identity.
**`search/``person/` + `document/` dependency direction:** `NlQueryParserService` calls `PersonService.findByDisplayNameContaining()` and `DocumentService.searchDocuments()` — both are legitimate cross-domain service calls, not repository leaks. The `search/` package has no JPA entities of its own and never accesses `PersonRepository` or `DocumentRepository` directly.
**Keyword→tag resolution** (issue #743): After Ollama extracts the `keywords` list, `NlQueryParserService` calls `TagService.findByNameContaining()` for each keyword. Keywords that match one or more tags are removed from the FTS text list and added as OR-union tag filters; keywords with no tag match remain as FTS text. Resolved tags are returned to the frontend as `TagHint` objects in `NlQueryInterpretation.resolvedTags` and rendered as removable "Thema: X" chips. The `tagsApplied` flag signals whether the OR-union filter was actually passed to `DocumentService.searchDocuments()` — it is `false` when the `personRole:any` single-person path is taken, because that path has no tag filter slot. See ADR-033 for the tag name resolution and case-collision rules that `TagService.findByNameContaining()` relies on.
## Decision
**Introduce a new `search/` domain package** with a local Ollama integration via `RestClientOllamaClient`. The Ollama service runs as a separate Docker container, reachable only on the internal Docker network (`expose: ["11434"]`, not `ports:`). The backend calls Ollama's `/api/generate` endpoint with grammar-constrained JSON output. Name resolution and document search are performed by existing services after the model returns.
Key component structure:
- `OllamaClient` / `OllamaHealthClient` interfaces — mockable for tests, modelled on `OcrClient`/`OcrHealthClient`
- `RestClientOllamaClient` — two `RestClient` instances (30 s inference, 2 s health-check)
- `NlQueryParserService` — orchestrates Ollama → name resolution → document search
- `NlSearchRateLimiter` — Bucket4j + Caffeine, 5 req/min per user
- `NlSearchController``POST /api/search/nl`, `@RequirePermission(READ_ALL)`
## Consequences
- Family members can query in natural German without learning filter UI. Expected search satisfaction improvement for the 60+ age cohort (primary transcription audience) is significant.
- NL search is unavailable when Ollama is down or the model pull is not complete. The regular search is unaffected. The 503 response includes a CTA directing users to the regular search.
- Operator responsibility: run `ollama pull` on first deploy and after model updates. The backup runbook must exclude `ollama_models` volume (model weights are re-downloadable, not user data).
- Inference takes 215 seconds. The frontend loading indicator is a hard requirement (see issue #738 frontend notes).
- The rate limiter is node-local. At the current single-node deployment scale this is correct. If the service is ever scaled horizontally, the rate limiter must be moved to Redis (same caveat as `LoginRateLimiter`).
- The `search/` package introduces a new cross-domain dependency direction (`search``person`, `search``document`). This is intentional and documented in `docs/architecture/c4/l3-backend-search.puml`.

View File

@@ -1,239 +0,0 @@
# ADR-028: Ollama Docker Compose service for NL search
**Date:** 2026-06-06
**Status:** Accepted
**Deciders:** Marcel Raddatz
**Relates to:** #737 (infrastructure), #735 (NL search epic)
---
## Context
Issue #735 introduces natural-language document search, requiring a local LLM to generate embeddings and/or run inference at query time. The family archive stores personal family history — data privacy is non-negotiable, so cloud-based inference APIs are excluded. The production target is a Hetzner CX42 (16 GB RAM, 8 vCPUs, CPU-only, ~32 EUR/month).
Alternatives considered:
| Option | Reason rejected |
|---|---|
| **llama.cpp** | No HTTP API out of the box; requires custom wrapper; higher ops burden |
| **vLLM** | GPU-first; significant overhead on CPU-only hardware; overkill for this scale |
| **Cloud APIs** (OpenAI, Gemini, etc.) | Vendor lock-in; per-token cost at scale; data leaves the server — unacceptable for a private family archive |
| **Ollama** | Self-contained Docker image; built-in HTTP REST API; actively maintained; CPU-compatible; zero egress |
**Decision:** run Ollama as a Docker Compose service alongside the existing stack.
---
## Decisions
### 1. Hardware minimums and CPU-only constraint
All inference runs on CPU. The target is the Hetzner CX42 (16 GB RAM, 8 vCPUs).
| Tier | RAM | NL search |
|---|---|---|
| CX42 | 16 GB | Supported — full stack including Ollama |
| CX32 | 8 GB | Disabled — set `APP_OLLAMA_BASE_URL=` (empty) to skip Ollama entirely |
| CX22 | 4 GB | Unsupported for NL search |
### 2. Memory budget on CX42
| Component | `mem_limit` | Typical active RSS |
|---|---|---|
| OCR service | 12g (hard ceiling) | ~6 GB |
| Ollama | 8g | ~8 GB |
| **Total** | | **~14 GB active** |
`memswap_limit` on the Ollama service is set to `8g` (matching `mem_limit`) to prevent Linux from swapping model weights into swap under OCR memory pressure. Swapping model weights does not crash the container but silently degrades inference latency. This mirrors the pattern already applied to the OCR service.
**Operational constraint:** do NOT run `docker-compose.observability.yml` continuously alongside both OCR and Ollama on a CX42. The observability stack adds ~2 GB, which leaves no headroom.
### 3. Graceful-degradation contract
`app.ollama.base-url` absent OR blank → Ollama bean NOT registered → NL search returns HTTP 503 with `ErrorCode: NL_SEARCH_UNAVAILABLE`.
This single code path covers all unavailability scenarios: base-url unset, service unreachable, health check failed, and request timeout.
#### Why not `@ConditionalOnProperty`
`@ConditionalOnProperty` registers the bean when the property is present but blank (`APP_OLLAMA_BASE_URL=`). This produces a `RestClient` with an empty base URL that fails at runtime with an opaque error rather than a clean 503.
#### Correct condition expression
```java
@ConditionalOnExpression("!'${app.ollama.base-url:}'.isBlank()")
```
When the property is absent, the placeholder resolves to `''`; `.isBlank()` returns `true`; negation makes the condition `false`; the bean is not registered. Same result for an explicit empty string (`APP_OLLAMA_BASE_URL=`).
### 4. Backend configuration pattern
Use a `@ConfigurationProperties` record, not separate `@Value` injections:
```java
@ConfigurationProperties("app.ollama")
record OllamaProperties(String baseUrl, String apiKey) {}
```
`OllamaProperties` is registered unconditionally — it is a plain value holder with no side effects.
`@ConditionalOnExpression` belongs **only** on `RestClientOllamaClient` (the bean that creates a live network client).
**Deliberate divergence from the OCR pattern:** the OCR service uses `@Value`-with-default because OCR is always-on and `http://ocr-service:8000` is a safe default. Ollama is truly optional — a missing URL means "feature disabled", not "use this default server". There is no safe default Ollama URL.
### 5. Optional<OllamaClient> injection
The NL search service uses constructor injection with `Optional<OllamaClient>`:
```java
private final Optional<OllamaClient> ollamaClient;
```
When empty (bean not registered), the service method returns 503 immediately:
```java
var client = ollamaClient.orElseThrow(
() -> DomainException.internal(ErrorCode.NL_SEARCH_UNAVAILABLE, "Ollama not configured"));
```
Prefer this over `@Autowired(required = false)` with a null check — the null-check pattern is noisy when the service already uses `@RequiredArgsConstructor`.
### 6. Empty API key guard
`RestClientOllamaClient` omits the `Authorization` header entirely when `apiKey` is blank:
```java
if (!apiKey.isBlank()) {
request.header("Authorization", "Bearer " + apiKey);
}
```
Sending `Authorization: Bearer ` (empty token) has undefined or potentially broken behavior depending on the Ollama version. This mirrors the `trainingToken` guard in `RestClientOcrClient.java:107`.
### 7. OLLAMA_API_KEY behavior in Ollama 0.6.5 and 0.30.6
**Empirically verified (2026-06-06) on both `0.6.5` and `0.30.6`:** `OLLAMA_API_KEY` does **not** enforce request authentication in either version.
Test matrix run against `/api/tags`:
| Configuration | No auth header | `Authorization: Bearer ` (empty) | `Authorization: Bearer wrongkey` | `Authorization: Bearer correctkey` |
|---|---|---|---|---|
| `OLLAMA_API_KEY=` (empty) | 200 | 200 | — | — |
| `OLLAMA_API_KEY` unset | 200 | — | — | — |
| `OLLAMA_API_KEY=testkey99` | 200 | 200 | 200 | 200 |
**Finding:** The `OLLAMA_API_KEY` environment variable is not listed in Ollama's startup config dump and does not gate any HTTP request in either tested version. All configurations — empty string, fully unset, and a real key — accept all requests without authentication.
**Practical implication:** `OLLAMA_API_KEY` provides no defense-in-depth in the tested versions. `archiv-net` network isolation is the only effective security control. The env var is retained in the Compose definition and `.env.example` for forward compatibility if Ollama enables enforcement in a future version, but operators must not rely on it for access control.
**Backend guard still valid:** the `RestClientOllamaClient` code-level guard (omit `Authorization` header when `apiKey.isBlank()`) remains correct behavior regardless — it prevents a malformed `Authorization: Bearer ` header from being sent.
### 8. read_only: true feasibility
**Empirically verified (2026-06-06) on both `0.6.5` and `0.30.6`:** `read_only: true` works with Ollama. All three operations — `ollama serve`, `ollama pull qwen2.5:7b-instruct-q4_K_M`, and `ollama list` — succeeded with exit code 0 in both versions.
Test run:
```bash
docker run --rm --read-only \
-v ollama_models:/root/.ollama \
--tmpfs /tmp \
--entrypoint sh ollama/ollama:0.30.6 \
-c "ollama serve & sleep 5 && ollama pull qwen2.5:7b-instruct-q4_K_M && ollama list"
```
**Note:** the entrypoint must be overridden to `sh` for the test command — the container's default entrypoint is `/bin/ollama` and does not accept `sh` as a subcommand. This is a Docker invocation detail; the Compose service definition uses the image's default entrypoint and `command:` override for the init container, which works correctly.
**Result:** `read_only: true` and `tmpfs: - /tmp:size=512m` are applied to both `ollama` and `ollama-model-init`. The `ollama_models` volume handles all persistent writes; no other paths require write access during normal operation.
### 9. Peak RSS of init container during pull
**Empirically verified (2026-06-06):** Peak RSS during `qwen2.5:7b-instruct-q4_K_M` pull was **~108 MiB**.
`docker stats` samples during the pull (15-second intervals):
| Sample | MEM |
|---|---|
| 1 | 54.89 MiB |
| 2 | 66.3 MiB |
| 5 | 97.25 MiB |
| 9 | **107.8 MiB** (peak) |
`mem_limit: 2g` is adequate — the model weights stream directly to the named volume; RSS is dominated by the Ollama server process alone (~100 MB), not the model data. No bump to 4 GB needed.
### 10. Init container pull mechanism
The `ollama-model-init` container uses a curl-based readiness loop with captured PID:
```sh
ollama serve & SERVE_PID=$!
until curl -sf http://localhost:11434/api/tags; do sleep 1; done
ollama pull qwen2.5:7b-instruct-q4_K_M
kill $SERVE_PID
```
`kill %1` (job-control syntax) is unreliable in non-interactive `sh -c` contexts. Capturing the PID via `SERVE_PID=$!` is reliable.
The same endpoint (`/api/tags`) is used for both the init container readiness loop and the main service `healthcheck`.
### 11. start_period: 60s rationale
The model is pre-pulled by `ollama-model-init` before the main service starts (via `condition: service_completed_successfully`). At main service startup, Ollama only loads model weights from the named volume and binds port 11434.
60 seconds is appropriate for this cold-start profile. 300 seconds was considered — that would be appropriate if the service pulled the model itself — but overstates actual startup time when the model is already present on the volume.
### 12. Security threat model
**Primary control:** `archiv-net` network isolation. Ollama has no externally exposed port (`expose:` only, not `ports:`). The Caddyfile must not route any path to the Ollama service.
**Note on `OLLAMA_API_KEY`:** Per §7, `OLLAMA_API_KEY` is not enforced in Ollama 0.6.5 or 0.30.6 and provides no authentication barrier against a compromised backend container. `archiv-net` network isolation is the sole effective security control. The env var is retained for forward compatibility only — do not rely on it for access control.
Both `ollama` and `ollama-model-init` receive the ADR-019 hardening baseline:
```yaml
cap_drop: [ALL]
security_opt: [no-new-privileges:true]
```
### 13. CI exclusion strategy
Docker Compose profiles are not used — they would add developer friction (requiring `--profile ...` for all local dev commands).
CI uses explicit service selection in `docker-compose.ci.yml`:
```bash
docker compose -f docker-compose.ci.yml up -d db minio create-buckets
```
Ollama is simply not listed and is never started in CI. A YAML comment on the `ollama` service block documents this:
```yaml
# Not started in CI — CI uses explicit service selection
# (docker-compose.ci.yml: db minio create-buckets)
```
### 14. ollama_models volume operational note
The `ollama_models` named volume holds model weights only — fully reproducible by re-pull. No backup is needed.
If the volume fills after a model upgrade:
```bash
docker volume rm ollama_models && docker compose up -d
```
The init container re-pulls the model on next startup.
---
## Consequences
### Positive
- NL search runs entirely on-premises; no data leaves the server and no per-token cloud cost.
- Graceful degradation is a first-class concern: smaller or budget-constrained instances can run the app without Ollama with a single env var change.
- The init container pattern keeps model pull out of the critical startup path for the main service, giving accurate healthcheck timings.
- `@ConditionalOnExpression` with a blank-check is more correct than `@ConditionalOnProperty` for optional features with no safe default URL.
### Risks and operational implications
- **Memory pressure:** OCR + Ollama together consume ~14 GB on a 16 GB host. Running the observability stack simultaneously risks OOM kills. Monitor with `docker stats`.
- **CPU inference latency:** `qwen2.5:7b-instruct-q4_K_M` is chosen for CPU viability, but inference on 8 vCPUs will be noticeably slower than GPU-accelerated alternatives. This is acceptable for the family archive use case (low concurrency, not real-time).
- All three empirical TBD items from the original issue spec were resolved — see §7 (OLLAMA_API_KEY not enforced), §8 (`read_only: true` works), §9 (peak RSS ~108 MiB).
- Model upgrades require a `docker volume rm` to free old weights before pulling the replacement. Document this in runbook/DEPLOYMENT.md.

View File

@@ -1,125 +0,0 @@
# ADR-034: Ollama in production — deployment, keep-alive pinning, and corrected init recipe
**Date:** 2026-06-06
**Status:** Accepted
**Deciders:** Marcel Raddatz
**Relates to:** #758 (bug), #759 (fix), #737 (NL search infrastructure)
**Corrects:** ADR-028 §10§11 (init recipe and readiness probe)
---
## Context
ADR-028 introduced Ollama as a Docker Compose service for NL search and documented
its topology, graceful-degradation contract, and memory budget. Two defects survived
that work and only surfaced once NL search reached staging (#758):
1. **Ollama was added only to the dev `docker-compose.yml`.** Staging and production
deploy from the self-contained `docker-compose.prod.yml`, which had no `ollama`
service. The backend defaults to `app.ollama.base-url: http://ollama:11434`, so its
client bean was active and resolved to a non-existent host → `ResourceAccessException`
→ HTTP 503 on every NL search.
2. **The init recipe documented in ADR-028 §10 never worked.** The `ollama/ollama` image
`ENTRYPOINT` is `ollama`, so a bare `command: sh -c "…"` ran as `ollama sh -c "…"`
(`unknown command "sh"`), and the image ships **no curl**, so the curl-based readiness
loop and the curl healthcheck could never pass.
This ADR records the production deployment decision and the corrected operational
contract. It is also the durable record of *why* `OLLAMA_KEEP_ALIVE=-1` is set, so a
future maintainer does not "optimize" it away and reintroduce the cold-load 503.
---
## Decisions
### 1. Ollama is a first-class production service
`docker-compose.prod.yml` now defines `ollama` + `ollama-model-init` + the
`ollama-models` volume, mirroring the dev stack. The graceful-degradation contract from
ADR-028 §3 is preserved: `backend` has **no** hard `depends_on` on `ollama`, so an absent
or unhealthy Ollama still yields a clean 503 rather than blocking backend startup.
### 2. Corrected init recipe (supersedes ADR-028 §10)
The init container overrides the image entrypoint to a shell and probes readiness with
`ollama list` (not curl, which the image lacks):
```sh
ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done && \
(ollama list | grep -q 'qwen2.5:7b-instruct-q4_K_M' || ollama pull qwen2.5:7b-instruct-q4_K_M)
```
```yaml
entrypoint: ["/bin/sh", "-c"]
```
The pull is **guarded by a grep on the cached model list**. A model already on the volume
exits clean without any registry round-trip. This makes re-up offline-safe: a host reboot
during a registry/network blip can no longer fail init (which, via
`condition: service_completed_successfully`, would otherwise block the `ollama` service
and take NL search down until the registry was reachable again). The same recipe is used
in dev and prod — one mental model.
### 3. Healthcheck uses `ollama list` (supersedes ADR-028 §11 probe)
```yaml
healthcheck:
test: ["CMD", "ollama", "list"]
```
`ollama list` hits the local API and exits non-zero when the server is down — the correct
probe for a curl-less image. The `start_period: 60s` rationale from ADR-028 §11 still holds.
### 4. `OLLAMA_KEEP_ALIVE=-1` — pin the model in memory
```yaml
environment:
OLLAMA_KEEP_ALIVE: "-1"
```
By default Ollama evicts an idle model after ~5 minutes. The next query then pays a
cold-load penalty that exceeds the backend read timeout, producing an NL search 503 after
any idle period. Pinning the model (`-1` = never unload) keeps warm-path latency
predictable (~18 s on CPU). **Do not remove this** without re-introducing the post-idle
cold-load 503.
### 5. Read timeout raised 30 → 60 s
`app.ollama.timeout-seconds` is raised from 30 to 60 (`application.yaml`, mirrored in
`DEPLOYMENT.md`). Warm CPU inference is ~18 s; the higher ceiling absorbs the one cold
model load on the first query after an Ollama (re)start, before §4's pin takes hold.
**Implicit NFR made explicit:** NL search shall return a result or a 503 within 60 s; the
cold-start path immediately after an Ollama restart is the only path that approaches this
ceiling.
### 6. Hard-OOM trade-off (refines ADR-028 §2)
`memswap_limit == mem_limit` (both `${OLLAMA_MEM_LIMIT:-8g}`) disables swap for the
container. Combined with §4's pinned model, a memory-pressure event is a **hard OOM-kill,
not graceful latency degradation**. This is deliberate — swap-thrashing an LLM is worse
than a clean restart — but it means the 8 GB envelope is a real ceiling. `qwen2.5-7B-q4`
plus its KV cache under load sits close enough to 8 GB that this needs a Prometheus
memory alert on the `ollama` container before it bites in production (tracked as
observability follow-up, not in this PR).
---
## Consequences
### Positive
- NL search works on staging/production, not just dev — the actual deploy artifact now
matches the documented architecture.
- Re-up is offline-safe: a cached model never depends on registry reachability.
- The keep-alive pin and timeout ceiling make NL search latency predictable on CPU.
### Risks and operational implications
- **Hard OOM under memory pressure** (§6): a Prometheus alert on `ollama` container memory
is required before this is load-bearing in prod. Tracked as an observability follow-up.
- **Unauthenticated inference** relies entirely on `archiv-net` isolation (ADR-028 §7/§12,
unchanged). Sending an `Authorization` header from `RestClientOllamaClient` is a separate
durable hardening item, tracked outside this PR.
- ADR-028 §10§11 describe a recipe that never functioned; this ADR is the authoritative
init/healthcheck contract going forward.

View File

@@ -1,105 +0,0 @@
# ADR-035: Replace Ollama with a rule-based NLP service for smart search
**Date:** 2026-06-07
**Status:** Accepted
**Deciders:** Marcel Raddatz
**Supersedes:** ADR-028 (Ollama for NL search), ADR-034 (Ollama production deployment)
**Relates to:** #771 (implementation)
---
## Context
ADR-028 introduced Ollama + qwen2.5-7B to parse free-text search queries into structured
extractions (person names, date ranges, person role, keywords). After deploying to
staging (ADR-034) the approach showed three problems:
1. **Cold-start latency:** even with `OLLAMA_KEEP_ALIVE=-1` a Qwen inference on CPU takes
~18 s. This blows the UX budget for a search feature and requires a 60 s timeout.
2. **Resource cost:** 8 GB resident RAM + 4 vCPU cap for an LLM whose only job is regex-
level entity extraction from short (< 500 char) German family-history queries.
3. **Fragility:** model-weight downloads, version pinning, and init-container orchestration
add operational surface area with no quality benefit over a deterministic parser.
The query set is narrow and well-understood: person names are all in the PostgreSQL
`persons` table; date patterns are a fixed repertoire of German/English/Spanish formats;
person role (sender vs. receiver) is reliably signalled by a handful of prepositions
("von", "an", "von … an"); keywords are nouns/proper nouns not consumed by the other
extractors.
---
## Decision
Replace Ollama with a lightweight, rule-based Python FastAPI service (`nlp-service`).
### Architecture
```
POST /api/search/nl (NlSearchController)
→ NlQueryParserService
→ RestClientNlpClient.parse(query, lang)
→ POST http://nlp-service:8001/parse
← { personNames, personRole, dateFrom, dateTo, keywords, rawQuery }
```
The response contract is identical to the old `OllamaExtraction`; only the transport
and implementation change. Java callers see `NlpExtraction` (renamed, same shape).
### Implementation
- **`nlp-service/`** — standalone FastAPI app (Python 3.11.12-slim image, ~256 MB RAM)
- `extractor.py` — pipeline: person extraction → role detection → date parsing → keywords
- `person_matcher.py` — two-pass fuzzy lookup (rapidfuzz 3.x) against the `persons` DB table;
loaded at startup, no live DB queries during extraction
- `models.py` — Pydantic `ParseRequest` (max 500 chars), `ParseResponse`
- `main.py` — lifespan loads persons from `DATABASE_URL`; `/health` reports `persons_loaded`
- **`backend/search/`** — `OllamaClient` / `OllamaExtraction` renamed to `NlpClient` /
`NlpExtraction`; `NlpProperties` (`@ConfigurationProperties("app.nlp")`) replaces
`OllamaProperties`; `lang` parameter added to `/parse` and threaded through the stack.
### Tunable parameters
| Env var | Default | Effect |
|---|---|---|
| `DATABASE_URL` | — | PostgreSQL DSN; unset → person matching disabled |
| `NLP_FUZZY_THRESHOLD` | `80` | rapidfuzz similarity floor (0100) |
### Graceful degradation
The backend's `RestClientNlpClient` wraps all HTTP errors and timeouts in
`DomainException.serviceUnavailable(SMART_SEARCH_UNAVAILABLE)`, returning HTTP 503 to
the client — identical behaviour to the Ollama path. The rate limiter is relaxed from
5 to 20 requests/min (rule-based extraction completes in < 50 ms vs. ~18 s for LLM).
---
## Consequences
### Positive
- **Latency:** < 50 ms per extraction vs. ~18 s — smart search is now interactive.
- **Memory:** ~256 MB vs. 8 GB — frees 7.75 GB on the production host.
- **No model downloads:** the image ships no weights; startup is a single DB query.
- **Deterministic:** same query always produces the same result; no temperature/sampling.
- **Testable without infrastructure:** pytest with a seeded `PersonMatcher` fixture; no
WireMock stubs needed for most unit tests.
### Trade-offs
- **No semantic generalisation.** The LLM could handle novel phrasing; the rule-based
parser only handles the preposition patterns it was written for. Edge cases that fall
outside the pattern produce an empty extraction rather than a best-effort result.
- **Person matching depends on DB content.** A person not yet in the archive will never
match, even if the user types their exact name. The LLM could surface the name as a
raw string; this service surfaces nothing. This is acceptable for the current archive
size and query patterns.
- **Language support is fixed at de/en/es** (Paraglide locales). Adding a fourth locale
requires adding its stopword list and preposition table to `extractor.py`.
### Superseded ADRs
ADR-028 and ADR-034 documented the Ollama topology, init recipe, keep-alive pin, and
memory budget. All of that is now moot. The `ollama`, `ollama-model-init`, and
`ollama_models` volume are removed from `docker-compose.yml`.