refactor(search): remove NLP/smart-search feature entirely #772

Merged
marcel merged 51 commits from worktree-feat+nlp-service into main 2026-06-08 10:57:01 +02:00
15 changed files with 0 additions and 1799 deletions
Showing only changes of commit 3b54175945 - Show all commits

View File

@@ -1,67 +0,0 @@
# ADR-028 — Natural language search is powered by Ollama (Qwen 2.5 7B), not a cloud API
**Date:** 2026-06-06
**Status:** Accepted
**Issue:** #738 (NL search backend); part of epic #735
**Milestone:** Archive Intelligence — NL Search
---
## Context
Family members write their search intent in plain German ("Was hat Walter im Krieg an Emma geschrieben?"), not in structured filter forms. Issue #735 defines NL search as a core product goal. Three delivery options were evaluated:
**Option A — extend the OCR service.** The OCR Python microservice already runs on the same host. Adding LLM inference there avoids a new container. Rejected: the OCR service is a single-purpose, CPU-bound pipeline optimised for Kraken; bundling a 4.5 GB LLM weight into the same image would bloat it, complicate model lifecycle management, and create an unrelated failure domain (OOM on large OCR batches vs. LLM load time). ADR-001 was explicit about keeping OCR single-purpose.
**Option B — call an external API (OpenAI, Anthropic, etc.).** Cloud inference is instant and requires no local hardware. Rejected: the archive contains real person names and private family correspondence from 18991950 — sending query content to a third party violates the project's data-residency principle (family data stays on the family server). Additionally, API cost and availability are outside the operator's control; the system must work air-gapped.
**Option C — local Ollama service (chosen).** Ollama is a purpose-built LLM runtime with a simple REST API, model lifecycle management (`ollama pull`), and support for grammar-constrained JSON output. It runs entirely on the existing server (i7-6700, 64 GB RAM) with no cloud dependency.
**Model selection:** Qwen 2.5 7B Q4_K_M (`qwen2.5:7b-instruct-q4_K_M`) was chosen over larger models because:
- Quantised weight is ~4.5 GB — fits comfortably in 64 GB RAM alongside PostgreSQL and the JVM.
- Instruction-tuned variant follows the structured JSON schema reliably without fine-tuning.
- CPU-only inference at Q4_K_M takes 215 seconds per query, acceptable for a search that replaces a multi-step filter form.
**Prompt injection mitigation:** The backend sends the raw user query to Ollama. To prevent the model from being prompted to return schema-breaking output, the API call uses Ollama's `format` parameter with a grammar-constrained JSON schema. Output length is further bounded by `maxLength` constraints in the schema (names ≤ 200 chars, keywords ≤ 100 chars). `NlQueryParserService` enforces these limits in code before any LLM-extracted fragment is passed to `PersonRepository.searchByName()` — defence in depth.
**DB-blind name resolution:** The Ollama prompt stays small (the raw query only); person database records are never sent to the model. Name resolution happens as a cheap SQL query after the model returns. This keeps the prompt short, avoids data leakage, and means adding 1,000 new persons requires no prompt change.
**Graceful degradation:** In-path Ollama failures surface via `OllamaClient.parse()` — any `IOException`, read timeout, or non-2xx response is caught by `RestClientOllamaClient` and re-thrown as `DomainException(SMART_SEARCH_UNAVAILABLE, HTTP 503)`. `isHealthy()` has no callers inside `search/`; it is reserved for the ops/health-endpoint polling path only (e.g. a future `/api/health/ollama` endpoint). The regular structured search (`GET /api/documents/search`) is unaffected — it never calls Ollama.
**Expected inference latency:** 215 seconds on the current CPU-only hardware. The frontend issue must show a persistent "Suche läuft…" indicator for the full duration (see `aria-live="polite"` requirement in issue #738 frontend notes). The backend timeout is 30 seconds (`app.ollama.timeout-seconds=30`) — chosen as a safe upper bound for Q4_K_M on the i7-6700 with a realistic 500-character query under modest concurrent load.
**NL query logging policy:** Only metadata is logged — query length, resolved person count, latency in milliseconds. The raw query is never written to the log file. Rationale: queries contain real family names (PII); log files persist to disk and may be shipped to Loki. Structured metadata is sufficient for debugging latency regressions.
**Prompt-amplification abuse:** A malicious user could submit a long or crafted query to cause slow Ollama inference, consuming CPU. Mitigated by `NlSearchRateLimiter` (5 requests per user per minute, Bucket4j + Caffeine) and by `@Size(max=500)` on the request body. The rate limiter is node-local; in multi-replica deployments the effective limit multiplies by replica count — acceptable at the current single-node deployment scale.
**Ollama model pre-pull requirement:** The Docker image contains only the Ollama binary, not the model weights. The operator must run `ollama pull qwen2.5:7b-instruct-q4_K_M` (≈4.5 GB download, 1030 minutes) before the backend starts inference. If skipped, every NL search request returns 503 until the pull completes. The deployment runbook in `docs/DEPLOYMENT.md` covers this explicitly.
**Startup dependency:** The `backend` Compose service declares `depends_on: ollama: condition: service_healthy`. The Ollama healthcheck polls `GET http://localhost:11434/api/tags`; `start_period: 120s` provides margin for weight loading (2060 s on SSD). Note: `service_healthy` confirms the API is responding, not that the model is downloaded — if the pull was skipped, inference still returns 404.
**Multi-name resolution heuristic:** For 2-name queries (e.g. "Was hat Walter an Emma geschrieben?"), the first extracted name is treated as sender and the second as receiver. Per-name role annotation (e.g. `{name: "Walter", role: "sender"}`) was rejected because it would require a combinatorially complex Ollama schema and the most natural German phrasing strongly implies sender→receiver order. For single-name queries, a `personRole` field (`sender`/`receiver`/`any`) is returned.
**`personRole: "any"` keyword limitation:** When `personRole` is `"any"` and the name resolves to exactly one person, `DocumentService.searchDocumentsByPersonId()` is called (OR semantics: person as sender or receiver). Keyword filtering is not applied on this path — only person identity and date range. `keywordsApplied = false` is returned in the response. Rationale: the JPQL for OR-semantics person queries has no text predicate; adding FTS would require a native query or a separate pass, adding complexity for a case that is already well-narrowed by person identity.
**`search/``person/` + `document/` dependency direction:** `NlQueryParserService` calls `PersonService.findByDisplayNameContaining()` and `DocumentService.searchDocuments()` — both are legitimate cross-domain service calls, not repository leaks. The `search/` package has no JPA entities of its own and never accesses `PersonRepository` or `DocumentRepository` directly.
**Keyword→tag resolution** (issue #743): After Ollama extracts the `keywords` list, `NlQueryParserService` calls `TagService.findByNameContaining()` for each keyword. Keywords that match one or more tags are removed from the FTS text list and added as OR-union tag filters; keywords with no tag match remain as FTS text. Resolved tags are returned to the frontend as `TagHint` objects in `NlQueryInterpretation.resolvedTags` and rendered as removable "Thema: X" chips. The `tagsApplied` flag signals whether the OR-union filter was actually passed to `DocumentService.searchDocuments()` — it is `false` when the `personRole:any` single-person path is taken, because that path has no tag filter slot. See ADR-033 for the tag name resolution and case-collision rules that `TagService.findByNameContaining()` relies on.
## Decision
**Introduce a new `search/` domain package** with a local Ollama integration via `RestClientOllamaClient`. The Ollama service runs as a separate Docker container, reachable only on the internal Docker network (`expose: ["11434"]`, not `ports:`). The backend calls Ollama's `/api/generate` endpoint with grammar-constrained JSON output. Name resolution and document search are performed by existing services after the model returns.
Key component structure:
- `OllamaClient` / `OllamaHealthClient` interfaces — mockable for tests, modelled on `OcrClient`/`OcrHealthClient`
- `RestClientOllamaClient` — two `RestClient` instances (30 s inference, 2 s health-check)
- `NlQueryParserService` — orchestrates Ollama → name resolution → document search
- `NlSearchRateLimiter` — Bucket4j + Caffeine, 5 req/min per user
- `NlSearchController``POST /api/search/nl`, `@RequirePermission(READ_ALL)`
## Consequences
- Family members can query in natural German without learning filter UI. Expected search satisfaction improvement for the 60+ age cohort (primary transcription audience) is significant.
- NL search is unavailable when Ollama is down or the model pull is not complete. The regular search is unaffected. The 503 response includes a CTA directing users to the regular search.
- Operator responsibility: run `ollama pull` on first deploy and after model updates. The backup runbook must exclude `ollama_models` volume (model weights are re-downloadable, not user data).
- Inference takes 215 seconds. The frontend loading indicator is a hard requirement (see issue #738 frontend notes).
- The rate limiter is node-local. At the current single-node deployment scale this is correct. If the service is ever scaled horizontally, the rate limiter must be moved to Redis (same caveat as `LoginRateLimiter`).
- The `search/` package introduces a new cross-domain dependency direction (`search``person`, `search``document`). This is intentional and documented in `docs/architecture/c4/l3-backend-search.puml`.

View File

@@ -1,239 +0,0 @@
# ADR-028: Ollama Docker Compose service for NL search
**Date:** 2026-06-06
**Status:** Accepted
**Deciders:** Marcel Raddatz
**Relates to:** #737 (infrastructure), #735 (NL search epic)
---
## Context
Issue #735 introduces natural-language document search, requiring a local LLM to generate embeddings and/or run inference at query time. The family archive stores personal family history — data privacy is non-negotiable, so cloud-based inference APIs are excluded. The production target is a Hetzner CX42 (16 GB RAM, 8 vCPUs, CPU-only, ~32 EUR/month).
Alternatives considered:
| Option | Reason rejected |
|---|---|
| **llama.cpp** | No HTTP API out of the box; requires custom wrapper; higher ops burden |
| **vLLM** | GPU-first; significant overhead on CPU-only hardware; overkill for this scale |
| **Cloud APIs** (OpenAI, Gemini, etc.) | Vendor lock-in; per-token cost at scale; data leaves the server — unacceptable for a private family archive |
| **Ollama** | Self-contained Docker image; built-in HTTP REST API; actively maintained; CPU-compatible; zero egress |
**Decision:** run Ollama as a Docker Compose service alongside the existing stack.
---
## Decisions
### 1. Hardware minimums and CPU-only constraint
All inference runs on CPU. The target is the Hetzner CX42 (16 GB RAM, 8 vCPUs).
| Tier | RAM | NL search |
|---|---|---|
| CX42 | 16 GB | Supported — full stack including Ollama |
| CX32 | 8 GB | Disabled — set `APP_OLLAMA_BASE_URL=` (empty) to skip Ollama entirely |
| CX22 | 4 GB | Unsupported for NL search |
### 2. Memory budget on CX42
| Component | `mem_limit` | Typical active RSS |
|---|---|---|
| OCR service | 12g (hard ceiling) | ~6 GB |
| Ollama | 8g | ~8 GB |
| **Total** | | **~14 GB active** |
`memswap_limit` on the Ollama service is set to `8g` (matching `mem_limit`) to prevent Linux from swapping model weights into swap under OCR memory pressure. Swapping model weights does not crash the container but silently degrades inference latency. This mirrors the pattern already applied to the OCR service.
**Operational constraint:** do NOT run `docker-compose.observability.yml` continuously alongside both OCR and Ollama on a CX42. The observability stack adds ~2 GB, which leaves no headroom.
### 3. Graceful-degradation contract
`app.ollama.base-url` absent OR blank → Ollama bean NOT registered → NL search returns HTTP 503 with `ErrorCode: NL_SEARCH_UNAVAILABLE`.
This single code path covers all unavailability scenarios: base-url unset, service unreachable, health check failed, and request timeout.
#### Why not `@ConditionalOnProperty`
`@ConditionalOnProperty` registers the bean when the property is present but blank (`APP_OLLAMA_BASE_URL=`). This produces a `RestClient` with an empty base URL that fails at runtime with an opaque error rather than a clean 503.
#### Correct condition expression
```java
@ConditionalOnExpression("!'${app.ollama.base-url:}'.isBlank()")
```
When the property is absent, the placeholder resolves to `''`; `.isBlank()` returns `true`; negation makes the condition `false`; the bean is not registered. Same result for an explicit empty string (`APP_OLLAMA_BASE_URL=`).
### 4. Backend configuration pattern
Use a `@ConfigurationProperties` record, not separate `@Value` injections:
```java
@ConfigurationProperties("app.ollama")
record OllamaProperties(String baseUrl, String apiKey) {}
```
`OllamaProperties` is registered unconditionally — it is a plain value holder with no side effects.
`@ConditionalOnExpression` belongs **only** on `RestClientOllamaClient` (the bean that creates a live network client).
**Deliberate divergence from the OCR pattern:** the OCR service uses `@Value`-with-default because OCR is always-on and `http://ocr-service:8000` is a safe default. Ollama is truly optional — a missing URL means "feature disabled", not "use this default server". There is no safe default Ollama URL.
### 5. Optional<OllamaClient> injection
The NL search service uses constructor injection with `Optional<OllamaClient>`:
```java
private final Optional<OllamaClient> ollamaClient;
```
When empty (bean not registered), the service method returns 503 immediately:
```java
var client = ollamaClient.orElseThrow(
() -> DomainException.internal(ErrorCode.NL_SEARCH_UNAVAILABLE, "Ollama not configured"));
```
Prefer this over `@Autowired(required = false)` with a null check — the null-check pattern is noisy when the service already uses `@RequiredArgsConstructor`.
### 6. Empty API key guard
`RestClientOllamaClient` omits the `Authorization` header entirely when `apiKey` is blank:
```java
if (!apiKey.isBlank()) {
request.header("Authorization", "Bearer " + apiKey);
}
```
Sending `Authorization: Bearer ` (empty token) has undefined or potentially broken behavior depending on the Ollama version. This mirrors the `trainingToken` guard in `RestClientOcrClient.java:107`.
### 7. OLLAMA_API_KEY behavior in Ollama 0.6.5 and 0.30.6
**Empirically verified (2026-06-06) on both `0.6.5` and `0.30.6`:** `OLLAMA_API_KEY` does **not** enforce request authentication in either version.
Test matrix run against `/api/tags`:
| Configuration | No auth header | `Authorization: Bearer ` (empty) | `Authorization: Bearer wrongkey` | `Authorization: Bearer correctkey` |
|---|---|---|---|---|
| `OLLAMA_API_KEY=` (empty) | 200 | 200 | — | — |
| `OLLAMA_API_KEY` unset | 200 | — | — | — |
| `OLLAMA_API_KEY=testkey99` | 200 | 200 | 200 | 200 |
**Finding:** The `OLLAMA_API_KEY` environment variable is not listed in Ollama's startup config dump and does not gate any HTTP request in either tested version. All configurations — empty string, fully unset, and a real key — accept all requests without authentication.
**Practical implication:** `OLLAMA_API_KEY` provides no defense-in-depth in the tested versions. `archiv-net` network isolation is the only effective security control. The env var is retained in the Compose definition and `.env.example` for forward compatibility if Ollama enables enforcement in a future version, but operators must not rely on it for access control.
**Backend guard still valid:** the `RestClientOllamaClient` code-level guard (omit `Authorization` header when `apiKey.isBlank()`) remains correct behavior regardless — it prevents a malformed `Authorization: Bearer ` header from being sent.
### 8. read_only: true feasibility
**Empirically verified (2026-06-06) on both `0.6.5` and `0.30.6`:** `read_only: true` works with Ollama. All three operations — `ollama serve`, `ollama pull qwen2.5:7b-instruct-q4_K_M`, and `ollama list` — succeeded with exit code 0 in both versions.
Test run:
```bash
docker run --rm --read-only \
-v ollama_models:/root/.ollama \
--tmpfs /tmp \
--entrypoint sh ollama/ollama:0.30.6 \
-c "ollama serve & sleep 5 && ollama pull qwen2.5:7b-instruct-q4_K_M && ollama list"
```
**Note:** the entrypoint must be overridden to `sh` for the test command — the container's default entrypoint is `/bin/ollama` and does not accept `sh` as a subcommand. This is a Docker invocation detail; the Compose service definition uses the image's default entrypoint and `command:` override for the init container, which works correctly.
**Result:** `read_only: true` and `tmpfs: - /tmp:size=512m` are applied to both `ollama` and `ollama-model-init`. The `ollama_models` volume handles all persistent writes; no other paths require write access during normal operation.
### 9. Peak RSS of init container during pull
**Empirically verified (2026-06-06):** Peak RSS during `qwen2.5:7b-instruct-q4_K_M` pull was **~108 MiB**.
`docker stats` samples during the pull (15-second intervals):
| Sample | MEM |
|---|---|
| 1 | 54.89 MiB |
| 2 | 66.3 MiB |
| 5 | 97.25 MiB |
| 9 | **107.8 MiB** (peak) |
`mem_limit: 2g` is adequate — the model weights stream directly to the named volume; RSS is dominated by the Ollama server process alone (~100 MB), not the model data. No bump to 4 GB needed.
### 10. Init container pull mechanism
The `ollama-model-init` container uses a curl-based readiness loop with captured PID:
```sh
ollama serve & SERVE_PID=$!
until curl -sf http://localhost:11434/api/tags; do sleep 1; done
ollama pull qwen2.5:7b-instruct-q4_K_M
kill $SERVE_PID
```
`kill %1` (job-control syntax) is unreliable in non-interactive `sh -c` contexts. Capturing the PID via `SERVE_PID=$!` is reliable.
The same endpoint (`/api/tags`) is used for both the init container readiness loop and the main service `healthcheck`.
### 11. start_period: 60s rationale
The model is pre-pulled by `ollama-model-init` before the main service starts (via `condition: service_completed_successfully`). At main service startup, Ollama only loads model weights from the named volume and binds port 11434.
60 seconds is appropriate for this cold-start profile. 300 seconds was considered — that would be appropriate if the service pulled the model itself — but overstates actual startup time when the model is already present on the volume.
### 12. Security threat model
**Primary control:** `archiv-net` network isolation. Ollama has no externally exposed port (`expose:` only, not `ports:`). The Caddyfile must not route any path to the Ollama service.
**Note on `OLLAMA_API_KEY`:** Per §7, `OLLAMA_API_KEY` is not enforced in Ollama 0.6.5 or 0.30.6 and provides no authentication barrier against a compromised backend container. `archiv-net` network isolation is the sole effective security control. The env var is retained for forward compatibility only — do not rely on it for access control.
Both `ollama` and `ollama-model-init` receive the ADR-019 hardening baseline:
```yaml
cap_drop: [ALL]
security_opt: [no-new-privileges:true]
```
### 13. CI exclusion strategy
Docker Compose profiles are not used — they would add developer friction (requiring `--profile ...` for all local dev commands).
CI uses explicit service selection in `docker-compose.ci.yml`:
```bash
docker compose -f docker-compose.ci.yml up -d db minio create-buckets
```
Ollama is simply not listed and is never started in CI. A YAML comment on the `ollama` service block documents this:
```yaml
# Not started in CI — CI uses explicit service selection
# (docker-compose.ci.yml: db minio create-buckets)
```
### 14. ollama_models volume operational note
The `ollama_models` named volume holds model weights only — fully reproducible by re-pull. No backup is needed.
If the volume fills after a model upgrade:
```bash
docker volume rm ollama_models && docker compose up -d
```
The init container re-pulls the model on next startup.
---
## Consequences
### Positive
- NL search runs entirely on-premises; no data leaves the server and no per-token cloud cost.
- Graceful degradation is a first-class concern: smaller or budget-constrained instances can run the app without Ollama with a single env var change.
- The init container pattern keeps model pull out of the critical startup path for the main service, giving accurate healthcheck timings.
- `@ConditionalOnExpression` with a blank-check is more correct than `@ConditionalOnProperty` for optional features with no safe default URL.
### Risks and operational implications
- **Memory pressure:** OCR + Ollama together consume ~14 GB on a 16 GB host. Running the observability stack simultaneously risks OOM kills. Monitor with `docker stats`.
- **CPU inference latency:** `qwen2.5:7b-instruct-q4_K_M` is chosen for CPU viability, but inference on 8 vCPUs will be noticeably slower than GPU-accelerated alternatives. This is acceptable for the family archive use case (low concurrency, not real-time).
- All three empirical TBD items from the original issue spec were resolved — see §7 (OLLAMA_API_KEY not enforced), §8 (`read_only: true` works), §9 (peak RSS ~108 MiB).
- Model upgrades require a `docker volume rm` to free old weights before pulling the replacement. Document this in runbook/DEPLOYMENT.md.

View File

@@ -1,125 +0,0 @@
# ADR-034: Ollama in production — deployment, keep-alive pinning, and corrected init recipe
**Date:** 2026-06-06
**Status:** Accepted
**Deciders:** Marcel Raddatz
**Relates to:** #758 (bug), #759 (fix), #737 (NL search infrastructure)
**Corrects:** ADR-028 §10§11 (init recipe and readiness probe)
---
## Context
ADR-028 introduced Ollama as a Docker Compose service for NL search and documented
its topology, graceful-degradation contract, and memory budget. Two defects survived
that work and only surfaced once NL search reached staging (#758):
1. **Ollama was added only to the dev `docker-compose.yml`.** Staging and production
deploy from the self-contained `docker-compose.prod.yml`, which had no `ollama`
service. The backend defaults to `app.ollama.base-url: http://ollama:11434`, so its
client bean was active and resolved to a non-existent host → `ResourceAccessException`
→ HTTP 503 on every NL search.
2. **The init recipe documented in ADR-028 §10 never worked.** The `ollama/ollama` image
`ENTRYPOINT` is `ollama`, so a bare `command: sh -c "…"` ran as `ollama sh -c "…"`
(`unknown command "sh"`), and the image ships **no curl**, so the curl-based readiness
loop and the curl healthcheck could never pass.
This ADR records the production deployment decision and the corrected operational
contract. It is also the durable record of *why* `OLLAMA_KEEP_ALIVE=-1` is set, so a
future maintainer does not "optimize" it away and reintroduce the cold-load 503.
---
## Decisions
### 1. Ollama is a first-class production service
`docker-compose.prod.yml` now defines `ollama` + `ollama-model-init` + the
`ollama-models` volume, mirroring the dev stack. The graceful-degradation contract from
ADR-028 §3 is preserved: `backend` has **no** hard `depends_on` on `ollama`, so an absent
or unhealthy Ollama still yields a clean 503 rather than blocking backend startup.
### 2. Corrected init recipe (supersedes ADR-028 §10)
The init container overrides the image entrypoint to a shell and probes readiness with
`ollama list` (not curl, which the image lacks):
```sh
ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done && \
(ollama list | grep -q 'qwen2.5:7b-instruct-q4_K_M' || ollama pull qwen2.5:7b-instruct-q4_K_M)
```
```yaml
entrypoint: ["/bin/sh", "-c"]
```
The pull is **guarded by a grep on the cached model list**. A model already on the volume
exits clean without any registry round-trip. This makes re-up offline-safe: a host reboot
during a registry/network blip can no longer fail init (which, via
`condition: service_completed_successfully`, would otherwise block the `ollama` service
and take NL search down until the registry was reachable again). The same recipe is used
in dev and prod — one mental model.
### 3. Healthcheck uses `ollama list` (supersedes ADR-028 §11 probe)
```yaml
healthcheck:
test: ["CMD", "ollama", "list"]
```
`ollama list` hits the local API and exits non-zero when the server is down — the correct
probe for a curl-less image. The `start_period: 60s` rationale from ADR-028 §11 still holds.
### 4. `OLLAMA_KEEP_ALIVE=-1` — pin the model in memory
```yaml
environment:
OLLAMA_KEEP_ALIVE: "-1"
```
By default Ollama evicts an idle model after ~5 minutes. The next query then pays a
cold-load penalty that exceeds the backend read timeout, producing an NL search 503 after
any idle period. Pinning the model (`-1` = never unload) keeps warm-path latency
predictable (~18 s on CPU). **Do not remove this** without re-introducing the post-idle
cold-load 503.
### 5. Read timeout raised 30 → 60 s
`app.ollama.timeout-seconds` is raised from 30 to 60 (`application.yaml`, mirrored in
`DEPLOYMENT.md`). Warm CPU inference is ~18 s; the higher ceiling absorbs the one cold
model load on the first query after an Ollama (re)start, before §4's pin takes hold.
**Implicit NFR made explicit:** NL search shall return a result or a 503 within 60 s; the
cold-start path immediately after an Ollama restart is the only path that approaches this
ceiling.
### 6. Hard-OOM trade-off (refines ADR-028 §2)
`memswap_limit == mem_limit` (both `${OLLAMA_MEM_LIMIT:-8g}`) disables swap for the
container. Combined with §4's pinned model, a memory-pressure event is a **hard OOM-kill,
not graceful latency degradation**. This is deliberate — swap-thrashing an LLM is worse
than a clean restart — but it means the 8 GB envelope is a real ceiling. `qwen2.5-7B-q4`
plus its KV cache under load sits close enough to 8 GB that this needs a Prometheus
memory alert on the `ollama` container before it bites in production (tracked as
observability follow-up, not in this PR).
---
## Consequences
### Positive
- NL search works on staging/production, not just dev — the actual deploy artifact now
matches the documented architecture.
- Re-up is offline-safe: a cached model never depends on registry reachability.
- The keep-alive pin and timeout ceiling make NL search latency predictable on CPU.
### Risks and operational implications
- **Hard OOM under memory pressure** (§6): a Prometheus alert on `ollama` container memory
is required before this is load-bearing in prod. Tracked as an observability follow-up.
- **Unauthenticated inference** relies entirely on `archiv-net` isolation (ADR-028 §7/§12,
unchanged). Sending an `Authorization` header from `RestClientOllamaClient` is a separate
durable hardening item, tracked outside this PR.
- ADR-028 §10§11 describe a recipe that never functioned; this ADR is the authoritative
init/healthcheck contract going forward.

View File

@@ -1,105 +0,0 @@
# ADR-035: Replace Ollama with a rule-based NLP service for smart search
**Date:** 2026-06-07
**Status:** Accepted
**Deciders:** Marcel Raddatz
**Supersedes:** ADR-028 (Ollama for NL search), ADR-034 (Ollama production deployment)
**Relates to:** #771 (implementation)
---
## Context
ADR-028 introduced Ollama + qwen2.5-7B to parse free-text search queries into structured
extractions (person names, date ranges, person role, keywords). After deploying to
staging (ADR-034) the approach showed three problems:
1. **Cold-start latency:** even with `OLLAMA_KEEP_ALIVE=-1` a Qwen inference on CPU takes
~18 s. This blows the UX budget for a search feature and requires a 60 s timeout.
2. **Resource cost:** 8 GB resident RAM + 4 vCPU cap for an LLM whose only job is regex-
level entity extraction from short (< 500 char) German family-history queries.
3. **Fragility:** model-weight downloads, version pinning, and init-container orchestration
add operational surface area with no quality benefit over a deterministic parser.
The query set is narrow and well-understood: person names are all in the PostgreSQL
`persons` table; date patterns are a fixed repertoire of German/English/Spanish formats;
person role (sender vs. receiver) is reliably signalled by a handful of prepositions
("von", "an", "von … an"); keywords are nouns/proper nouns not consumed by the other
extractors.
---
## Decision
Replace Ollama with a lightweight, rule-based Python FastAPI service (`nlp-service`).
### Architecture
```
POST /api/search/nl (NlSearchController)
→ NlQueryParserService
→ RestClientNlpClient.parse(query, lang)
→ POST http://nlp-service:8001/parse
← { personNames, personRole, dateFrom, dateTo, keywords, rawQuery }
```
The response contract is identical to the old `OllamaExtraction`; only the transport
and implementation change. Java callers see `NlpExtraction` (renamed, same shape).
### Implementation
- **`nlp-service/`** — standalone FastAPI app (Python 3.11.12-slim image, ~256 MB RAM)
- `extractor.py` — pipeline: person extraction → role detection → date parsing → keywords
- `person_matcher.py` — two-pass fuzzy lookup (rapidfuzz 3.x) against the `persons` DB table;
loaded at startup, no live DB queries during extraction
- `models.py` — Pydantic `ParseRequest` (max 500 chars), `ParseResponse`
- `main.py` — lifespan loads persons from `DATABASE_URL`; `/health` reports `persons_loaded`
- **`backend/search/`** — `OllamaClient` / `OllamaExtraction` renamed to `NlpClient` /
`NlpExtraction`; `NlpProperties` (`@ConfigurationProperties("app.nlp")`) replaces
`OllamaProperties`; `lang` parameter added to `/parse` and threaded through the stack.
### Tunable parameters
| Env var | Default | Effect |
|---|---|---|
| `DATABASE_URL` | — | PostgreSQL DSN; unset → person matching disabled |
| `NLP_FUZZY_THRESHOLD` | `80` | rapidfuzz similarity floor (0100) |
### Graceful degradation
The backend's `RestClientNlpClient` wraps all HTTP errors and timeouts in
`DomainException.serviceUnavailable(SMART_SEARCH_UNAVAILABLE)`, returning HTTP 503 to
the client — identical behaviour to the Ollama path. The rate limiter is relaxed from
5 to 20 requests/min (rule-based extraction completes in < 50 ms vs. ~18 s for LLM).
---
## Consequences
### Positive
- **Latency:** < 50 ms per extraction vs. ~18 s — smart search is now interactive.
- **Memory:** ~256 MB vs. 8 GB — frees 7.75 GB on the production host.
- **No model downloads:** the image ships no weights; startup is a single DB query.
- **Deterministic:** same query always produces the same result; no temperature/sampling.
- **Testable without infrastructure:** pytest with a seeded `PersonMatcher` fixture; no
WireMock stubs needed for most unit tests.
### Trade-offs
- **No semantic generalisation.** The LLM could handle novel phrasing; the rule-based
parser only handles the preposition patterns it was written for. Edge cases that fall
outside the pattern produce an empty extraction rather than a best-effort result.
- **Person matching depends on DB content.** A person not yet in the archive will never
match, even if the user types their exact name. The LLM could surface the name as a
raw string; this service surfaces nothing. This is acceptable for the current archive
size and query patterns.
- **Language support is fixed at de/en/es** (Paraglide locales). Adding a fourth locale
requires adding its stopword list and preposition table to `extractor.py`.
### Superseded ADRs
ADR-028 and ADR-034 documented the Ollama topology, init recipe, keep-alive pin, and
memory budget. All of that is now moot. The `ollama`, `ollama-model-init`, and
`ollama_models` volume are removed from `docker-compose.yml`.

View File

@@ -1,6 +0,0 @@
venv/
.env
__pycache__/
.pytest_cache/
test_*.py
*.md

View File

@@ -1,59 +0,0 @@
# NLP Service
Lightweight FastAPI service that parses free-text search queries into structured extractions,
replacing Ollama for the Familienarchiv NL search feature.
## Stack
- Python 3.11, FastAPI 0.115, rapidfuzz 3.x, psycopg2-binary
No ML models — persons are matched against the live DB via fuzzy lookup.
## Endpoints
- `POST /parse` — parse a free-text query, return extraction matching `NlpExtraction` contract
- `GET /health` — returns `{"status": "ok", "persons_loaded": N}`
## Running locally
```bash
pip install -r requirements.txt
# Without DB (empty person matcher — dates and keywords still work):
uvicorn main:app --reload --port 8001
# With DB (full person matching):
DATABASE_URL=postgresql://archive_user:secret@localhost:5432/family_archive_db \
uvicorn main:app --reload --port 8001
curl -X POST http://localhost:8001/parse \
-H "Content-Type: application/json" \
-d '{"query": "Briefe von Clara Cram an Walter de Gruyter vor 1920", "lang": "de"}'
```
## Testing
```bash
pytest -v
```
No DB required for tests — fixture pre-seeds the PersonMatcher with a small test corpus.
## Architecture
- `person_matcher.py` — DB-backed name lookup: loads all persons at startup, fuzzy-matches query tokens after person prepositions
- `extractor.py` — pipeline: persons → role → dates (regex) → keywords (stopword filter)
- `main.py` — FastAPI app; reads `DATABASE_URL` env var at startup
## Design spec
See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`.
## Notes
This service is fully wired into `docker-compose.yml` (container `archive-nlp`, port 8001
internal-only) and the Java search path (`RestClientNlpClient``NlQueryParserService`
`NlSearchController`). The extraction contract matches `NlpExtraction` in
`backend/src/main/java/org/raddatz/familienarchiv/search/`.
Test sentences for manual evaluation are in `test_sentences.md`.

View File

@@ -1,24 +0,0 @@
FROM python:3.11.12-slim
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \
&& chown -R nlp:nlp /app
USER nlp
EXPOSE 8001
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
CMD curl -f http://localhost:8001/health || exit 1
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]

View File

@@ -1,310 +0,0 @@
"""Rule-based NLP pipeline: dates via regex, persons via DB-backed matcher."""
from __future__ import annotations
import re
from datetime import date
from typing import TYPE_CHECKING
from models import ParseResponse
from person_matcher import PersonMatcher
if TYPE_CHECKING:
pass
# ── Module-level PersonMatcher and fuzzy threshold (set at startup) ──────────
_matcher: PersonMatcher | None = None
_fuzzy_threshold: int = 80
def set_person_matcher(m: PersonMatcher) -> None:
global _matcher
_matcher = m
def get_person_matcher() -> PersonMatcher | None:
return _matcher
def set_fuzzy_threshold(threshold: int) -> None:
global _fuzzy_threshold
_fuzzy_threshold = threshold
# ── Preposition sets ──────────────────────────────────────────────────────────
_SENDER_PREPS: dict[str, frozenset[str]] = {
"de": frozenset({"von", "vom"}),
"en": frozenset({"from", "by"}),
"es": frozenset({"de", "por"}),
}
_RECEIVER_PREPS: dict[str, frozenset[str]] = {
"de": frozenset({"an", "nach", "für"}),
"en": frozenset({"to", "for"}),
"es": frozenset({"para", "a"}),
}
_ALL_PERSON_PREPS: dict[str, frozenset[str]] = {
lang: _SENDER_PREPS[lang] | _RECEIVER_PREPS[lang]
for lang in ("de", "en", "es")
}
# ── Date direction tokens ─────────────────────────────────────────────────────
_DATE_BEFORE: dict[str, frozenset[str]] = {
"de": frozenset({"vor"}),
"en": frozenset({"before"}),
"es": frozenset({"antes"}),
}
_DATE_AFTER: dict[str, frozenset[str]] = {
"de": frozenset({"nach"}),
"en": frozenset({"after"}),
"es": frozenset({"después", "despues"}),
}
_DATE_BETWEEN: dict[str, frozenset[str]] = {
"de": frozenset({"zwischen"}),
"en": frozenset({"between"}),
"es": frozenset({"entre"}),
}
# ── Extra span-termination tokens (function words that cannot be in a name) ──
_EXTRA_SPAN_STOPS: dict[str, frozenset[str]] = {
# German articles, possessives, and particles that end a name span
"de": frozenset({
"im", "am", "beim", "zum", "zur",
"dem", "den", "des",
"sein", "seine", "seinen", "seiner",
"ihr", "ihre", "ihrem", "ihren", "ihrer",
"unser", "unsere", "unseren",
"über", "auch", "oder", "und",
}),
"en": frozenset(),
"es": frozenset({"el", "la", "los", "las", "su", "sus", "mi"}),
}
# ── Stopword lists ────────────────────────────────────────────────────────────
_STOPWORDS: dict[str, frozenset[str]] = {
"de": frozenset({
"der", "die", "das", "des", "dem", "den",
"ein", "eine", "einem", "einen", "einer", "eines",
"er", "sie", "es", "wir", "ihr", "ich", "du",
"und", "oder", "aber", "doch", "auch", "noch", "nur",
"in", "an", "auf", "aus", "bei", "mit", "nach", "von", "vom",
"vor", "zu", "zur", "zum", "durch", "für", "über", "unter",
"zwischen", "gegen", "ohne", "um", "bis", "seit", "wegen",
"ist", "sind", "war", "waren", "wird", "werden",
"hat", "haben", "hatte", "hatten",
"sein", "seine", "seinen", "seiner", "seines",
"ihre", "ihren", "ihrer", "ihrem", "ihres",
"nicht", "kein", "keine", "keinen", "keinem", "keines",
"so", "wie", "als", "da", "hier", "dort", "wo", "wer", "was",
"im", "am", "beim", "ins", "ans",
"ja", "nein", "denn", "wenn", "weil", "dass", "ob", "damit",
"alle", "alles", "mehr", "sehr", "viel", "wenig",
"diesem", "dieser", "dieses", "diese", "diesen",
"jetzt", "dann", "nun", "schon", "wohl", "wurde", "wurden",
"worden", "geschrieben", "seinen", "ihrer",
"beim", "nach", "zum", "zur", "dem", "den",
"seine", "ihrem", "Jahr", "Jahren", "jahre", "jahr",
}),
"en": frozenset({
"the", "a", "an", "and", "or", "but", "in", "on", "at", "to",
"for", "of", "with", "by", "from", "about", "as", "into",
"through", "is", "are", "was", "were", "be", "been", "being",
"have", "has", "had", "do", "does", "did", "will", "would",
"could", "should", "may", "might", "must", "shall", "can",
"i", "you", "he", "she", "it", "we", "they", "their", "our",
"his", "her", "its", "my", "your",
"this", "that", "these", "those", "all", "not", "no", "nor",
"very", "more", "most", "much", "many", "some", "any",
"before", "after", "between", "during", "since", "until",
"when", "where", "who", "which", "what", "how",
}),
"es": frozenset({
"el", "la", "los", "las", "un", "una", "unos", "unas",
"y", "o", "pero", "sin", "con", "en", "de", "del", "al",
"a", "ante", "bajo", "desde", "entre", "hacia", "hasta",
"para", "por", "sobre", "tras",
"es", "son", "era", "eran", "fue", "fueron", "ser", "estar",
"ha", "han", "he", "tener", "tiene",
"yo", "su", "sus", "mi", "tu",
"este", "esta", "estos", "estas", "ese", "esa",
"no", "muy", "todo", "todos", "toda",
"que", "cuando", "donde", "como",
"antes", "después", "durante", "desde", "hasta",
}),
}
# ── Year regex ────────────────────────────────────────────────────────────────
_YEAR_RE = re.compile(r"\b(\d{4})\b")
_WORD_RE = re.compile(r"\b[^\W\d_]{3,}\b", re.UNICODE)
# ── Step 1 + 2: Person extraction and role detection ─────────────────────────
def _extract_persons_and_role(
query: str,
lang: str,
) -> tuple[list[str], str]:
"""Return (person_names, role) using the DB-backed PersonMatcher."""
m = _matcher
if m is None or len(m) == 0:
return [], "any"
preps = _ALL_PERSON_PREPS[lang]
stops = preps | _DATE_BEFORE[lang] | _DATE_AFTER[lang] | _DATE_BETWEEN[lang] | _EXTRA_SPAN_STOPS[lang]
matches = m.find_in_query(query, preps, stop_tokens=stops, threshold=_fuzzy_threshold)
person_names = [text for text, _ in matches]
if len(matches) != 1:
return person_names, "any"
_, prep = matches[0]
if prep is None:
return person_names, "any"
if prep in _SENDER_PREPS[lang]:
return person_names, "sender"
if prep in _RECEIVER_PREPS[lang]:
return person_names, "receiver"
return person_names, "any"
# ── Step 3: Date extraction ───────────────────────────────────────────────────
def _find_years(query: str) -> list[tuple[int, int, int]]:
"""Return list of (start, end, year_int) for valid 4-digit year tokens."""
return [
(m.start(), m.end(), int(m.group()))
for m in _YEAR_RE.finditer(query)
if 1000 < int(m.group()) < 3000
]
def _direction_before_year(
query: str,
year_start: int,
lang: str,
person_names: list[str],
) -> str:
"""Classify direction of the date span as 'before', 'after', or 'bare'.
Looks at the two tokens immediately preceding the year. If the closer
token is a matched person name part, the direction word belongs to that
person — not to the year — so we return 'bare'.
"""
prefix_words = query[:year_start].split()
if not prefix_words:
return "bare"
person_tokens = {w.lower() for name in person_names for w in name.split()}
recent = [w.lower() for w in prefix_words[-2:]]
before_set = _DATE_BEFORE[lang]
after_set = _DATE_AFTER[lang]
for direction_tok in reversed(recent): # closest first
if direction_tok in before_set:
# Only use this if the word immediately before the year is not a person
if recent[-1] in person_tokens:
return "bare"
return "before"
if direction_tok in after_set:
if recent[-1] in person_tokens:
return "bare"
return "after"
return "bare"
def extract_dates(
query: str,
lang: str,
person_names: list[str] | None = None,
) -> tuple[str | None, str | None]:
"""Return (date_from, date_to) as ISO strings or None."""
if person_names is None:
person_names = []
year_spans = _find_years(query)
if not year_spans:
return None, None
# "zwischen X und Y" / "between X and Y" — two years form a range
query_lower = query.lower()
if any(w in query_lower.split() for w in _DATE_BETWEEN[lang]) and len(year_spans) >= 2:
years = sorted([y for _, _, y in year_spans[:2]])
return date(years[0], 1, 1).isoformat(), date(years[1], 12, 31).isoformat()
start, end, year = year_spans[0]
direction = _direction_before_year(query, start, lang, person_names)
if direction == "before":
return None, date(year, 12, 31).isoformat()
if direction == "after":
return date(year, 1, 1).isoformat(), None
# bare year → closed year range
return date(year, 1, 1).isoformat(), date(year, 12, 31).isoformat()
# ── Step 4: Keyword extraction ────────────────────────────────────────────────
def extract_keywords(
query: str,
lang: str,
person_spans: list[str],
year_strings: list[str],
) -> list[str]:
"""Return lowercased content words after removing persons, years, stopwords."""
text = query
# Remove matched person spans (longest first to avoid partial replacements)
for span in sorted(person_spans, key=len, reverse=True):
text = re.sub(
r"(?<!\w)" + re.escape(span) + r"(?!\w)",
" ",
text,
flags=re.IGNORECASE,
)
# Remove year tokens
for yr in year_strings:
text = re.sub(r"\b" + re.escape(yr) + r"\b", " ", text)
stopwords = _STOPWORDS.get(lang, frozenset())
seen: set[str] = set()
result: list[str] = []
for tok in _WORD_RE.findall(text):
lower = tok.lower()
if lower in stopwords or lower in seen:
continue
seen.add(lower)
result.append(lower)
return result
# ── Step 5: Assembly ──────────────────────────────────────────────────────────
def extract(query: str, lang: str) -> ParseResponse:
"""Run the full rule-based pipeline and return a ParseResponse."""
person_names, person_role = _extract_persons_and_role(query, lang)
year_strings = [str(y) for _, _, y in _find_years(query)]
date_from, date_to = extract_dates(query, lang, person_names)
keywords = extract_keywords(query, lang, person_names, year_strings)
return ParseResponse(
personNames=person_names,
personRole=person_role,
dateFrom=date_from,
dateTo=date_to,
keywords=keywords,
rawQuery=query,
)

View File

@@ -1,79 +0,0 @@
"""FastAPI app — /parse and /health endpoints."""
from __future__ import annotations
import logging
import os
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
logger = logging.getLogger(__name__)
from extractor import extract, get_person_matcher, set_fuzzy_threshold, set_person_matcher
from models import ParseRequest, ParseResponse
from person_matcher import PersonMatcher
_DEFAULT_FUZZY_THRESHOLD = 80
def _parse_fuzzy_threshold(val: str) -> int:
"""Parse and validate NLP_FUZZY_THRESHOLD — must be integer in [0, 100]."""
try:
n = int(val)
except ValueError:
raise ValueError(f"NLP_FUZZY_THRESHOLD must be an integer, got: {val!r}")
if not (0 <= n <= 100):
raise ValueError(f"NLP_FUZZY_THRESHOLD must be between 0 and 100, got: {n}")
return n
def _load_persons_from_db(db_url: str) -> list[tuple[str | None, str | None]]:
import psycopg2 # deferred — not available in test environments without a DB
conn = psycopg2.connect(db_url)
try:
cur = conn.cursor()
cur.execute("SELECT first_name, last_name FROM persons")
return cur.fetchall()
finally:
conn.close()
@asynccontextmanager
async def lifespan(app: FastAPI):
threshold_raw = os.environ.get("NLP_FUZZY_THRESHOLD", str(_DEFAULT_FUZZY_THRESHOLD))
threshold = _parse_fuzzy_threshold(threshold_raw)
set_fuzzy_threshold(threshold)
# Only initialise the matcher when nothing was pre-seeded (e.g., by tests).
if get_person_matcher() is None:
m = PersonMatcher()
db_url = os.environ.get("DATABASE_URL")
if db_url:
try:
rows = _load_persons_from_db(db_url)
m.load(rows)
logger.info("PersonMatcher loaded %d name variants from DB", len(m))
except Exception:
logger.error("Failed to load persons from DB — person matching disabled", exc_info=True)
else:
logger.warning("DATABASE_URL not set — person matching disabled")
set_person_matcher(m)
yield
app = FastAPI(lifespan=lifespan)
@app.get("/health")
def health() -> dict:
m = get_person_matcher()
return {"status": "ok", "persons_loaded": len(m) if m else 0}
@app.post("/parse", response_model=ParseResponse)
def parse(request: ParseRequest) -> ParseResponse:
try:
return extract(request.query, request.lang)
except Exception as exc:
raise HTTPException(status_code=500, detail="internal error") from exc

View File

@@ -1,17 +0,0 @@
from __future__ import annotations
from typing import Literal
from pydantic import BaseModel, Field
class ParseRequest(BaseModel):
query: str = Field(max_length=500)
lang: Literal["de", "en", "es"]
class ParseResponse(BaseModel):
personNames: list[str]
personRole: Literal["sender", "receiver", "any"]
dateFrom: str | None
dateTo: str | None
keywords: list[str]
rawQuery: str

View File

@@ -1,184 +0,0 @@
"""DB-backed person name matcher with fuzzy search."""
from __future__ import annotations
import re
from rapidfuzz import fuzz, process
_PUNCT_RE = re.compile(r"[^\w\s\-]", re.UNICODE)
_YEAR_PAT = re.compile(r"^\d{4}$")
# Tokens that cannot appear in a real person's first name — used to filter DB
# records that are annotations or descriptions masquerading as persons.
_NON_NAME_TOKENS: frozenset[str] = frozenset({
# German prepositions
"an", "in", "im", "am", "aus", "von", "vom", "nach", "zu", "zum", "zur",
"für", "bei", "beim", "mit", "über", "unter", "durch", "gegen", "ohne",
"bis", "seit", "des", "dem", "den",
# German possessives / pronouns
"sein", "seine", "seinen", "seiner",
"ihr", "ihre", "ihren", "ihrem",
# English prepositions
"for", "from", "by", "of",
# Spanish prepositions
"del", "por", "para",
})
class PersonMatcher:
"""Match person name fragments from free-text queries against known persons.
Loaded once at startup from (first_name, last_name) DB rows. At query
time, scans for tokens following person-indicator prepositions and fuzzy-
matches them against the loaded name variants. Returns the original query
text (not the resolved DB name) so the Java resolveNames() mechanism can
do its own disambiguation.
"""
def __init__(self) -> None:
self._names: list[str] = [] # lowercase name variants
# ── Loading ───────────────────────────────────────────────────────────────
def load(self, rows: list[tuple[str | None, str | None]]) -> None:
"""Populate from DB rows of (first_name, last_name)."""
seen: set[str] = set()
for first, last in rows:
first = (first or "").strip()
last = (last or "").strip()
# Skip records whose first_name contains function words — these are
# annotations or descriptions in the DB, not real person names.
if any(w in _NON_NAME_TOKENS for w in first.lower().split()):
continue
for variant in _name_variants(first, last):
key = variant.lower()
if key not in seen:
seen.add(key)
self._names.append(key)
def __len__(self) -> int:
return len(self._names)
# ── Query-time matching ───────────────────────────────────────────────────
def find_in_query(
self,
query: str,
prepositions: frozenset[str],
stop_tokens: frozenset[str] | None = None,
threshold: int = 80,
) -> list[tuple[str, str | None]]:
"""Find person name spans in *query*.
Returns a list of ``(original_query_text, anchoring_prep_or_None)``
in left-to-right order.
Parameters
----------
prepositions:
Person-indicator prepositions for the query language (triggers a
scan for the tokens that follow).
stop_tokens:
Tokens that terminate a name span (prepositions + date-direction
words). "de" is a special exception: when immediately followed by
a capitalised word it is treated as a name connector (e.g.
"de Gruyter") rather than a stop.
threshold:
Minimum rapidfuzz token_sort_ratio score to accept a match.
Strategy
--------
Pass 1 — prep-anchored: for each person-indicator preposition found in
the token list, collect up to 3 consecutive non-stop, non-year tokens
after it and fuzzy-match the resulting span against loaded names.
Longest match wins.
Pass 2 — full-name scan: scan positions not yet consumed for exact
multi-word full-name matches (no preposition anchor required).
"""
tokens = query.split()
clean = [_PUNCT_RE.sub("", t) for t in tokens]
lower = [t.lower() for t in clean]
# Prepositions always terminate a name span, even without explicit stop_tokens.
stops = (stop_tokens or frozenset()) | prepositions
consumed: set[int] = set()
hits: list[tuple[int, str, str | None]] = [] # (position, text, prep)
# Pass 1 — prep-anchored
for i, ltok in enumerate(lower):
if ltok not in prepositions or i + 1 >= len(tokens):
continue
# Build candidate span — stop at stop tokens or 4-digit years.
# Exception: "de" before a capitalised word is a name connector.
span_indices: list[int] = []
j = i + 1
while j < len(tokens) and len(span_indices) < 3:
if j in consumed:
break
t = lower[j]
if t in stops or _YEAR_PAT.match(clean[j]):
# Allow "de" when the *next* token starts with a capital —
# e.g. "Walter de Gruyter".
next_clean = clean[j + 1] if j + 1 < len(tokens) else ""
if t == "de" and next_clean[:1].isupper():
pass # connector — keep going
else:
break
span_indices.append(j)
j += 1
# Try longest match first, then shorter spans
for span_len in range(len(span_indices), 0, -1):
idx = span_indices[:span_len]
span_lower = " ".join(lower[k] for k in idx)
if self._is_match(span_lower, threshold):
hits.append((idx[0], " ".join(tokens[k] for k in idx), ltok))
consumed.update(idx)
break
# Pass 2 — full multi-word name scan (exact only, no preposition needed)
for span_len in (3, 2):
for i in range(len(tokens) - span_len + 1):
span_idx = range(i, i + span_len)
if any(j in consumed for j in span_idx):
continue
span_lower = " ".join(lower[i : i + span_len])
if span_lower in self._names:
hits.append((i, " ".join(tokens[i : i + span_len]), None))
consumed.update(span_idx)
hits.sort(key=lambda h: h[0])
return [(text, prep) for _, text, prep in hits]
# ── Internal helpers ──────────────────────────────────────────────────────
def _is_match(self, text: str, threshold: int) -> bool:
"""Return True if *text* fuzzy-matches any loaded name at >= threshold."""
if not self._names or len(text.strip()) < 3:
return False
text_lower = text.strip().lower()
if text_lower in self._names:
return True # exact match — fast path
result = process.extractOne(
text_lower,
self._names,
scorer=fuzz.token_sort_ratio,
score_cutoff=threshold,
)
return result is not None
# ── helpers ───────────────────────────────────────────────────────────────────
def _name_variants(first: str, last: str) -> list[str]:
"""Return the name variants to index for a single person."""
variants = []
if first and last:
variants.append(f"{first} {last}")
if first:
variants.append(first)
if last:
variants.append(last)
return variants

View File

@@ -1,6 +0,0 @@
fastapi[standard]==0.115.6
uvicorn[standard]==0.34.0
rapidfuzz>=3.0,<4.0
psycopg2-binary>=2.9,<3.0
pytest>=8.0,<9.0
httpx>=0.28,<1.0

View File

@@ -1,335 +0,0 @@
"""Tests for the rule-based extractor and PersonMatcher."""
import pytest
from extractor import extract, extract_dates, extract_keywords, set_person_matcher
from person_matcher import PersonMatcher
# ── Shared test fixture ───────────────────────────────────────────────────────
_TEST_PERSONS = [
("Clara", "Cram"),
("Herbert", "Cram"),
("Eugenie", "de Gruyter"),
("Walter", "de Gruyter"),
("Marie", "Cram"),
("Juan", "Cram"),
("Hilde", "de Gruyter"),
("Hans", "de Gruyter"),
("Albert", "de Gruyter"),
("Anita", "Wöhler"),
("Else", "Bohrmann"),
("Lili", "Duvenbeck"),
]
@pytest.fixture(scope="session", autouse=True)
def seeded_matcher():
"""Load test persons into the global matcher before any test runs."""
m = PersonMatcher()
m.load(_TEST_PERSONS)
set_person_matcher(m)
return m
# ── PersonMatcher unit tests ──────────────────────────────────────────────────
class TestPersonMatcher:
DE_PREPS = frozenset({"von", "vom", "an", "nach", "für"})
def test_load_populates_names(self, seeded_matcher):
assert len(seeded_matcher) > 0
def test_exact_full_name_match(self, seeded_matcher):
hits = seeded_matcher.find_in_query("Briefe von Clara Cram", self.DE_PREPS)
assert hits == [("Clara Cram", "von")]
def test_exact_first_name_only(self, seeded_matcher):
hits = seeded_matcher.find_in_query("Briefe von Eugenie", self.DE_PREPS)
assert hits == [("Eugenie", "von")]
def test_exact_first_name_receiver(self, seeded_matcher):
hits = seeded_matcher.find_in_query("Briefe an Herbert", self.DE_PREPS)
assert hits == [("Herbert", "an")]
def test_fuzzy_typo(self, seeded_matcher):
hits = seeded_matcher.find_in_query("Briefe von Herrbert Cram", self.DE_PREPS)
assert len(hits) == 1
assert hits[0][1] == "von"
def test_two_persons_extracted(self, seeded_matcher):
hits = seeded_matcher.find_in_query(
"Briefe von Clara Cram an Herbert Cram", self.DE_PREPS
)
assert len(hits) == 2
assert hits[0][0] == "Clara Cram"
assert hits[0][1] == "von"
assert hits[1][0] == "Herbert Cram"
assert hits[1][1] == "an"
def test_no_match_for_place_name(self, seeded_matcher):
hits = seeded_matcher.find_in_query("Reise nach Mexiko", self.DE_PREPS)
assert hits == []
def test_no_match_for_topic_word(self, seeded_matcher):
hits = seeded_matcher.find_in_query("Briefe aus dem Krieg", self.DE_PREPS)
assert hits == []
def test_first_name_eugenie_regression(self, seeded_matcher):
# spaCy NER missed standalone first names
hits = seeded_matcher.find_in_query("Briefe von Eugenie", self.DE_PREPS)
assert len(hits) == 1
def test_merged_names_regression(self, seeded_matcher):
# spaCy NER merged "Herbert an Eugenie de Gruyter" into one PER span
hits = seeded_matcher.find_in_query(
"Briefe von Herbert an Eugenie de Gruyter nach 1914", self.DE_PREPS
)
assert len(hits) == 2
names = [h[0] for h in hits]
assert "Herbert" in names
assert "Eugenie de Gruyter" in names
def test_english_preps(self, seeded_matcher):
en_preps = frozenset({"from", "by", "to", "for"})
hits = seeded_matcher.find_in_query(
"Letters from Clara Cram to Walter de Gruyter in 1920", en_preps
)
assert len(hits) == 2
assert hits[0][0] == "Clara Cram"
assert hits[1][0] == "Walter de Gruyter"
def test_double_preposition_de(self, seeded_matcher):
hits = seeded_matcher.find_in_query(
"Briefe von Clara nach Herbert", self.DE_PREPS
)
assert len(hits) == 2
names = [h[0] for h in hits]
assert "Clara" in names
assert "Herbert" in names
# ── Date extraction tests ─────────────────────────────────────────────────────
class TestExtractDates:
def test_bare_year_gives_range(self):
assert extract_dates("Briefe 1920", "de") == ("1920-01-01", "1920-12-31")
def test_im_jahr(self):
assert extract_dates("Schriften im Jahr 1905", "de") == (
"1905-01-01", "1905-12-31"
)
def test_vor_year(self):
assert extract_dates("Briefe vor 1920", "de") == (None, "1920-12-31")
def test_nach_year(self):
assert extract_dates("Schriften nach 1920", "de") == ("1920-01-01", None)
def test_zwischen(self):
assert extract_dates("Dokumente zwischen 1914 und 1918", "de") == (
"1914-01-01", "1918-12-31"
)
def test_before_en(self):
assert extract_dates("Letters before 1918", "en") == (None, "1918-12-31")
def test_after_en(self):
assert extract_dates("Letters after 1939", "en") == ("1939-01-01", None)
def test_between_en(self):
assert extract_dates("Letters between 1914 and 1918", "en") == (
"1914-01-01", "1918-12-31"
)
def test_antes_de_es(self):
assert extract_dates("Cartas antes de 1900", "es") == (None, "1900-12-31")
def test_entre_es(self):
assert extract_dates("entre 1915 y 1920", "es") == (
"1915-01-01", "1920-12-31"
)
def test_no_year(self):
assert extract_dates("Briefe aus dem Krieg", "de") == (None, None)
def test_nach_before_person_then_year(self):
# "nach Marie 1920" — "nach" belongs to person, not date
date_from, date_to = extract_dates("Briefe nach Marie 1920", "de", ["Marie"])
assert date_from == "1920-01-01"
assert date_to == "1920-12-31"
def test_bare_year_alone(self):
assert extract_dates("1918", "de") == ("1918-01-01", "1918-12-31")
# ── Keyword extraction tests ──────────────────────────────────────────────────
class TestExtractKeywords:
def test_basic_topic_words(self):
kws = extract_keywords("Briefe aus dem Krieg", "de", [], [])
assert "krieg" in kws
def test_stopwords_excluded(self):
kws = extract_keywords("von der nach dem aus", "de", [], [])
for sw in ("von", "der", "nach", "dem", "aus"):
assert sw not in kws
def test_person_spans_excluded(self):
kws = extract_keywords(
"Briefe von Clara Cram nach Herbert", "de",
["Clara Cram", "Herbert"], []
)
assert "clara" not in kws
assert "cram" not in kws
assert "herbert" not in kws
def test_years_excluded(self):
kws = extract_keywords("Schriften 1920 über Reise", "de", [], ["1920"])
assert "1920" not in kws
def test_deduplication(self):
kws = extract_keywords("Krieg Krieg Krieg", "de", [], [])
assert kws.count("krieg") == 1
def test_en_stopwords(self):
kws = extract_keywords("Letters about the war", "en", [], [])
assert "the" not in kws
assert "war" in kws
def test_short_words_excluded(self):
kws = extract_keywords("ab cd ef xy", "de", [], [])
assert all(len(k) >= 3 for k in kws)
# ── Full pipeline integration tests ──────────────────────────────────────────
class TestExtract:
def test_full_sentence_de(self):
r = extract("Briefe von Clara Cram an Walter de Gruyter im Jahr 1920", "de")
assert "Clara Cram" in r.personNames
assert "Walter de Gruyter" in r.personNames
assert r.personRole == "any"
assert r.dateFrom == "1920-01-01"
assert r.dateTo == "1920-12-31"
def test_sender_role_de(self):
r = extract("Briefe von Clara Cram vor 1910", "de")
assert r.personNames == ["Clara Cram"]
assert r.personRole == "sender"
assert r.dateTo == "1910-12-31"
assert r.dateFrom is None
def test_receiver_role_de(self):
r = extract("Briefe an Walter de Gruyter", "de")
assert r.personNames == ["Walter de Gruyter"]
assert r.personRole == "receiver"
def test_first_name_only_eugenie(self):
r = extract("Briefe von Eugenie", "de")
assert "Eugenie" in r.personNames
assert r.personRole == "sender"
def test_first_name_only_herbert(self):
r = extract("Kriegsbriefe von Herbert", "de")
assert "Herbert" in r.personNames
def test_merged_names_bug_fixed(self):
r = extract("Briefe von Herbert an Eugenie de Gruyter nach 1914", "de")
assert "Herbert" in r.personNames
assert "Eugenie de Gruyter" in r.personNames
assert r.dateFrom == "1914-01-01"
def test_topic_only_krieg(self):
r = extract("Briefe aus dem Krieg", "de")
assert r.personNames == []
assert "krieg" in r.keywords
def test_topic_only_single_word(self):
r = extract("Kriegspost", "de")
assert r.personNames == []
def test_date_range_only(self):
r = extract("Dokumente zwischen 1914 und 1918", "de")
assert r.personNames == []
assert r.dateFrom == "1914-01-01"
assert r.dateTo == "1918-12-31"
def test_colloquial_von(self):
r = extract("von Clara", "de")
assert r.personNames == ["Clara"]
assert r.personRole == "sender"
def test_colloquial_an(self):
r = extract("an Walter", "de")
assert r.personNames == ["Walter"]
assert r.personRole == "receiver"
def test_bare_year_alone(self):
r = extract("1918", "de")
assert r.dateFrom == "1918-01-01"
assert r.dateTo == "1918-12-31"
assert r.personNames == []
def test_english_full_sentence(self):
r = extract("Letters from Clara Cram to Walter de Gruyter in 1920", "en")
assert "Clara Cram" in r.personNames
assert "Walter de Gruyter" in r.personNames
assert r.dateFrom == "1920-01-01"
def test_english_receiver_with_date(self):
r = extract("Letters to Herbert Cram after 1939", "en")
assert "Herbert Cram" in r.personNames
assert r.personRole == "receiver"
assert r.dateFrom == "1939-01-01"
def test_english_birthday(self):
r = extract("Birthday greetings from Anita Wöhler", "en")
assert "Anita Wöhler" in r.personNames
assert r.personRole == "sender"
def test_english_between_dates(self):
r = extract("Letters between 1914 and 1918", "en")
assert r.dateFrom == "1914-01-01"
assert r.dateTo == "1918-12-31"
def test_spanish_full_sentence(self):
r = extract("Cartas de Clara Cram a Walter de Gruyter en 1920", "es")
assert "Clara Cram" in r.personNames
assert "Walter de Gruyter" in r.personNames
assert r.dateFrom == "1920-01-01"
def test_spanish_before(self):
r = extract("Cartas antes de 1900", "es")
assert r.dateTo == "1900-12-31"
assert r.dateFrom is None
def test_rawquery_echoed(self):
q = "test query"
r = extract(q, "de")
assert r.rawQuery == q
def test_false_positive_compound_noun_regression(self):
# spaCy tagged "Geburtstagsglückwünsche" as a PER entity
r = extract("Geburtstagsglückwünsche", "de")
assert r.personNames == []
def test_question_phrasing(self):
r = extract("Wer hat an Herbert Cram 1918 geschrieben?", "de")
assert "Herbert Cram" in r.personNames
assert r.personRole == "receiver"
assert r.dateFrom == "1918-01-01"
def test_lowercase_query(self):
r = extract("briefe von clara cram an herbert 1920", "de")
# Should still find persons despite lowercase
assert len(r.personNames) >= 1
def test_empty_matcher_returns_no_persons(self, seeded_matcher):
from extractor import get_person_matcher, set_person_matcher
original = get_person_matcher()
try:
set_person_matcher(PersonMatcher())
r = extract("Briefe von Clara Cram", "de")
assert r.personNames == []
finally:
set_person_matcher(original)

View File

@@ -1,117 +0,0 @@
"""Integration tests for the FastAPI app."""
import pytest
from fastapi.testclient import TestClient
from extractor import set_person_matcher
from person_matcher import PersonMatcher
_TEST_PERSONS = [
("Clara", "Cram"),
("Herbert", "Cram"),
("Eugenie", "de Gruyter"),
("Walter", "de Gruyter"),
("Marie", "Cram"),
("Anita", "Wöhler"),
]
@pytest.fixture(scope="session")
def client():
# Pre-seed the matcher so the lifespan doesn't overwrite it with an empty one.
m = PersonMatcher()
m.load(_TEST_PERSONS)
set_person_matcher(m)
from main import app
with TestClient(app) as c:
yield c
def test_health(client):
r = client.get("/health")
assert r.status_code == 200
assert r.json()["status"] == "ok"
assert r.json()["persons_loaded"] > 0
def test_parse_returns_200_with_all_fields(client):
r = client.post("/parse", json={"query": "Briefe vor 1920", "lang": "de"})
assert r.status_code == 200
d = r.json()
assert "personNames" in d
assert d["personRole"] in ("sender", "receiver", "any")
assert "dateFrom" in d
assert "dateTo" in d
assert "keywords" in d
assert d["rawQuery"] == "Briefe vor 1920"
assert d["dateTo"] == "1920-12-31"
def test_parse_person_with_date(client):
r = client.post(
"/parse",
json={"query": "Briefe von Clara Cram an Walter de Gruyter im Jahr 1920", "lang": "de"},
)
assert r.status_code == 200
d = r.json()
assert "Clara Cram" in d["personNames"]
assert "Walter de Gruyter" in d["personNames"]
assert d["dateFrom"] == "1920-01-01"
assert d["dateTo"] == "1920-12-31"
def test_parse_unknown_lang_returns_422(client):
r = client.post("/parse", json={"query": "test", "lang": "fr"})
assert r.status_code == 422
def test_parse_missing_query_returns_422(client):
r = client.post("/parse", json={"lang": "de"})
assert r.status_code == 422
def test_parse_all_languages(client):
cases = [
("de", "Briefe vor 1920"),
("en", "letters before 1920"),
("es", "cartas antes de 1920"),
]
for lang, query in cases:
r = client.post("/parse", json={"query": query, "lang": lang})
assert r.status_code == 200, f"Failed for lang={lang}"
assert r.json()["dateTo"] == "1920-12-31", f"Wrong dateTo for lang={lang}"
def test_fuzzy_threshold_valid_range():
from main import _parse_fuzzy_threshold
assert _parse_fuzzy_threshold("80") == 80
assert _parse_fuzzy_threshold("0") == 0
assert _parse_fuzzy_threshold("100") == 100
def test_fuzzy_threshold_out_of_range_raises():
from main import _parse_fuzzy_threshold
with pytest.raises(ValueError):
_parse_fuzzy_threshold("101")
with pytest.raises(ValueError):
_parse_fuzzy_threshold("-1")
with pytest.raises(ValueError):
_parse_fuzzy_threshold("abc")
def test_parse_exceeds_max_length_returns_422(client):
r = client.post("/parse", json={"query": "a" * 501, "lang": "de"})
assert r.status_code == 422
def test_parse_internal_exception_does_not_leak_detail(client, monkeypatch):
"""500 errors must return generic message — never expose internal details."""
import main as main_module
def _boom(query, lang):
raise RuntimeError("postgresql://archive_user:s3cr3t@db:5432/family_archive_db")
monkeypatch.setattr(main_module, "extract", _boom)
r = client.post("/parse", json={"query": "test", "lang": "de"})
assert r.status_code == 500
assert "s3cr3t" not in r.text
assert r.json()["detail"] == "internal error"

View File

@@ -1,126 +0,0 @@
# NLP Service — Test Sentences
Real data drawn from the Familienarchiv DB (2026-06-07).
Top persons: Clara Cram, Herbert Cram, Eugenie de Gruyter, Walter de Gruyter, Marie Cram,
Juan Cram, Albert de Gruyter, Hilde de Gruyter, Else Bohrmann, Anita Wöhler, Lili Duvenbeck.
Date range: ~18951945. Key tags: Krieg, Hochzeit, Reise, Geburtstag, Tod, Alltag, Briefwechsel.
---
## German — full sentences
```json
{"query": "Briefe von Clara Cram an Walter de Gruyter im Jahr 1920", "lang": "de"}
{"query": "Briefe von Herbert an Eugenie de Gruyter nach 1914", "lang": "de"}
{"query": "Schreiben von Albert de Gruyter an seine Kinder vor 1900", "lang": "de"}
{"query": "Briefe von Juan Cram an Marie zwischen 1915 und 1918", "lang": "de"}
{"query": "Telegramm von Walter de Gruyter an Clara im Jahr 1930", "lang": "de"}
{"query": "Briefe von Else Bohrmann an Herbert Cram nach 1939", "lang": "de"}
```
## German — medium (person + date, no strong role signal)
```json
{"query": "Briefe von Clara Cram vor 1910", "lang": "de"}
{"query": "Dokumente über Walter de Gruyter aus den 1920er Jahren", "lang": "de"}
{"query": "Briefe an Herbert Cram nach dem Krieg", "lang": "de"}
{"query": "Schriften von Eugenie de Gruyter im Jahr 1905", "lang": "de"}
```
## German — short (person only)
```json
{"query": "Briefe an Walter de Gruyter", "lang": "de"}
{"query": "Dokumente über Clara Cram", "lang": "de"}
{"query": "Herbert Cram", "lang": "de"}
{"query": "Anita Wöhler", "lang": "de"}
```
## German — topic only (keywords → tag resolution on Java side)
```json
{"query": "Briefe aus dem Krieg", "lang": "de"}
{"query": "Kriegspost", "lang": "de"}
{"query": "Hochzeitsbriefe", "lang": "de"}
{"query": "Reisebriefe", "lang": "de"}
{"query": "Geburtstagsglückwünsche", "lang": "de"}
{"query": "Briefe über die Hochzeitsreise", "lang": "de"}
{"query": "Kinderbriefe", "lang": "de"}
{"query": "Familienbriefe aus dem Alltag", "lang": "de"}
{"query": "Brautbriefe", "lang": "de"}
{"query": "Kondolenzbriefe nach dem Tod von Eugenie", "lang": "de"}
```
## German — date range only
```json
{"query": "Briefe aus dem Ersten Weltkrieg", "lang": "de"}
{"query": "Dokumente zwischen 1914 und 1918", "lang": "de"}
{"query": "Briefe vor 1900", "lang": "de"}
{"query": "Schriften nach 1920", "lang": "de"}
```
## German — combined (all fields)
```json
{"query": "Briefe von Clara Cram an ihre Kinder über die Reise nach Mexiko im Jahr 1925", "lang": "de"}
{"query": "Kriegspost von Herbert Cram an Eugenie de Gruyter zwischen 1916 und 1918", "lang": "de"}
{"query": "Glückwünsche von Hilde de Gruyter zur Hochzeit im Jahr 1910", "lang": "de"}
{"query": "Kondolenzschreiben an Walter de Gruyter nach dem Tod von Eugenie", "lang": "de"}
```
## English
```json
{"query": "Letters from Clara Cram to Walter de Gruyter in 1920", "lang": "en"}
{"query": "Letters about the war before 1918", "lang": "en"}
{"query": "Letters to Herbert Cram after 1939", "lang": "en"}
{"query": "Birthday greetings from Anita Wöhler", "lang": "en"}
{"query": "Letters between 1914 and 1918", "lang": "en"}
```
## Spanish
```json
{"query": "Cartas de Clara Cram a Walter de Gruyter en 1920", "lang": "es"}
{"query": "Cartas antes de 1900", "lang": "es"}
{"query": "Cartas después de la guerra", "lang": "es"}
{"query": "Cartas de Juan Cram a sus hijos entre 1915 y 1920", "lang": "es"}
```
---
## Edge cases — lazy / missing words / typos
```json
{"query": "Clara", "lang": "de"}
{"query": "Eugenie", "lang": "de"}
{"query": "Herbert", "lang": "de"}
{"query": "de Gruyter", "lang": "de"}
{"query": "Briefe von Klara Kram an Herbert", "lang": "de"}
{"query": "briefe von clara cram an herbert 1920", "lang": "de"}
{"query": "1918", "lang": "de"}
{"query": "1914 1918", "lang": "de"}
{"query": "Krieg", "lang": "de"}
{"query": "Briefe von Eugenie", "lang": "de"}
{"query": "Clara Cram Herbert Cram 1920", "lang": "de"}
{"query": "Wer hat an Herbert Cram 1918 geschrieben?", "lang": "de"}
{"query": "von Clara", "lang": "de"}
{"query": "an Walter", "lang": "de"}
{"query": "Clara 1920", "lang": "de"}
{"query": "Kriegsbriefe von Herbert", "lang": "de"}
{"query": "Briefe von Clara nach Herbert", "lang": "de"}
{"query": "Briefe von Herrbert Cram", "lang": "de"}
```
---
## Known spaCy failures now fixed by DB-backed matcher
| Query | spaCy result | Expected |
|---|---|---|
| `Briefe von Eugenie` | persons=[] | persons=["Eugenie"] |
| `Kriegsbriefe von Herbert` | keywords=["herbert"] | persons=["Herbert"] |
| `Briefe von Herbert an Eugenie de Gruyter nach 1914` | persons=["Herbert an Eugenie de Gruyter"] (merged!) | persons=["Herbert", "Eugenie de Gruyter"] |
| `Letters from Clara Cram to Walter de Gruyter` | persons=[] (EN model doesn't know German names) | persons=["Clara Cram", "Walter de Gruyter"] |
| `Geburtstagsglückwünsche` | persons=["Geburtstagsglückwünsche"] (false positive!) | persons=[] |