diff --git a/docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md b/docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md new file mode 100644 index 00000000..cc051314 --- /dev/null +++ b/docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md @@ -0,0 +1,188 @@ +# spaCy NLP Service — Design Spec + +**Date:** 2026-06-07 +**Status:** Prototype + +## Problem + +The current NL search uses Ollama (`qwen2.5:7b-instruct-q4_K_M`) to parse free-text queries into structured extractions (person names, dates, role, keywords). Inference takes 5–15 seconds per query, making the feature too slow to be useful compared to filling in the filter UI manually. + +## Goal + +Build a standalone `nlp-service/` prototype that replaces Ollama with spaCy for query parsing. The prototype is scoped to **extraction quality evaluation** — run it locally, curl it with real archive queries, and measure whether spaCy extracts names/dates/keywords well enough to justify a full migration. No Java-side changes in this iteration. + +## Extraction Contract + +The service must produce an output compatible with the existing `OllamaExtraction` Java record: + +| Field | Type | Description | +|---|---|---| +| `personNames` | `string[]` | Names of persons mentioned, left-to-right order | +| `personRole` | `"sender"` \| `"receiver"` \| `"any"` | Role of the person(s) in the document | +| `dateFrom` | `string \| null` | ISO 8601 date `YYYY-MM-DD` or null | +| `dateTo` | `string \| null` | ISO 8601 date `YYYY-MM-DD` or null | +| `keywords` | `string[]` | Content words — fuzzy-matched against tags by Java | +| `rawQuery` | `string` | Echo of the input query | + +**Two-person ordering:** `personNames` must be in left-to-right span order. Java maps `[0]` → sender, `[1]` → receiver. + +**`rawQuery` note:** In the current Java code `rawQuery` is set by the caller, not parsed from Ollama. The service echoes the input for convenience; the eventual `RestClientSpacyClient` will set it from the input directly, same as today. + +## Architecture + +``` +nlp-service/ +├── main.py # FastAPI app — /parse and /health endpoints +├── extractor.py # NLP pipeline: NER → role → dates → keywords +├── models.py # Pydantic request/response types +├── requirements.txt +├── Dockerfile +└── CLAUDE.md +``` + +Sits alongside `ocr-service/` in the repo. For the prototype it runs standalone (no docker-compose wiring). + +## Extraction Pipeline (`extractor.py`) + +Five steps run in sequence on each query. + +### Step 1 — NER pass + +Run spaCy on the query using the model for the requested language. Collect: +- All `PER` spans → candidates for `personNames` +- All `DATE` spans → raw text strings for step 3 + +### Step 2 — Role detection + +Only relevant when exactly **one** PER entity is found. Walk the dependency tree of the PER span's root token; check if a governing `case` or `prep` token matches the sender or receiver preposition set for the language: + +| Language | Sender prepositions | Receiver prepositions | +|---|---|---| +| `de` | von, vom | an, nach, für | +| `en` | from, by | to, for | +| `es` | de, por | para, a | + +- One person + sender preposition → `personRole = "sender"` +- One person + receiver preposition → `personRole = "receiver"` +- One person + no match / two or more persons → `personRole = "any"` + +Two-person queries always return `"any"` — Java derives direction from position. + +### Step 3 — Date parsing + +For each DATE span, inspect the token immediately before the span to detect range direction: + +| Direction token | Effect | +|---|---| +| vor / before / antes de | Span → `dateTo` | +| nach / after / después de | Span → `dateFrom` | +| zwischen…und / between…and / entre…y | Earlier span → `dateFrom`, later → `dateTo` | +| No direction token (bare year/date) | Span → both `dateFrom` and `dateTo` set to that year (year-range, Jan 1–Dec 31) | + +`dateparser.parse()` with `PREFER_DAY_OF_MONTH=first` converts the span text to a Python `date`. For `dateTo` results that resolve to a year boundary, set to Dec 31 of that year (mirrors `RestClientOllamaClient.parseDate()` behaviour). + +Output as ISO strings (`YYYY-MM-DD`) or `null`. + +### Step 4 — Keyword extraction + +Collect tokens that satisfy all of: +- POS tag is `NOUN` or `PROPN` +- Not a stopword +- Not inside any NER span (PER or DATE) +- Lemma length ≥ 3 + +Output as lowercased lemmas. These are fuzzy-matched against the tags table by `NlQueryParserService.resolveTags()` on the Java side — no tag lookup in the Python service. + +Examples: +- "Briefe aus dem Krieg" → `keywords: ["brief", "krieg"]` +- "Texte über Weihnachten" → `keywords: ["text", "weihnachten"]` + +### Step 5 — Assembly + +```json +{ + "personNames": ["Opa Hermann", "Marie"], + "personRole": "any", + "dateFrom": null, + "dateTo": "1920-12-31", + "keywords": ["brief"], + "rawQuery": "Briefe von Opa Hermann an Marie vor 1920" +} +``` + +## API + +### `POST /parse` + +**Request:** +```json +{ "query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de" } +``` + +`lang` is a required enum: `"de"` | `"en"` | `"es"`. Unknown values → HTTP 422 (FastAPI validation). + +**Response:** extraction object as above, HTTP 200. + +**Error:** pipeline crash → HTTP 500 `{"detail": "..."}`. + +### `GET /health` + +Returns HTTP 200 `{"status": "ok"}` when all three models are loaded. + +## Language Models + +| `lang` | spaCy model | +|---|---| +| `de` | `de_core_news_sm` | +| `en` | `en_core_web_sm` | +| `es` | `es_core_news_sm` | + +All three models are loaded at startup and held in memory. Routing is by the `lang` field on the request. + +## Dockerfile + +Mirrors `ocr-service/` — `python:3.11-slim`, non-root user, models baked into the image: + +```dockerfile +FROM python:3.11-slim +WORKDIR /app +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt +RUN python -m spacy download de_core_news_sm \ + && python -m spacy download en_core_web_sm \ + && python -m spacy download es_core_news_sm +COPY . . +RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \ + && chown -R nlp:nlp /app +USER nlp +EXPOSE 8001 +CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"] +``` + +Image size: ~350 MB. No volume needed — models live in the image layer. + +## Local Dev + +```bash +cd nlp-service +pip install -r requirements.txt +python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm +uvicorn main:app --reload --port 8001 + +curl -X POST http://localhost:8001/parse \ + -H "Content-Type: application/json" \ + -d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}' +``` + +## Known Limitations + +- **Historical names:** spaCy models are trained on modern news corpora. Unusual 1899–1950 German names may not score as `PER`. Mitigation: the Java `resolveNames()` already does fuzzy matching against the persons table, so partial name extraction is recoverable. +- **Role detection:** the preposition sets are a fixed enumeration (~12 tokens across 3 languages). Sentences that express direction without one of these prepositions will fall through to `personRole = "any"`. This is acceptable — `"any"` is the safe default and searches both sender and receiver positions. +- **"über Oma" ambiguity:** if spaCy recognises "Oma" as a PER entity it lands in `personNames` (person search); if not, it lands in `keywords` (tag search via Java). Both paths return relevant results. The prototype evaluation will reveal which path dominates for real archive queries. + +## Out of Scope (prototype) + +- docker-compose integration (Ollama replacement) +- Java-side changes (`RestClientSpacyClient`, rename `OllamaClient` → `NlParserClient`) +- Tag lookup inside the Python service +- Automated test suite (pytest fixtures) — evaluation is done by curling the running service