docs(nlp): add spaCy NLP service prototype design spec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-06-07 09:40:00 +02:00
parent 6878419156
commit 03e22a2f26

View File

@@ -0,0 +1,188 @@
# spaCy NLP Service — Design Spec
**Date:** 2026-06-07
**Status:** Prototype
## Problem
The current NL search uses Ollama (`qwen2.5:7b-instruct-q4_K_M`) to parse free-text queries into structured extractions (person names, dates, role, keywords). Inference takes 515 seconds per query, making the feature too slow to be useful compared to filling in the filter UI manually.
## Goal
Build a standalone `nlp-service/` prototype that replaces Ollama with spaCy for query parsing. The prototype is scoped to **extraction quality evaluation** — run it locally, curl it with real archive queries, and measure whether spaCy extracts names/dates/keywords well enough to justify a full migration. No Java-side changes in this iteration.
## Extraction Contract
The service must produce an output compatible with the existing `OllamaExtraction` Java record:
| Field | Type | Description |
|---|---|---|
| `personNames` | `string[]` | Names of persons mentioned, left-to-right order |
| `personRole` | `"sender"` \| `"receiver"` \| `"any"` | Role of the person(s) in the document |
| `dateFrom` | `string \| null` | ISO 8601 date `YYYY-MM-DD` or null |
| `dateTo` | `string \| null` | ISO 8601 date `YYYY-MM-DD` or null |
| `keywords` | `string[]` | Content words — fuzzy-matched against tags by Java |
| `rawQuery` | `string` | Echo of the input query |
**Two-person ordering:** `personNames` must be in left-to-right span order. Java maps `[0]` → sender, `[1]` → receiver.
**`rawQuery` note:** In the current Java code `rawQuery` is set by the caller, not parsed from Ollama. The service echoes the input for convenience; the eventual `RestClientSpacyClient` will set it from the input directly, same as today.
## Architecture
```
nlp-service/
├── main.py # FastAPI app — /parse and /health endpoints
├── extractor.py # NLP pipeline: NER → role → dates → keywords
├── models.py # Pydantic request/response types
├── requirements.txt
├── Dockerfile
└── CLAUDE.md
```
Sits alongside `ocr-service/` in the repo. For the prototype it runs standalone (no docker-compose wiring).
## Extraction Pipeline (`extractor.py`)
Five steps run in sequence on each query.
### Step 1 — NER pass
Run spaCy on the query using the model for the requested language. Collect:
- All `PER` spans → candidates for `personNames`
- All `DATE` spans → raw text strings for step 3
### Step 2 — Role detection
Only relevant when exactly **one** PER entity is found. Walk the dependency tree of the PER span's root token; check if a governing `case` or `prep` token matches the sender or receiver preposition set for the language:
| Language | Sender prepositions | Receiver prepositions |
|---|---|---|
| `de` | von, vom | an, nach, für |
| `en` | from, by | to, for |
| `es` | de, por | para, a |
- One person + sender preposition → `personRole = "sender"`
- One person + receiver preposition → `personRole = "receiver"`
- One person + no match / two or more persons → `personRole = "any"`
Two-person queries always return `"any"` — Java derives direction from position.
### Step 3 — Date parsing
For each DATE span, inspect the token immediately before the span to detect range direction:
| Direction token | Effect |
|---|---|
| vor / before / antes de | Span → `dateTo` |
| nach / after / después de | Span → `dateFrom` |
| zwischen…und / between…and / entre…y | Earlier span → `dateFrom`, later → `dateTo` |
| No direction token (bare year/date) | Span → both `dateFrom` and `dateTo` set to that year (year-range, Jan 1Dec 31) |
`dateparser.parse()` with `PREFER_DAY_OF_MONTH=first` converts the span text to a Python `date`. For `dateTo` results that resolve to a year boundary, set to Dec 31 of that year (mirrors `RestClientOllamaClient.parseDate()` behaviour).
Output as ISO strings (`YYYY-MM-DD`) or `null`.
### Step 4 — Keyword extraction
Collect tokens that satisfy all of:
- POS tag is `NOUN` or `PROPN`
- Not a stopword
- Not inside any NER span (PER or DATE)
- Lemma length ≥ 3
Output as lowercased lemmas. These are fuzzy-matched against the tags table by `NlQueryParserService.resolveTags()` on the Java side — no tag lookup in the Python service.
Examples:
- "Briefe aus dem Krieg" → `keywords: ["brief", "krieg"]`
- "Texte über Weihnachten" → `keywords: ["text", "weihnachten"]`
### Step 5 — Assembly
```json
{
"personNames": ["Opa Hermann", "Marie"],
"personRole": "any",
"dateFrom": null,
"dateTo": "1920-12-31",
"keywords": ["brief"],
"rawQuery": "Briefe von Opa Hermann an Marie vor 1920"
}
```
## API
### `POST /parse`
**Request:**
```json
{ "query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de" }
```
`lang` is a required enum: `"de"` | `"en"` | `"es"`. Unknown values → HTTP 422 (FastAPI validation).
**Response:** extraction object as above, HTTP 200.
**Error:** pipeline crash → HTTP 500 `{"detail": "..."}`.
### `GET /health`
Returns HTTP 200 `{"status": "ok"}` when all three models are loaded.
## Language Models
| `lang` | spaCy model |
|---|---|
| `de` | `de_core_news_sm` |
| `en` | `en_core_web_sm` |
| `es` | `es_core_news_sm` |
All three models are loaded at startup and held in memory. Routing is by the `lang` field on the request.
## Dockerfile
Mirrors `ocr-service/``python:3.11-slim`, non-root user, models baked into the image:
```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN python -m spacy download de_core_news_sm \
&& python -m spacy download en_core_web_sm \
&& python -m spacy download es_core_news_sm
COPY . .
RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \
&& chown -R nlp:nlp /app
USER nlp
EXPOSE 8001
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]
```
Image size: ~350 MB. No volume needed — models live in the image layer.
## Local Dev
```bash
cd nlp-service
pip install -r requirements.txt
python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm
uvicorn main:app --reload --port 8001
curl -X POST http://localhost:8001/parse \
-H "Content-Type: application/json" \
-d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}'
```
## Known Limitations
- **Historical names:** spaCy models are trained on modern news corpora. Unusual 18991950 German names may not score as `PER`. Mitigation: the Java `resolveNames()` already does fuzzy matching against the persons table, so partial name extraction is recoverable.
- **Role detection:** the preposition sets are a fixed enumeration (~12 tokens across 3 languages). Sentences that express direction without one of these prepositions will fall through to `personRole = "any"`. This is acceptable — `"any"` is the safe default and searches both sender and receiver positions.
- **"über Oma" ambiguity:** if spaCy recognises "Oma" as a PER entity it lands in `personNames` (person search); if not, it lands in `keywords` (tag search via Java). Both paths return relevant results. The prototype evaluation will reveal which path dominates for real archive queries.
## Out of Scope (prototype)
- docker-compose integration (Ollama replacement)
- Java-side changes (`RestClientSpacyClient`, rename `OllamaClient``NlParserClient`)
- Tag lookup inside the Python service
- Automated test suite (pytest fixtures) — evaluation is done by curling the running service