7.3 KiB
spaCy NLP Service — Design Spec
Date: 2026-06-07 Status: Prototype
Problem
The current NL search uses Ollama (qwen2.5:7b-instruct-q4_K_M) to parse free-text queries into structured extractions (person names, dates, role, keywords). Inference takes 5–15 seconds per query, making the feature too slow to be useful compared to filling in the filter UI manually.
Goal
Build a standalone nlp-service/ prototype that replaces Ollama with spaCy for query parsing. The prototype is scoped to extraction quality evaluation — run it locally, curl it with real archive queries, and measure whether spaCy extracts names/dates/keywords well enough to justify a full migration. No Java-side changes in this iteration.
Extraction Contract
The service must produce an output compatible with the existing OllamaExtraction Java record:
| Field | Type | Description |
|---|---|---|
personNames |
string[] |
Names of persons mentioned, left-to-right order |
personRole |
"sender" | "receiver" | "any" |
Role of the person(s) in the document |
dateFrom |
string | null |
ISO 8601 date YYYY-MM-DD or null |
dateTo |
string | null |
ISO 8601 date YYYY-MM-DD or null |
keywords |
string[] |
Content words — fuzzy-matched against tags by Java |
rawQuery |
string |
Echo of the input query |
Two-person ordering: personNames must be in left-to-right span order. Java maps [0] → sender, [1] → receiver.
rawQuery note: In the current Java code rawQuery is set by the caller, not parsed from Ollama. The service echoes the input for convenience; the eventual RestClientSpacyClient will set it from the input directly, same as today.
Architecture
nlp-service/
├── main.py # FastAPI app — /parse and /health endpoints
├── extractor.py # NLP pipeline: NER → role → dates → keywords
├── models.py # Pydantic request/response types
├── requirements.txt
├── Dockerfile
└── CLAUDE.md
Sits alongside ocr-service/ in the repo. For the prototype it runs standalone (no docker-compose wiring).
Extraction Pipeline (extractor.py)
Five steps run in sequence on each query.
Step 1 — NER pass
Run spaCy on the query using the model for the requested language. Collect:
- All
PERspans → candidates forpersonNames - All
DATEspans → raw text strings for step 3
Step 2 — Role detection
Only relevant when exactly one PER entity is found. Walk the dependency tree of the PER span's root token; check if a governing case or prep token matches the sender or receiver preposition set for the language:
| Language | Sender prepositions | Receiver prepositions |
|---|---|---|
de |
von, vom | an, nach, für |
en |
from, by | to, for |
es |
de, por | para, a |
- One person + sender preposition →
personRole = "sender" - One person + receiver preposition →
personRole = "receiver" - One person + no match / two or more persons →
personRole = "any"
Two-person queries always return "any" — Java derives direction from position.
Step 3 — Date parsing
For each DATE span, inspect the token immediately before the span to detect range direction:
| Direction token | Effect |
|---|---|
| vor / before / antes de | Span → dateTo |
| nach / after / después de | Span → dateFrom |
| zwischen…und / between…and / entre…y | Earlier span → dateFrom, later → dateTo |
| No direction token (bare year/date) | Span → both dateFrom and dateTo set to that year (year-range, Jan 1–Dec 31) |
dateparser.parse() with PREFER_DAY_OF_MONTH=first converts the span text to a Python date. For dateTo results that resolve to a year boundary, set to Dec 31 of that year (mirrors RestClientOllamaClient.parseDate() behaviour).
Output as ISO strings (YYYY-MM-DD) or null.
Step 4 — Keyword extraction
Collect tokens that satisfy all of:
- POS tag is
NOUNorPROPN - Not a stopword
- Not inside any NER span (PER or DATE)
- Lemma length ≥ 3
Output as lowercased lemmas. These are fuzzy-matched against the tags table by NlQueryParserService.resolveTags() on the Java side — no tag lookup in the Python service.
Examples:
- "Briefe aus dem Krieg" →
keywords: ["brief", "krieg"] - "Texte über Weihnachten" →
keywords: ["text", "weihnachten"]
Step 5 — Assembly
{
"personNames": ["Opa Hermann", "Marie"],
"personRole": "any",
"dateFrom": null,
"dateTo": "1920-12-31",
"keywords": ["brief"],
"rawQuery": "Briefe von Opa Hermann an Marie vor 1920"
}
API
POST /parse
Request:
{ "query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de" }
lang is a required enum: "de" | "en" | "es". Unknown values → HTTP 422 (FastAPI validation).
Response: extraction object as above, HTTP 200.
Error: pipeline crash → HTTP 500 {"detail": "..."}.
GET /health
Returns HTTP 200 {"status": "ok"} when all three models are loaded.
Language Models
lang |
spaCy model |
|---|---|
de |
de_core_news_sm |
en |
en_core_web_sm |
es |
es_core_news_sm |
All three models are loaded at startup and held in memory. Routing is by the lang field on the request.
Dockerfile
Mirrors ocr-service/ — python:3.11-slim, non-root user, models baked into the image:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN python -m spacy download de_core_news_sm \
&& python -m spacy download en_core_web_sm \
&& python -m spacy download es_core_news_sm
COPY . .
RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \
&& chown -R nlp:nlp /app
USER nlp
EXPOSE 8001
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]
Image size: ~350 MB. No volume needed — models live in the image layer.
Local Dev
cd nlp-service
pip install -r requirements.txt
python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm
uvicorn main:app --reload --port 8001
curl -X POST http://localhost:8001/parse \
-H "Content-Type: application/json" \
-d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}'
Known Limitations
- Historical names: spaCy models are trained on modern news corpora. Unusual 1899–1950 German names may not score as
PER. Mitigation: the JavaresolveNames()already does fuzzy matching against the persons table, so partial name extraction is recoverable. - Role detection: the preposition sets are a fixed enumeration (~12 tokens across 3 languages). Sentences that express direction without one of these prepositions will fall through to
personRole = "any". This is acceptable —"any"is the safe default and searches both sender and receiver positions. - "über Oma" ambiguity: if spaCy recognises "Oma" as a PER entity it lands in
personNames(person search); if not, it lands inkeywords(tag search via Java). Both paths return relevant results. The prototype evaluation will reveal which path dominates for real archive queries.
Out of Scope (prototype)
- docker-compose integration (Ollama replacement)
- Java-side changes (
RestClientSpacyClient, renameOllamaClient→NlParserClient) - Tag lookup inside the Python service
- Automated test suite (pytest fixtures) — evaluation is done by curling the running service