Files

Marcel 03e22a2f26 docs(nlp): add spaCy NLP service prototype design spec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-07 09:40:00 +02:00

7.3 KiB

Raw Blame History

spaCy NLP Service — Design Spec

Date: 2026-06-07 Status: Prototype

Problem

The current NL search uses Ollama (qwen2.5:7b-instruct-q4_K_M) to parse free-text queries into structured extractions (person names, dates, role, keywords). Inference takes 5–15 seconds per query, making the feature too slow to be useful compared to filling in the filter UI manually.

Goal

Build a standalone nlp-service/ prototype that replaces Ollama with spaCy for query parsing. The prototype is scoped to extraction quality evaluation — run it locally, curl it with real archive queries, and measure whether spaCy extracts names/dates/keywords well enough to justify a full migration. No Java-side changes in this iteration.

Extraction Contract

The service must produce an output compatible with the existing OllamaExtraction Java record:

Field	Type	Description
`personNames`	`string[]`	Names of persons mentioned, left-to-right order
`personRole`	`"sender"` \| `"receiver"` \| `"any"`	Role of the person(s) in the document
`dateFrom`	`string \| null`	ISO 8601 date `YYYY-MM-DD` or null
`dateTo`	`string \| null`	ISO 8601 date `YYYY-MM-DD` or null
`keywords`	`string[]`	Content words — fuzzy-matched against tags by Java
`rawQuery`	`string`	Echo of the input query

Two-person ordering: personNames must be in left-to-right span order. Java maps [0] → sender, [1] → receiver.

rawQuery note: In the current Java code rawQuery is set by the caller, not parsed from Ollama. The service echoes the input for convenience; the eventual RestClientSpacyClient will set it from the input directly, same as today.

Architecture

nlp-service/
├── main.py        # FastAPI app — /parse and /health endpoints
├── extractor.py   # NLP pipeline: NER → role → dates → keywords
├── models.py      # Pydantic request/response types
├── requirements.txt
├── Dockerfile
└── CLAUDE.md

Sits alongside ocr-service/ in the repo. For the prototype it runs standalone (no docker-compose wiring).

Extraction Pipeline (`extractor.py`)

Five steps run in sequence on each query.

Step 1 — NER pass

Run spaCy on the query using the model for the requested language. Collect:

All PER spans → candidates for personNames
All DATE spans → raw text strings for step 3

Step 2 — Role detection

Only relevant when exactly one PER entity is found. Walk the dependency tree of the PER span's root token; check if a governing case or prep token matches the sender or receiver preposition set for the language:

Language	Sender prepositions	Receiver prepositions
`de`	von, vom	an, nach, für
`en`	from, by	to, for
`es`	de, por	para, a

One person + sender preposition → personRole = "sender"
One person + receiver preposition → personRole = "receiver"
One person + no match / two or more persons → personRole = "any"

Two-person queries always return "any" — Java derives direction from position.

Step 3 — Date parsing

For each DATE span, inspect the token immediately before the span to detect range direction:

Direction token	Effect
vor / before / antes de	Span → `dateTo`
nach / after / después de	Span → `dateFrom`
zwischen…und / between…and / entre…y	Earlier span → `dateFrom`, later → `dateTo`
No direction token (bare year/date)	Span → both `dateFrom` and `dateTo` set to that year (year-range, Jan 1–Dec 31)

dateparser.parse() with PREFER_DAY_OF_MONTH=first converts the span text to a Python date. For dateTo results that resolve to a year boundary, set to Dec 31 of that year (mirrors RestClientOllamaClient.parseDate() behaviour).

Output as ISO strings (YYYY-MM-DD) or null.

Step 4 — Keyword extraction

Collect tokens that satisfy all of:

POS tag is NOUN or PROPN
Not a stopword
Not inside any NER span (PER or DATE)
Lemma length ≥ 3

Output as lowercased lemmas. These are fuzzy-matched against the tags table by NlQueryParserService.resolveTags() on the Java side — no tag lookup in the Python service.

Examples:

"Briefe aus dem Krieg" → keywords: ["brief", "krieg"]
"Texte über Weihnachten" → keywords: ["text", "weihnachten"]

Step 5 — Assembly

{
  "personNames": ["Opa Hermann", "Marie"],
  "personRole": "any",
  "dateFrom": null,
  "dateTo": "1920-12-31",
  "keywords": ["brief"],
  "rawQuery": "Briefe von Opa Hermann an Marie vor 1920"
}

API

`POST /parse`

Request:

{ "query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de" }

lang is a required enum: "de" | "en" | "es". Unknown values → HTTP 422 (FastAPI validation).

Response: extraction object as above, HTTP 200.

Error: pipeline crash → HTTP 500 {"detail": "..."}.

`GET /health`

Returns HTTP 200 {"status": "ok"} when all three models are loaded.

Language Models

`lang`	spaCy model
`de`	`de_core_news_sm`
`en`	`en_core_web_sm`
`es`	`es_core_news_sm`

All three models are loaded at startup and held in memory. Routing is by the lang field on the request.

Dockerfile

Mirrors ocr-service/ — python:3.11-slim, non-root user, models baked into the image:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN python -m spacy download de_core_news_sm \
 && python -m spacy download en_core_web_sm \
 && python -m spacy download es_core_news_sm
COPY . .
RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \
    && chown -R nlp:nlp /app
USER nlp
EXPOSE 8001
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]

Image size: ~350 MB. No volume needed — models live in the image layer.

Local Dev

cd nlp-service
pip install -r requirements.txt
python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm
uvicorn main:app --reload --port 8001

curl -X POST http://localhost:8001/parse \
  -H "Content-Type: application/json" \
  -d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}'

Known Limitations

Historical names: spaCy models are trained on modern news corpora. Unusual 1899–1950 German names may not score as PER. Mitigation: the Java resolveNames() already does fuzzy matching against the persons table, so partial name extraction is recoverable.
Role detection: the preposition sets are a fixed enumeration (~12 tokens across 3 languages). Sentences that express direction without one of these prepositions will fall through to personRole = "any". This is acceptable — "any" is the safe default and searches both sender and receiver positions.
"über Oma" ambiguity: if spaCy recognises "Oma" as a PER entity it lands in personNames (person search); if not, it lands in keywords (tag search via Java). Both paths return relevant results. The prototype evaluation will reveal which path dominates for real archive queries.

Out of Scope (prototype)

docker-compose integration (Ollama replacement)
Java-side changes (RestClientSpacyClient, rename OllamaClient → NlParserClient)
Tag lookup inside the Python service
Automated test suite (pytest fixtures) — evaluation is done by curling the running service

7.3 KiB Raw Blame History Unescape Escape