Files
familienarchiv/docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md
2026-06-07 09:40:00 +02:00

7.3 KiB
Raw Blame History

spaCy NLP Service — Design Spec

Date: 2026-06-07 Status: Prototype

Problem

The current NL search uses Ollama (qwen2.5:7b-instruct-q4_K_M) to parse free-text queries into structured extractions (person names, dates, role, keywords). Inference takes 515 seconds per query, making the feature too slow to be useful compared to filling in the filter UI manually.

Goal

Build a standalone nlp-service/ prototype that replaces Ollama with spaCy for query parsing. The prototype is scoped to extraction quality evaluation — run it locally, curl it with real archive queries, and measure whether spaCy extracts names/dates/keywords well enough to justify a full migration. No Java-side changes in this iteration.

Extraction Contract

The service must produce an output compatible with the existing OllamaExtraction Java record:

Field Type Description
personNames string[] Names of persons mentioned, left-to-right order
personRole "sender" | "receiver" | "any" Role of the person(s) in the document
dateFrom string | null ISO 8601 date YYYY-MM-DD or null
dateTo string | null ISO 8601 date YYYY-MM-DD or null
keywords string[] Content words — fuzzy-matched against tags by Java
rawQuery string Echo of the input query

Two-person ordering: personNames must be in left-to-right span order. Java maps [0] → sender, [1] → receiver.

rawQuery note: In the current Java code rawQuery is set by the caller, not parsed from Ollama. The service echoes the input for convenience; the eventual RestClientSpacyClient will set it from the input directly, same as today.

Architecture

nlp-service/
├── main.py        # FastAPI app — /parse and /health endpoints
├── extractor.py   # NLP pipeline: NER → role → dates → keywords
├── models.py      # Pydantic request/response types
├── requirements.txt
├── Dockerfile
└── CLAUDE.md

Sits alongside ocr-service/ in the repo. For the prototype it runs standalone (no docker-compose wiring).

Extraction Pipeline (extractor.py)

Five steps run in sequence on each query.

Step 1 — NER pass

Run spaCy on the query using the model for the requested language. Collect:

  • All PER spans → candidates for personNames
  • All DATE spans → raw text strings for step 3

Step 2 — Role detection

Only relevant when exactly one PER entity is found. Walk the dependency tree of the PER span's root token; check if a governing case or prep token matches the sender or receiver preposition set for the language:

Language Sender prepositions Receiver prepositions
de von, vom an, nach, für
en from, by to, for
es de, por para, a
  • One person + sender preposition → personRole = "sender"
  • One person + receiver preposition → personRole = "receiver"
  • One person + no match / two or more persons → personRole = "any"

Two-person queries always return "any" — Java derives direction from position.

Step 3 — Date parsing

For each DATE span, inspect the token immediately before the span to detect range direction:

Direction token Effect
vor / before / antes de Span → dateTo
nach / after / después de Span → dateFrom
zwischen…und / between…and / entre…y Earlier span → dateFrom, later → dateTo
No direction token (bare year/date) Span → both dateFrom and dateTo set to that year (year-range, Jan 1Dec 31)

dateparser.parse() with PREFER_DAY_OF_MONTH=first converts the span text to a Python date. For dateTo results that resolve to a year boundary, set to Dec 31 of that year (mirrors RestClientOllamaClient.parseDate() behaviour).

Output as ISO strings (YYYY-MM-DD) or null.

Step 4 — Keyword extraction

Collect tokens that satisfy all of:

  • POS tag is NOUN or PROPN
  • Not a stopword
  • Not inside any NER span (PER or DATE)
  • Lemma length ≥ 3

Output as lowercased lemmas. These are fuzzy-matched against the tags table by NlQueryParserService.resolveTags() on the Java side — no tag lookup in the Python service.

Examples:

  • "Briefe aus dem Krieg" → keywords: ["brief", "krieg"]
  • "Texte über Weihnachten" → keywords: ["text", "weihnachten"]

Step 5 — Assembly

{
  "personNames": ["Opa Hermann", "Marie"],
  "personRole": "any",
  "dateFrom": null,
  "dateTo": "1920-12-31",
  "keywords": ["brief"],
  "rawQuery": "Briefe von Opa Hermann an Marie vor 1920"
}

API

POST /parse

Request:

{ "query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de" }

lang is a required enum: "de" | "en" | "es". Unknown values → HTTP 422 (FastAPI validation).

Response: extraction object as above, HTTP 200.

Error: pipeline crash → HTTP 500 {"detail": "..."}.

GET /health

Returns HTTP 200 {"status": "ok"} when all three models are loaded.

Language Models

lang spaCy model
de de_core_news_sm
en en_core_web_sm
es es_core_news_sm

All three models are loaded at startup and held in memory. Routing is by the lang field on the request.

Dockerfile

Mirrors ocr-service/python:3.11-slim, non-root user, models baked into the image:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN python -m spacy download de_core_news_sm \
 && python -m spacy download en_core_web_sm \
 && python -m spacy download es_core_news_sm
COPY . .
RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \
    && chown -R nlp:nlp /app
USER nlp
EXPOSE 8001
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]

Image size: ~350 MB. No volume needed — models live in the image layer.

Local Dev

cd nlp-service
pip install -r requirements.txt
python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm
uvicorn main:app --reload --port 8001

curl -X POST http://localhost:8001/parse \
  -H "Content-Type: application/json" \
  -d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}'

Known Limitations

  • Historical names: spaCy models are trained on modern news corpora. Unusual 18991950 German names may not score as PER. Mitigation: the Java resolveNames() already does fuzzy matching against the persons table, so partial name extraction is recoverable.
  • Role detection: the preposition sets are a fixed enumeration (~12 tokens across 3 languages). Sentences that express direction without one of these prepositions will fall through to personRole = "any". This is acceptable — "any" is the safe default and searches both sender and receiver positions.
  • "über Oma" ambiguity: if spaCy recognises "Oma" as a PER entity it lands in personNames (person search); if not, it lands in keywords (tag search via Java). Both paths return relevant results. The prototype evaluation will reveal which path dominates for real archive queries.

Out of Scope (prototype)

  • docker-compose integration (Ollama replacement)
  • Java-side changes (RestClientSpacyClient, rename OllamaClientNlParserClient)
  • Tag lookup inside the Python service
  • Automated test suite (pytest fixtures) — evaluation is done by curling the running service