36 KiB
spaCy NLP Service Prototype — Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Build nlp-service/ — a FastAPI service that parses free-text search queries into structured extractions (person names, role, dates, keywords) using spaCy, as a drop-in replacement for the current Ollama service.
Architecture: Five-step pipeline (NER → role detection → date parsing → keyword extraction → assembly) in extractor.py. main.py exposes /parse and /health via FastAPI. Models baked into the Docker image at build time — no volume needed.
Tech Stack: Python 3.11, FastAPI 0.115, spaCy 3.8 (de_core_news_sm / en_core_web_sm / es_core_news_sm), dateparser 1.2, pytest
File Map
| File | Responsibility |
|---|---|
nlp-service/models.py |
Pydantic request/response types — the extraction contract |
nlp-service/extractor.py |
NLP pipeline: model loading + 5 extraction steps |
nlp-service/main.py |
FastAPI app — /parse, /health, lifespan model loading |
nlp-service/requirements.txt |
Python dependencies |
nlp-service/Dockerfile |
Image — python:3.11-slim, models baked in, non-root user |
nlp-service/CLAUDE.md |
Service-level docs |
nlp-service/test_extractor.py |
Unit + integration tests for the pipeline |
nlp-service/test_main.py |
HTTP contract tests for the FastAPI endpoints |
Task 1: Scaffold — requirements.txt, CLAUDE.md, models.py
Files:
-
Create:
nlp-service/requirements.txt -
Create:
nlp-service/CLAUDE.md -
Create:
nlp-service/models.py -
Create:
nlp-service/test_extractor.py(skeleton only) -
Step 1: Create
nlp-service/requirements.txt
fastapi[standard]==0.115.6
uvicorn[standard]==0.34.0
spacy>=3.8,<4.0
dateparser>=1.2,<2.0
pytest>=8.0,<9.0
httpx>=0.28,<1.0
- Step 2: Create
nlp-service/CLAUDE.md
# NLP Service
Lightweight FastAPI service that parses free-text search queries into structured extractions,
replacing Ollama for the Familienarchiv NL search feature.
## Stack
- Python 3.11, FastAPI 0.115, spaCy 3.8, dateparser 1.2
## Endpoints
- `POST /parse` — parse a free-text query, return extraction matching `OllamaExtraction` contract
- `GET /health` — returns `{"status": "ok"}` when all models are loaded
## Running locally
\`\`\`bash
pip install -r requirements.txt
python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm
uvicorn main:app --reload --port 8001
curl -X POST http://localhost:8001/parse \
-H "Content-Type: application/json" \
-d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}'
\`\`\`
## Testing
\`\`\`bash
pytest -v
\`\`\`
## Design spec
See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`.
## Notes
This is a **prototype** for extraction quality evaluation. No docker-compose integration or
Java-side changes in this iteration. The extraction contract matches `OllamaExtraction` in
`backend/src/main/java/org/raddatz/familienarchiv/search/`.
- Step 3: Write the failing test for Pydantic models
Create nlp-service/test_extractor.py:
import pytest
from pydantic import ValidationError
# ── Models ──────────────────────────────────────────────────────────────────
def test_parse_request_valid():
from models import ParseRequest
req = ParseRequest(query="Briefe von Opa", lang="de")
assert req.query == "Briefe von Opa"
assert req.lang == "de"
def test_parse_request_rejects_unknown_lang():
from models import ParseRequest
with pytest.raises(ValidationError):
ParseRequest(query="Letters from grandpa", lang="fr")
def test_parse_response_serializes_nulls():
from models import ParseResponse
resp = ParseResponse(
personNames=["Opa"],
personRole="sender",
dateFrom=None,
dateTo="1920-12-31",
keywords=["brief"],
rawQuery="Briefe von Opa",
)
data = resp.model_dump()
assert data["dateFrom"] is None
assert data["dateTo"] == "1920-12-31"
assert data["personRole"] == "sender"
- Step 4: Run to confirm failure
cd nlp-service
pip install -r requirements.txt
pytest test_extractor.py::test_parse_request_valid -v
Expected: ModuleNotFoundError: No module named 'models'
- Step 5: Create
nlp-service/models.py
from __future__ import annotations
from typing import Literal
from pydantic import BaseModel
class ParseRequest(BaseModel):
query: str
lang: Literal["de", "en", "es"]
class ParseResponse(BaseModel):
personNames: list[str]
personRole: Literal["sender", "receiver", "any"]
dateFrom: str | None
dateTo: str | None
keywords: list[str]
rawQuery: str
- Step 6: Run tests to confirm they pass
pytest test_extractor.py::test_parse_request_valid \
test_extractor.py::test_parse_request_rejects_unknown_lang \
test_extractor.py::test_parse_response_serializes_nulls -v
Expected: 3 passed
- Step 7: Commit
git add nlp-service/
git commit -m "feat(nlp-service): scaffold — models, requirements, CLAUDE.md"
Task 2: spaCy model loading
Files:
- Create:
nlp-service/extractor.py - Modify:
nlp-service/test_extractor.py
Before running these tests, the three spaCy models must be installed:
python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm
- Step 1: Write the failing tests
Append to nlp-service/test_extractor.py:
# ── Model loading ────────────────────────────────────────────────────────────
import pytest
@pytest.fixture(scope="session")
def nlp_de():
from extractor import get_nlp
return get_nlp("de")
@pytest.fixture(scope="session")
def nlp_en():
from extractor import get_nlp
return get_nlp("en")
@pytest.fixture(scope="session")
def nlp_es():
from extractor import get_nlp
return get_nlp("es")
def test_get_nlp_de_loads(nlp_de):
doc = nlp_de("Test")
assert doc is not None
def test_get_nlp_en_loads(nlp_en):
doc = nlp_en("Test")
assert doc is not None
def test_get_nlp_es_loads(nlp_es):
doc = nlp_es("Prueba")
assert doc is not None
def test_get_nlp_unknown_lang_raises():
from extractor import get_nlp
with pytest.raises(ValueError, match="Unsupported language"):
get_nlp("fr")
- Step 2: Run to confirm failure
pytest test_extractor.py::test_get_nlp_de_loads -v
Expected: ModuleNotFoundError: No module named 'extractor'
- Step 3: Create
nlp-service/extractor.pywith model loading
from __future__ import annotations
import re
from datetime import date
import dateparser
import spacy
from spacy.language import Language
from models import ParseResponse
# ── Language model registry ──────────────────────────────────────────────────
_MODEL_NAMES: dict[str, str] = {
"de": "de_core_news_sm",
"en": "en_core_web_sm",
"es": "es_core_news_sm",
}
_nlp_cache: dict[str, Language] = {}
def get_nlp(lang: str) -> Language:
if lang not in _MODEL_NAMES:
raise ValueError(f"Unsupported language: {lang!r}. Valid: {list(_MODEL_NAMES)}")
if lang not in _nlp_cache:
_nlp_cache[lang] = spacy.load(_MODEL_NAMES[lang])
return _nlp_cache[lang]
def load_all_models() -> None:
for lang in _MODEL_NAMES:
get_nlp(lang)
- Step 4: Run tests to confirm they pass
pytest test_extractor.py::test_get_nlp_de_loads \
test_extractor.py::test_get_nlp_en_loads \
test_extractor.py::test_get_nlp_es_loads \
test_extractor.py::test_get_nlp_unknown_lang_raises -v
Expected: 4 passed
- Step 5: Commit
git add nlp-service/extractor.py nlp-service/test_extractor.py
git commit -m "feat(nlp-service): spaCy model loading with get_nlp/load_all_models"
Task 3: Person name extraction (NER)
Files:
-
Modify:
nlp-service/extractor.py -
Modify:
nlp-service/test_extractor.py -
Step 1: Write the failing tests
Append to nlp-service/test_extractor.py:
# ── Person name extraction ───────────────────────────────────────────────────
def _make_doc_with_ents(nlp, text: str, char_ents: list[tuple[int, int, str]]):
"""Create a Doc with manually injected entity spans (no NER model needed)."""
doc = nlp.make_doc(text)
spans = [doc.char_span(s, e, label=lbl) for s, e, lbl in char_ents]
doc.ents = [sp for sp in spans if sp is not None]
return doc
def test_extract_person_names_two_persons(nlp_de):
from extractor import extract_person_names
# "Briefe von Opa Hermann an Marie"
# 0123456789012345678901234567890
# 1111111111222222222233
# "Opa Hermann" = 11..22, "Marie" = 26..31
doc = _make_doc_with_ents(nlp_de, "Briefe von Opa Hermann an Marie", [
(11, 22, "PER"),
(26, 31, "PER"),
])
assert extract_person_names(doc) == ["Opa Hermann", "Marie"]
def test_extract_person_names_preserves_order(nlp_de):
from extractor import extract_person_names
# Reversed: "Marie von Opa" — Marie comes first in text
# "Marie" = 0..5, "Opa" = 10..13
doc = _make_doc_with_ents(nlp_de, "Marie von Opa", [
(0, 5, "PER"),
(10, 13, "PER"),
])
assert extract_person_names(doc) == ["Marie", "Opa"]
def test_extract_person_names_empty(nlp_de):
from extractor import extract_person_names
doc = _make_doc_with_ents(nlp_de, "Briefe aus dem Krieg", [])
assert extract_person_names(doc) == []
def test_extract_person_names_ignores_non_per(nlp_de):
from extractor import extract_person_names
# DATE entity should not appear in personNames
doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [(7, 11, "DATE")])
assert extract_person_names(doc) == []
- Step 2: Run to confirm failure
pytest test_extractor.py::test_extract_person_names_two_persons -v
Expected: ImportError: cannot import name 'extract_person_names' from 'extractor'
- Step 3: Add
extract_person_namestoextractor.py
Add after the model loading section:
# ── Step 1: Person name extraction ──────────────────────────────────────────
def extract_person_names(doc) -> list[str]:
"""Return PER entity texts in left-to-right span order."""
return [ent.text for ent in doc.ents if ent.label_ == "PER"]
- Step 4: Run tests to confirm they pass
pytest test_extractor.py::test_extract_person_names_two_persons \
test_extractor.py::test_extract_person_names_preserves_order \
test_extractor.py::test_extract_person_names_empty \
test_extractor.py::test_extract_person_names_ignores_non_per -v
Expected: 4 passed
- Step 5: Commit
git add nlp-service/extractor.py nlp-service/test_extractor.py
git commit -m "feat(nlp-service): NER person name extraction"
Task 4: Role detection
Files:
- Modify:
nlp-service/extractor.py - Modify:
nlp-service/test_extractor.py
Role is only meaningful when exactly one PER entity is found. The function checks:
- Dependency-tree children of the PER span's root with
dep_in("case", "prep", "mo") - Fallback: the token immediately before the span
- Step 1: Write the failing tests
Append to nlp-service/test_extractor.py:
# ── Role detection ───────────────────────────────────────────────────────────
def test_role_sender_von(nlp_de):
from extractor import detect_person_role
# "Briefe von Marie" — "von" immediately before "Marie"
# B=0..6, ' '=6, v=7..10, ' '=10, M=11..16
doc = _make_doc_with_ents(nlp_de, "Briefe von Marie", [(11, 16, "PER")])
per_spans = list(doc.ents)
assert detect_person_role(doc, per_spans, "de") == "sender"
def test_role_receiver_an(nlp_de):
from extractor import detect_person_role
# "Briefe an Marie" — "an" immediately before "Marie"
# B=0..6, ' '=6, a=7..9, ' '=9, M=10..15
doc = _make_doc_with_ents(nlp_de, "Briefe an Marie", [(10, 15, "PER")])
per_spans = list(doc.ents)
assert detect_person_role(doc, per_spans, "de") == "receiver"
def test_role_two_persons_returns_any(nlp_de):
from extractor import detect_person_role
# "von Opa an Marie" — two PER spans → always "any"
# v=0..3, ' '=3, O=4..7, ' '=7, a=8..10, ' '=10, M=11..16
doc = _make_doc_with_ents(nlp_de, "von Opa an Marie", [
(4, 7, "PER"),
(11, 16, "PER"),
])
per_spans = list(doc.ents)
assert detect_person_role(doc, per_spans, "de") == "any"
def test_role_no_prep_returns_any(nlp_de):
from extractor import detect_person_role
# "Briefe Marie" — no preposition
# B=0..6, ' '=6, M=7..12
doc = _make_doc_with_ents(nlp_de, "Briefe Marie", [(7, 12, "PER")])
per_spans = list(doc.ents)
assert detect_person_role(doc, per_spans, "de") == "any"
def test_role_empty_returns_any(nlp_de):
from extractor import detect_person_role
doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [])
assert detect_person_role(doc, [], "de") == "any"
def test_role_sender_from_english(nlp_en):
from extractor import detect_person_role
# "letters from Marie" — "from" before "Marie"
# l=0..7, ' '=7, f=8..12, ' '=12, M=13..18
doc = _make_doc_with_ents(nlp_en, "letters from Marie", [(13, 18, "PER")])
per_spans = list(doc.ents)
assert detect_person_role(doc, per_spans, "en") == "sender"
def test_role_receiver_to_english(nlp_en):
from extractor import detect_person_role
# "letters to Marie"
# l=0..7, ' '=7, t=8..10, ' '=10, M=11..16
doc = _make_doc_with_ents(nlp_en, "letters to Marie", [(11, 16, "PER")])
per_spans = list(doc.ents)
assert detect_person_role(doc, per_spans, "en") == "receiver"
- Step 2: Run to confirm failure
pytest test_extractor.py::test_role_sender_von -v
Expected: ImportError: cannot import name 'detect_person_role' from 'extractor'
- Step 3: Add role detection constants and function to
extractor.py
Add after extract_person_names:
# ── Step 2: Role detection ───────────────────────────────────────────────────
_SENDER_PREPS: dict[str, frozenset[str]] = {
"de": frozenset({"von", "vom"}),
"en": frozenset({"from", "by"}),
"es": frozenset({"de", "por"}),
}
_RECEIVER_PREPS: dict[str, frozenset[str]] = {
"de": frozenset({"an", "nach", "für"}),
"en": frozenset({"to", "for"}),
"es": frozenset({"para", "a"}),
}
def detect_person_role(doc, per_spans: list, lang: str) -> str:
"""Return 'sender', 'receiver', or 'any'.
Only meaningful for single-PER queries — two-person queries always return
'any' because Java derives direction from list position.
"""
if len(per_spans) != 1:
return "any"
span = per_spans[0]
root = span.root
sender = _SENDER_PREPS[lang]
receiver = _RECEIVER_PREPS[lang]
# Primary: dependency-tree children of the PER root
for child in root.children:
if child.dep_ in ("case", "prep", "mo"):
if child.lower_ in sender:
return "sender"
if child.lower_ in receiver:
return "receiver"
# Fallback: token immediately before the span start
if span.start > 0:
prev = doc[span.start - 1]
if prev.lower_ in sender:
return "sender"
if prev.lower_ in receiver:
return "receiver"
return "any"
- Step 4: Run tests to confirm they pass
pytest test_extractor.py::test_role_sender_von \
test_extractor.py::test_role_receiver_an \
test_extractor.py::test_role_two_persons_returns_any \
test_extractor.py::test_role_no_prep_returns_any \
test_extractor.py::test_role_empty_returns_any \
test_extractor.py::test_role_sender_from_english \
test_extractor.py::test_role_receiver_to_english -v
Expected: 7 passed
- Step 5: Commit
git add nlp-service/extractor.py nlp-service/test_extractor.py
git commit -m "feat(nlp-service): role detection (sender/receiver/any)"
Task 5: Date parsing
Files:
- Modify:
nlp-service/extractor.py - Modify:
nlp-service/test_extractor.py
Direction is detected from the token immediately before each DATE span. For "zwischen/between/entre", both DATE spans form the range (sorted so earlier = dateFrom). A bare year with no direction token produces a closed year-range (dateFrom = Jan 1, dateTo = Dec 31).
Note: "nach" appears in both _RECEIVER_PREPS["de"] and the date-after set. This is safe — role detection only examines tokens before PER spans; date parsing only examines tokens before DATE spans. They operate on different span types.
- Step 1: Write the failing tests
Append to nlp-service/test_extractor.py:
# ── Date parsing ─────────────────────────────────────────────────────────────
def test_date_vor_1920(nlp_de):
from extractor import extract_dates
# "Briefe vor 1920" — "1920" at chars 11..15
doc = _make_doc_with_ents(nlp_de, "Briefe vor 1920", [(11, 15, "DATE")])
date_from, date_to = extract_dates(doc, "de")
assert date_from is None
assert date_to == "1920-12-31"
def test_date_nach_1900(nlp_de):
from extractor import extract_dates
# "Briefe nach 1900" — "1900" at chars 12..16
doc = _make_doc_with_ents(nlp_de, "Briefe nach 1900", [(12, 16, "DATE")])
date_from, date_to = extract_dates(doc, "de")
assert date_from == "1900-01-01"
assert date_to is None
def test_date_zwischen_1900_und_1920(nlp_de):
from extractor import extract_dates
# "zwischen 1900 und 1920"
# z=0..8, ' '=8, 1900=9..13, ' '=13, u=14..17, ' '=17, 1920=18..22
doc = _make_doc_with_ents(nlp_de, "zwischen 1900 und 1920", [
(9, 13, "DATE"),
(18, 22, "DATE"),
])
date_from, date_to = extract_dates(doc, "de")
assert date_from == "1900-01-01"
assert date_to == "1920-12-31"
def test_date_bare_year_makes_range(nlp_de):
from extractor import extract_dates
# "Briefe 1920" — no direction token → year-range
# B=0..6, ' '=6, 1920=7..11
doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [(7, 11, "DATE")])
date_from, date_to = extract_dates(doc, "de")
assert date_from == "1920-01-01"
assert date_to == "1920-12-31"
def test_date_no_date_entity(nlp_de):
from extractor import extract_dates
doc = _make_doc_with_ents(nlp_de, "Briefe von Opa", [])
date_from, date_to = extract_dates(doc, "de")
assert date_from is None
assert date_to is None
def test_date_before_english(nlp_en):
from extractor import extract_dates
# "letters before 1920" — "1920" at chars 15..19
doc = _make_doc_with_ents(nlp_en, "letters before 1920", [(15, 19, "DATE")])
date_from, date_to = extract_dates(doc, "en")
assert date_from is None
assert date_to == "1920-12-31"
def test_date_after_english(nlp_en):
from extractor import extract_dates
# "letters after 1900" — "1900" at chars 14..18
doc = _make_doc_with_ents(nlp_en, "letters after 1900", [(14, 18, "DATE")])
date_from, date_to = extract_dates(doc, "en")
assert date_from == "1900-01-01"
assert date_to is None
- Step 2: Run to confirm failure
pytest test_extractor.py::test_date_vor_1920 -v
Expected: ImportError: cannot import name 'extract_dates' from 'extractor'
- Step 3: Add date parsing to
extractor.py
Add after detect_person_role:
# ── Step 3: Date parsing ─────────────────────────────────────────────────────
_YEAR_RE = re.compile(r"^\d{4}$")
_DATE_BEFORE: dict[str, frozenset[str]] = {
"de": frozenset({"vor"}),
"en": frozenset({"before"}),
"es": frozenset({"antes"}),
}
_DATE_AFTER: dict[str, frozenset[str]] = {
"de": frozenset({"nach"}),
"en": frozenset({"after"}),
"es": frozenset({"después", "despues"}),
}
_DATE_BETWEEN: dict[str, frozenset[str]] = {
"de": frozenset({"zwischen"}),
"en": frozenset({"between"}),
"es": frozenset({"entre"}),
}
def _parse_date_text(text: str, lang: str) -> date | None:
text = text.strip()
if _YEAR_RE.match(text):
year = int(text)
if 1000 < year < 3000:
return date(year, 1, 1)
parsed = dateparser.parse(
text,
languages=[lang],
settings={"PREFER_DAY_OF_MONTH": "first", "RETURN_AS_TIMEZONE_AWARE": False},
)
return parsed.date() if parsed else None
def _year_end(d: date) -> date:
"""If d is Jan 1, return Dec 31 of the same year (year-only boundary)."""
if d.month == 1 and d.day == 1:
return date(d.year, 12, 31)
return d
def extract_dates(doc, lang: str) -> tuple[str | None, str | None]:
"""Return (date_from, date_to) as ISO strings or None."""
date_spans = [ent for ent in doc.ents if ent.label_ == "DATE"]
if not date_spans:
return None, None
between_tokens = _DATE_BETWEEN[lang]
before_tokens = _DATE_BEFORE[lang]
after_tokens = _DATE_AFTER[lang]
# "zwischen X und Y" / "between X and Y" — two DATE spans form a range
has_between = any(tok.lower_ in between_tokens for tok in doc)
if has_between and len(date_spans) >= 2:
parsed = []
for span in date_spans[:2]:
d = _parse_date_text(span.text, lang)
if d:
parsed.append(d)
if len(parsed) == 2:
parsed.sort()
return parsed[0].isoformat(), _year_end(parsed[1]).isoformat()
# Single DATE span — use direction token
span = date_spans[0]
d = _parse_date_text(span.text, lang)
if not d:
return None, None
prev_lower = doc[span.start - 1].lower_ if span.start > 0 else ""
if prev_lower in before_tokens:
return None, _year_end(d).isoformat()
if prev_lower in after_tokens:
return d.isoformat(), None
# Bare year/date — closed year-range
return d.isoformat(), _year_end(d).isoformat()
- Step 4: Run tests to confirm they pass
pytest test_extractor.py::test_date_vor_1920 \
test_extractor.py::test_date_nach_1900 \
test_extractor.py::test_date_zwischen_1900_und_1920 \
test_extractor.py::test_date_bare_year_makes_range \
test_extractor.py::test_date_no_date_entity \
test_extractor.py::test_date_before_english \
test_extractor.py::test_date_after_english -v
Expected: 7 passed
- Step 5: Commit
git add nlp-service/extractor.py nlp-service/test_extractor.py
git commit -m "feat(nlp-service): date range extraction with direction detection"
Task 6: Keyword extraction
Files:
- Modify:
nlp-service/extractor.py - Modify:
nlp-service/test_extractor.py
Keywords are POS-filtered content words (NOUN or PROPN, non-stop, length ≥ 3, not inside any NER span). These are passed to Java's resolveTags() which fuzzy-matches them against the tag table — no tag lookup in Python.
- Step 1: Write the failing tests
Append to nlp-service/test_extractor.py:
# ── Keyword extraction ───────────────────────────────────────────────────────
def test_keywords_extracts_nouns(nlp_de):
from extractor import extract_keywords
# Use real NLP for POS tags; disable NER to control entities manually
doc = nlp_de("Briefe aus dem Krieg", disable=["ner"])
keywords = extract_keywords(doc, [])
# "Brief" (NOUN, lemma "Brief") and "Krieg" (NOUN) should appear
assert "brief" in keywords
assert "krieg" in keywords
def test_keywords_excludes_stopwords(nlp_de):
from extractor import extract_keywords
doc = nlp_de("Briefe aus dem Krieg", disable=["ner"])
keywords = extract_keywords(doc, [])
# "dem" is a stopword article (DET) — must not appear
assert "dem" not in keywords
def test_keywords_excludes_per_ner_spans(nlp_de):
from extractor import extract_keywords
# Run full NLP so POS tagger fires, then inject PER span over "Hermann"
doc = nlp_de("Briefe von Hermann")
per_span = doc.char_span(11, 18, label="PER") # "Hermann" = 11..18
if per_span:
doc.ents = [per_span]
keywords = extract_keywords(doc, list(doc.ents))
assert "hermann" not in keywords
def test_keywords_excludes_short_lemmas(nlp_de):
from extractor import extract_keywords
# Single-letter / two-letter words should be excluded (length < 3)
doc = nlp_de("Briefe an ihn", disable=["ner"])
keywords = extract_keywords(doc, [])
assert "ihn" not in keywords
def test_keywords_deduplicates(nlp_de):
from extractor import extract_keywords
doc = nlp_de("Brief Brief Krieg", disable=["ner"])
keywords = extract_keywords(doc, [])
assert keywords.count("brief") == 1
- Step 2: Run to confirm failure
pytest test_extractor.py::test_keywords_extracts_nouns -v
Expected: ImportError: cannot import name 'extract_keywords' from 'extractor'
- Step 3: Add keyword extraction to
extractor.py
Add after extract_dates:
# ── Step 4: Keyword extraction ───────────────────────────────────────────────
def extract_keywords(doc, excluded_spans: list) -> list[str]:
"""Return lowercased lemmas of content words not inside any NER span."""
excluded_indices: set[int] = set()
for span in excluded_spans:
excluded_indices.update(range(span.start, span.end))
seen: set[str] = set()
keywords: list[str] = []
for token in doc:
if token.i in excluded_indices:
continue
if token.pos_ not in ("NOUN", "PROPN"):
continue
if token.is_stop:
continue
lemma = token.lemma_.lower()
if len(lemma) < 3:
continue
if lemma not in seen:
seen.add(lemma)
keywords.append(lemma)
return keywords
- Step 4: Run tests to confirm they pass
pytest test_extractor.py::test_keywords_extracts_nouns \
test_extractor.py::test_keywords_excludes_stopwords \
test_extractor.py::test_keywords_excludes_per_ner_spans \
test_extractor.py::test_keywords_excludes_short_lemmas \
test_extractor.py::test_keywords_deduplicates -v
Expected: 5 passed
- Step 5: Commit
git add nlp-service/extractor.py nlp-service/test_extractor.py
git commit -m "feat(nlp-service): keyword extraction (POS-filtered, deduped lemmas)"
Task 7: Full extract() function
Files:
- Modify:
nlp-service/extractor.py - Modify:
nlp-service/test_extractor.py
This assembles all steps. Tests here use real NLP (no synthetic docs) to validate actual extraction quality.
- Step 1: Write the failing tests
Append to nlp-service/test_extractor.py:
# ── Full extract() pipeline ──────────────────────────────────────────────────
def test_extract_dates_de():
from extractor import extract
result = extract("Briefe vor 1920", "de")
assert result.dateFrom is None
assert result.dateTo == "1920-12-31"
assert result.rawQuery == "Briefe vor 1920"
assert result.personNames == []
assert result.personRole == "any"
def test_extract_keywords_from_topic_de():
from extractor import extract
result = extract("Briefe aus dem Krieg", "de")
assert "krieg" in result.keywords
assert result.dateFrom is None
assert result.dateTo is None
def test_extract_dates_en():
from extractor import extract
result = extract("letters before 1920", "en")
assert result.dateTo == "1920-12-31"
assert result.dateFrom is None
def test_extract_dates_es():
from extractor import extract
result = extract("cartas antes de 1920", "es")
assert result.dateTo == "1920-12-31"
assert result.dateFrom is None
def test_extract_rawquery_echoed():
from extractor import extract
q = "Texte über Weihnachten"
result = extract(q, "de")
assert result.rawQuery == q
def test_extract_response_fields_are_complete():
from extractor import extract
result = extract("Briefe 1900", "de")
assert isinstance(result.personNames, list)
assert result.personRole in ("sender", "receiver", "any")
assert isinstance(result.keywords, list)
assert result.rawQuery == "Briefe 1900"
- Step 2: Run to confirm failure
pytest test_extractor.py::test_extract_dates_de -v
Expected: ImportError: cannot import name 'extract' from 'extractor'
- Step 3: Add
extract()toextractor.py
Add at the bottom of extractor.py:
# ── Step 5: Assembly ─────────────────────────────────────────────────────────
def extract(query: str, lang: str) -> ParseResponse:
"""Run the full NLP pipeline and return a ParseResponse."""
nlp = get_nlp(lang)
doc = nlp(query)
per_spans = [ent for ent in doc.ents if ent.label_ == "PER"]
person_names = extract_person_names(doc)
person_role = detect_person_role(doc, per_spans, lang)
date_from, date_to = extract_dates(doc, lang)
keywords = extract_keywords(doc, list(doc.ents))
return ParseResponse(
personNames=person_names,
personRole=person_role,
dateFrom=date_from,
dateTo=date_to,
keywords=keywords,
rawQuery=query,
)
- Step 4: Run tests to confirm they pass
pytest test_extractor.py::test_extract_dates_de \
test_extractor.py::test_extract_keywords_from_topic_de \
test_extractor.py::test_extract_dates_en \
test_extractor.py::test_extract_dates_es \
test_extractor.py::test_extract_rawquery_echoed \
test_extractor.py::test_extract_response_fields_are_complete -v
Expected: 6 passed
- Step 5: Run the full test suite to confirm no regressions
pytest test_extractor.py -v
Expected: all tests pass
- Step 6: Commit
git add nlp-service/extractor.py nlp-service/test_extractor.py
git commit -m "feat(nlp-service): full extract() pipeline — assembles all steps"
Task 8: FastAPI app
Files:
-
Create:
nlp-service/main.py -
Create:
nlp-service/test_main.py -
Step 1: Write the failing tests
Create nlp-service/test_main.py:
import pytest
from fastapi.testclient import TestClient
@pytest.fixture(scope="session")
def client():
from main import app
with TestClient(app) as c:
yield c
def test_health(client):
response = client.get("/health")
assert response.status_code == 200
assert response.json() == {"status": "ok"}
def test_parse_returns_200_with_all_fields(client):
response = client.post("/parse", json={"query": "Briefe vor 1920", "lang": "de"})
assert response.status_code == 200
data = response.json()
assert "personNames" in data
assert "personRole" in data
assert data["personRole"] in ("sender", "receiver", "any")
assert "dateFrom" in data
assert "dateTo" in data
assert "keywords" in data
assert "rawQuery" in data
assert data["rawQuery"] == "Briefe vor 1920"
assert data["dateTo"] == "1920-12-31"
def test_parse_unknown_lang_returns_422(client):
response = client.post("/parse", json={"query": "test", "lang": "fr"})
assert response.status_code == 422
def test_parse_missing_query_returns_422(client):
response = client.post("/parse", json={"lang": "de"})
assert response.status_code == 422
def test_parse_all_languages(client):
cases = [
("de", "Briefe vor 1920"),
("en", "letters before 1920"),
("es", "cartas antes de 1920"),
]
for lang, query in cases:
response = client.post("/parse", json={"query": query, "lang": lang})
assert response.status_code == 200, f"Failed for lang={lang}"
assert response.json()["dateTo"] == "1920-12-31", f"Wrong dateTo for lang={lang}"
- Step 2: Run to confirm failure
pytest test_main.py::test_health -v
Expected: ModuleNotFoundError: No module named 'main'
- Step 3: Create
nlp-service/main.py
import logging
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from extractor import extract, load_all_models
from models import ParseRequest, ParseResponse
logger = logging.getLogger(__name__)
@asynccontextmanager
async def lifespan(app: FastAPI):
logger.info("Loading spaCy models...")
load_all_models()
logger.info("All models ready.")
yield
app = FastAPI(lifespan=lifespan)
@app.get("/health")
def health() -> dict:
return {"status": "ok"}
@app.post("/parse", response_model=ParseResponse)
def parse(request: ParseRequest) -> ParseResponse:
try:
return extract(request.query, request.lang)
except Exception as exc:
logger.exception("Extraction pipeline failed")
raise HTTPException(status_code=500, detail=str(exc)) from exc
- Step 4: Run tests to confirm they pass
pytest test_main.py -v
Expected: 5 passed
- Step 5: Run the full test suite
pytest -v
Expected: all tests pass
- Step 6: Smoke-test the running service
uvicorn main:app --reload --port 8001 &
sleep 2
curl -s http://localhost:8001/health
# Expected: {"status":"ok"}
curl -s -X POST http://localhost:8001/parse \
-H "Content-Type: application/json" \
-d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}' | python3 -m json.tool
# Expected (spaCy may or may not catch "Opa Hermann"/"Marie" as PER):
# {
# "personNames": [...],
# "personRole": "any",
# "dateFrom": null,
# "dateTo": "1920-12-31",
# "keywords": ["brief"],
# "rawQuery": "Briefe von Opa Hermann an Marie vor 1920"
# }
kill %1
- Step 7: Commit
git add nlp-service/main.py nlp-service/test_main.py
git commit -m "feat(nlp-service): FastAPI app with /parse and /health endpoints"
Task 9: Dockerfile
Files:
-
Create:
nlp-service/Dockerfile -
Step 1: Create
nlp-service/Dockerfile
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Bake models into the image — no volume needed, ~350 MB total
RUN python -m spacy download de_core_news_sm \
&& python -m spacy download en_core_web_sm \
&& python -m spacy download es_core_news_sm
COPY . .
RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \
&& chown -R nlp:nlp /app
USER nlp
EXPOSE 8001
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
CMD curl -f http://localhost:8001/health || exit 1
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]
- Step 2: Build the image
cd nlp-service
docker build -t nlp-service:prototype .
Expected: build completes, image ~350 MB
- Step 3: Run and smoke-test the container
docker run --rm -d -p 8001:8001 --name nlp-test nlp-service:prototype
sleep 5
curl -s http://localhost:8001/health
# Expected: {"status":"ok"}
curl -s -X POST http://localhost:8001/parse \
-H "Content-Type: application/json" \
-d '{"query": "Briefe aus dem Krieg", "lang": "de"}' | python3 -m json.tool
docker stop nlp-test
- Step 4: Commit
git add nlp-service/Dockerfile
git commit -m "feat(nlp-service): Dockerfile — python:3.11-slim, models baked in"